Why the SFR framework must be tested before it can be trusted.
A proposed standard that produces no independent evaluation record is an assertion. A proposed standard that demonstrates repeatable classifications across independent evaluators has begun to build the evidence base that earns trust and advances toward ratification. This section explains why validation matters, what the validation infrastructure consists of, and what must exist before the framework can advance from Stage 1 Community Review to Stage 2 Independent Evaluation.
A classification framework has one non-negotiable requirement: two independent evaluators applying the same methodology to the same system with the same evidence must reach the same classification outcome. If they do not, the framework is not measuring what it claims to measure. The classification criteria are ambiguous, the evidence hierarchy is not being applied consistently, or the methodology produces evaluator-dependent results. None of these conditions is acceptable in a framework that aspires to standards status.
Repeatability is not a procedural nicety. It is the structural property that distinguishes a standard from an opinion. An evaluation of a simulation system conducted by one evaluator is evidence. An evaluation conducted by two independent evaluators that reaches the same result is reproducible evidence. A body of such evidence, accumulated across multiple systems and evaluators, is the foundation on which a formal standard is built.
The SFR framework's evaluation methodology defines three criteria (Causative Accuracy, Temporal Coherence, Human Response Relevance) and a four-tier evidence hierarchy. These definitions are the mechanism by which repeatability is intended to be achieved. But whether they actually produce repeatable results in practice — across real systems, real evidence, and real evaluators who were not involved in writing the framework — is an empirical question that has not yet been answered.
Repeatability is not claimed. It is demonstrated. The validation infrastructure exists to generate the demonstration.
An evaluation conducted by the framework's authors is not independent. It may still be technically rigorous, but it carries an inherent limitation: the authors know what the framework is intended to produce. Their application of the criteria may be unconsciously guided by that knowledge. Confirmation bias in evaluation is not a personal failing — it is a structural risk that applies to any framework whose evaluation is conducted only by those who designed it.
Independent evaluation means evaluation conducted by parties who had no involvement in the design of the framework, who are applying the criteria from the published documentation alone, and who receive no guidance from the authors during the evaluation process. The result of independent evaluation is more informative than any internally-conducted evaluation because it tests whether the published framework is clear enough, complete enough, and unambiguous enough for someone who did not write it to apply it correctly.
Where independent evaluators apply the methodology and reach different conclusions from each other, the disagreement itself is data. It identifies the specific criteria, conditions, or evidence-type combinations where the methodology is ambiguous. Those are the gaps that must be resolved before the framework can claim repeatability. Without independent evaluation, those gaps remain invisible — and present in the methodology without being known.
Whether the framework criteria are clear enough to apply without author guidance. Whether the evidence hierarchy resolves ambiguous cases consistently. Whether two evaluators who follow the methodology agree on what the evidence shows.
It does not certify the system being evaluated. It does not endorse the evaluator. It does not produce a score or ranking. It produces a classification determination, an evidence tier, and a record — nothing more.
Standards development is not a design process — it is an evidence accumulation process. The normative documents define what the standard requires. The validation process tests whether the standard's requirements are measurable, consistent, and reproducible. Evidence accumulated through validation is the feedback mechanism that closes the loop between normative intent and practical reality.
Specifically, a growing body of evaluation records does three things for the SFR framework:
The evaluation record infrastructure exists to accumulate evidence. Evidence is what earns a standard its standing.
The SFR Validation & Evidence layer consists of five documents, each addressing a distinct aspect of the evidence accumulation process.
Standalone expansion of the four-tier evidence hierarchy. Why higher-tier evidence prevails and how to apply the hierarchy in practice.
→How organizations may volunteer simulation systems for independent classification under the framework. No certification, no endorsement.
→The standards-report format used for each completed evaluation. Defines all required fields for a classification record.
→The methodology for tracking agreement across completed evaluations. Structure for logging disagreements and unresolved determinations.
→Public registry of completed evaluations, their classification outcomes, evidence tier, and current status. Currently awaiting first entries.
→The Adoption Roadmap defines Stage 1 as Community Review and Stage 2 as Independent Evaluation. Advancing from Stage 1 to Stage 2 requires specific conditions to be met. These conditions are listed below with their current status.
The framework currently satisfies none of the Stage 2 advancement conditions in full. The infrastructure required to support Stage 2 activity is now in place. Advancement depends on community engagement — specifically, organizations from the implementation pathway types reviewing the normative corpus, providing feedback, and in at least one case, volunteering a system for independent evaluation through the Pilot Validation Program.
Stage 2 cannot begin through declaration. It begins when evidence accumulation begins.
A framework that asks organizations to reference it in procurement documents, research publications, and policy instruments is asking for a form of trust. That trust should not be granted on the basis of the framework's internal consistency alone — it should be earned through a demonstrated record of independent, repeatable application. The validation infrastructure defined here is the mechanism by which that record is built. It is not complete. It does not yet contain results. But its existence is the first structural step toward a framework that has earned the standing it is proposing to occupy.