Criterion assessment, evidence hierarchy, repeatability, and evaluation output format.
This document defines how SFR evaluations are conducted: how each criterion is assessed, how borderline and incomplete cases are handled, how evidence is weighted, what repeatability requires, and what the standardized evaluation output must contain. No numerical scores are assigned. Classification is structural: Pass, Fail, or Insufficient Data per criterion, producing a final system classification.
Each of the three fundamental criteria is assessed independently against the required inputs collected during the reference test. For each criterion, the evaluator assigns one of three outcomes: Pass, Fail, or Insufficient Data. These outcomes are determined by the presence, absence, or ambiguity of required input data — not by subjective impression.
| Criterion | Outcome | Condition |
|---|---|---|
| A Causative Accuracy |
PASS | Motion telemetry (R1) and physics telemetry (R2) confirm that motion output during reference events is derived from live physics state. Actuator telemetry (R3) shows no post-processed or scripted motion profiles. The motion event corresponds directly to the physics event that caused it. |
| FAIL | Motion telemetry reveals scripted, canned, or post-processed motion that does not correspond to the live physics state. Or: motion is present but actuator telemetry confirms it is generated from a source other than the physics model output. Or: motion at the cockpit reference point does not originate at the vehicle's center of mass. | |
| INSUFFICIENT DATA | Physics telemetry (R2) or actuator telemetry (R3) is unavailable, below quality threshold, or contains gaps that prevent comparison with motion telemetry. Causative source cannot be confirmed or denied from available evidence. | |
| B Temporal Coherence |
PASS | Synchronization measurements (R4) confirm that the motion cue arrives within the correct temporal relationship to the physics event across all reference events. Combined Axis event (Event 3) data confirms that independent axes do not corrupt each other's timing under simultaneous demand. |
| FAIL | Synchronization data reveals that the motion cue arrives after a delay that exceeds the valid temporal relationship for the channel. Or: visual cue precedes the motion cue by a measurable interval under Event 2 conditions. Or: simultaneous axis demand in Event 3 produces timing degradation in one or both axes. | |
| INSUFFICIENT DATA | Synchronization measurements (R4) are missing, use different time bases without verified offset correction, or contain gaps during reference events. Temporal relationship between channels cannot be established from available evidence. | |
| C Human Response Relevance |
PASS | Control system measurements (R5) confirm that the participant's control corrections occur at Event 4 (Limit-State Threshold) in a pattern consistent with physical sensation response rather than visual anticipation. The correction onset timing and pattern is consistent with a vestibularly-driven response. |
| FAIL | Control system measurements reveal that the participant's control corrections at the limit state are delayed relative to the physics event in a pattern consistent with visual response rather than physical sensation response. Or: participant completes the limit-state event without any measurable correction, indicating the physical cue is insufficient to trigger a response. | |
| INSUFFICIENT DATA | Control system measurements (R5) are unavailable or do not capture the relevant response window. The response pattern cannot be classified as physical or visual from available evidence alone. |
A system classification requires a determination on each criterion. Insufficient Data on any criterion prevents a final In-the-Loop classification, regardless of Pass results on other criteria.
Borderline cases arise when a system's architecture or available data does not produce a clear Pass or Fail determination on one or more criteria. The following guidance applies to four common borderline scenarios.
A system that meets some but not all structural requirements of the In-the-Loop Standard. For example: physics-driven motion (Req. 1 met) but rotation not resolved at center of mass (Req. 2 not met). In this case, Criterion A is evaluated against what the data shows, not what the system's documentation claims. A hybrid architecture does not receive partial credit. Criteria are assessed on evidence, not on architectural intent.
When one or more required inputs are unavailable for a specific criterion but available for others, the evaluation proceeds on the criteria for which required data exists. The criterion with missing data is assigned Insufficient Data. The evaluator must not infer a Pass from evidence that does not directly address the criterion in question.
When actuator or motion telemetry is available for some axes but not others during a reference event, the evaluator may complete the assessment for the axes with available data. For Criterion B assessment, if the key axis for the reference event (e.g., yaw for Event 2) has missing telemetry, that event is invalid for Criterion B purposes and must be repeated or noted as Insufficient Data for that criterion.
A system may Pass Criterion A and B but Fail Criterion C, or any other combination. The overall classification reflects the weakest criterion result. A system that Passes two criteria but Fails one cannot be classified as In-the-Loop. The final classification is determined by the classification logic defined in Section 5.
When evaluating a system, evidence is weighted according to its source. Higher-tier evidence takes precedence over lower-tier evidence. Lower-tier evidence cannot override higher-tier evidence, even if it contradicts it. Discrepancies between tiers are recorded in the evidence summary.
Calibrated, timestamped data captured by instrumentation during reference events. Includes motion telemetry (R1), physics telemetry (R2), actuator telemetry (R3), synchronization measurements (R4), and control system measurements (R5). Tier 1 is the primary basis for all criterion assessments. It cannot be overridden by any other tier.
Manufacturer or developer-supplied specifications, block diagrams, physics model documentation, and actuator specifications. Used when Tier 1 data is unavailable for a specific parameter. May support an assessment but cannot produce a Pass determination on its own if Tier 1 data is absent. Establishes the basis against which Tier 1 measurements are interpreted.
Structured observation of system behavior during evaluation, without instrumentation. Includes evaluator notes on motion timing, visual synchronization, and driver response patterns during reference events. May support Insufficient Data determinations or provide context for Tier 1 and Tier 2 evidence. Cannot produce a Pass determination. May support a Fail determination when behavior clearly contradicts a Pass condition.
Marketing materials, product descriptions, claimed specifications, testimonials, and unverified assertions about system performance. Tier 4 evidence carries no weight in classification determinations. It may be cited in the evidence summary for context, but it cannot support, modify, or override any criterion assessment at any other tier. Absence of a Tier 4 claim is not evidence of absence of the condition.
Tier 4 evidence cannot override Tier 1. A manufacturer's claim that a system is in-the-loop is not evidence that it is. Tier 1 measurement determines the outcome.
Repeatability is not a quality goal for SFR evaluations. It is a structural requirement. An evaluation that cannot be reproduced under the same conditions is not a valid evaluation — it is a single observation. For a classification to be credible, it must satisfy the following three repeatability conditions simultaneously.
The repeatability requirement is what distinguishes a classification standard from an opinion. It requires that the methodology, not the evaluator, determines the outcome.
A classification that changes depending on who performs the evaluation is not a classification. It is a judgment call. The repeatability requirement eliminates judgment calls from the classification process.
Every SFR evaluation produces a standardized output record in the following format. No numerical scores are assigned. The output contains three components: a Classification Result, Supporting Findings per criterion, and an Evidence Summary.
One result is assigned. In-the-Loop requires Pass on all three criteria. Surface-Level applies when one or more criteria Fail and the system has physics-derived motion present. Out-of-the-Loop applies when no physics-derived motion is delivered to the participant.
The output record contains no numerical scores. A Classification Result of In-the-Loop means all three criteria passed. Any other result specifies which criteria failed or produced Insufficient Data so that the nature and location of the limitation is clear.
The output format is standardized so that results from different evaluators and different sessions can be compared on a common basis.
The SFR evaluation process at v0.9 deliberately avoids numerical scoring. This is not a limitation — it is a deliberate sequencing decision. Before assigning numbers to a classification system, the classification logic must be repeatable without numbers. A framework that cannot produce consistent Pass/Fail determinations will not produce consistent scores. The goal of this sprint is to establish repeatable classification before introducing numerical scoring in a future version.
When the three reference methodology documents (Reference Test Methodology, Evaluation Inputs, and this document) produce consistent, repeatable classifications across independent evaluators, the framework is ready to define the numerical scoring layer that sits above it. Until then, structural classification is the foundation.