Evaluation Process

SFR Evaluation Process

Criterion assessment, evidence hierarchy, repeatability, and evaluation output format.

This document defines how SFR evaluations are conducted: how each criterion is assessed, how borderline and incomplete cases are handled, how evidence is weighted, what repeatability requires, and what the standardized evaluation output must contain. No numerical scores are assigned. Classification is structural: Pass, Fail, or Insufficient Data per criterion, producing a final system classification.

Evaluation Protocol Reference Test Methodology Evaluation Inputs Evaluation Process

Criterion Assessment Structure


Each of the three fundamental criteria is assessed independently against the required inputs collected during the reference test. For each criterion, the evaluator assigns one of three outcomes: Pass, Fail, or Insufficient Data. These outcomes are determined by the presence, absence, or ambiguity of required input data — not by subjective impression.

Criterion Outcome Condition
A
Causative Accuracy
PASS Motion telemetry (R1) and physics telemetry (R2) confirm that motion output during reference events is derived from live physics state. Actuator telemetry (R3) shows no post-processed or scripted motion profiles. The motion event corresponds directly to the physics event that caused it.
FAIL Motion telemetry reveals scripted, canned, or post-processed motion that does not correspond to the live physics state. Or: motion is present but actuator telemetry confirms it is generated from a source other than the physics model output. Or: motion at the cockpit reference point does not originate at the vehicle's center of mass.
INSUFFICIENT DATA Physics telemetry (R2) or actuator telemetry (R3) is unavailable, below quality threshold, or contains gaps that prevent comparison with motion telemetry. Causative source cannot be confirmed or denied from available evidence.
B
Temporal Coherence
PASS Synchronization measurements (R4) confirm that the motion cue arrives within the correct temporal relationship to the physics event across all reference events. Combined Axis event (Event 3) data confirms that independent axes do not corrupt each other's timing under simultaneous demand.
FAIL Synchronization data reveals that the motion cue arrives after a delay that exceeds the valid temporal relationship for the channel. Or: visual cue precedes the motion cue by a measurable interval under Event 2 conditions. Or: simultaneous axis demand in Event 3 produces timing degradation in one or both axes.
INSUFFICIENT DATA Synchronization measurements (R4) are missing, use different time bases without verified offset correction, or contain gaps during reference events. Temporal relationship between channels cannot be established from available evidence.
C
Human Response Relevance
PASS Control system measurements (R5) confirm that the participant's control corrections occur at Event 4 (Limit-State Threshold) in a pattern consistent with physical sensation response rather than visual anticipation. The correction onset timing and pattern is consistent with a vestibularly-driven response.
FAIL Control system measurements reveal that the participant's control corrections at the limit state are delayed relative to the physics event in a pattern consistent with visual response rather than physical sensation response. Or: participant completes the limit-state event without any measurable correction, indicating the physical cue is insufficient to trigger a response.
INSUFFICIENT DATA Control system measurements (R5) are unavailable or do not capture the relevant response window. The response pattern cannot be classified as physical or visual from available evidence alone.

A system classification requires a determination on each criterion. Insufficient Data on any criterion prevents a final In-the-Loop classification, regardless of Pass results on other criteria.

Borderline System Handling


Borderline cases arise when a system's architecture or available data does not produce a clear Pass or Fail determination on one or more criteria. The following guidance applies to four common borderline scenarios.

Hybrid Architectures

A system that meets some but not all structural requirements of the In-the-Loop Standard. For example: physics-driven motion (Req. 1 met) but rotation not resolved at center of mass (Req. 2 not met). In this case, Criterion A is evaluated against what the data shows, not what the system's documentation claims. A hybrid architecture does not receive partial credit. Criteria are assessed on evidence, not on architectural intent.

Rule: Evaluate on evidence. Partial structural compliance does not produce a partial Pass.

Incomplete Required Data

When one or more required inputs are unavailable for a specific criterion but available for others, the evaluation proceeds on the criteria for which required data exists. The criterion with missing data is assigned Insufficient Data. The evaluator must not infer a Pass from evidence that does not directly address the criterion in question.

Rule: Insufficient Data is the correct outcome for missing required inputs. It is not a soft Fail — it is an open determination pending evidence.

Missing Telemetry for One Axis

When actuator or motion telemetry is available for some axes but not others during a reference event, the evaluator may complete the assessment for the axes with available data. For Criterion B assessment, if the key axis for the reference event (e.g., yaw for Event 2) has missing telemetry, that event is invalid for Criterion B purposes and must be repeated or noted as Insufficient Data for that criterion.

Rule: Partial axis telemetry supports a partial assessment. The criterion defaults to Insufficient Data only if the missing axis is required for the specific criterion being assessed.

Partially Compliant Systems

A system may Pass Criterion A and B but Fail Criterion C, or any other combination. The overall classification reflects the weakest criterion result. A system that Passes two criteria but Fails one cannot be classified as In-the-Loop. The final classification is determined by the classification logic defined in Section 5.

Rule: In-the-Loop requires Pass on all three criteria. Any Fail on any criterion results in Surface-Level or Out-of-the-Loop classification depending on the structural nature of the failure.

Evidence Hierarchy


When evaluating a system, evidence is weighted according to its source. Higher-tier evidence takes precedence over lower-tier evidence. Lower-tier evidence cannot override higher-tier evidence, even if it contradicts it. Discrepancies between tiers are recorded in the evidence summary.

Tier 4 evidence cannot override Tier 1. A manufacturer's claim that a system is in-the-loop is not evidence that it is. Tier 1 measurement determines the outcome.

Repeatability Requirement


Repeatability is not a quality goal for SFR evaluations. It is a structural requirement. An evaluation that cannot be reproduced under the same conditions is not a valid evaluation — it is a single observation. For a classification to be credible, it must satisfy the following three repeatability conditions simultaneously.

  • R1
    Same Methodology. The evaluation must be conducted according to the Reference Test Methodology (reference vehicle, reference events, reference conditions, measurement procedures). Any deviation from the reference methodology invalidates the repeatability requirement for that evaluation session.
  • R2
    Same Test Conditions. The system under evaluation must be in the same operational configuration across all sessions. Software version, hardware configuration, calibration state, and environmental conditions must be documented and held constant. A classification conducted on a non-standard configuration does not apply to the standard configuration.
  • R3
    Same Classification Outcome. When the same system is evaluated by a different evaluator under the same methodology and test conditions, the classification must be the same. If two evaluators reach different classifications from the same evidence, the evidence is insufficient for a definitive determination and the system is classified as Insufficient Data pending resolution of the discrepancy.

The repeatability requirement is what distinguishes a classification standard from an opinion. It requires that the methodology, not the evaluator, determines the outcome.

A classification that changes depending on who performs the evaluation is not a classification. It is a judgment call. The repeatability requirement eliminates judgment calls from the classification process.

Evaluation Output Format


Every SFR evaluation produces a standardized output record in the following format. No numerical scores are assigned. The output contains three components: a Classification Result, Supporting Findings per criterion, and an Evidence Summary.

SFR Evaluation Output Record — Standard Format
Classification Result
In-the-Loop Surface-Level Out-of-the-Loop

One result is assigned. In-the-Loop requires Pass on all three criteria. Surface-Level applies when one or more criteria Fail and the system has physics-derived motion present. Out-of-the-Loop applies when no physics-derived motion is delivered to the participant.

Supporting Findings — Per Criterion
Criterion A Causative Accuracy
PASS or FAIL or INSUFFICIENT DATA
+ one-sentence finding statement
Criterion B Temporal Coherence
PASS or FAIL or INSUFFICIENT DATA
+ one-sentence finding statement
Criterion C Human Response Relevance
PASS or FAIL or INSUFFICIENT DATA
+ one-sentence finding statement
Evidence Summary
Tier level used for each criterion assessment. Data quality notes. Required inputs present/absent. Optional inputs used (if any). Discrepancies between evidence tiers (if any). Date of evaluation, reference event set used, system configuration at time of evaluation.

The output record contains no numerical scores. A Classification Result of In-the-Loop means all three criteria passed. Any other result specifies which criteria failed or produced Insufficient Data so that the nature and location of the limitation is clear.

The output format is standardized so that results from different evaluators and different sessions can be compared on a common basis.

Classification Without Numbers

The SFR evaluation process at v0.9 deliberately avoids numerical scoring. This is not a limitation — it is a deliberate sequencing decision. Before assigning numbers to a classification system, the classification logic must be repeatable without numbers. A framework that cannot produce consistent Pass/Fail determinations will not produce consistent scores. The goal of this sprint is to establish repeatable classification before introducing numerical scoring in a future version.

When the three reference methodology documents (Reference Test Methodology, Evaluation Inputs, and this document) produce consistent, repeatable classifications across independent evaluators, the framework is ready to define the numerical scoring layer that sits above it. Until then, structural classification is the foundation.

Repeatability before scoring. Structure before numbers. Classification before rating.