Tracking classification consistency across independent evaluators.
This page tracks inter-evaluator agreement across completed SFR evaluations. Agreement data is the primary empirical measure of the framework's repeatability. The structure below defines how agreement is calculated, what disagreements are logged, and what the current status of the agreement record is. No evaluations have been completed at this stage. The structure awaits its first entries.
The following statistics are derived from completed evaluations in which two independent evaluators assessed the same system using the same evidence. Agreement is measured at the criterion level (A, B, C) and at the final classification level.
Statistics will be populated as evaluations are completed through the Pilot Validation Program. Current status: awaiting first evaluation pair.
Agreement is assessed by comparing the independent evaluation records completed by two evaluators for the same system. The following methodology applies.
Agreement is measured at the criterion level — not at the final classification level. Two evaluators may reach the same final classification through different criterion-level determinations if those combinations produce the same classification outcome under the derivation rules. For this reason, criterion-level agreement is the primary measure. Final classification agreement is a secondary measure that follows from criterion agreement.
Two evaluators agree on a criterion if their determination entries are identical: both Pass, both Fail, or both Insufficient Data. A determination of Pass by one evaluator and Fail by another is a disagreement. A determination of Pass by one evaluator and Insufficient Data by another is also a disagreement — Insufficient Data is not a neutral position, it is a specific determination about the evidentiary state.
Separately from determination agreement, evaluators' evidence tier assignments for each criterion are also compared. Two evaluators who reach the same determination but record different evidence tiers have made different judgments about the quality of the evidence. This is a soft disagreement — it does not affect the classification result but may indicate ambiguity in the evidence quality assessment criteria.
Where evaluators disagree on a criterion determination, the disagreement is logged (see Section 3), the criterion determination is marked Disputed in the Results Registry, and the framework authors conduct a review to identify the methodological source of the disagreement. The review may result in clarification of the criterion, an update to the evaluation guidance, or a finding that the evidence was insufficient for either evaluator to make a definitive determination.
The framework does not specify a minimum numerical agreement rate for Stage 2 advancement. The requirement is that evaluations have been conducted and disagreements have been reviewed. A perfect agreement rate across zero evaluations is meaningless. Agreement data requires actual evaluations. The first evaluation pair — regardless of outcome — constitutes progress.
Disagreements are not failures. They are methodological feedback. A disagreement that is properly logged and reviewed produces more value for the framework than an agreement that is never recorded.
The following table logs all criterion-level disagreements from completed evaluations. For each disagreement, the table records the evaluation ID, the criterion on which disagreement occurred, the two conflicting determinations, the evidence tier involved, and the resolution status.
| Eval ID | Criterion | Evaluator A | Evaluator B | Evidence Tier | Status |
|---|---|---|---|---|---|
| No evaluations completed. This log awaits its first entry. | |||||
An unresolved determination is a criterion-level disagreement that has been logged but not yet reviewed by the framework authors. Unresolved determinations are tracked separately because they represent open questions about the framework's clarity — questions that must be answered before the framework can claim full methodological consistency.
| Eval ID | Criterion | Disagreement Summary | Days Open | Assigned To |
|---|---|---|---|---|
| No unresolved determinations. No evaluations completed. | ||||
When a disagreement is resolved, the resolution is added to the framework's evaluation guidance documents and the criterion is marked Resolved in this log. Resolutions that require normative changes are processed through the revision policy defined in the Governance Framework.
The inter-evaluator agreement record is not a performance indicator. It is a diagnostic instrument. Agreement tells the framework that its criteria are clear enough for independent evaluators to apply consistently. Disagreement tells it where they are not. Both outcomes are data. The only outcome that provides no data is no evaluation at all.