Inter-Evaluator Agreement

Inter-Evaluator Agreement

Tracking classification consistency across independent evaluators.

This page tracks inter-evaluator agreement across completed SFR evaluations. Agreement data is the primary empirical measure of the framework's repeatability. The structure below defines how agreement is calculated, what disagreements are logged, and what the current status of the agreement record is. No evaluations have been completed at this stage. The structure awaits its first entries.

Agreement Summary


The following statistics are derived from completed evaluations in which two independent evaluators assessed the same system using the same evidence. Agreement is measured at the criterion level (A, B, C) and at the final classification level.

Current Agreement Record — SFR v0.9 Draft
0 Total Evaluations Completed dual-evaluator assessments entered in the record
Agreement Rate Percentage of evaluations with full criterion-level agreement between both evaluators
Disagreement Rate Percentage of evaluations with at least one criterion-level disagreement
0 Unresolved Determinations Criterion-level disagreements not yet resolved by framework review

Statistics will be populated as evaluations are completed through the Pilot Validation Program. Current status: awaiting first evaluation pair.

How Agreement Is Measured


Agreement is assessed by comparing the independent evaluation records completed by two evaluators for the same system. The following methodology applies.

Unit of Agreement

Agreement is measured at the criterion level — not at the final classification level. Two evaluators may reach the same final classification through different criterion-level determinations if those combinations produce the same classification outcome under the derivation rules. For this reason, criterion-level agreement is the primary measure. Final classification agreement is a secondary measure that follows from criterion agreement.

Agreement Definition

Two evaluators agree on a criterion if their determination entries are identical: both Pass, both Fail, or both Insufficient Data. A determination of Pass by one evaluator and Fail by another is a disagreement. A determination of Pass by one evaluator and Insufficient Data by another is also a disagreement — Insufficient Data is not a neutral position, it is a specific determination about the evidentiary state.

Evidence Tier Agreement

Separately from determination agreement, evaluators' evidence tier assignments for each criterion are also compared. Two evaluators who reach the same determination but record different evidence tiers have made different judgments about the quality of the evidence. This is a soft disagreement — it does not affect the classification result but may indicate ambiguity in the evidence quality assessment criteria.

Disagreement Response

Where evaluators disagree on a criterion determination, the disagreement is logged (see Section 3), the criterion determination is marked Disputed in the Results Registry, and the framework authors conduct a review to identify the methodological source of the disagreement. The review may result in clarification of the criterion, an update to the evaluation guidance, or a finding that the evidence was insufficient for either evaluator to make a definitive determination.

Threshold for Stage 2 Advancement

The framework does not specify a minimum numerical agreement rate for Stage 2 advancement. The requirement is that evaluations have been conducted and disagreements have been reviewed. A perfect agreement rate across zero evaluations is meaningless. Agreement data requires actual evaluations. The first evaluation pair — regardless of outcome — constitutes progress.

Disagreements are not failures. They are methodological feedback. A disagreement that is properly logged and reviewed produces more value for the framework than an agreement that is never recorded.

Criterion-Level Disagreements


The following table logs all criterion-level disagreements from completed evaluations. For each disagreement, the table records the evaluation ID, the criterion on which disagreement occurred, the two conflicting determinations, the evidence tier involved, and the resolution status.

Eval ID Criterion Evaluator A Evaluator B Evidence Tier Status
No evaluations completed. This log awaits its first entry.

Unresolved Criterion Determinations


An unresolved determination is a criterion-level disagreement that has been logged but not yet reviewed by the framework authors. Unresolved determinations are tracked separately because they represent open questions about the framework's clarity — questions that must be answered before the framework can claim full methodological consistency.

Eval ID Criterion Disagreement Summary Days Open Assigned To
No unresolved determinations. No evaluations completed.

When a disagreement is resolved, the resolution is added to the framework's evaluation guidance documents and the criterion is marked Resolved in this log. Resolutions that require normative changes are processed through the revision policy defined in the Governance Framework.

Agreement Data Is the Test

The inter-evaluator agreement record is not a performance indicator. It is a diagnostic instrument. Agreement tells the framework that its criteria are clear enough for independent evaluators to apply consistently. Disagreement tells it where they are not. Both outcomes are data. The only outcome that provides no data is no evaluation at all.

The record structure is ready. The methodology is defined. Pilot program participation is the action that begins filling it in.