ADR-001: Tier-A/Tier-B labeling protocol for the gold set | LLM Evaluation Framework

Context

A working LLM eval framework is worth nothing without trustworthy labeled examples to score against. The gold set is the foundation everything else gates on — multi-judge calibration in M03, RAGAS in M05, regression detection in M04, the human-in-the-loop loop in M07.

We had three constraints and they fight each other:

Coverage. We need ≥ 1,000 labeled cases across at least 4 task types (factual QA, summarization, code generation, RAG synthesis) to make per-task accuracy claims defensible.
Quality. Inter-annotator agreement on the gold set has to clear κ ≥ 0.7 — below that, we can't tell whether judge errors are model failures or label noise.
Throughput. A single labeler at full speed handles ~80 cases/day on the harder tasks. Dual-labeler-with-adjudication is ~30/day. Hitting 1,000 cases with dual-labeling is roughly 33 labeler-days; we have ~10 in the budget for v1.

Three options on the table:

Option A: Dual-labeler everything. Highest quality, doesn't fit the budget.
Option B: Single-labeler everything. Fits the budget; agreement signal is unmeasurable, so we can't catch labeling drift.
Option C: Stratified — single-labeler triage on the larger tier, dual-labeler + adjudication on a smaller gold tier used for calibration.

Decision

Adopt Option C. Two tiers:

Tier-A (triage, 60% of cases): single annotator with weekly 10% spot-check by a second labeler. Used for bulk regression detection and CI-gate sampling. Accept ~5% noise floor.
Tier-B (gold, 40%): two independent annotators per case with adjudication on disagreements. Krippendorff's α tracked per task type, target ≥ 0.75. This is what judge calibration runs against; this is what we ship as gold/*.jsonl in the starter kit.

# llm_eval/labeling/protocol.py
class LabelingTier(str, Enum):
    TRIAGE = "tier_a"   # single-annotator + weekly spot-check
    GOLD   = "tier_b"   # dual-annotator + adjudication

ROUTING_RULES = {
    LabelingTier.GOLD: {
        "task_types": {"factual_qa", "rag_synthesis"},   # high-stakes
        "uncertainty_threshold": 0.65,                   # if model is unsure
        "adversarial_flag": True,                        # known hard cases
    },
    LabelingTier.TRIAGE: "default",
}

Cases route to GOLD when ANY rule matches; everything else goes to TRIAGE.

Tradeoffs we accept

Lever	Alternative	Chosen
Quality bar	Uniform dual-annotator	Stratified — quality where it gates, throughput where it doesn't
Labeler cost	$0.40/case dual on everything (~$400/1k)	$0.40 × 0.4 + $0.16 × 0.6 = $0.26/case avg (~$260/1k)
Drift detection	Per-case agreement on every label	10% spot-check on tier-A; full agreement on tier-B
Inter-annotator metric	Cohen's κ (2 raters)	Krippendorff's α (handles ≥ 2 raters + missing data)

Consequences (positive)

We can ship a 1,000-case gold set in ~14 labeler-days (within budget) with measurable κ on the 400 high-stakes cases.
Tier-B doubles as the multi-judge calibration set in M03 — judges that fall outside α-based agreement bounds get their weights deweighted.
Tier-A's spot-check signal is enough to catch labeler drift week-over-week without burning the budget.

Consequences (negative)

Tier-A claims have a 5% confidence-interval floor on accuracy. We disclose this explicitly in the eval report.
Adjudication queue can back up — if two adjudicators are out, tier-B throughput drops to zero. Mitigation: rotate a third backup labeler.
The 10% spot-check is a sampling, not a guarantee. We accept the residual risk; promote tier-A → tier-B if a regression-blocking eval depends on it.

Reversal plan

If team size grows past 8 labelers, revisit and ramp tier-A → dual-annotator on cases that gate releases. Estimated effort: ~2 engineer-weeks (routing + adjudication queue scaling). Triggers:

Tier-A spot-check disagreement rate > 12% for 2 consecutive weeks.
A regression-detection bug traced back to tier-A label noise.
SLA changes that require κ on every gating case.

References

llm_eval/labeling/protocol.py — routing rules
llm_eval/labeling/agreement.py — Krippendorff α calculation
gold/*.jsonl — tier-B labeled cases shipped with the starter kit
data/sample_test_cases.json — tier-A samples
ADR-004 (multi-judge calibration depends on this gold set)