Context
A working LLM eval framework is worth nothing without trustworthy labeled examples to score against. The gold set is the foundation everything else gates on — multi-judge calibration in M03, RAGAS in M05, regression detection in M04, the human-in-the-loop loop in M07.
We had three constraints and they fight each other:
- Coverage. We need ≥ 1,000 labeled cases across at least 4 task types (factual QA, summarization, code generation, RAG synthesis) to make per-task accuracy claims defensible.
- Quality. Inter-annotator agreement on the gold set has to clear κ ≥ 0.7 — below that, we can't tell whether judge errors are model failures or label noise.
- Throughput. A single labeler at full speed handles ~80 cases/day on the harder tasks. Dual-labeler-with-adjudication is ~30/day. Hitting 1,000 cases with dual-labeling is roughly 33 labeler-days; we have ~10 in the budget for v1.
Three options on the table:
- Option A: Dual-labeler everything. Highest quality, doesn't fit the budget.
- Option B: Single-labeler everything. Fits the budget; agreement signal is unmeasurable, so we can't catch labeling drift.
- Option C: Stratified — single-labeler triage on the larger tier, dual-labeler + adjudication on a smaller gold tier used for calibration.
Decision
Adopt Option C. Two tiers:
- Tier-A (triage, 60% of cases): single annotator with weekly 10% spot-check by a second labeler. Used for bulk regression detection and CI-gate sampling. Accept ~5% noise floor.
- Tier-B (gold, 40%): two independent annotators per case with adjudication on disagreements. Krippendorff's α tracked per task type, target ≥ 0.75. This is what judge calibration runs against; this is what we ship as
gold/*.jsonlin the starter kit.
# llm_eval/labeling/protocol.py
class LabelingTier(str, Enum):
TRIAGE = "tier_a" # single-annotator + weekly spot-check
GOLD = "tier_b" # dual-annotator + adjudication
ROUTING_RULES = {
LabelingTier.GOLD: {
"task_types": {"factual_qa", "rag_synthesis"}, # high-stakes
"uncertainty_threshold": 0.65, # if model is unsure
"adversarial_flag": True, # known hard cases
},
LabelingTier.TRIAGE: "default",
}
Cases route to GOLD when ANY rule matches; everything else goes to TRIAGE.
Tradeoffs we accept
| Lever | Alternative | Chosen |
|---|---|---|
| Quality bar | Uniform dual-annotator | Stratified — quality where it gates, throughput where it doesn't |
| Labeler cost | $0.40/case dual on everything (~$400/1k) | $0.40 × 0.4 + $0.16 × 0.6 = $0.26/case avg (~$260/1k) |
| Drift detection | Per-case agreement on every label | 10% spot-check on tier-A; full agreement on tier-B |
| Inter-annotator metric | Cohen's κ (2 raters) | Krippendorff's α (handles ≥ 2 raters + missing data) |
Consequences (positive)
- We can ship a 1,000-case gold set in ~14 labeler-days (within budget) with measurable κ on the 400 high-stakes cases.
- Tier-B doubles as the multi-judge calibration set in M03 — judges that fall outside α-based agreement bounds get their weights deweighted.
- Tier-A's spot-check signal is enough to catch labeler drift week-over-week without burning the budget.
Consequences (negative)
- Tier-A claims have a 5% confidence-interval floor on accuracy. We disclose this explicitly in the eval report.
- Adjudication queue can back up — if two adjudicators are out, tier-B throughput drops to zero. Mitigation: rotate a third backup labeler.
- The 10% spot-check is a sampling, not a guarantee. We accept the residual risk; promote tier-A → tier-B if a regression-blocking eval depends on it.
Reversal plan
If team size grows past 8 labelers, revisit and ramp tier-A → dual-annotator on cases that gate releases. Estimated effort: ~2 engineer-weeks (routing + adjudication queue scaling). Triggers:
- Tier-A spot-check disagreement rate > 12% for 2 consecutive weeks.
- A regression-detection bug traced back to tier-A label noise.
- SLA changes that require κ on every gating case.
References
llm_eval/labeling/protocol.py— routing rulesllm_eval/labeling/agreement.py— Krippendorff α calculationgold/*.jsonl— tier-B labeled cases shipped with the starter kitdata/sample_test_cases.json— tier-A samples- ADR-004 (multi-judge calibration depends on this gold set)