Skip to content
Back to LLM Evaluation Framework

Tier-A/Tier-B labeling protocol for the gold set

✓ AcceptedLLM Evaluation Framework02 — Test Management System
By AI-DE Engineering Team·Stakeholders: ML engineer, data labeler lead, eng manager

Context

A working LLM eval framework is worth nothing without trustworthy labeled examples to score against. The gold set is the foundation everything else gates on — multi-judge calibration in M03, RAGAS in M05, regression detection in M04, the human-in-the-loop loop in M07.

We had three constraints and they fight each other:

  1. Coverage. We need ≥ 1,000 labeled cases across at least 4 task types (factual QA, summarization, code generation, RAG synthesis) to make per-task accuracy claims defensible.
  2. Quality. Inter-annotator agreement on the gold set has to clear κ ≥ 0.7 — below that, we can't tell whether judge errors are model failures or label noise.
  3. Throughput. A single labeler at full speed handles ~80 cases/day on the harder tasks. Dual-labeler-with-adjudication is ~30/day. Hitting 1,000 cases with dual-labeling is roughly 33 labeler-days; we have ~10 in the budget for v1.

Three options on the table:

  • Option A: Dual-labeler everything. Highest quality, doesn't fit the budget.
  • Option B: Single-labeler everything. Fits the budget; agreement signal is unmeasurable, so we can't catch labeling drift.
  • Option C: Stratified — single-labeler triage on the larger tier, dual-labeler + adjudication on a smaller gold tier used for calibration.

Decision

Adopt Option C. Two tiers:

  • Tier-A (triage, 60% of cases): single annotator with weekly 10% spot-check by a second labeler. Used for bulk regression detection and CI-gate sampling. Accept ~5% noise floor.
  • Tier-B (gold, 40%): two independent annotators per case with adjudication on disagreements. Krippendorff's α tracked per task type, target ≥ 0.75. This is what judge calibration runs against; this is what we ship as gold/*.jsonl in the starter kit.
# llm_eval/labeling/protocol.py
class LabelingTier(str, Enum):
    TRIAGE = "tier_a"   # single-annotator + weekly spot-check
    GOLD   = "tier_b"   # dual-annotator + adjudication

ROUTING_RULES = {
    LabelingTier.GOLD: {
        "task_types": {"factual_qa", "rag_synthesis"},   # high-stakes
        "uncertainty_threshold": 0.65,                   # if model is unsure
        "adversarial_flag": True,                        # known hard cases
    },
    LabelingTier.TRIAGE: "default",
}

Cases route to GOLD when ANY rule matches; everything else goes to TRIAGE.

Tradeoffs we accept

LeverAlternativeChosen
Quality barUniform dual-annotatorStratified — quality where it gates, throughput where it doesn't
Labeler cost$0.40/case dual on everything (~$400/1k)$0.40 × 0.4 + $0.16 × 0.6 = $0.26/case avg (~$260/1k)
Drift detectionPer-case agreement on every label10% spot-check on tier-A; full agreement on tier-B
Inter-annotator metricCohen's κ (2 raters)Krippendorff's α (handles ≥ 2 raters + missing data)

Consequences (positive)

  • We can ship a 1,000-case gold set in ~14 labeler-days (within budget) with measurable κ on the 400 high-stakes cases.
  • Tier-B doubles as the multi-judge calibration set in M03 — judges that fall outside α-based agreement bounds get their weights deweighted.
  • Tier-A's spot-check signal is enough to catch labeler drift week-over-week without burning the budget.

Consequences (negative)

  • Tier-A claims have a 5% confidence-interval floor on accuracy. We disclose this explicitly in the eval report.
  • Adjudication queue can back up — if two adjudicators are out, tier-B throughput drops to zero. Mitigation: rotate a third backup labeler.
  • The 10% spot-check is a sampling, not a guarantee. We accept the residual risk; promote tier-A → tier-B if a regression-blocking eval depends on it.

Reversal plan

If team size grows past 8 labelers, revisit and ramp tier-A → dual-annotator on cases that gate releases. Estimated effort: ~2 engineer-weeks (routing + adjudication queue scaling). Triggers:

  1. Tier-A spot-check disagreement rate > 12% for 2 consecutive weeks.
  2. A regression-detection bug traced back to tier-A label noise.
  3. SLA changes that require κ on every gating case.

References

  • llm_eval/labeling/protocol.py — routing rules
  • llm_eval/labeling/agreement.py — Krippendorff α calculation
  • gold/*.jsonl — tier-B labeled cases shipped with the starter kit
  • data/sample_test_cases.json — tier-A samples
  • ADR-004 (multi-judge calibration depends on this gold set)
Built into the project

This decision shipped as part of LLM Evaluation Framework — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open