# ADR-001 — Tier-A/Tier-B labeling protocol for the gold set

- **Status:** Accepted
- **Date:** 2026-04-12
- **Module:** 02 — Test Management System
- **Stakeholders:** ML engineer, data labeler lead, eng manager

## Context

A working LLM eval framework is worth nothing without trustworthy labeled examples to score against. The gold set is the foundation everything else gates on — multi-judge calibration in M03, RAGAS in M05, regression detection in M04, the human-in-the-loop loop in M07.

We had three constraints and they fight each other:

1. **Coverage.** We need ≥ 1,000 labeled cases across at least 4 task types (factual QA, summarization, code generation, RAG synthesis) to make per-task accuracy claims defensible.
2. **Quality.** Inter-annotator agreement on the gold set has to clear κ ≥ 0.7 — below that, we can't tell whether judge errors are model failures or label noise.
3. **Throughput.** A single labeler at full speed handles ~80 cases/day on the harder tasks. Dual-labeler-with-adjudication is ~30/day. Hitting 1,000 cases with dual-labeling is roughly 33 labeler-days; we have ~10 in the budget for v1.

Three options on the table:

- **Option A:** Dual-labeler everything. Highest quality, doesn't fit the budget.
- **Option B:** Single-labeler everything. Fits the budget; agreement signal is unmeasurable, so we can't catch labeling drift.
- **Option C:** Stratified — single-labeler triage on the larger tier, dual-labeler + adjudication on a smaller gold tier used for calibration.

## Decision

**Adopt Option C.** Two tiers:

- **Tier-A (triage, 60% of cases):** single annotator with weekly 10% spot-check by a second labeler. Used for bulk regression detection and CI-gate sampling. Accept ~5% noise floor.
- **Tier-B (gold, 40%):** two independent annotators per case with adjudication on disagreements. Krippendorff's α tracked per task type, target ≥ 0.75. This is what judge calibration runs against; this is what we ship as `gold/*.jsonl` in the starter kit.

```python
# llm_eval/labeling/protocol.py
class LabelingTier(str, Enum):
    TRIAGE = "tier_a"   # single-annotator + weekly spot-check
    GOLD   = "tier_b"   # dual-annotator + adjudication

ROUTING_RULES = {
    LabelingTier.GOLD: {
        "task_types": {"factual_qa", "rag_synthesis"},   # high-stakes
        "uncertainty_threshold": 0.65,                   # if model is unsure
        "adversarial_flag": True,                        # known hard cases
    },
    LabelingTier.TRIAGE: "default",
}
```

Cases route to GOLD when ANY rule matches; everything else goes to TRIAGE.

## Tradeoffs we accept

| Lever                  | Alternative                              | Chosen                                                           |
| ---------------------- | ---------------------------------------- | ---------------------------------------------------------------- |
| Quality bar            | Uniform dual-annotator                   | Stratified — quality where it gates, throughput where it doesn't |
| Labeler cost           | $0.40/case dual on everything (~$400/1k) | $0.40 × 0.4 + $0.16 × 0.6 = $0.26/case avg (~$260/1k)            |
| Drift detection        | Per-case agreement on every label        | 10% spot-check on tier-A; full agreement on tier-B               |
| Inter-annotator metric | Cohen's κ (2 raters)                     | Krippendorff's α (handles ≥ 2 raters + missing data)             |

## Consequences (positive)

- We can ship a 1,000-case gold set in ~14 labeler-days (within budget) with measurable κ on the 400 high-stakes cases.
- Tier-B doubles as the multi-judge calibration set in M03 — judges that fall outside α-based agreement bounds get their weights deweighted.
- Tier-A's spot-check signal is enough to catch labeler drift week-over-week without burning the budget.

## Consequences (negative)

- Tier-A claims have a 5% confidence-interval floor on accuracy. We disclose this explicitly in the eval report.
- Adjudication queue can back up — if two adjudicators are out, tier-B throughput drops to zero. Mitigation: rotate a third backup labeler.
- The 10% spot-check is a sampling, not a guarantee. We accept the residual risk; promote tier-A → tier-B if a regression-blocking eval depends on it.

## Reversal plan

If team size grows past 8 labelers, revisit and ramp tier-A → dual-annotator on cases that gate releases. Estimated effort: ~2 engineer-weeks (routing + adjudication queue scaling). Triggers:

1. Tier-A spot-check disagreement rate > 12% for 2 consecutive weeks.
2. A regression-detection bug traced back to tier-A label noise.
3. SLA changes that require κ on every gating case.

## References

- `llm_eval/labeling/protocol.py` — routing rules
- `llm_eval/labeling/agreement.py` — Krippendorff α calculation
- `gold/*.jsonl` — tier-B labeled cases shipped with the starter kit
- `data/sample_test_cases.json` — tier-A samples
- ADR-004 (multi-judge calibration depends on this gold set)
