Context
The eval framework gates RAG releases on retrieval quality. The single biggest cause of "the LLM hallucinated" turns out to be the retriever returning the wrong chunk — so the metric we put on the CI gate matters more than the metric we put on a slide.
Three families of retrieval metrics were on the table:
- recall@k: "did the relevant chunk appear in the top k?" Binary, easy to explain, threshold-able.
- MRR (mean reciprocal rank): average 1/rank-of-first-relevant-result. Cares where the relevant chunk landed in the top-k.
- nDCG@k (normalized discounted cumulative gain): graded relevance with logarithmic position discount. Industry standard for IR papers.
Constraints:
- Stakeholders need to read it. The PR comment bot in M04 has to say something like "retrieval recall@5 dropped from 87% to 81%." Reviewers who aren't ML engineers must understand the number.
- CI must threshold-gate it. A pass/fail decision is required, not a vibe.
- Gold-set budget. Per ADR-001, we have ~1k tier-A + ~400 tier-B labeled cases. Graded relevance (3+ levels) for nDCG would roughly double labeling cost on tier-B.
Decision
Adopt recall@5 as the gating metric and MRR as the diagnostic metric. Reject nDCG for v1.
# llm_eval/metrics/retrieval.py
class RetrievalMetrics:
"""Pair of metrics: recall@5 gates CI, MRR debugs failures."""
@staticmethod
def recall_at_k(retrieved_ids: list[str], gold_ids: set[str], k: int = 5) -> float:
top_k = set(retrieved_ids[:k])
return len(top_k & gold_ids) / len(gold_ids) if gold_ids else 0.0
@staticmethod
def mrr(retrieved_ids: list[str], gold_ids: set[str]) -> float:
for rank, doc_id in enumerate(retrieved_ids, start=1):
if doc_id in gold_ids:
return 1.0 / rank
return 0.0
Quality gate (M04 check_quality_gates.py):
# config/ship-criteria.yaml
gates:
- name: retrieval_recall_at_5
threshold: 0.85 # absolute floor
regression_max: 0.03 # may not drop more than 3pp from baseline
severity: error # blocks merge
- name: retrieval_mrr
threshold: 0.62
regression_max: 0.05
severity: warning # PR comment, no block
Tradeoffs we accept
| Lever | Alternative | Chosen |
|---|---|---|
| Gating signal | nDCG@10 | recall@5 — readable in PR comments, binary labels match our gold set |
| Position sensitivity | nDCG penalises bad ranking | MRR carries that signal as a diagnostic; it's not on the gate |
| Graded relevance | nDCG with 4-level relevance | Binary relevance — fits ADR-001's labeling budget |
| Threshold floor | Per-task-type thresholds | Single global threshold for v1; per-task split deferred |
Consequences (positive)
- The PR comment bot (M04) says one number reviewers read at a glance.
- Gold-set labeling stays binary — no nDCG-driven cost doubling on tier-B.
- recall@k generalises trivially when k changes; we can ship recall@1/5/10 in the dashboard without re-labeling.
- MRR still catches "we put the right doc at rank 7 not rank 1" failures — visible in the dashboard, not on the gate.
Consequences (negative)
- We're blind to fine-grained relevance differences (e.g. "highly relevant" vs "tangentially relevant"). The 5% of cases this matters for don't ship through the gate cleanly.
- Stakeholders reading IR literature will ask "where's nDCG?". The answer is "in the dashboard later". We accept the social cost.
- Single global threshold can mask per-task regressions. Mitigation: tag-level breakdowns in the PR comment.
Reversal plan
Add nDCG@10 to the metric registry whenever any of these triggers fire:
- A graded-relevance use case lands (e.g. legal RAG where "exactly relevant" vs "topically relevant" gates differently).
- recall@5 regressions stop correlating with user-visible quality drops.
- The gold set crosses 5,000 cases (the labeling-cost argument weakens at scale).
Implementation effort: ~3 days. The metric registry is a Protocol-based plugin (Part 1 design); adding nDCG is one new class + a config flag. Gold-set re-labeling for graded relevance is the bigger cost, ~2 labeler-weeks for the existing ~1.4k cases.
References
llm_eval/metrics/retrieval.py— implementationconfig/ship-criteria.yaml— gate configscripts/check_quality_gates.py— Part 4 gate runner- ADR-001 (gold-set tier protocol — drives our labeling-cost constraint)
- IR text: Manning, Raghavan, Schütze — Introduction to Information Retrieval, ch. 8