ADR-002: recall@k + MRR for retrieval gating; nDCG rejected for v1 | LLM Evaluation Framework

Context

The eval framework gates RAG releases on retrieval quality. The single biggest cause of "the LLM hallucinated" turns out to be the retriever returning the wrong chunk — so the metric we put on the CI gate matters more than the metric we put on a slide.

Three families of retrieval metrics were on the table:

recall@k: "did the relevant chunk appear in the top k?" Binary, easy to explain, threshold-able.
MRR (mean reciprocal rank): average 1/rank-of-first-relevant-result. Cares where the relevant chunk landed in the top-k.
nDCG@k (normalized discounted cumulative gain): graded relevance with logarithmic position discount. Industry standard for IR papers.

Constraints:

Stakeholders need to read it. The PR comment bot in M04 has to say something like "retrieval recall@5 dropped from 87% to 81%." Reviewers who aren't ML engineers must understand the number.
CI must threshold-gate it. A pass/fail decision is required, not a vibe.
Gold-set budget. Per ADR-001, we have ~1k tier-A + ~400 tier-B labeled cases. Graded relevance (3+ levels) for nDCG would roughly double labeling cost on tier-B.

Decision

Adopt recall@5 as the gating metric and MRR as the diagnostic metric. Reject nDCG for v1.

# llm_eval/metrics/retrieval.py
class RetrievalMetrics:
    """Pair of metrics: recall@5 gates CI, MRR debugs failures."""

    @staticmethod
    def recall_at_k(retrieved_ids: list[str], gold_ids: set[str], k: int = 5) -> float:
        top_k = set(retrieved_ids[:k])
        return len(top_k & gold_ids) / len(gold_ids) if gold_ids else 0.0

    @staticmethod
    def mrr(retrieved_ids: list[str], gold_ids: set[str]) -> float:
        for rank, doc_id in enumerate(retrieved_ids, start=1):
            if doc_id in gold_ids:
                return 1.0 / rank
        return 0.0

Quality gate (M04 check_quality_gates.py):

# config/ship-criteria.yaml
gates:
  - name: retrieval_recall_at_5
    threshold: 0.85 # absolute floor
    regression_max: 0.03 # may not drop more than 3pp from baseline
    severity: error # blocks merge
  - name: retrieval_mrr
    threshold: 0.62
    regression_max: 0.05
    severity: warning # PR comment, no block

Tradeoffs we accept

Lever	Alternative	Chosen
Gating signal	nDCG@10	recall@5 — readable in PR comments, binary labels match our gold set
Position sensitivity	nDCG penalises bad ranking	MRR carries that signal as a diagnostic; it's not on the gate
Graded relevance	nDCG with 4-level relevance	Binary relevance — fits ADR-001's labeling budget
Threshold floor	Per-task-type thresholds	Single global threshold for v1; per-task split deferred

Consequences (positive)

The PR comment bot (M04) says one number reviewers read at a glance.
Gold-set labeling stays binary — no nDCG-driven cost doubling on tier-B.
recall@k generalises trivially when k changes; we can ship recall@1/5/10 in the dashboard without re-labeling.
MRR still catches "we put the right doc at rank 7 not rank 1" failures — visible in the dashboard, not on the gate.

Consequences (negative)

We're blind to fine-grained relevance differences (e.g. "highly relevant" vs "tangentially relevant"). The 5% of cases this matters for don't ship through the gate cleanly.
Stakeholders reading IR literature will ask "where's nDCG?". The answer is "in the dashboard later". We accept the social cost.
Single global threshold can mask per-task regressions. Mitigation: tag-level breakdowns in the PR comment.

Reversal plan

Add nDCG@10 to the metric registry whenever any of these triggers fire:

A graded-relevance use case lands (e.g. legal RAG where "exactly relevant" vs "topically relevant" gates differently).
recall@5 regressions stop correlating with user-visible quality drops.
The gold set crosses 5,000 cases (the labeling-cost argument weakens at scale).

Implementation effort: ~3 days. The metric registry is a Protocol-based plugin (Part 1 design); adding nDCG is one new class + a config flag. Gold-set re-labeling for graded relevance is the bigger cost, ~2 labeler-weeks for the existing ~1.4k cases.

References

llm_eval/metrics/retrieval.py — implementation
config/ship-criteria.yaml — gate config
scripts/check_quality_gates.py — Part 4 gate runner
ADR-001 (gold-set tier protocol — drives our labeling-cost constraint)
IR text: Manning, Raghavan, Schütze — Introduction to Information Retrieval, ch. 8