ADR-003: Reranker is a CPU cross-encoder, not LLM-as-judge | AI Retrieval Platform

Context

The hybrid retriever (ADR-001 + ADR-002) returns ~50 candidate documents per query. The reranker's job is to score absolute quality across those 50 and surface a top-K (typically 5–10) for the agent or the LLM downstream. The reranker is the last and most expensive step on the read path. The classic options:

Cross-encoder (e.g. cross-encoder/ms-marco-MiniLM-L-6-v2) — a small transformer that takes (query, doc) pairs and emits a relevance score. Runs on CPU in batch.
LLM-as-judge — call GPT-4 / Claude with the 50 candidates and a prompt asking it to rank them. Strong quality, slow + expensive.
Cohere Rerank API — managed reranker service. No local infra, pay per token.
Skip reranking — return RRF-fused top-10 directly.

Decision

Adopt cross-encoder ms-marco-MiniLM-L-6-v2 (CPU-only, batch inference).

# api/reranker.py
from sentence_transformers import CrossEncoder
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", max_length=512)

def rerank(query: str, docs: list[Doc], top_n: int = 10) -> list[Doc]:
    pairs = [(query, doc.content[:512]) for doc in docs]
    scores = model.predict(pairs, batch_size=32, show_progress_bar=False)
    ranked = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in ranked[:top_n]]

Tradeoffs we accept

Lever	Cross-encoder (chosen)	LLM-as-judge	Cohere Rerank	Skip rerank
Recall@10 (Module 02 measured)	~0.81 (hybrid + rerank)	~0.85 (estimated)	~0.82	~0.75 (RRF only)
Latency for top-50 → top-10	<50 ms (CPU batch=32)	1.5–4 s	100–250 ms	0 ms
Cost per query	~$0 (model is free, CPU compute amortized)	$0.01–$0.03 (GPT-4)	$0.001	$0
Cost at 1M queries / mo	$0 marginal	$10k–$30k	$1k	$0
Operational footprint	One Python dep	Vendor account + retry logic	Vendor account	None
Tutorial reproducibility	`pip install` + first-run model download	Cloud account	Cloud account	None
Quality on adversarial queries	Good (MS MARCO trained)	Best	Good	Poor
Long-document handling	Truncates at 512 tokens	Native	Truncates	N/A

We optimize for cost-quality at production volume. The 2-percent recall gap between cross-encoder and LLM-as-judge is real but the cost gap is 5 orders of magnitude. At 1M queries/month, an LLM reranker adds ~$20k/mo vs $0 for a CPU cross-encoder; that buys a lot of hyperparameter sweeps to close the recall gap.

Cohere Rerank is the right answer for teams that don't want to manage a Python ML serving dep. It's documented as the swap path.

Consequences (positive)

The reranker fits in <50 ms for 50 candidates on a single t4g.medium CPU. No GPU. No GPU autoscaling. No GPU bills.
The model is checkpoint-pinned in requirements.txt — Module 02's recall benchmark is reproducible across machines.
Adding a custom-trained reranker later is a model-file swap, not an architecture change. The interface stays (query, docs) → ranked.
The cross-encoder is itself the eval signal — Module 02 uses cross-encoder/ms-marco-MiniLM-L-12-v2 (the larger sibling) as a quality oracle for cross-checking the production model.

Consequences (negative)

2-percent recall ceiling vs LLM-as-judge. A team with the budget for $20k/mo of reranker compute would see better quality. Mitigation: the gap closes with a fine-tuned cross-encoder; data/labeled_pairs.csv ships 2k labeled triples for exactly this purpose.
512-token truncation. Long documents are reranked on their first ~250 words. Mitigation: chunk-then-merge at the document level (out of scope for v1; design noted in DESIGN.md).
CPU latency scales linearly with batch size. Reranking 100 candidates takes ~100 ms; at 200 candidates the latency budget breaks. Mitigation: cap candidate-list size to 50 in api/main.py.
No instruction-tuned quality. A "rank these by relevance to a user who is buying ergonomic furniture" intent-aware rerank is out of reach without LLM-as-judge or fine-tuning.

Reversal plan

The reranker interface is rerank(query, docs, top_n). Replacement is bounded:

Cohere Rerank swap — replace model.predict(pairs) with a cohere.rerank(query=query, documents=[d.content for d in docs]) call. ~30 lines including retry logic.
LLM-as-judge swap — replace with an Anthropic Claude call that returns a JSON array of {doc_id, score}. Add response validation; handle cost via the Module 04 cost panel.
Fine-tune the cross-encoder — train against data/labeled_pairs.csv (2k labeled triples ship in starter kit); swap the model name in api/reranker.py.

Estimated effort: 2-5 engineer-days for any of the three. Reversible.

References

api/reranker.py (the cross-encoder runner)
api/main.py (/search/hybrid/reranked endpoint)
data/labeled_pairs.csv (2k triples for fine-tuning)
scripts/eval.py (recall@10 + MRR + nDCG validation)
ADR-002 (RRF — feeds the candidate list the reranker scores)
Cost-model CSV (cross-encoder compute is the largest CPU lever; budget impact)