ADR-002: Cross-encoder reranking on top-K (precision lever vs latency cost) | Enterprise RAG

Context

The hybrid retriever (ADR-001) returns the fused top-50 chunks. The LLM's context window can hold maybe top-10 of those without saturating cost or diluting attention. The question is: which 10?

Bi-encoder retrievers (Pinecone HNSW + BM25) score every chunk independently and rank by score. They are fast (sub-100 ms over millions of chunks) but relatively imprecise — they have never compared the query and the chunk in the same forward pass. Cross-encoders do exactly that: feed (query, chunk) into a transformer that outputs a single relevance score conditioned on both. Higher precision, much higher cost (per-pair inference).

Three options:

Skip reranking. Pass the fused top-10 from ADR-001 directly to the LLM. Simplest, lowest latency. On the seed bench, ~74% hit-rate@10.
LLM-as-reranker. Send candidates to a small LLM with a "rate this chunk" prompt. Effective but per-call cost (~$0.01) on every query crushes the cost model.
Cross-encoder model. Use a small encoder (ms-marco-MiniLM-L-6-v2 class) to re-score top-50 → top-10 in a single batched forward pass. ~150-200 ms p95 on CPU, ~30-50 ms on GPU.

Option 3 is the production-grade lever — used by every reranking-aware RAG system in the wild (Cohere, sentence-transformers cross-encoders, etc.).

Decision

We adopt a cross-encoder reranker that takes the hybrid retriever's top-50 and emits the top-10 the LLM actually sees.

# backend/app/services/rag_service.py
async def retrieve(query: str, top_k_final: int = 10) -> list[Chunk]:
    candidates = await hybrid_retrieve(query, top_k=50)         # ADR-001

    pairs = [(query, c.content) for c in candidates]
    scores = reranker.predict(pairs)                            # CrossEncoder

    ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
    return [c for c, _ in ranked[:top_k_final]]

Model: cross-encoder/ms-marco-MiniLM-L-6-v2 (Sentence-Transformers default). 22M parameters, runs on CPU with ~150 ms p95 over 50 pairs; swappable to a larger cross-encoder via RERANK_MODEL env var when GPU inference is available.

The reranker is always-on by default but disable-able via RETRIEVAL_RERANKER_ENABLED=false for low-latency endpoints (e.g. an auto-complete suggest API).

Tradeoffs we accept

Lever	Alternative	Chosen
Precision	Bi-encoder only	Cross-encoder rerank stage
Latency	~80 ms p95 retrieval	~250 ms p95 (retrieval + rerank)
Cost	$0/query	CPU compute amortised; ~$0.0002/query equiv.
Tunability	Top-K knob only	Top-K + reranker model + score floor
Maintenance	One model surface	Two: embedder + reranker

The largest concrete cost is the ~150-200 ms latency hit. We accept it because the precision lift is large enough to change downstream prompt quality (and therefore LLM cost — better context = shorter, more grounded prompts).

Consequences (positive)

Hit-rate@10 lift demonstrable. On the chunking-A/B bench, hybrid retrieval + cross-encoder reranking gets to ~88% hit-rate@10 vs ~74% for hybrid alone. The 14pp delta directly shows up as faithfulness lift in RAGAS evaluation (faithfulness > 0.85 floor in sla_definition.py).
Tunable per endpoint. /chat runs full reranking; /suggest runs bi-encoder-only with a cached top-3. The endpoint chooses the latency budget.
Independent of vendor. Cross-encoder runs locally; not dependent on Cohere or any reranker SaaS. ADR-004's vendor-fallback story stays clean.
Score floor support. A configurable minimum score (e.g. 0.05) lets the reranker say "none of these are good enough" — the API can return a "I don't have enough context to answer" response instead of hallucinating against weak chunks.

Consequences (negative)

CPU cost grows linearly with traffic. At 10k qpd and ~50 pairs scored per call, that's 500k forward passes/day. We pin the reranker to a dedicated CPU pool so it doesn't compete with the API workers.
Two failure modes to handle. Reranker model load failure or timeout must fall back to bi-encoder ranking gracefully. The RAGRetrievalError handler in app/explainability.py swaps to bi-encoder + emits a Prometheus counter increment.
Score floor is workload-dependent. Calibrated against the seed corpus; new domains will need re-calibration. Documented in the runbook ("recalibrate when faithfulness drops > 5pp on the canary set").

Reversal plan

If the cross-encoder's marginal precision lift drops below ~5pp on the canary RAGAS set (e.g. embedding models improve enough that bi-encoder ranking is sufficient), reversal is:

Set RETRIEVAL_RERANKER_ENABLED=false globally.
rag_service.retrieve returns hybrid top-K directly.
Decommission the dedicated reranker CPU pool.
Re-run RAGAS canary; verify faithfulness floor still holds.

Estimated effort: ~2 engineer-days. The reversal is per-endpoint capable, so this can roll out gradually.

References

backend/app/services/rag_service.py — orchestrator wiring rerank stage
prompts.py — context-block assembly that consumes reranked output
app/explainability.py — bi-encoder fallback + Prometheus counters
app/metrics.py — rag_rerank_latency, rag_rerank_hits
eval/approach_decision.md — cost-precision tradeoff doc
ADR-001 (the hybrid retriever this reranker sits on top of)
ADR-003 (the chunking that produces the candidates being reranked)