Context
The hybrid retriever (ADR-001) returns the fused top-50 chunks. The LLM's context window can hold maybe top-10 of those without saturating cost or diluting attention. The question is: which 10?
Bi-encoder retrievers (Pinecone HNSW + BM25) score every chunk independently
and rank by score. They are fast (sub-100 ms over millions of chunks) but
relatively imprecise — they have never compared the query and the chunk
in the same forward pass. Cross-encoders do exactly that: feed (query, chunk) into a transformer that outputs a single relevance score
conditioned on both. Higher precision, much higher cost (per-pair
inference).
Three options:
- Skip reranking. Pass the fused top-10 from ADR-001 directly to the LLM. Simplest, lowest latency. On the seed bench, ~74% hit-rate@10.
- LLM-as-reranker. Send candidates to a small LLM with a "rate this chunk" prompt. Effective but per-call cost (~$0.01) on every query crushes the cost model.
- Cross-encoder model. Use a small encoder (
ms-marco-MiniLM-L-6-v2class) to re-score top-50 → top-10 in a single batched forward pass. ~150-200 ms p95 on CPU, ~30-50 ms on GPU.
Option 3 is the production-grade lever — used by every reranking-aware RAG system in the wild (Cohere, sentence-transformers cross-encoders, etc.).
Decision
We adopt a cross-encoder reranker that takes the hybrid retriever's top-50 and emits the top-10 the LLM actually sees.
# backend/app/services/rag_service.py
async def retrieve(query: str, top_k_final: int = 10) -> list[Chunk]:
candidates = await hybrid_retrieve(query, top_k=50) # ADR-001
pairs = [(query, c.content) for c in candidates]
scores = reranker.predict(pairs) # CrossEncoder
ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
return [c for c, _ in ranked[:top_k_final]]
Model: cross-encoder/ms-marco-MiniLM-L-6-v2 (Sentence-Transformers
default). 22M parameters, runs on CPU with ~150 ms p95 over 50 pairs;
swappable to a larger cross-encoder via RERANK_MODEL env var when GPU
inference is available.
The reranker is always-on by default but disable-able via
RETRIEVAL_RERANKER_ENABLED=false for low-latency endpoints (e.g. an
auto-complete suggest API).
Tradeoffs we accept
| Lever | Alternative | Chosen |
|---|---|---|
| Precision | Bi-encoder only | Cross-encoder rerank stage |
| Latency | ~80 ms p95 retrieval | ~250 ms p95 (retrieval + rerank) |
| Cost | $0/query | CPU compute amortised; ~$0.0002/query equiv. |
| Tunability | Top-K knob only | Top-K + reranker model + score floor |
| Maintenance | One model surface | Two: embedder + reranker |
The largest concrete cost is the ~150-200 ms latency hit. We accept it because the precision lift is large enough to change downstream prompt quality (and therefore LLM cost — better context = shorter, more grounded prompts).
Consequences (positive)
- Hit-rate@10 lift demonstrable. On the chunking-A/B bench, hybrid
retrieval + cross-encoder reranking gets to ~88% hit-rate@10 vs ~74% for
hybrid alone. The 14pp delta directly shows up as faithfulness lift in
RAGAS evaluation (faithfulness > 0.85 floor in
sla_definition.py). - Tunable per endpoint.
/chatruns full reranking;/suggestruns bi-encoder-only with a cached top-3. The endpoint chooses the latency budget. - Independent of vendor. Cross-encoder runs locally; not dependent on Cohere or any reranker SaaS. ADR-004's vendor-fallback story stays clean.
- Score floor support. A configurable minimum score (e.g. 0.05) lets the reranker say "none of these are good enough" — the API can return a "I don't have enough context to answer" response instead of hallucinating against weak chunks.
Consequences (negative)
- CPU cost grows linearly with traffic. At 10k qpd and ~50 pairs scored per call, that's 500k forward passes/day. We pin the reranker to a dedicated CPU pool so it doesn't compete with the API workers.
- Two failure modes to handle. Reranker model load failure or timeout
must fall back to bi-encoder ranking gracefully. The
RAGRetrievalErrorhandler inapp/explainability.pyswaps to bi-encoder + emits a Prometheus counter increment. - Score floor is workload-dependent. Calibrated against the seed corpus; new domains will need re-calibration. Documented in the runbook ("recalibrate when faithfulness drops > 5pp on the canary set").
Reversal plan
If the cross-encoder's marginal precision lift drops below ~5pp on the canary RAGAS set (e.g. embedding models improve enough that bi-encoder ranking is sufficient), reversal is:
- Set
RETRIEVAL_RERANKER_ENABLED=falseglobally. rag_service.retrievereturns hybrid top-K directly.- Decommission the dedicated reranker CPU pool.
- Re-run RAGAS canary; verify faithfulness floor still holds.
Estimated effort: ~2 engineer-days. The reversal is per-endpoint capable, so this can roll out gradually.
References
backend/app/services/rag_service.py— orchestrator wiring rerank stageprompts.py— context-block assembly that consumes reranked outputapp/explainability.py— bi-encoder fallback + Prometheus countersapp/metrics.py—rag_rerank_latency,rag_rerank_hitseval/approach_decision.md— cost-precision tradeoff doc- ADR-001 (the hybrid retriever this reranker sits on top of)
- ADR-003 (the chunking that produces the candidates being reranked)