Skip to content
Back to Enterprise RAG

Cross-encoder reranking on top-K (precision lever vs latency cost)

✓ AcceptedEnterprise RAG03 — Querying, Prompting & Generation
By AI-DE Engineering Team·Stakeholders: retrieval engineer, ML platform lead, product owner

Context

The hybrid retriever (ADR-001) returns the fused top-50 chunks. The LLM's context window can hold maybe top-10 of those without saturating cost or diluting attention. The question is: which 10?

Bi-encoder retrievers (Pinecone HNSW + BM25) score every chunk independently and rank by score. They are fast (sub-100 ms over millions of chunks) but relatively imprecise — they have never compared the query and the chunk in the same forward pass. Cross-encoders do exactly that: feed (query, chunk) into a transformer that outputs a single relevance score conditioned on both. Higher precision, much higher cost (per-pair inference).

Three options:

  1. Skip reranking. Pass the fused top-10 from ADR-001 directly to the LLM. Simplest, lowest latency. On the seed bench, ~74% hit-rate@10.
  2. LLM-as-reranker. Send candidates to a small LLM with a "rate this chunk" prompt. Effective but per-call cost (~$0.01) on every query crushes the cost model.
  3. Cross-encoder model. Use a small encoder (ms-marco-MiniLM-L-6-v2 class) to re-score top-50 → top-10 in a single batched forward pass. ~150-200 ms p95 on CPU, ~30-50 ms on GPU.

Option 3 is the production-grade lever — used by every reranking-aware RAG system in the wild (Cohere, sentence-transformers cross-encoders, etc.).

Decision

We adopt a cross-encoder reranker that takes the hybrid retriever's top-50 and emits the top-10 the LLM actually sees.

# backend/app/services/rag_service.py
async def retrieve(query: str, top_k_final: int = 10) -> list[Chunk]:
    candidates = await hybrid_retrieve(query, top_k=50)         # ADR-001

    pairs = [(query, c.content) for c in candidates]
    scores = reranker.predict(pairs)                            # CrossEncoder

    ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
    return [c for c, _ in ranked[:top_k_final]]

Model: cross-encoder/ms-marco-MiniLM-L-6-v2 (Sentence-Transformers default). 22M parameters, runs on CPU with ~150 ms p95 over 50 pairs; swappable to a larger cross-encoder via RERANK_MODEL env var when GPU inference is available.

The reranker is always-on by default but disable-able via RETRIEVAL_RERANKER_ENABLED=false for low-latency endpoints (e.g. an auto-complete suggest API).

Tradeoffs we accept

LeverAlternativeChosen
PrecisionBi-encoder onlyCross-encoder rerank stage
Latency~80 ms p95 retrieval~250 ms p95 (retrieval + rerank)
Cost$0/queryCPU compute amortised; ~$0.0002/query equiv.
TunabilityTop-K knob onlyTop-K + reranker model + score floor
MaintenanceOne model surfaceTwo: embedder + reranker

The largest concrete cost is the ~150-200 ms latency hit. We accept it because the precision lift is large enough to change downstream prompt quality (and therefore LLM cost — better context = shorter, more grounded prompts).

Consequences (positive)

  • Hit-rate@10 lift demonstrable. On the chunking-A/B bench, hybrid retrieval + cross-encoder reranking gets to ~88% hit-rate@10 vs ~74% for hybrid alone. The 14pp delta directly shows up as faithfulness lift in RAGAS evaluation (faithfulness > 0.85 floor in sla_definition.py).
  • Tunable per endpoint. /chat runs full reranking; /suggest runs bi-encoder-only with a cached top-3. The endpoint chooses the latency budget.
  • Independent of vendor. Cross-encoder runs locally; not dependent on Cohere or any reranker SaaS. ADR-004's vendor-fallback story stays clean.
  • Score floor support. A configurable minimum score (e.g. 0.05) lets the reranker say "none of these are good enough" — the API can return a "I don't have enough context to answer" response instead of hallucinating against weak chunks.

Consequences (negative)

  • CPU cost grows linearly with traffic. At 10k qpd and ~50 pairs scored per call, that's 500k forward passes/day. We pin the reranker to a dedicated CPU pool so it doesn't compete with the API workers.
  • Two failure modes to handle. Reranker model load failure or timeout must fall back to bi-encoder ranking gracefully. The RAGRetrievalError handler in app/explainability.py swaps to bi-encoder + emits a Prometheus counter increment.
  • Score floor is workload-dependent. Calibrated against the seed corpus; new domains will need re-calibration. Documented in the runbook ("recalibrate when faithfulness drops > 5pp on the canary set").

Reversal plan

If the cross-encoder's marginal precision lift drops below ~5pp on the canary RAGAS set (e.g. embedding models improve enough that bi-encoder ranking is sufficient), reversal is:

  1. Set RETRIEVAL_RERANKER_ENABLED=false globally.
  2. rag_service.retrieve returns hybrid top-K directly.
  3. Decommission the dedicated reranker CPU pool.
  4. Re-run RAGAS canary; verify faithfulness floor still holds.

Estimated effort: ~2 engineer-days. The reversal is per-endpoint capable, so this can roll out gradually.

References

  • backend/app/services/rag_service.py — orchestrator wiring rerank stage
  • prompts.py — context-block assembly that consumes reranked output
  • app/explainability.py — bi-encoder fallback + Prometheus counters
  • app/metrics.pyrag_rerank_latency, rag_rerank_hits
  • eval/approach_decision.md — cost-precision tradeoff doc
  • ADR-001 (the hybrid retriever this reranker sits on top of)
  • ADR-003 (the chunking that produces the candidates being reranked)
Built into the project

This decision shipped as part of Enterprise RAG — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open