Skip to content
Back to AI Retrieval Platform

Reranker is a CPU cross-encoder, not LLM-as-judge

✓ AcceptedAI Retrieval Platform02 — Hybrid Search & Precision Tuning
By AI-DE Engineering Team·Stakeholders: retrieval engineer, infra reviewer, finance

Context

The hybrid retriever (ADR-001 + ADR-002) returns ~50 candidate documents per query. The reranker's job is to score absolute quality across those 50 and surface a top-K (typically 5–10) for the agent or the LLM downstream. The reranker is the last and most expensive step on the read path. The classic options:

  1. Cross-encoder (e.g. cross-encoder/ms-marco-MiniLM-L-6-v2) — a small transformer that takes (query, doc) pairs and emits a relevance score. Runs on CPU in batch.
  2. LLM-as-judge — call GPT-4 / Claude with the 50 candidates and a prompt asking it to rank them. Strong quality, slow + expensive.
  3. Cohere Rerank API — managed reranker service. No local infra, pay per token.
  4. Skip reranking — return RRF-fused top-10 directly.

Decision

Adopt cross-encoder ms-marco-MiniLM-L-6-v2 (CPU-only, batch inference).

# api/reranker.py
from sentence_transformers import CrossEncoder
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", max_length=512)

def rerank(query: str, docs: list[Doc], top_n: int = 10) -> list[Doc]:
    pairs = [(query, doc.content[:512]) for doc in docs]
    scores = model.predict(pairs, batch_size=32, show_progress_bar=False)
    ranked = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in ranked[:top_n]]

Tradeoffs we accept

LeverCross-encoder (chosen)LLM-as-judgeCohere RerankSkip rerank
Recall@10 (Module 02 measured)~0.81 (hybrid + rerank)~0.85 (estimated)~0.82~0.75 (RRF only)
Latency for top-50 → top-10<50 ms (CPU batch=32)1.5–4 s100–250 ms0 ms
Cost per query~$0 (model is free, CPU compute amortized)$0.01–$0.03 (GPT-4)$0.001$0
Cost at 1M queries / mo$0 marginal$10k–$30k$1k$0
Operational footprintOne Python depVendor account + retry logicVendor accountNone
Tutorial reproducibilitypip install + first-run model downloadCloud accountCloud accountNone
Quality on adversarial queriesGood (MS MARCO trained)BestGoodPoor
Long-document handlingTruncates at 512 tokensNativeTruncatesN/A

We optimize for cost-quality at production volume. The 2-percent recall gap between cross-encoder and LLM-as-judge is real but the cost gap is 5 orders of magnitude. At 1M queries/month, an LLM reranker adds ~$20k/mo vs $0 for a CPU cross-encoder; that buys a lot of hyperparameter sweeps to close the recall gap.

Cohere Rerank is the right answer for teams that don't want to manage a Python ML serving dep. It's documented as the swap path.

Consequences (positive)

  • The reranker fits in <50 ms for 50 candidates on a single t4g.medium CPU. No GPU. No GPU autoscaling. No GPU bills.
  • The model is checkpoint-pinned in requirements.txt — Module 02's recall benchmark is reproducible across machines.
  • Adding a custom-trained reranker later is a model-file swap, not an architecture change. The interface stays (query, docs) → ranked.
  • The cross-encoder is itself the eval signal — Module 02 uses cross-encoder/ms-marco-MiniLM-L-12-v2 (the larger sibling) as a quality oracle for cross-checking the production model.

Consequences (negative)

  • 2-percent recall ceiling vs LLM-as-judge. A team with the budget for $20k/mo of reranker compute would see better quality. Mitigation: the gap closes with a fine-tuned cross-encoder; data/labeled_pairs.csv ships 2k labeled triples for exactly this purpose.
  • 512-token truncation. Long documents are reranked on their first ~250 words. Mitigation: chunk-then-merge at the document level (out of scope for v1; design noted in DESIGN.md).
  • CPU latency scales linearly with batch size. Reranking 100 candidates takes ~100 ms; at 200 candidates the latency budget breaks. Mitigation: cap candidate-list size to 50 in api/main.py.
  • No instruction-tuned quality. A "rank these by relevance to a user who is buying ergonomic furniture" intent-aware rerank is out of reach without LLM-as-judge or fine-tuning.

Reversal plan

The reranker interface is rerank(query, docs, top_n). Replacement is bounded:

  1. Cohere Rerank swap — replace model.predict(pairs) with a cohere.rerank(query=query, documents=[d.content for d in docs]) call. ~30 lines including retry logic.
  2. LLM-as-judge swap — replace with an Anthropic Claude call that returns a JSON array of {doc_id, score}. Add response validation; handle cost via the Module 04 cost panel.
  3. Fine-tune the cross-encoder — train against data/labeled_pairs.csv (2k labeled triples ship in starter kit); swap the model name in api/reranker.py.

Estimated effort: 2-5 engineer-days for any of the three. Reversible.

References

  • api/reranker.py (the cross-encoder runner)
  • api/main.py (/search/hybrid/reranked endpoint)
  • data/labeled_pairs.csv (2k triples for fine-tuning)
  • scripts/eval.py (recall@10 + MRR + nDCG validation)
  • ADR-002 (RRF — feeds the candidate list the reranker scores)
  • Cost-model CSV (cross-encoder compute is the largest CPU lever; budget impact)
Built into the project

This decision shipped as part of AI Retrieval Platform — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open