Context
The hybrid retriever (ADR-001 + ADR-002) returns ~50 candidate documents per query. The reranker's job is to score absolute quality across those 50 and surface a top-K (typically 5–10) for the agent or the LLM downstream. The reranker is the last and most expensive step on the read path. The classic options:
- Cross-encoder (e.g.
cross-encoder/ms-marco-MiniLM-L-6-v2) — a small transformer that takes(query, doc)pairs and emits a relevance score. Runs on CPU in batch. - LLM-as-judge — call GPT-4 / Claude with the 50 candidates and a prompt asking it to rank them. Strong quality, slow + expensive.
- Cohere Rerank API — managed reranker service. No local infra, pay per token.
- Skip reranking — return RRF-fused top-10 directly.
Decision
Adopt cross-encoder ms-marco-MiniLM-L-6-v2 (CPU-only, batch
inference).
# api/reranker.py
from sentence_transformers import CrossEncoder
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", max_length=512)
def rerank(query: str, docs: list[Doc], top_n: int = 10) -> list[Doc]:
pairs = [(query, doc.content[:512]) for doc in docs]
scores = model.predict(pairs, batch_size=32, show_progress_bar=False)
ranked = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, _ in ranked[:top_n]]
Tradeoffs we accept
| Lever | Cross-encoder (chosen) | LLM-as-judge | Cohere Rerank | Skip rerank |
|---|---|---|---|---|
| Recall@10 (Module 02 measured) | ~0.81 (hybrid + rerank) | ~0.85 (estimated) | ~0.82 | ~0.75 (RRF only) |
| Latency for top-50 → top-10 | <50 ms (CPU batch=32) | 1.5–4 s | 100–250 ms | 0 ms |
| Cost per query | ~$0 (model is free, CPU compute amortized) | $0.01–$0.03 (GPT-4) | $0.001 | $0 |
| Cost at 1M queries / mo | $0 marginal | $10k–$30k | $1k | $0 |
| Operational footprint | One Python dep | Vendor account + retry logic | Vendor account | None |
| Tutorial reproducibility | pip install + first-run model download | Cloud account | Cloud account | None |
| Quality on adversarial queries | Good (MS MARCO trained) | Best | Good | Poor |
| Long-document handling | Truncates at 512 tokens | Native | Truncates | N/A |
We optimize for cost-quality at production volume. The 2-percent recall gap between cross-encoder and LLM-as-judge is real but the cost gap is 5 orders of magnitude. At 1M queries/month, an LLM reranker adds ~$20k/mo vs $0 for a CPU cross-encoder; that buys a lot of hyperparameter sweeps to close the recall gap.
Cohere Rerank is the right answer for teams that don't want to manage a Python ML serving dep. It's documented as the swap path.
Consequences (positive)
- The reranker fits in <50 ms for 50 candidates on a single
t4g.mediumCPU. No GPU. No GPU autoscaling. No GPU bills. - The model is checkpoint-pinned in
requirements.txt— Module 02's recall benchmark is reproducible across machines. - Adding a custom-trained reranker later is a model-file swap, not an
architecture change. The interface stays
(query, docs) → ranked. - The cross-encoder is itself the eval signal — Module 02 uses
cross-encoder/ms-marco-MiniLM-L-12-v2(the larger sibling) as a quality oracle for cross-checking the production model.
Consequences (negative)
- 2-percent recall ceiling vs LLM-as-judge. A team with the budget
for $20k/mo of reranker compute would see better quality. Mitigation:
the gap closes with a fine-tuned cross-encoder;
data/labeled_pairs.csvships 2k labeled triples for exactly this purpose. - 512-token truncation. Long documents are reranked on their first
~250 words. Mitigation: chunk-then-merge at the document level (out
of scope for v1; design noted in
DESIGN.md). - CPU latency scales linearly with batch size. Reranking 100
candidates takes ~100 ms; at 200 candidates the latency budget breaks.
Mitigation: cap candidate-list size to 50 in
api/main.py. - No instruction-tuned quality. A "rank these by relevance to a user who is buying ergonomic furniture" intent-aware rerank is out of reach without LLM-as-judge or fine-tuning.
Reversal plan
The reranker interface is rerank(query, docs, top_n). Replacement is
bounded:
- Cohere Rerank swap — replace
model.predict(pairs)with acohere.rerank(query=query, documents=[d.content for d in docs])call. ~30 lines including retry logic. - LLM-as-judge swap — replace with an Anthropic Claude call that
returns a JSON array of
{doc_id, score}. Add response validation; handle cost via the Module 04 cost panel. - Fine-tune the cross-encoder — train against
data/labeled_pairs.csv(2k labeled triples ship in starter kit); swap the model name inapi/reranker.py.
Estimated effort: 2-5 engineer-days for any of the three. Reversible.
References
api/reranker.py(the cross-encoder runner)api/main.py(/search/hybrid/rerankedendpoint)data/labeled_pairs.csv(2k triples for fine-tuning)scripts/eval.py(recall@10 + MRR + nDCG validation)- ADR-002 (RRF — feeds the candidate list the reranker scores)
- Cost-model CSV (cross-encoder compute is the largest CPU lever; budget impact)