Skip to content
Back to AI Retrieval Platform

Hybrid retrieval uses Reciprocal Rank Fusion, not score averaging or learned-sparse

✓ AcceptedAI Retrieval Platform02 — Hybrid Search & Precision Tuning
By AI-DE Engineering Team·Stakeholders: retrieval engineer, search relevance reviewer

Context

Hybrid retrieval merges a dense (semantic) candidate list with a sparse (lexical) candidate list. The dense list comes from cosine similarity over pgvector; the sparse list comes from Postgres ts_rank over a tsvector GIN index (BM25-shaped). The merge step is non-trivial because the two scores live on incompatible scales:

  • pgvector cosine similarity: bounded [0, 1], distribution skewed high (~0.6–0.95 for real queries)
  • ts_rank BM25: unbounded ~[0.001, 0.99], distribution log-shaped

Naive options that don't work:

  1. Score averaging ((cosine + ts_rank) / 2) — ts_rank is ~10× smaller than cosine on most queries; averaging effectively ignores the BM25 signal.
  2. Min-max normalization before averaging — sensitive to query distribution outliers; one rare-term query can warp the scale for the rest of the batch.
  3. Score multiplication — exponentiates noise in the smaller distribution.

Options that work:

  1. Reciprocal Rank Fusion (RRF) — ignore scores entirely; combine on rank position with 1 / (k + rank), k=60 by convention.
  2. Learned-sparse retrievers (SPLADE, ColBERT) — replace BM25 with a learned representation that produces vector-compatible scores natively.
  3. A learned ranker on top of both lists — train a ranker over [cosine_score, ts_rank, query_features].

Decision

Adopt Reciprocal Rank Fusion (RRF) with k=60.

# api/rrf.py
def rrf_fuse(dense_results: list[Doc], sparse_results: list[Doc], k: int = 60) -> list[Doc]:
    scores: dict[str, float] = {}
    for rank, doc in enumerate(dense_results, start=1):
        scores[doc.id] = scores.get(doc.id, 0.0) + 1.0 / (k + rank)
    for rank, doc in enumerate(sparse_results, start=1):
        scores[doc.id] = scores.get(doc.id, 0.0) + 1.0 / (k + rank)
    return sorted(unique(dense_results + sparse_results),
                  key=lambda d: scores[d.id], reverse=True)
-- The two source queries the fuser merges:
-- 1. Dense: SELECT id FROM documents ORDER BY embedding <=> :query_vec LIMIT 50;
-- 2. Sparse: SELECT id, ts_rank(ts_content, plainto_tsquery(:q)) AS rank
--           FROM documents WHERE ts_content @@ plainto_tsquery(:q)
--           ORDER BY rank DESC LIMIT 50;

Tradeoffs we accept

LeverRRF (chosen)Score averagingSPLADE/ColBERTLearned ranker
Day-1 setup12 lines of Python4 lines of PythonNew model + serving infraTrain + retrain pipeline
Sensitivity to score scaleNone (rank-only)CatastrophicNativeDepends on features
Recall@10 on 50-query bench~0.75 (vs 0.67 dense-only baseline)~0.69~0.78~0.80
With cross-encoder rerank on top~0.81 (Part 2 measured)~0.71~0.83~0.84
Latency overhead~5 ms in Python<1 ms+20–40 ms inference+10 ms inference
Tunablek parameter (lab-default 60)WeightingModel retrainingFeature engineering
Cost$0$0GPU servingGPU training + serving
Tutorial reproducibilityPure PythonPure PythonNew stackNew stack

We optimize for mathematical correctness + near-zero overhead

  • tutorial reproducibility. RRF is the de-facto industry default because it solves the scale-mismatch problem with zero training data and zero new infrastructure. Learned-sparse retrievers buy 2–3 percentage points of recall at the cost of a separate model server — not the right tradeoff at v1.

Consequences (positive)

  • v1 ships in 12 lines of Python (api/rrf.py).
  • Adding a third source (e.g. fuzzy matching, query expansion via HyDE) is one additional for rank, doc in enumerate(...) loop.
  • The k=60 default is robust across query distributions — no per-deployment tuning required.
  • Score-free fusion means a sparse-only result with rank=1 contributes the same boost regardless of whether ts_rank was 0.99 or 0.05.

Consequences (negative)

  • Information loss. We discard the magnitude of confidence (a dense-cosine of 0.95 is treated identically to a dense-cosine of 0.65 if both are at the same rank). Mitigation: cross-encoder reranker (ADR-003) restores absolute-quality scoring at the top of the merged list.
  • No automatic weighting. A poorly-targeted BM25 query and a high-quality semantic match contribute equally if both are at rank 1. Mitigation: rare in practice; covered by the cross-encoder rerank.
  • k tuning is opaque. The k=60 default works on most corpora; we document the parameter but don't expose a per-query knob. A team with skewed corpus distributions might want k=20 or k=100; the swap is one line.

Reversal plan

The fuser interface is rrf_fuse(dense, sparse, k). Replacement is bounded:

  1. Add api/learned_fuser.py with the same signature, returning a ranked merged list.
  2. Switch api/main.py's /search/hybrid endpoint to dispatch via feature flag.
  3. Re-run the Module 02 eval harness — recall@10 + MRR assertions in scripts/eval.py validate the swap.

If you also want to replace BM25 with SPLADE: drop the ts_rank sparse query, add a SPLADE serving sidecar, and feed its output rank list into the same fuser interface. The fuser doesn't care.

Estimated effort: 1 engineer-week for a tested RRF → learned-fuser swap. Reversible.

References

  • api/rrf.py (the fuser)
  • api/main.py (/search/hybrid endpoint composing the two lists)
  • migrations/add_bm25.sql (tsvector GENERATED + GIN index)
  • scripts/eval.py (recall@10 + MRR + nDCG validation)
  • data/golden_queries.json (50 labeled queries the fuser is measured against)
  • ADR-001 (pgvector — single-store assumption that makes the fan-out cheap)
  • ADR-003 (cross-encoder reranker — restores score magnitude post-fusion)
Built into the project

This decision shipped as part of AI Retrieval Platform — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open