ADR-002: Hybrid retrieval uses Reciprocal Rank Fusion, not score averaging or learned-sparse | AI Retrieval Platform

Context

Hybrid retrieval merges a dense (semantic) candidate list with a sparse (lexical) candidate list. The dense list comes from cosine similarity over pgvector; the sparse list comes from Postgres ts_rank over a tsvector GIN index (BM25-shaped). The merge step is non-trivial because the two scores live on incompatible scales:

pgvector cosine similarity: bounded [0, 1], distribution skewed high (~0.6–0.95 for real queries)
ts_rank BM25: unbounded ~[0.001, 0.99], distribution log-shaped

Naive options that don't work:

Score averaging ((cosine + ts_rank) / 2) — ts_rank is ~10× smaller than cosine on most queries; averaging effectively ignores the BM25 signal.
Min-max normalization before averaging — sensitive to query distribution outliers; one rare-term query can warp the scale for the rest of the batch.
Score multiplication — exponentiates noise in the smaller distribution.

Options that work:

Reciprocal Rank Fusion (RRF) — ignore scores entirely; combine on rank position with 1 / (k + rank), k=60 by convention.
Learned-sparse retrievers (SPLADE, ColBERT) — replace BM25 with a learned representation that produces vector-compatible scores natively.
A learned ranker on top of both lists — train a ranker over [cosine_score, ts_rank, query_features].

Decision

Adopt Reciprocal Rank Fusion (RRF) with k=60.

# api/rrf.py
def rrf_fuse(dense_results: list[Doc], sparse_results: list[Doc], k: int = 60) -> list[Doc]:
    scores: dict[str, float] = {}
    for rank, doc in enumerate(dense_results, start=1):
        scores[doc.id] = scores.get(doc.id, 0.0) + 1.0 / (k + rank)
    for rank, doc in enumerate(sparse_results, start=1):
        scores[doc.id] = scores.get(doc.id, 0.0) + 1.0 / (k + rank)
    return sorted(unique(dense_results + sparse_results),
                  key=lambda d: scores[d.id], reverse=True)

-- The two source queries the fuser merges:
-- 1. Dense: SELECT id FROM documents ORDER BY embedding <=> :query_vec LIMIT 50;
-- 2. Sparse: SELECT id, ts_rank(ts_content, plainto_tsquery(:q)) AS rank
--           FROM documents WHERE ts_content @@ plainto_tsquery(:q)
--           ORDER BY rank DESC LIMIT 50;

Tradeoffs we accept

Lever	RRF (chosen)	Score averaging	SPLADE/ColBERT	Learned ranker
Day-1 setup	12 lines of Python	4 lines of Python	New model + serving infra	Train + retrain pipeline
Sensitivity to score scale	None (rank-only)	Catastrophic	Native	Depends on features
Recall@10 on 50-query bench	~0.75 (vs 0.67 dense-only baseline)	~0.69	~0.78	~0.80
With cross-encoder rerank on top	~0.81 (Part 2 measured)	~0.71	~0.83	~0.84
Latency overhead	~5 ms in Python	<1 ms	+20–40 ms inference	+10 ms inference
Tunable	k parameter (lab-default 60)	Weighting	Model retraining	Feature engineering
Cost	$0	$0	GPU serving	GPU training + serving
Tutorial reproducibility	Pure Python	Pure Python	New stack	New stack

We optimize for mathematical correctness + near-zero overhead

tutorial reproducibility. RRF is the de-facto industry default because it solves the scale-mismatch problem with zero training data and zero new infrastructure. Learned-sparse retrievers buy 2–3 percentage points of recall at the cost of a separate model server — not the right tradeoff at v1.

Consequences (positive)

v1 ships in 12 lines of Python (api/rrf.py).
Adding a third source (e.g. fuzzy matching, query expansion via HyDE) is one additional for rank, doc in enumerate(...) loop.
The k=60 default is robust across query distributions — no per-deployment tuning required.
Score-free fusion means a sparse-only result with rank=1 contributes the same boost regardless of whether ts_rank was 0.99 or 0.05.

Consequences (negative)

Information loss. We discard the magnitude of confidence (a dense-cosine of 0.95 is treated identically to a dense-cosine of 0.65 if both are at the same rank). Mitigation: cross-encoder reranker (ADR-003) restores absolute-quality scoring at the top of the merged list.
No automatic weighting. A poorly-targeted BM25 query and a high-quality semantic match contribute equally if both are at rank 1. Mitigation: rare in practice; covered by the cross-encoder rerank.
k tuning is opaque. The k=60 default works on most corpora; we document the parameter but don't expose a per-query knob. A team with skewed corpus distributions might want k=20 or k=100; the swap is one line.

Reversal plan

The fuser interface is rrf_fuse(dense, sparse, k). Replacement is bounded:

Add api/learned_fuser.py with the same signature, returning a ranked merged list.
Switch api/main.py's /search/hybrid endpoint to dispatch via feature flag.
Re-run the Module 02 eval harness — recall@10 + MRR assertions in scripts/eval.py validate the swap.

If you also want to replace BM25 with SPLADE: drop the ts_rank sparse query, add a SPLADE serving sidecar, and feed its output rank list into the same fuser interface. The fuser doesn't care.

Estimated effort: 1 engineer-week for a tested RRF → learned-fuser swap. Reversible.

References

api/rrf.py (the fuser)
api/main.py (/search/hybrid endpoint composing the two lists)
migrations/add_bm25.sql (tsvector GENERATED + GIN index)
scripts/eval.py (recall@10 + MRR + nDCG validation)
data/golden_queries.json (50 labeled queries the fuser is measured against)
ADR-001 (pgvector — single-store assumption that makes the fan-out cheap)
ADR-003 (cross-encoder reranker — restores score magnitude post-fusion)