Context
Hybrid retrieval merges a dense (semantic) candidate list with a
sparse (lexical) candidate list. The dense list comes from cosine
similarity over pgvector; the sparse list comes from Postgres
ts_rank over a tsvector GIN index (BM25-shaped). The merge step
is non-trivial because the two scores live on incompatible scales:
- pgvector cosine similarity: bounded
[0, 1], distribution skewed high (~0.6–0.95 for real queries) ts_rankBM25: unbounded ~[0.001, 0.99], distribution log-shaped
Naive options that don't work:
- Score averaging (
(cosine + ts_rank) / 2) —ts_rankis ~10× smaller thancosineon most queries; averaging effectively ignores the BM25 signal. - Min-max normalization before averaging — sensitive to query distribution outliers; one rare-term query can warp the scale for the rest of the batch.
- Score multiplication — exponentiates noise in the smaller distribution.
Options that work:
- Reciprocal Rank Fusion (RRF) — ignore scores entirely; combine
on rank position with
1 / (k + rank), k=60 by convention. - Learned-sparse retrievers (SPLADE, ColBERT) — replace BM25 with a learned representation that produces vector-compatible scores natively.
- A learned ranker on top of both lists — train a ranker over
[cosine_score, ts_rank, query_features].
Decision
Adopt Reciprocal Rank Fusion (RRF) with k=60.
# api/rrf.py
def rrf_fuse(dense_results: list[Doc], sparse_results: list[Doc], k: int = 60) -> list[Doc]:
scores: dict[str, float] = {}
for rank, doc in enumerate(dense_results, start=1):
scores[doc.id] = scores.get(doc.id, 0.0) + 1.0 / (k + rank)
for rank, doc in enumerate(sparse_results, start=1):
scores[doc.id] = scores.get(doc.id, 0.0) + 1.0 / (k + rank)
return sorted(unique(dense_results + sparse_results),
key=lambda d: scores[d.id], reverse=True)
-- The two source queries the fuser merges:
-- 1. Dense: SELECT id FROM documents ORDER BY embedding <=> :query_vec LIMIT 50;
-- 2. Sparse: SELECT id, ts_rank(ts_content, plainto_tsquery(:q)) AS rank
-- FROM documents WHERE ts_content @@ plainto_tsquery(:q)
-- ORDER BY rank DESC LIMIT 50;
Tradeoffs we accept
| Lever | RRF (chosen) | Score averaging | SPLADE/ColBERT | Learned ranker |
|---|---|---|---|---|
| Day-1 setup | 12 lines of Python | 4 lines of Python | New model + serving infra | Train + retrain pipeline |
| Sensitivity to score scale | None (rank-only) | Catastrophic | Native | Depends on features |
| Recall@10 on 50-query bench | ~0.75 (vs 0.67 dense-only baseline) | ~0.69 | ~0.78 | ~0.80 |
| With cross-encoder rerank on top | ~0.81 (Part 2 measured) | ~0.71 | ~0.83 | ~0.84 |
| Latency overhead | ~5 ms in Python | <1 ms | +20–40 ms inference | +10 ms inference |
| Tunable | k parameter (lab-default 60) | Weighting | Model retraining | Feature engineering |
| Cost | $0 | $0 | GPU serving | GPU training + serving |
| Tutorial reproducibility | Pure Python | Pure Python | New stack | New stack |
We optimize for mathematical correctness + near-zero overhead
- tutorial reproducibility. RRF is the de-facto industry default because it solves the scale-mismatch problem with zero training data and zero new infrastructure. Learned-sparse retrievers buy 2–3 percentage points of recall at the cost of a separate model server — not the right tradeoff at v1.
Consequences (positive)
- v1 ships in 12 lines of Python (
api/rrf.py). - Adding a third source (e.g. fuzzy matching, query expansion via HyDE)
is one additional
for rank, doc in enumerate(...)loop. - The k=60 default is robust across query distributions — no per-deployment tuning required.
- Score-free fusion means a sparse-only result with rank=1 contributes
the same boost regardless of whether
ts_rankwas 0.99 or 0.05.
Consequences (negative)
- Information loss. We discard the magnitude of confidence (a dense-cosine of 0.95 is treated identically to a dense-cosine of 0.65 if both are at the same rank). Mitigation: cross-encoder reranker (ADR-003) restores absolute-quality scoring at the top of the merged list.
- No automatic weighting. A poorly-targeted BM25 query and a high-quality semantic match contribute equally if both are at rank 1. Mitigation: rare in practice; covered by the cross-encoder rerank.
- k tuning is opaque. The k=60 default works on most corpora; we document the parameter but don't expose a per-query knob. A team with skewed corpus distributions might want k=20 or k=100; the swap is one line.
Reversal plan
The fuser interface is rrf_fuse(dense, sparse, k). Replacement is
bounded:
- Add
api/learned_fuser.pywith the same signature, returning a ranked merged list. - Switch
api/main.py's/search/hybridendpoint to dispatch via feature flag. - Re-run the Module 02 eval harness — recall@10 + MRR assertions in
scripts/eval.pyvalidate the swap.
If you also want to replace BM25 with SPLADE: drop the ts_rank
sparse query, add a SPLADE serving sidecar, and feed its output rank
list into the same fuser interface. The fuser doesn't care.
Estimated effort: 1 engineer-week for a tested RRF → learned-fuser swap. Reversible.
References
api/rrf.py(the fuser)api/main.py(/search/hybridendpoint composing the two lists)migrations/add_bm25.sql(tsvectorGENERATED + GIN index)scripts/eval.py(recall@10 + MRR + nDCG validation)data/golden_queries.json(50 labeled queries the fuser is measured against)- ADR-001 (pgvector — single-store assumption that makes the fan-out cheap)
- ADR-003 (cross-encoder reranker — restores score magnitude post-fusion)