ADR-001: Hybrid retrieval (BM25 + dense) with Reciprocal Rank Fusion | Enterprise RAG

Context

A RAG system's retrieval layer is the single largest determinant of answer quality. Module 02 ships a Pinecone HNSW index over text-embedding-3-small (1536-dim, cosine). On the seed corpus (4 documents, ~80 chunks) dense-only retrieval works well for paraphrased questions but fails predictably on exact-match queries: an SKU like MX-9920-W, a function name like load_index, or a section heading like "Q4 OKRs" gets sent to the wrong document because semantic similarity blurs lexical specificity.

We had three options for closing the lexical gap:

Stay dense-only. Lean on better embeddings, query rewriting, and chunk metadata. Embedding-only retrieval keeps the index simple but leaves a measurable recall gap on exact-match queries — empirically ~10-15pp lower hit-rate on lookup-style questions in the seed dataset.
Switch to BM25-only. Solid for exact-match but loses the paraphrase-tolerant recall that makes dense retrieval useful in the first place. Symmetric problem to (1).
Hybrid — run both, fuse the rankings. Keep both the dense index and a BM25 index. At query time, run both, fuse the result rankings, and send the fused top-K to the reranker (ADR-002).

Naive hybrid implementations either average raw scores (which doesn't work because BM25 and cosine live on incompatible scales) or interpolate them with a hand-tuned alpha (which drifts as the corpus grows). Both fail in practice at any non-trivial scale.

Decision

We adopt hybrid retrieval with Reciprocal Rank Fusion (RRF) over the two ranked lists.

# backend/app/services/rag_service.py — pseudocode of the merge
def hybrid_retrieve(query: str, top_k: int = 50) -> list[Chunk]:
    dense = vector_store.search(query, top_k=top_k)         # Pinecone HNSW
    sparse = bm25_index.search(query, top_k=top_k)          # in-process BM25

    return reciprocal_rank_fusion(dense, sparse, k=60)[:top_k]


def reciprocal_rank_fusion(*ranklists, k=60):
    scores = defaultdict(float)
    for ranklist in ranklists:
        for rank, doc in enumerate(ranklist, start=1):
            scores[doc.id] += 1.0 / (k + rank)
    return [doc for doc, _ in sorted(scores.items(), key=lambda x: -x[1])]

RRF treats both retrievers as black-box rankers and fuses on rank position (not raw score). The k=60 constant is the standard value from the original RRF paper (Cormack et al., 2009) — it dampens the contribution of docs ranked very low in either list while preserving the signal from high-ranked items in either ranker.

The fused top-50 is handed to the cross-encoder reranker (ADR-002), which pares it down to the top-10 the LLM actually sees.

Tradeoffs we accept

Lever	Alternative	Chosen
Index complexity	One index	Two indexes (dense + BM25), two query paths
Query latency	Single retrieval call	Parallel retrieval + RRF merge
Maintenance	One re-index pipeline	Lockstep dense + sparse re-indexing
Tunability	One alpha knob	k=60 (paper default) + per-retriever top-K
Score interpretability	Raw cosine / BM25 scores	RRF fused scores (rank-based, not magnitudes)

The largest concrete cost is the lockstep re-indexing requirement: every document update has to land in both indexes or recall degrades on whichever side is stale. The streaming embedder in streaming_embedder.py was extended in M02 to fan out updates to both indexes; the dead-letter queue catches half-applied updates.

Consequences (positive)

Recall coverage stacks. Dense catches paraphrases; BM25 catches exact matches. Hit-rate@10 on the chunking-A/B bench moves from ~74% (dense only) to ~88% (hybrid + reranker).
No alpha to tune. The k=60 constant works across our seed corpus variants (long_handbook, short_faq, structured_pricing, noisy_release_notes) without per-corpus calibration. Drift risk is low.
Reranker has better candidates to work with. The cross-encoder (ADR-002) is precision-heavy but expensive; feeding it a top-50 mixed of lexical + semantic candidates outperforms feeding it 50 dense results.
Failure mode is graceful. If BM25 indexing breaks, we serve dense-only with a Prometheus alert; if dense breaks, we serve BM25-only. Either is better than serving neither.

Consequences (negative)

Two indexes to keep consistent. A document update has to land in both. We mitigate with the streaming embedder fan-out + a nightly reconciliation job.
RRF is opaque. "Why was this chunk ranked above that one?" is answerable but requires showing both source ranks plus the fusion math. The explainability route (app/explainability.py) returns both ranks plus the fused score on every retrieval.
Per-query latency increases ~30-40 ms. The parallel retrieval + in-process fusion adds work on every call. Caching at the LLM-gateway layer (ADR-004) absorbs most of it for repeat queries.

Reversal plan

If the BM25 retriever's marginal hit-rate contribution drops below ~5pp (i.e. dense retrieval is catching everything BM25 catches), the cost of running and re-indexing it exceeds the benefit. Reversal is mechanical:

Set RETRIEVAL_HYBRID_ENABLED=false in .env.
rag_service.hybrid_retrieve short-circuits to dense-only.
Stop the BM25 fan-out in streaming_embedder.py.
Drop the BM25 index after a 30-day freeze.

Estimated effort: ~3 engineer-days.

References

backend/app/services/rag_service.py — hybrid orchestrator
backend/app/services/vector_store.py — Pinecone HNSW backend
backend/app/routes/search.py — query API entry
streaming_embedder.py — fan-out re-index pipeline
app/explainability.py — exposes per-retriever ranks for debugging
ADR-002 (cross-encoder reranker fed by this hybrid output)
ADR-003 (chunking strategy that produces the chunks both retrievers see)
Cormack, Clarke, Buettcher 2009 — the original Reciprocal Rank Fusion paper