Context
A RAG system's retrieval layer is the single largest determinant of answer
quality. Module 02 ships a Pinecone HNSW index over text-embedding-3-small
(1536-dim, cosine). On the seed corpus (4 documents, ~80 chunks) dense-only
retrieval works well for paraphrased questions but fails predictably on
exact-match queries: an SKU like MX-9920-W, a function name like
load_index, or a section heading like "Q4 OKRs" gets sent to the wrong
document because semantic similarity blurs lexical specificity.
We had three options for closing the lexical gap:
- Stay dense-only. Lean on better embeddings, query rewriting, and chunk metadata. Embedding-only retrieval keeps the index simple but leaves a measurable recall gap on exact-match queries — empirically ~10-15pp lower hit-rate on lookup-style questions in the seed dataset.
- Switch to BM25-only. Solid for exact-match but loses the paraphrase-tolerant recall that makes dense retrieval useful in the first place. Symmetric problem to (1).
- Hybrid — run both, fuse the rankings. Keep both the dense index and a BM25 index. At query time, run both, fuse the result rankings, and send the fused top-K to the reranker (ADR-002).
Naive hybrid implementations either average raw scores (which doesn't work because BM25 and cosine live on incompatible scales) or interpolate them with a hand-tuned alpha (which drifts as the corpus grows). Both fail in practice at any non-trivial scale.
Decision
We adopt hybrid retrieval with Reciprocal Rank Fusion (RRF) over the two ranked lists.
# backend/app/services/rag_service.py — pseudocode of the merge
def hybrid_retrieve(query: str, top_k: int = 50) -> list[Chunk]:
dense = vector_store.search(query, top_k=top_k) # Pinecone HNSW
sparse = bm25_index.search(query, top_k=top_k) # in-process BM25
return reciprocal_rank_fusion(dense, sparse, k=60)[:top_k]
def reciprocal_rank_fusion(*ranklists, k=60):
scores = defaultdict(float)
for ranklist in ranklists:
for rank, doc in enumerate(ranklist, start=1):
scores[doc.id] += 1.0 / (k + rank)
return [doc for doc, _ in sorted(scores.items(), key=lambda x: -x[1])]
RRF treats both retrievers as black-box rankers and fuses on rank position
(not raw score). The k=60 constant is the standard value from the
original RRF paper (Cormack et al., 2009) — it dampens the contribution of
docs ranked very low in either list while preserving the signal from
high-ranked items in either ranker.
The fused top-50 is handed to the cross-encoder reranker (ADR-002), which pares it down to the top-10 the LLM actually sees.
Tradeoffs we accept
| Lever | Alternative | Chosen |
|---|---|---|
| Index complexity | One index | Two indexes (dense + BM25), two query paths |
| Query latency | Single retrieval call | Parallel retrieval + RRF merge |
| Maintenance | One re-index pipeline | Lockstep dense + sparse re-indexing |
| Tunability | One alpha knob | k=60 (paper default) + per-retriever top-K |
| Score interpretability | Raw cosine / BM25 scores | RRF fused scores (rank-based, not magnitudes) |
The largest concrete cost is the lockstep re-indexing requirement: every
document update has to land in both indexes or recall degrades on whichever
side is stale. The streaming embedder in streaming_embedder.py was
extended in M02 to fan out updates to both indexes; the dead-letter queue
catches half-applied updates.
Consequences (positive)
- Recall coverage stacks. Dense catches paraphrases; BM25 catches exact matches. Hit-rate@10 on the chunking-A/B bench moves from ~74% (dense only) to ~88% (hybrid + reranker).
- No alpha to tune. The
k=60constant works across our seed corpus variants (long_handbook, short_faq, structured_pricing, noisy_release_notes) without per-corpus calibration. Drift risk is low. - Reranker has better candidates to work with. The cross-encoder (ADR-002) is precision-heavy but expensive; feeding it a top-50 mixed of lexical + semantic candidates outperforms feeding it 50 dense results.
- Failure mode is graceful. If BM25 indexing breaks, we serve dense-only with a Prometheus alert; if dense breaks, we serve BM25-only. Either is better than serving neither.
Consequences (negative)
- Two indexes to keep consistent. A document update has to land in both. We mitigate with the streaming embedder fan-out + a nightly reconciliation job.
- RRF is opaque. "Why was this chunk ranked above that one?" is
answerable but requires showing both source ranks plus the fusion math.
The explainability route (
app/explainability.py) returns both ranks plus the fused score on every retrieval. - Per-query latency increases ~30-40 ms. The parallel retrieval + in-process fusion adds work on every call. Caching at the LLM-gateway layer (ADR-004) absorbs most of it for repeat queries.
Reversal plan
If the BM25 retriever's marginal hit-rate contribution drops below ~5pp (i.e. dense retrieval is catching everything BM25 catches), the cost of running and re-indexing it exceeds the benefit. Reversal is mechanical:
- Set
RETRIEVAL_HYBRID_ENABLED=falsein.env. rag_service.hybrid_retrieveshort-circuits to dense-only.- Stop the BM25 fan-out in
streaming_embedder.py. - Drop the BM25 index after a 30-day freeze.
Estimated effort: ~3 engineer-days.
References
backend/app/services/rag_service.py— hybrid orchestratorbackend/app/services/vector_store.py— Pinecone HNSW backendbackend/app/routes/search.py— query API entrystreaming_embedder.py— fan-out re-index pipelineapp/explainability.py— exposes per-retriever ranks for debugging- ADR-002 (cross-encoder reranker fed by this hybrid output)
- ADR-003 (chunking strategy that produces the chunks both retrievers see)
- Cormack, Clarke, Buettcher 2009 — the original Reciprocal Rank Fusion paper