# ADR-001 — Hybrid retrieval (BM25 + dense) with Reciprocal Rank Fusion

- **Status:** Accepted
- **Date:** 2026-04-12
- **Module:** 03 — Querying, Prompting & Generation
- **Stakeholders:** retrieval engineer, ML platform lead, product owner

## Context

A RAG system's retrieval layer is the single largest determinant of answer
quality. Module 02 ships a Pinecone HNSW index over `text-embedding-3-small`
(1536-dim, cosine). On the seed corpus (4 documents, ~80 chunks) dense-only
retrieval works well for paraphrased questions but fails predictably on
exact-match queries: an SKU like `MX-9920-W`, a function name like
`load_index`, or a section heading like "Q4 OKRs" gets sent to the wrong
document because semantic similarity blurs lexical specificity.

We had three options for closing the lexical gap:

1. **Stay dense-only.** Lean on better embeddings, query rewriting, and
   chunk metadata. Embedding-only retrieval keeps the index simple but
   leaves a measurable recall gap on exact-match queries — empirically
   ~10-15pp lower hit-rate on lookup-style questions in the seed dataset.
2. **Switch to BM25-only.** Solid for exact-match but loses the
   paraphrase-tolerant recall that makes dense retrieval useful in the
   first place. Symmetric problem to (1).
3. **Hybrid — run both, fuse the rankings.** Keep both the dense index and
   a BM25 index. At query time, run both, fuse the result rankings, and
   send the fused top-K to the reranker (ADR-002).

Naive hybrid implementations either average raw scores (which doesn't work
because BM25 and cosine live on incompatible scales) or interpolate them
with a hand-tuned alpha (which drifts as the corpus grows). Both fail in
practice at any non-trivial scale.

## Decision

We adopt **hybrid retrieval with Reciprocal Rank Fusion (RRF)** over the
two ranked lists.

```python
# backend/app/services/rag_service.py — pseudocode of the merge
def hybrid_retrieve(query: str, top_k: int = 50) -> list[Chunk]:
    dense = vector_store.search(query, top_k=top_k)         # Pinecone HNSW
    sparse = bm25_index.search(query, top_k=top_k)          # in-process BM25

    return reciprocal_rank_fusion(dense, sparse, k=60)[:top_k]


def reciprocal_rank_fusion(*ranklists, k=60):
    scores = defaultdict(float)
    for ranklist in ranklists:
        for rank, doc in enumerate(ranklist, start=1):
            scores[doc.id] += 1.0 / (k + rank)
    return [doc for doc, _ in sorted(scores.items(), key=lambda x: -x[1])]
```

RRF treats both retrievers as black-box rankers and fuses on rank position
(not raw score). The `k=60` constant is the standard value from the
original RRF paper (Cormack et al., 2009) — it dampens the contribution of
docs ranked very low in either list while preserving the signal from
high-ranked items in either ranker.

The fused top-50 is handed to the cross-encoder reranker (ADR-002), which
pares it down to the top-10 the LLM actually sees.

## Tradeoffs we accept

| Lever                  | Alternative              | Chosen                                        |
| ---------------------- | ------------------------ | --------------------------------------------- |
| Index complexity       | One index                | Two indexes (dense + BM25), two query paths   |
| Query latency          | Single retrieval call    | Parallel retrieval + RRF merge                |
| Maintenance            | One re-index pipeline    | Lockstep dense + sparse re-indexing           |
| Tunability             | One alpha knob           | k=60 (paper default) + per-retriever top-K    |
| Score interpretability | Raw cosine / BM25 scores | RRF fused scores (rank-based, not magnitudes) |

The largest concrete cost is the lockstep re-indexing requirement: every
document update has to land in both indexes or recall degrades on whichever
side is stale. The streaming embedder in `streaming_embedder.py` was
extended in M02 to fan out updates to both indexes; the dead-letter queue
catches half-applied updates.

## Consequences (positive)

- **Recall coverage stacks.** Dense catches paraphrases; BM25 catches exact
  matches. Hit-rate@10 on the chunking-A/B bench moves from ~74% (dense
  only) to ~88% (hybrid + reranker).
- **No alpha to tune.** The `k=60` constant works across our seed corpus
  variants (long_handbook, short_faq, structured_pricing, noisy_release_notes)
  without per-corpus calibration. Drift risk is low.
- **Reranker has better candidates to work with.** The cross-encoder
  (ADR-002) is precision-heavy but expensive; feeding it a top-50 mixed of
  lexical + semantic candidates outperforms feeding it 50 dense results.
- **Failure mode is graceful.** If BM25 indexing breaks, we serve dense-only
  with a Prometheus alert; if dense breaks, we serve BM25-only. Either is
  better than serving neither.

## Consequences (negative)

- **Two indexes to keep consistent.** A document update has to land in
  both. We mitigate with the streaming embedder fan-out + a nightly
  reconciliation job.
- **RRF is opaque.** "Why was this chunk ranked above that one?" is
  answerable but requires showing both source ranks plus the fusion math.
  The explainability route (`app/explainability.py`) returns both ranks
  plus the fused score on every retrieval.
- **Per-query latency increases ~30-40 ms.** The parallel retrieval +
  in-process fusion adds work on every call. Caching at the LLM-gateway
  layer (ADR-004) absorbs most of it for repeat queries.

## Reversal plan

If the BM25 retriever's marginal hit-rate contribution drops below ~5pp
(i.e. dense retrieval is catching everything BM25 catches), the cost of
running and re-indexing it exceeds the benefit. Reversal is mechanical:

1. Set `RETRIEVAL_HYBRID_ENABLED=false` in `.env`.
2. `rag_service.hybrid_retrieve` short-circuits to dense-only.
3. Stop the BM25 fan-out in `streaming_embedder.py`.
4. Drop the BM25 index after a 30-day freeze.

Estimated effort: ~3 engineer-days.

## References

- `backend/app/services/rag_service.py` — hybrid orchestrator
- `backend/app/services/vector_store.py` — Pinecone HNSW backend
- `backend/app/routes/search.py` — query API entry
- `streaming_embedder.py` — fan-out re-index pipeline
- `app/explainability.py` — exposes per-retriever ranks for debugging
- ADR-002 (cross-encoder reranker fed by this hybrid output)
- ADR-003 (chunking strategy that produces the chunks both retrievers see)
- Cormack, Clarke, Buettcher 2009 — the original Reciprocal Rank Fusion paper
