Context
The retrieval layer is the single highest-leverage decision in a RAG platform — get it wrong and every downstream layer (LLM generation, eval, observability) is debugging the wrong problem. Vendor choices were on the table at v0:
- Pinecone (managed). Best-in-class HNSW + filtering. ~$70-200/mo at our scale. Vendor lock; data leaves our perimeter.
- Weaviate (self-hosted). Open source. Schema-driven. Adds a service to operate.
- Qdrant (self-hosted). Open source. Lean. Adds a service to operate.
- pgvector (Postgres extension). Open source. Lives in Postgres we already operate. ANN via HNSW or IVFFlat. Native joins with the rest of our data.
Constraints driving the pick:
- One database, not two. The platform already needs Postgres for tenants, prompt registry, eval dataset, traces, user feedback. Adding a second database doubles ops surface.
- Hybrid retrieval is non-negotiable. Pure semantic loses on rare-token queries (proper nouns, error codes); pure BM25 loses on synonym-heavy queries. We need both, then a fusion step.
- Reranking matters more than retrieval. A cross-encoder on top-50 candidates beats raw retrieval@5 quality by 8-12pp in our gold-set eval. The retrieval index just has to surface the candidate pool.
- Compliance: documents stay in our VPC. Pinecone's managed offering is a non-starter for tenants on the compliance track.
Decision
Three-stage retrieval, all on pgvector + Postgres:
- Index:
chunkstable withvector(384)column (sentence-transformersall-MiniLM-L6-v2), HNSW index,tsvcolumn for BM25. - Hybrid retrieve: semantic (
<=>cosine distance, top-50) + BM25 (Postgres FTS, top-50) → reciprocal rank fusion to merge. Returns top-20. - Rerank: cross-encoder
ms-marco-MiniLM-L-6-v2over the top-20 → top-5 cited candidates with grounded scores.
# retrieval/hybrid.py
def reciprocal_rank_fusion(
semantic_hits: list[Chunk], keyword_hits: list[Chunk], k: int = 60
) -> list[ScoredChunk]:
"""Merge two ranked lists via RRF — robust to score-scale mismatch."""
scores: dict[str, float] = defaultdict(float)
for rank, chunk in enumerate(semantic_hits):
scores[chunk.id] += 1.0 / (k + rank + 1)
for rank, chunk in enumerate(keyword_hits):
scores[chunk.id] += 1.0 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: -x[1])
-- HNSW index on chunks
CREATE INDEX chunks_embedding_idx ON chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- BM25 via tsv column
ALTER TABLE chunks ADD COLUMN tsv tsvector
GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;
CREATE INDEX chunks_tsv_idx ON chunks USING gin(tsv);
Tradeoffs we accept
| Lever | Alternative | Chosen |
|---|---|---|
| Vendor risk | Pinecone managed | pgvector — accept slightly worse recall@1 for one-database simplicity |
| Operational burden | Weaviate / Qdrant as a service | pgvector — Postgres we already operate |
| Recall vs filter cost | Two indexes + intersect | Hybrid + RRF — accept the rerank step as the precision gate |
| Compliance | Managed cloud vector DB | All retrieval inside our VPC |
| Reranking | None (faster) | Cross-encoder on top-20 — accept ~80ms latency for 8-12pp recall@5 gain |
Consequences (positive)
- Single database, single backup, single monitoring stack. Ops cost ~$0 over baseline Postgres.
- Retrieval queries can JOIN against tenants, prompts, traces —
WHERE tenant_id = ?without cross-system reconciliation. - Cross-encoder rerank delivers the precision win Pinecone alone wouldn't have given us.
- HNSW on
m=16, ef_construction=64returns p95 < 30ms on ~2k-chunk corpora — within the M01 SystemContract latency floor. - Hybrid + RRF is robust to score-scale mismatch between semantic and keyword paths — no tuning headaches.
Consequences (negative)
- pgvector recall@1 is ~2-3pp behind Pinecone on adversarial queries. Mitigation: cross-encoder rerank closes the gap on top-5.
- HNSW build time scales with corpus size. At 100k+ chunks, ANALYZE + VACUUM windows have to be planned — we've measured this on a sample 50k corpus and it's manageable through M03; will revisit at scale.
- BM25 tsv column adds ~25% storage per chunk. We accept this for the hybrid retrieval gain.
- Cross-encoder is a separate model load (~80MB). On cold containers, first query takes ~1s longer. Mitigation: warmup ping in M04 health check.
Reversal plan
Pinecone swap (full vendor migration): ~2 engineer-weeks. The retrieval interface is a Protocol — workers don't know they're hitting pgvector. Migration:
- Stand up Pinecone index in parallel (~3 days).
- Write Pinecone implementation of the
RetrieverProtocol (~2 days). - Dual-write phase (~3 days).
- Cut over and decommission pgvector path (~2 days).
Trigger: corpus exceeds 5M chunks per tenant and cross-encoder rerank stops closing the recall gap.
Weaviate / Qdrant swap: ~1.5 engineer-weeks. Similar pattern, lower vendor lock. Use when compliance prefers self-hosted-but-not-Postgres.
Drop hybrid (semantic-only): ~2 engineer-days. Use only if BM25 path consistently underperforms in eval (we don't see this — keyword path saves ~15% of queries on rare-token recall).
References
retrieval/semantic.py— semantic search via<=>retrieval/keyword.py— BM25 via Postgres FTSretrieval/hybrid.py— RRF fusionretrieval/rerank.py— cross-encoderembeddings/schema.sql— HNSW + tsv column- ADR-001 (SystemContract
correctness_floor— what retrieval gates against) - ADR-005 (DEPRECATED — multi-tenant via row-filter on shared index, reverted to per-tenant index)