ADR-002: pgvector + HNSW + hybrid RRF + cross-encoder rerank, not a dedicated vector DB | Full-Stack AI Platform

Context

The retrieval layer is the single highest-leverage decision in a RAG platform — get it wrong and every downstream layer (LLM generation, eval, observability) is debugging the wrong problem. Vendor choices were on the table at v0:

Pinecone (managed). Best-in-class HNSW + filtering. ~$70-200/mo at our scale. Vendor lock; data leaves our perimeter.
Weaviate (self-hosted). Open source. Schema-driven. Adds a service to operate.
Qdrant (self-hosted). Open source. Lean. Adds a service to operate.
pgvector (Postgres extension). Open source. Lives in Postgres we already operate. ANN via HNSW or IVFFlat. Native joins with the rest of our data.

Constraints driving the pick:

One database, not two. The platform already needs Postgres for tenants, prompt registry, eval dataset, traces, user feedback. Adding a second database doubles ops surface.
Hybrid retrieval is non-negotiable. Pure semantic loses on rare-token queries (proper nouns, error codes); pure BM25 loses on synonym-heavy queries. We need both, then a fusion step.
Reranking matters more than retrieval. A cross-encoder on top-50 candidates beats raw retrieval@5 quality by 8-12pp in our gold-set eval. The retrieval index just has to surface the candidate pool.
Compliance: documents stay in our VPC. Pinecone's managed offering is a non-starter for tenants on the compliance track.

Decision

Three-stage retrieval, all on pgvector + Postgres:

Index: chunks table with vector(384) column (sentence-transformers all-MiniLM-L6-v2), HNSW index, tsv column for BM25.
Hybrid retrieve: semantic (<=> cosine distance, top-50) + BM25 (Postgres FTS, top-50) → reciprocal rank fusion to merge. Returns top-20.
Rerank: cross-encoder ms-marco-MiniLM-L-6-v2 over the top-20 → top-5 cited candidates with grounded scores.

# retrieval/hybrid.py
def reciprocal_rank_fusion(
    semantic_hits: list[Chunk], keyword_hits: list[Chunk], k: int = 60
) -> list[ScoredChunk]:
    """Merge two ranked lists via RRF — robust to score-scale mismatch."""
    scores: dict[str, float] = defaultdict(float)
    for rank, chunk in enumerate(semantic_hits):
        scores[chunk.id] += 1.0 / (k + rank + 1)
    for rank, chunk in enumerate(keyword_hits):
        scores[chunk.id] += 1.0 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: -x[1])

-- HNSW index on chunks
CREATE INDEX chunks_embedding_idx ON chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- BM25 via tsv column
ALTER TABLE chunks ADD COLUMN tsv tsvector
  GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;
CREATE INDEX chunks_tsv_idx ON chunks USING gin(tsv);

Tradeoffs we accept

Lever	Alternative	Chosen
Vendor risk	Pinecone managed	pgvector — accept slightly worse recall@1 for one-database simplicity
Operational burden	Weaviate / Qdrant as a service	pgvector — Postgres we already operate
Recall vs filter cost	Two indexes + intersect	Hybrid + RRF — accept the rerank step as the precision gate
Compliance	Managed cloud vector DB	All retrieval inside our VPC
Reranking	None (faster)	Cross-encoder on top-20 — accept ~80ms latency for 8-12pp recall@5 gain

Consequences (positive)

Single database, single backup, single monitoring stack. Ops cost ~$0 over baseline Postgres.
Retrieval queries can JOIN against tenants, prompts, traces — WHERE tenant_id = ? without cross-system reconciliation.
Cross-encoder rerank delivers the precision win Pinecone alone wouldn't have given us.
HNSW on m=16, ef_construction=64 returns p95 < 30ms on ~2k-chunk corpora — within the M01 SystemContract latency floor.
Hybrid + RRF is robust to score-scale mismatch between semantic and keyword paths — no tuning headaches.

Consequences (negative)

pgvector recall@1 is ~2-3pp behind Pinecone on adversarial queries. Mitigation: cross-encoder rerank closes the gap on top-5.
HNSW build time scales with corpus size. At 100k+ chunks, ANALYZE + VACUUM windows have to be planned — we've measured this on a sample 50k corpus and it's manageable through M03; will revisit at scale.
BM25 tsv column adds ~25% storage per chunk. We accept this for the hybrid retrieval gain.
Cross-encoder is a separate model load (~80MB). On cold containers, first query takes ~1s longer. Mitigation: warmup ping in M04 health check.

Reversal plan

Pinecone swap (full vendor migration): ~2 engineer-weeks. The retrieval interface is a Protocol — workers don't know they're hitting pgvector. Migration:

Stand up Pinecone index in parallel (~3 days).
Write Pinecone implementation of the Retriever Protocol (~2 days).
Dual-write phase (~3 days).
Cut over and decommission pgvector path (~2 days).

Trigger: corpus exceeds 5M chunks per tenant and cross-encoder rerank stops closing the recall gap.

Weaviate / Qdrant swap: ~1.5 engineer-weeks. Similar pattern, lower vendor lock. Use when compliance prefers self-hosted-but-not-Postgres.

Drop hybrid (semantic-only): ~2 engineer-days. Use only if BM25 path consistently underperforms in eval (we don't see this — keyword path saves ~15% of queries on rare-token recall).

References

retrieval/semantic.py — semantic search via <=>
retrieval/keyword.py — BM25 via Postgres FTS
retrieval/hybrid.py — RRF fusion
retrieval/rerank.py — cross-encoder
embeddings/schema.sql — HNSW + tsv column
ADR-001 (SystemContract correctness_floor — what retrieval gates against)
ADR-005 (DEPRECATED — multi-tenant via row-filter on shared index, reverted to per-tenant index)