Skip to content
Back to Full-Stack AI Platform

pgvector + HNSW + hybrid RRF + cross-encoder rerank, not a dedicated vector DB

✓ AcceptedFull-Stack AI Platform02 — Retrieval System & Knowledge Layer
By AI-DE Engineering Team·Stakeholders: ML engineer, platform owner, infra lead

Context

The retrieval layer is the single highest-leverage decision in a RAG platform — get it wrong and every downstream layer (LLM generation, eval, observability) is debugging the wrong problem. Vendor choices were on the table at v0:

  • Pinecone (managed). Best-in-class HNSW + filtering. ~$70-200/mo at our scale. Vendor lock; data leaves our perimeter.
  • Weaviate (self-hosted). Open source. Schema-driven. Adds a service to operate.
  • Qdrant (self-hosted). Open source. Lean. Adds a service to operate.
  • pgvector (Postgres extension). Open source. Lives in Postgres we already operate. ANN via HNSW or IVFFlat. Native joins with the rest of our data.

Constraints driving the pick:

  1. One database, not two. The platform already needs Postgres for tenants, prompt registry, eval dataset, traces, user feedback. Adding a second database doubles ops surface.
  2. Hybrid retrieval is non-negotiable. Pure semantic loses on rare-token queries (proper nouns, error codes); pure BM25 loses on synonym-heavy queries. We need both, then a fusion step.
  3. Reranking matters more than retrieval. A cross-encoder on top-50 candidates beats raw retrieval@5 quality by 8-12pp in our gold-set eval. The retrieval index just has to surface the candidate pool.
  4. Compliance: documents stay in our VPC. Pinecone's managed offering is a non-starter for tenants on the compliance track.

Decision

Three-stage retrieval, all on pgvector + Postgres:

  1. Index: chunks table with vector(384) column (sentence-transformers all-MiniLM-L6-v2), HNSW index, tsv column for BM25.
  2. Hybrid retrieve: semantic (<=> cosine distance, top-50) + BM25 (Postgres FTS, top-50) → reciprocal rank fusion to merge. Returns top-20.
  3. Rerank: cross-encoder ms-marco-MiniLM-L-6-v2 over the top-20 → top-5 cited candidates with grounded scores.
# retrieval/hybrid.py
def reciprocal_rank_fusion(
    semantic_hits: list[Chunk], keyword_hits: list[Chunk], k: int = 60
) -> list[ScoredChunk]:
    """Merge two ranked lists via RRF — robust to score-scale mismatch."""
    scores: dict[str, float] = defaultdict(float)
    for rank, chunk in enumerate(semantic_hits):
        scores[chunk.id] += 1.0 / (k + rank + 1)
    for rank, chunk in enumerate(keyword_hits):
        scores[chunk.id] += 1.0 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: -x[1])
-- HNSW index on chunks
CREATE INDEX chunks_embedding_idx ON chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- BM25 via tsv column
ALTER TABLE chunks ADD COLUMN tsv tsvector
  GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;
CREATE INDEX chunks_tsv_idx ON chunks USING gin(tsv);

Tradeoffs we accept

LeverAlternativeChosen
Vendor riskPinecone managedpgvector — accept slightly worse recall@1 for one-database simplicity
Operational burdenWeaviate / Qdrant as a servicepgvector — Postgres we already operate
Recall vs filter costTwo indexes + intersectHybrid + RRF — accept the rerank step as the precision gate
ComplianceManaged cloud vector DBAll retrieval inside our VPC
RerankingNone (faster)Cross-encoder on top-20 — accept ~80ms latency for 8-12pp recall@5 gain

Consequences (positive)

  • Single database, single backup, single monitoring stack. Ops cost ~$0 over baseline Postgres.
  • Retrieval queries can JOIN against tenants, prompts, traces — WHERE tenant_id = ? without cross-system reconciliation.
  • Cross-encoder rerank delivers the precision win Pinecone alone wouldn't have given us.
  • HNSW on m=16, ef_construction=64 returns p95 < 30ms on ~2k-chunk corpora — within the M01 SystemContract latency floor.
  • Hybrid + RRF is robust to score-scale mismatch between semantic and keyword paths — no tuning headaches.

Consequences (negative)

  • pgvector recall@1 is ~2-3pp behind Pinecone on adversarial queries. Mitigation: cross-encoder rerank closes the gap on top-5.
  • HNSW build time scales with corpus size. At 100k+ chunks, ANALYZE + VACUUM windows have to be planned — we've measured this on a sample 50k corpus and it's manageable through M03; will revisit at scale.
  • BM25 tsv column adds ~25% storage per chunk. We accept this for the hybrid retrieval gain.
  • Cross-encoder is a separate model load (~80MB). On cold containers, first query takes ~1s longer. Mitigation: warmup ping in M04 health check.

Reversal plan

Pinecone swap (full vendor migration): ~2 engineer-weeks. The retrieval interface is a Protocol — workers don't know they're hitting pgvector. Migration:

  1. Stand up Pinecone index in parallel (~3 days).
  2. Write Pinecone implementation of the Retriever Protocol (~2 days).
  3. Dual-write phase (~3 days).
  4. Cut over and decommission pgvector path (~2 days).

Trigger: corpus exceeds 5M chunks per tenant and cross-encoder rerank stops closing the recall gap.

Weaviate / Qdrant swap: ~1.5 engineer-weeks. Similar pattern, lower vendor lock. Use when compliance prefers self-hosted-but-not-Postgres.

Drop hybrid (semantic-only): ~2 engineer-days. Use only if BM25 path consistently underperforms in eval (we don't see this — keyword path saves ~15% of queries on rare-token recall).

References

  • retrieval/semantic.py — semantic search via <=>
  • retrieval/keyword.py — BM25 via Postgres FTS
  • retrieval/hybrid.py — RRF fusion
  • retrieval/rerank.py — cross-encoder
  • embeddings/schema.sql — HNSW + tsv column
  • ADR-001 (SystemContract correctness_floor — what retrieval gates against)
  • ADR-005 (DEPRECATED — multi-tenant via row-filter on shared index, reverted to per-tenant index)
Built into the project

This decision shipped as part of Full-Stack AI Platform — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open