ADR-001: Use pgvector + HNSW over Qdrant / Pinecone / Weaviate | AI Retrieval Platform

Context

The platform serves hybrid (semantic + lexical + reranked) search over 1M-capable document corpora. The vector index is the hot path on every query — it has to hit the <100 ms P99 budget while staying within reach of a single-team operations footprint. The classic options:

Pinecone — managed, opinionated, fastest path to a hosted index.
Qdrant — open-source-with-managed, broader query feature set (payload filters as first-class), strong gRPC story.
Weaviate — open-source, multi-modal, schema-first.
pgvector — Postgres extension. SQL is the API; the vector index is just another table with an HNSW or IVFFlat index.

We are building a reference platform for tutorial purposes — the choice has to be reproducible by a learner on a laptop in <15 minutes and survive a real production deploy.

Decision

Adopt pgvector + HNSW, with Qdrant kept as a documented alternative in docker-compose.qdrant.yml.

-- seed/01_create_tables.sql
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id UUID PRIMARY KEY,
  content TEXT NOT NULL,
  embedding vector(1536) NOT NULL,
  metadata JSONB,
  ts_content tsvector GENERATED ALWAYS AS (to_tsvector('english', content)) STORED,
  created_at TIMESTAMPTZ DEFAULT now()
);

-- migrations/tune_hnsw.sql
CREATE INDEX documents_embedding_hnsw_idx
  ON documents USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

SET hnsw.ef_search = 40;  -- query-time accuracy/latency knob

# Module 02's recall benchmark sweeps ef_search 10–200 vs p50 latency
# Result: ef_search=40 hits ~0.81 recall@10 with hybrid+rerank

Tradeoffs we accept

Lever	Pinecone	Qdrant	Weaviate	pgvector (chosen)
Day-1 setup	Vendor account	Self-host or managed	Self-host or managed	`CREATE EXTENSION vector` — 30 seconds
Single-store JOIN with metadata	Build it	Payload filter (good)	Limited	Native SQL `WHERE` + JSONB
Tutorial reproducibility	Cloud account	Docker container	Docker container	Same container as the rest of the app
P99 latency at 1M vectors	<50 ms	<50 ms	<100 ms	<100 ms (with HNSW + ef_search tuned)
Full-text search (BM25)	Build it / external	Limited	Built-in	Native (`tsvector` + GIN — see ADR-002)
Operational footprint	Zero (managed)	One container	One container	Zero new infra (already running Postgres)
Vendor lock-in	High	None	None	None
Cost at <10M vectors	$$	$	$	Free (RDS db.t4g.medium baseline)

We optimize for single-store hybrid retrieval + operational parsimony. The hybrid + rerank pipeline (ADR-002 + ADR-003) reads both the vector index AND the BM25 GIN index in the same SQL query plan — that's only possible when both are in Postgres. A separate vector store would force a fan-out + RRF in the application layer with cross-store consistency questions.

Pinecone is the right answer if the team has zero Postgres operations expertise. Qdrant is the right answer if vector-native payload filters are a frequent workload. Both are documented as exit ramps.

Consequences (positive)

Single SQL query can run vector + BM25 + JSONB metadata filter and return ranked results — no fan-out, no cross-store joins.
HNSW + GIN indexes live next to the source-of-truth row, so cascade-delete (GDPR — see Module 05) is one transaction.
Backups are pg_dump. Per-tenant restore is WHERE tenant_id.
Local development is one container (the same pgvector:pg16 image as production).
Bench harness (scripts/benchmark_hnsw.py) sweeps ef_search 10–200 and produces a recall-vs-latency curve — Module 02's tuning artifact.

Consequences (negative)

No native gRPC streaming. Postgres protocol is the only on-ramp; high-fanout services that need binary streaming would prefer Qdrant.
Index build is single-threaded by default. A 1M-vector HNSW build takes ~10–15 minutes on db.t4g.medium. Mitigation: batch during off-hours; ADR-004 makes incremental updates cheap.
No managed UI for browsing the vector store. A learner queries via SQL or psql directly. This is fine for the tutorial but in production teams typically front it with a dashboard.
Memory ceiling. A 1M × 1536-dim HNSW index lives in shared_buffers. RDS instance sizing matters at scale.

Reversal plan

The retrieval interface is api/main.py's /search + /search/hybrid endpoints, both of which call psycopg2 against the local index. Replacement is bounded:

Add api/qdrant_client.py (or pinecone_client.py) with the same search(query_vector, k, filter) signature.
Switch the search endpoint behind a feature flag.
Re-run Module 02's eval harness — recall@10 and MRR assertions in scripts/eval.py will fail loud if the swap regresses quality.
Cut over after a 1-week soak with shadow traffic.

The starter kit ships docker-compose.qdrant.yml for exactly this — Qdrant is one docker compose up away if you outgrow pgvector.

Estimated effort: 1-2 engineer-weeks for a tested swap. Reversible.

References

seed/01_create_tables.sql (schema with HNSW + GIN)
migrations/tune_hnsw.sql (M / ef_construction / ef_search tuning)
scripts/benchmark_hnsw.py (recall vs latency sweep)
docker-compose.qdrant.yml (alternative vector store)
ADR-002 (BM25 + RRF — depends on single-store assumption)
ADR-005 (Deprecated single-embedding-version — orthogonal to vector store choice)