Context
The platform serves hybrid (semantic + lexical + reranked) search over 1M-capable document corpora. The vector index is the hot path on every query — it has to hit the <100 ms P99 budget while staying within reach of a single-team operations footprint. The classic options:
- Pinecone — managed, opinionated, fastest path to a hosted index.
- Qdrant — open-source-with-managed, broader query feature set (payload filters as first-class), strong gRPC story.
- Weaviate — open-source, multi-modal, schema-first.
- pgvector — Postgres extension. SQL is the API; the vector index is just another table with an HNSW or IVFFlat index.
We are building a reference platform for tutorial purposes — the choice has to be reproducible by a learner on a laptop in <15 minutes and survive a real production deploy.
Decision
Adopt pgvector + HNSW, with Qdrant kept as a documented alternative
in docker-compose.qdrant.yml.
-- seed/01_create_tables.sql
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id UUID PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(1536) NOT NULL,
metadata JSONB,
ts_content tsvector GENERATED ALWAYS AS (to_tsvector('english', content)) STORED,
created_at TIMESTAMPTZ DEFAULT now()
);
-- migrations/tune_hnsw.sql
CREATE INDEX documents_embedding_hnsw_idx
ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
SET hnsw.ef_search = 40; -- query-time accuracy/latency knob
# Module 02's recall benchmark sweeps ef_search 10–200 vs p50 latency
# Result: ef_search=40 hits ~0.81 recall@10 with hybrid+rerank
Tradeoffs we accept
| Lever | Pinecone | Qdrant | Weaviate | pgvector (chosen) |
|---|---|---|---|---|
| Day-1 setup | Vendor account | Self-host or managed | Self-host or managed | CREATE EXTENSION vector — 30 seconds |
| Single-store JOIN with metadata | Build it | Payload filter (good) | Limited | Native SQL WHERE + JSONB |
| Tutorial reproducibility | Cloud account | Docker container | Docker container | Same container as the rest of the app |
| P99 latency at 1M vectors | <50 ms | <50 ms | <100 ms | <100 ms (with HNSW + ef_search tuned) |
| Full-text search (BM25) | Build it / external | Limited | Built-in | Native (tsvector + GIN — see ADR-002) |
| Operational footprint | Zero (managed) | One container | One container | Zero new infra (already running Postgres) |
| Vendor lock-in | High | None | None | None |
| Cost at <10M vectors | $$ | $ | $ | Free (RDS db.t4g.medium baseline) |
We optimize for single-store hybrid retrieval + operational parsimony. The hybrid + rerank pipeline (ADR-002 + ADR-003) reads both the vector index AND the BM25 GIN index in the same SQL query plan — that's only possible when both are in Postgres. A separate vector store would force a fan-out + RRF in the application layer with cross-store consistency questions.
Pinecone is the right answer if the team has zero Postgres operations expertise. Qdrant is the right answer if vector-native payload filters are a frequent workload. Both are documented as exit ramps.
Consequences (positive)
- Single SQL query can run vector + BM25 + JSONB metadata filter and return ranked results — no fan-out, no cross-store joins.
- HNSW + GIN indexes live next to the source-of-truth row, so cascade-delete (GDPR — see Module 05) is one transaction.
- Backups are
pg_dump. Per-tenant restore isWHERE tenant_id. - Local development is one container (the same
pgvector:pg16image as production). - Bench harness (
scripts/benchmark_hnsw.py) sweepsef_search10–200 and produces a recall-vs-latency curve — Module 02's tuning artifact.
Consequences (negative)
- No native gRPC streaming. Postgres protocol is the only on-ramp; high-fanout services that need binary streaming would prefer Qdrant.
- Index build is single-threaded by default. A 1M-vector HNSW
build takes ~10–15 minutes on
db.t4g.medium. Mitigation: batch during off-hours; ADR-004 makes incremental updates cheap. - No managed UI for browsing the vector store. A learner queries
via SQL or
psqldirectly. This is fine for the tutorial but in production teams typically front it with a dashboard. - Memory ceiling. A 1M × 1536-dim HNSW index lives in
shared_buffers. RDS instance sizing matters at scale.
Reversal plan
The retrieval interface is api/main.py's /search + /search/hybrid
endpoints, both of which call psycopg2 against the local index.
Replacement is bounded:
- Add
api/qdrant_client.py(orpinecone_client.py) with the samesearch(query_vector, k, filter)signature. - Switch the search endpoint behind a feature flag.
- Re-run Module 02's eval harness — recall@10 and MRR assertions in
scripts/eval.pywill fail loud if the swap regresses quality. - Cut over after a 1-week soak with shadow traffic.
The starter kit ships docker-compose.qdrant.yml for exactly this —
Qdrant is one docker compose up away if you outgrow pgvector.
Estimated effort: 1-2 engineer-weeks for a tested swap. Reversible.
References
seed/01_create_tables.sql(schema with HNSW + GIN)migrations/tune_hnsw.sql(M / ef_construction / ef_search tuning)scripts/benchmark_hnsw.py(recall vs latency sweep)docker-compose.qdrant.yml(alternative vector store)- ADR-002 (BM25 + RRF — depends on single-store assumption)
- ADR-005 (Deprecated single-embedding-version — orthogonal to vector store choice)