Skip to content
Back to AI Retrieval Platform

Use pgvector + HNSW over Qdrant / Pinecone / Weaviate

✓ AcceptedAI Retrieval Platform01 — Embed, Store & Search
By AI-DE Engineering Team·Stakeholders: retrieval engineer, data infra lead, security reviewer

Context

The platform serves hybrid (semantic + lexical + reranked) search over 1M-capable document corpora. The vector index is the hot path on every query — it has to hit the <100 ms P99 budget while staying within reach of a single-team operations footprint. The classic options:

  1. Pinecone — managed, opinionated, fastest path to a hosted index.
  2. Qdrant — open-source-with-managed, broader query feature set (payload filters as first-class), strong gRPC story.
  3. Weaviate — open-source, multi-modal, schema-first.
  4. pgvector — Postgres extension. SQL is the API; the vector index is just another table with an HNSW or IVFFlat index.

We are building a reference platform for tutorial purposes — the choice has to be reproducible by a learner on a laptop in <15 minutes and survive a real production deploy.

Decision

Adopt pgvector + HNSW, with Qdrant kept as a documented alternative in docker-compose.qdrant.yml.

-- seed/01_create_tables.sql
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id UUID PRIMARY KEY,
  content TEXT NOT NULL,
  embedding vector(1536) NOT NULL,
  metadata JSONB,
  ts_content tsvector GENERATED ALWAYS AS (to_tsvector('english', content)) STORED,
  created_at TIMESTAMPTZ DEFAULT now()
);

-- migrations/tune_hnsw.sql
CREATE INDEX documents_embedding_hnsw_idx
  ON documents USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

SET hnsw.ef_search = 40;  -- query-time accuracy/latency knob
# Module 02's recall benchmark sweeps ef_search 10–200 vs p50 latency
# Result: ef_search=40 hits ~0.81 recall@10 with hybrid+rerank

Tradeoffs we accept

LeverPineconeQdrantWeaviatepgvector (chosen)
Day-1 setupVendor accountSelf-host or managedSelf-host or managedCREATE EXTENSION vector — 30 seconds
Single-store JOIN with metadataBuild itPayload filter (good)LimitedNative SQL WHERE + JSONB
Tutorial reproducibilityCloud accountDocker containerDocker containerSame container as the rest of the app
P99 latency at 1M vectors<50 ms<50 ms<100 ms<100 ms (with HNSW + ef_search tuned)
Full-text search (BM25)Build it / externalLimitedBuilt-inNative (tsvector + GIN — see ADR-002)
Operational footprintZero (managed)One containerOne containerZero new infra (already running Postgres)
Vendor lock-inHighNoneNoneNone
Cost at <10M vectors$$$$Free (RDS db.t4g.medium baseline)

We optimize for single-store hybrid retrieval + operational parsimony. The hybrid + rerank pipeline (ADR-002 + ADR-003) reads both the vector index AND the BM25 GIN index in the same SQL query plan — that's only possible when both are in Postgres. A separate vector store would force a fan-out + RRF in the application layer with cross-store consistency questions.

Pinecone is the right answer if the team has zero Postgres operations expertise. Qdrant is the right answer if vector-native payload filters are a frequent workload. Both are documented as exit ramps.

Consequences (positive)

  • Single SQL query can run vector + BM25 + JSONB metadata filter and return ranked results — no fan-out, no cross-store joins.
  • HNSW + GIN indexes live next to the source-of-truth row, so cascade-delete (GDPR — see Module 05) is one transaction.
  • Backups are pg_dump. Per-tenant restore is WHERE tenant_id.
  • Local development is one container (the same pgvector:pg16 image as production).
  • Bench harness (scripts/benchmark_hnsw.py) sweeps ef_search 10–200 and produces a recall-vs-latency curve — Module 02's tuning artifact.

Consequences (negative)

  • No native gRPC streaming. Postgres protocol is the only on-ramp; high-fanout services that need binary streaming would prefer Qdrant.
  • Index build is single-threaded by default. A 1M-vector HNSW build takes ~10–15 minutes on db.t4g.medium. Mitigation: batch during off-hours; ADR-004 makes incremental updates cheap.
  • No managed UI for browsing the vector store. A learner queries via SQL or psql directly. This is fine for the tutorial but in production teams typically front it with a dashboard.
  • Memory ceiling. A 1M × 1536-dim HNSW index lives in shared_buffers. RDS instance sizing matters at scale.

Reversal plan

The retrieval interface is api/main.py's /search + /search/hybrid endpoints, both of which call psycopg2 against the local index. Replacement is bounded:

  1. Add api/qdrant_client.py (or pinecone_client.py) with the same search(query_vector, k, filter) signature.
  2. Switch the search endpoint behind a feature flag.
  3. Re-run Module 02's eval harness — recall@10 and MRR assertions in scripts/eval.py will fail loud if the swap regresses quality.
  4. Cut over after a 1-week soak with shadow traffic.

The starter kit ships docker-compose.qdrant.yml for exactly this — Qdrant is one docker compose up away if you outgrow pgvector.

Estimated effort: 1-2 engineer-weeks for a tested swap. Reversible.

References

  • seed/01_create_tables.sql (schema with HNSW + GIN)
  • migrations/tune_hnsw.sql (M / ef_construction / ef_search tuning)
  • scripts/benchmark_hnsw.py (recall vs latency sweep)
  • docker-compose.qdrant.yml (alternative vector store)
  • ADR-002 (BM25 + RRF — depends on single-store assumption)
  • ADR-005 (Deprecated single-embedding-version — orthogonal to vector store choice)
Built into the project

This decision shipped as part of AI Retrieval Platform — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open