Skip to content
ai-de.net/Projects/P14 · AI retrieval platform — pgvector + hybrid + RRF + cross-encoder
Last updated By AI-DE Engineering Team
EXPERT-tier · PRO unlocks Modules 01-02AI & vectors trackP14

Build a
production retrieval
platform with pgvector + RRF

Ship a real retrieval platform with pgvector + HNSW, BM25 + GIN, Reciprocal Rank Fusion, a cross-encoder reranker, hash-based incremental updates, an OpenAI function-calling agent, Redis semantic cache, and an SLA verifier that runs 1k queries with p50/p95/p99 latency and recall@10 checks. Modules 01-02 unlock with PRO; the full platform with EXPERT.

Timeline
16 hours
Difficulty
Senior+
Stack
pgvector · BM25 · RRF · cross-encoder · Redis · OpenAI

The retrieval system-design portfolio piece for staff AI infra roles — 5 committed ADRs, a runnable cost-model CSV, a multi- region replication design, and a runbook with 6 P1/P2 failure scenarios you can defend in an architecture review.

By the end you will have wired
  • pgvector + HNSW index with M=16, ef_construction=64, ef_search tuned via a recall-vs-latency sweep
  • Hybrid retriever with BM25 + RRF (k=60) + cross-encoder reranker hitting ~0.81 recall@10 on 50 golden queries
  • Hash-based incremental embedding pipeline (~1% daily churn vs 100% nightly) with versioned model migration
  • OpenAI function-calling agent with short-term + long-term memory + cosine-similarity context retrieval
  • Prometheus + Grafana observability with semantic cache (50%+ hit-rate), A/B router, drift diagnosis
  • 5 ADRs (one Deprecated) committed alongside the code, plus a runnable cost-model CSV
PREREQ · SENIOR+Built for engineers shipping production search / RAG. Comfortable with Python (async, typing), SQL, and Postgres basics, plus at least one of: vector retrieval, IR metrics (recall / MRR / nDCG), or embedding-model migration. Not a “what is RAG” course.
ai-retrieval.platform · 5 modules · 1M-capable · hybrid + rerank · semantic cache · multi-region
recall + cost armed
Ingest + embed
Hybrid retrieve
Cache + serve
Observe + scale
embedding_pipeline.pyresume-on-failure · jobs table
incremental_managerSHA-256 hash · ~1%/day churn
embedding_versionEmbeddingVersion enum · zero-DT migrate
text-embedding-3-small1M-doc-capable batch
Versioned + reproducible — see ADR-004 + ADR-005
pgvector + HNSWM=16 · ef_construction=64 · ef_search=40
BM25 + GINtsvector GENERATED · ts_rank
RRF fuser · k=601.0 / (k + rank)
cross-encoder rerankms-marco-MiniLM-L-6-v2 · CPU <50ms
~0.81 recall@10 — see ADR-001 + ADR-002 + ADR-003
Redis semantic cache0.95 cosine · 1h TTL · 50%+ hit-rate
ab_routerstable hash · t-test analysis
FastAPI · /search/hybridP99 < 100ms target
retrieval_agentOpenAI tool_choice=auto + memory
A/B routed serving — see semantic_cache.py
Prometheus + Grafanap50/p95/p99 · recall · cost
sla_verifier · drift_diagnosis1k-query bench · 2 scenarios
infra/replication_managermulti-region async · lag → PD P2
runbook YAML6 scenarios · P1/P2 · diagnosis + fix
Production-shape — see DESIGN.md + runbook
# Recall + precision — hybrid+rerank ~0.81 measured
dense-only baseline ~0.67 on 50 golden queries
hybrid (RRF k=60) ~0.75 — +12% recall, no infra
hybrid + cross-encoder rerank ~0.81 — +21% over dense baseline
→ ms-marco-MiniLM-L-6-v2 reranks top-20 in <50ms (CPU only)
# Cost — $275 → $177 / mo at 1M queries (−36%)
1-yr Reserved Instances on RDS + ElastiCache + EC2 fleet
Semantic-cache hit-rate 50%+ measured · halves embed + query API
Hash-based incremental embedding (ADR-004) · ~$0.20/mo at 1% churn
→ ADR-005 documents the embedding-version reversal
~0.81
recall@10 · hybrid+rerank · 50 golden
P99 < 100ms
SLA target · sla_verifier.py
5 ADRs
committed in starter kit
Curriculum · 5 modules · 16 hours

Modules 01-02 unlock with PRO. Modules 03-05 with EXPERT.

Modules 01-02 (~6h) ship a working hybrid retriever with BM25 + RRF + cross-encoder hitting ~0.81 recall@10 on the bundled 50 golden queries — included with PRO. Modules 03-05 (~10h additional) layer on the scale, observability, and production-platform story and unlock with EXPERT.

P14 · 5 modules · 16 hours · ~100 lessons
Free preview EXPERT required
M01
Embed, Store & Search — pgvector + HNSW + FastAPI
Set up pgvector with HNSW indexing, embed 1,000 tech documents (scale-up to 1M via the bundled script), build a FastAPI semantic search endpoint, and compare results against keyword search side-by-side.
3h18 lessonsPRO TIER
Unlock with PRO →
M02
Hybrid Search & Precision Tuning — BM25 + RRF + Reranker + Eval
Add BM25 + GIN, implement Reciprocal Rank Fusion (k=60), wire a cross-encoder reranker (ms-marco-MiniLM-L-6-v2), build the eval harness with recall@10 + MRR + nDCG, sweep HNSW ef_search, beat BM25-only by ~+23% on 50 golden queries.
3h22 lessonsPRO TIER
Unlock with PRO →
M03
Scale & Agent Integration — Incremental + Agent Function Calling
Build a 1M-doc-capable async batch pipeline with resume-on-failure, hash-based incremental updates, and zero-downtime embedding-model migration. Wire retrieval as an OpenAI function-calling agent tool, with short-term + long-term agent memory.
3.5h22 lessonsEXPERT TIER
Unlock with EXPERT →
M04
Observability, Cost & Resilience — Prometheus + Cache + Drift
Build Prometheus dashboards (recall@10, p99 latency, cost-per-query), add Redis semantic caching (0.95 cosine threshold, 50%+ hit-rate), wire an A/B router with t-test significance, run the SLA verifier against 1k queries, and diagnose 2 drift scenarios.
3.5h21 lessonsEXPERT TIER
Unlock with EXPERT →
M05
Own a Retrieval Platform — Multi-region + GDPR + RBAC + Runbook
Design multi-region async replication with lag monitoring + PagerDuty alerting, implement GDPR-compliant cascade deletion + compliance report, wire PII detection + RBAC redaction, ship a 5-layer DESIGN.md and a runbook YAML with 6 P1/P2 failure scenarios.
3h17 lessonsEXPERT TIER
Unlock with EXPERT →
Modules 01-02 with PRO ($29/mo) · Modules 03-05 with EXPERT ($79/mo)
See plans →
Backed by curriculum
Vector Databases & Retrieval Infrastructure
15 modules~28 hourspgvector · HNSW · RRF · Cross-encoder
Open curriculum
iThe Vector Databases curriculum is the foundation for this project — it’s not a sales add-on. EXPERT subscribers get full access to all modules.
The build, in 3 phases

Embed. Retrieve. Scale.

Each phase ends with a tagged release, a passing eval suite, and a runbook drill. No ambiguity about where you are.

01~6h
Foundation (Modules 01-02)

Working hybrid retriever with BM25 + RRF + cross-encoder hitting ~0.81 recall@10 on 50 golden queries. HNSW tuned, eval harness green.

  • pgvector + HNSW index (recall-vs-latency tuned)
  • Hybrid /search endpoint with RRF + reranker
  • Eval harness · recall@10 + MRR + nDCG
02~7h
Production (Modules 03-04)

1M-capable pipeline with hash-based incremental + versioned embeddings + OpenAI agent. Prometheus + Grafana + semantic cache + drift diagnosis + SLA verifier all green.

  • Incremental embedding · resume-on-failure
  • Agent with function calling + memory
  • Semantic cache · A/B router · drift diagnosis
03~3h
Platform (Module 05)

Multi-region replication code, GDPR + PII RBAC, sharding, DESIGN.md, runbook YAML with 6 P1/P2 failure scenarios.

  • Multi-region async replication + lag monitoring
  • GDPR cascade delete + compliance report
  • Runbook YAML · 6 scenarios · P1/P2
Project setup · 15 minutes

One command. Local pgvector + Redis + FastAPI + cross-encoder.

What lives in the repo

You get the real platform on day one — pgvector + HNSW + GIN indexes inside Postgres, a FastAPI service for semantic + hybrid + agent endpoints, Redis for the semantic cache, sentence-transformers cross-encoder for reranking, and Prometheus + Grafana for the dashboards.

  • seed/ + migrations/ — 5-table schema, BM25 GENERATED column, HNSW tune migrations
  • api/ + scripts/ — FastAPI endpoints, RRF fuser, reranker, ingest, eval, HNSW benchmark
  • embedding_pipeline.py + incremental_manager.py + embedding_version.py — async batch + hash-incremental + zero-DT model migration
  • retrieval_agent.py + agent_memory.py — OpenAI function calling + short/long-term memory
  • metrics.py + semantic_cache.py + ab_router.py + drift_diagnosis.py + sla_verifier.py — Prometheus instrumentation + cache + A/B + drift + SLA
  • infra/ + runbook/ + DESIGN.md — multi-region + GDPR + PII RBAC + sharding + 6-scenario runbook YAML
  • docs/adr/ + docs/cost-model/ — 5 ADRs (one Deprecated) + the runnable cost-model CSV
Download · Starter Kit · 68 files · 224 KB

AI Retrieval Platform Starter Kit

Pre-built retrieval platform: 5 tutorial modules of source, Docker compose (pgvector + Redis), 5K seeded docs, 50 golden queries, 2K labeled pairs, 1K drift corpus, 7 pytest gates. Now bundled: 5 committed ADR markdown files (docs/adr/) and the runnable cost-model CSV (docs/cost-model/) — unzip and read them straight from the repo.

EXPERT project · 68 files · ADRs + cost model bundled · last updated 2026-05-09
~/projects/ai-retrieval-platform — zsh
1. Unzip and start the stack
$ unzip ai-retrieval-platform-starter.zip
$ cd ai-retrieval-platform && cp .env.example .env
$ docker compose up -d # pgvector + Redis
2. Ingest the bundled 5K corpus + run hybrid search
$ python scripts/ingest.py # 5K docs from data/documents.jsonl
$ curl -X POST localhost:8000/search/hybrid \
$ -d '{"q": "how does RRF work", "top_k": 10}'
3. Run the eval harness · recall@10 + MRR + nDCG
$ python scripts/eval.py # against 50 golden queries
4. Sweep HNSW ef_search + verify SLA
$ python scripts/benchmark_hnsw.py # ef_search 10–200 vs p50 latency
$ python sla_verifier.py # 1k-query bench · p50/p95/p99 + recall + cost
5k
docs · 3 tenants × 3 topics
50
golden queries
2k
labeled pairs · cross-encoder
1M
scale-up script · capable
Production hardening

The same RAG retrieval — but built for the production case.

Most retrieval tutorials show you a flat NumPy scan against a pickled index. This shows what changes when 4 tenants share infrastructure, on-call owns the recall dashboard, and finance asks for cost-per-1k-queries.

Notebook RAGWhat most teams ship
×
Vector index
Flat scan or pickled NumPy
×
Hybrid
Cosine only — keyword queries miss
×
Updates
Re-embed everything nightly
×
Agent integration
String-concat retrieved text
×
Caching
None — every query embeds + queries
×
Observability
Print latencies in the notebook
Your retrieval platformModules 01–05
Vector index
pgvector HNSW with M / ef_search tuned + recall benchmark
Hybrid
BM25 + RRF + cross-encoder reranker (+23% over BM25-only)
Updates
Hash-based incremental (incremental_manager.py, ~1%/day)
Agent integration
OpenAI function calling + agent_memory short/long-term
Caching
Redis semantic cache @ 0.95 cosine · 50%+ hit-rate measured
Observability
Prometheus + sla_verifier (p50/p95/p99 + recall + cost)
EXPERT-only · architecture decision records

Write the ADRs staff engineers actually get judged on.

Five ADRs ship inside the starter-kit zip at docs/adr/, one per major decision in the build, including a real Deprecated ADR documenting the single-fixed-embedding-model reversal. The kind of doc that travels with you to your next role. Preview ADR-001 →

ADR-001Accepted

pgvector + HNSW over Qdrant / Pinecone / Weaviate

Context
Single-store hybrid retrieval + operational parsimony at <10M vectors
Decision
CREATE EXTENSION vector + HNSW(M=16, ef_construction=64); Qdrant kept as alternative
Tradeoff
No native gRPC streaming + index build single-threaded vs SQL-native + zero new infra
Reversal
docker-compose.qdrant.yml ships in starter kit; swap is ~1-2 engineer-weeks
ADR-002Accepted

Hybrid retrieval is RRF, not score averaging or learned-sparse

Context
Cosine ∈ [0,1] vs ts_rank log-shaped — score-average ignores BM25 signal
Decision
1.0 / (k + rank), k=60 — rank-only fusion, score-free
Tradeoff
Discards confidence magnitude vs zero training data + zero new infra
Reversal
Learned-sparse (SPLADE) or learned-ranker swap is ~1 engineer-week behind the fuser interface
ADR-003Accepted

Reranker is a CPU cross-encoder, not LLM-as-judge

Context
Cost-quality at production volume · LLM-as-judge adds $20k/mo at 1M queries
Decision
cross-encoder/ms-marco-MiniLM-L-6-v2, batch 32 on CPU, <50ms top-50→top-10
Tradeoff
2-percent recall ceiling vs LLM-as-judge · 512-token truncation vs $0 marginal cost
Reversal
Cohere Rerank / LLM-as-judge / fine-tuned model swap is 2-5 engineer-days
ADR-004Accepted

Index updates use hash-based incremental, not nightly full re-embed

Context
Nightly full re-embed costs ~$600/mo at 1M docs · per-write embed adds latency to write path
Decision
compute_hash(content) SHA-256 · re-embed only when hash changed (~1%/day)
Tradeoff
Embedding-model migration requires forced re-embed (see ADR-005) vs $20/mo → $0.20/mo
Reversal
Nightly full / per-write / CDC + Debezium swap is 0.5-5 engineer-days
ADR-005Deprecated

Single fixed embedding model assumption

Context
v1 schema embedded vector(1536) without an embedding_version column
Decision
Reverted in M03 — added embedding_version + embedding_status + zero-downtime re-embed
Why reversed
text-embedding-3-large requires vector(3072) · multi-tenant model selection broke the schema
Replaced by
embedding_version.py · EmbeddingVersion enum · query-time preference filter
EXPERT-only · cost model

Read the FinOps story for the platform you actually ship.

Module 04 ships a runnable cost-model CSV inside the starter-kit zip at docs/cost-model/. 1-tenant beta load (~1M queries/mo, 1M-vector corpus), real AWS RDS + EC2 + ElastiCache + OpenAI list prices, with the 1-yr Reserved Instance, semantic-cache hit-rate, and hash-based incremental embedding levers wired up. Preview the CSV →

ComponentBaseline / moOptimized / moDelta
Postgres + pgvector (RDS)
db.t4g.medium · 100GB gp3 · 1M × 1536-dim ~ 6GB
$98
$68
−$30
ElastiCache Redis · semantic cache
cache.t4g.small · primary + replica · 50%+ hit-rate
$54
$36
−$18
FastAPI serving · 3 replicas
EC2 t4g.medium × 3 · /search + /agent · cross-encoder on CPU
$90
$66
−$24
OpenAI embeddings · one-shot at ingest
1M docs · ADR-004 hash-incremental drops to ~$0.20/mo at 1% churn
$20
$1
−$19
OpenAI agent function calls (M05)
~5M tok/mo · semantic-cache hit-rate halves call volume
$10
$3
−$7
S3 + egress + Grafana free tier
embedding backups + eval results · 50GB free observability
$3
$3
Total · 1 tenant @ 1M queries
~$0.30 per 1k queries at baseline
$275
$177
−$98 (−36%)

Optimization levers

1-yr Reserved Instances
Commit to 12-month reserved capacity on RDS + ElastiCache + the 3 EC2 instances once load is stable for 30 days. Standard ~30% off list.
−$72 / mo
Semantic cache hit-rate
Part 4's semantic_cache.py with cosine threshold 0.95 + 1h TTL hits 50%+ on real query distributions. Halves embedding + query API costs.
−$15 / mo
Hash-based incremental embedding (ADR-004)
incremental_manager.py SHA-256 of content; re-embed only on change (~1%/day vs 100% nightly). Turns embedding from monthly recurring to amortized one-shot.
−$15 / mo
EXPERT benefit · cohort beta

Async architecture review with a staff-level reviewer (cohort beta).

Submit your repo, your ADR draft, or your recall-vs-latency benchmark. A staff or principal-level reviewer who has shipped this exact stack at scale responds within 7 days with line-by-line comments + a Loom walkthrough. Cohort capped at 12 members.

Bring a diff, an ADR draft, or a recall benchmark.

The cohort beta runs as async architecture review — pick a reviewer by topic, send the artifact, get inline comments + a Loom walkthrough back. No back-and-forth scheduling. No 30-minute slot pressure.

MK
Mira K.
Ex-staff · search relevance · web-scale vector search
pgvector / Qdrant tuning at 100M+ vectors, hybrid retrieval, RRF + reranker depth
Send the diff. I'll go line-by-line through your HNSW parameters and your fuser logic and pick out the recall regressions.
DT
Daniel T.
Principal · RAG infra · public AI company
Agent function calling, retrieval-as-tool patterns, semantic cache design, drift detection
Send your worst recall report. We'll walk it backwards from the eval harness to the embedding-version mismatch.
AS
Anya S.
Eng manager · ML platform · public Series-D
Org design for retrieval teams, hiring rubrics, staff-MLE interview prep, scope-of-work for retrieval platforms
If you're prepping for staff-MLE promo, send your ADR draft. We'll work backwards from the rubric.
Format
Async
Turnaround
7 days
Cohort
12 members
Scope
ADR + arch review
Request a slot
What your tier unlocks

PRO unlocks Modules 01-02. EXPERT unlocks the full platform.

PRO is the entry point — Modules 01-02 plus the rest of the PRO catalog. EXPERT unlocks Modules 03-05 of this build, the 5 ADRs, the cost-model CSV, and the cohort-beta async review.

What you getFREEPROEXPERT
Modules 01-02 of P14
Embed + Search + Hybrid + Reranker (~6h)
Included
Included
Modules 03-05 of P14
Scale + Agent + Observability + Platform (~10h)
Included
5 committed ADRs + cost-model CSV
Starter kit docs/adr/ + docs/cost-model/
Included
PRO project catalog
Production-grade builds
2
All current
All current + this one
Curriculum
All 7 tracks
Phase 1 only
All
All + bonus modules
Code review
Senior+ reviewers
4 / month
Unlimited
Cohort-beta architecture review
Async · 7-day turnaround · 12-member cap
Included
Certificate
Verifiable on LinkedIn
Yes
Yes + LinkedIn rec
$79/mo
billed monthly · open enrollment · cancel anytime
or annual
$699/yr save 26%
Unlock EXPERT
Who this is for

Pick this if you own the recall dashboard, not just a query.

RE

Senior retrieval engineers

You've shipped semantic search. Now you own the eval harness, the recall dashboard, the embedding-version migration plan, and the architecture review with platform.

AI

AI infra engineers

You absorb new RAG features without absorbing new vendors. pgvector + Redis + FastAPI + Prometheus — tools your platform team already operates.

EM

Engineering managers · search / RAG

You need a reference architecture for the retrieval-quality + cost questions your CTO will ask before the AI team gets headcount or a model-serving budget.

FR

Founding engineers · AI startups

Your investors will ask about retrieval quality and unit economics before they ask about scale. The 5 ADRs + cost model + recall benchmark is the answer.

FAQ · EXPERT tier

Quick answers.

Modules 01-02 (pgvector + HNSW + BM25 + RRF + cross-encoder reranker + eval harness) are included with PRO at $29/mo. The rest of the platform — Modules 03-05 (scale + agent + observability + multi-region/GDPR/RBAC), the 5 committed ADRs, the runnable cost-model CSV, and the cohort-beta async architecture review — unlocks with EXPERT at $79/mo. PRO gets you the hybrid retriever; EXPERT gets you the platform you'd defend in an architecture review.
ADR-001 lays out the full tradeoff. Short version: pgvector wins on single-store hybrid (the BM25 + GIN index lives next to the vector index, so RRF can fan-out within a single SQL query plan), zero-new-infra, and tutorial reproducibility. Pinecone is right when the team has zero Postgres ops expertise; Qdrant is right when vector-native payload filters dominate your workload. Both are documented as exit ramps; the starter kit ships docker-compose.qdrant.yml.
Hybrid + cross-encoder rerank hits ~0.81 on the bundled 50 golden queries (data/golden_queries.json). Dense-only baseline is ~0.67; RRF without rerank is ~0.75. The page lists 0.90 as an aspirational SLA target — that's a fine-tuning ceiling, not a measured number on the bundled corpus. Module 02's eval harness validates the actual number against your test set.
The pipeline is 1M-capable; the bundled corpus is 5K. Module 03 ships scripts/scale_up_to_full.py to generate documents on demand, and the embedding pipeline (embedding_pipeline.py) is async + resume-on-failure + hash-based incremental — designed for 1M scale. Module 03's checkpoint validates the resume behavior, but the full 1M run is your local reproduction (takes ~2 hours + ~$20 of OpenAI API).
16 hours of focused work across 5 modules. Most learners spread it across 4-5 weeks alongside a day job. Modules 01-02 alone (~6 hours) get you a working hybrid retriever with reranking — included with PRO at $29/mo.
It's a strong forcing function. Staff retrieval interviews lean heavily on system design (hybrid retrieval, recall vs latency tradeoffs, embedding-model migration, multi-region) and on having opinions backed by real tradeoffs. The 5 ADRs you'll commit (one Deprecated documenting the single-fixed-embedding-model reversal, with receipts) are exactly the artifacts a panel asks about. Pair with the cohort-beta review on your final repo and you have a portfolio.
Related projects

Paired with this project

P06PAIDai
Enterprise RAG — retrieval-quality build

EXPERT-tier retrieval-quality RAG: 4-strategy chunking A/B (62/78/85%), hybrid BM25 + dense + RRF, cross-encoder reranker, RAGAS 4-metric canary, LLM gateway with fallback. 5 ADRs + cost-model CSV bundled.

Explore project →
P17PAIDanalytics
Full-stack AI platform — full RAG system + production hardening

EXPERT-tier full-stack RAG: pgvector + HNSW + hybrid retrieval + RRF + cross-encoder rerank, 4-class query router with confidence threshold, 3-level failure cascade (RAG → LLM-only → cached), per-tenant index isolation, eval gates, cost guardrails, 6-mode incident simulator, 5 committed ADRs (one Deprecated), runnable cost-model CSV. 6 modules · 20-22h. Modules 01-03 with PRO.

Explore project →

Ready to ship a real retrieval platform?

Start with PRO ($29/mo) for Modules 01-02 — pgvector + hybrid retrieval + reranker + eval harness. Or unlock the full 5-module platform plus 5 ADRs, the cost-model CSV, and cohort-beta architecture review with EXPERT ($79/mo).

P14 · AI retrieval platform · EXPERT · PRO unlocks M01-M02Unlock EXPERT →
Press Cmd+K to open