ai-de.net/Projects/P14 · AI retrieval platform — pgvector + hybrid + RRF + cross-encoder

Last updated 2026-05-22By AI-DE Engineering Team

EXPERT-tier · PRO unlocks Modules 01-02AI & vectors trackP14

Build a
production retrieval
platform with pgvector + RRF

Ship a real retrieval platform with pgvector + HNSW, BM25 + GIN, Reciprocal Rank Fusion, a cross-encoder reranker, hash-based incremental updates, an OpenAI function-calling agent, Redis semantic cache, and an SLA verifier that runs 1k queries with p50/p95/p99 latency and recall@10 checks. Modules 01-02 unlock with PRO; the full platform with EXPERT.

Timeline

16 hours

Difficulty

Senior+

Stack

pgvector · BM25 · RRF · cross-encoder · Redis · OpenAI

See EXPERT benefits

The retrieval system-design portfolio piece for staff AI infra roles — 5 committed ADRs, a runnable cost-model CSV, a multi- region replication design, and a runbook with 6 P1/P2 failure scenarios you can defend in an architecture review.

By the end you will have wired

pgvector + HNSW index with M=16, ef_construction=64, ef_search tuned via a recall-vs-latency sweep
Hybrid retriever with BM25 + RRF (k=60) + cross-encoder reranker hitting ~0.81 recall@10 on 50 golden queries
Hash-based incremental embedding pipeline (~1% daily churn vs 100% nightly) with versioned model migration
OpenAI function-calling agent with short-term + long-term memory + cosine-similarity context retrieval
Prometheus + Grafana observability with semantic cache (50%+ hit-rate), A/B router, drift diagnosis
5 ADRs (one Deprecated) committed alongside the code, plus a runnable cost-model CSV

PREREQ · SENIOR+Built for engineers shipping production search / RAG. Comfortable with Python (async, typing), SQL, and Postgres basics, plus at least one of: vector retrieval, IR metrics (recall / MRR / nDCG), or embedding-model migration. Not a “what is RAG” course.

ai-retrieval.platform · 5 modules · 1M-capable · hybrid + rerank · semantic cache · multi-region

recall + cost armed

Ingest + embed

Hybrid retrieve

Cache + serve

Observe + scale

embedding_pipeline.pyresume-on-failure · jobs table

incremental_managerSHA-256 hash · ~1%/day churn

embedding_versionEmbeddingVersion enum · zero-DT migrate

text-embedding-3-small1M-doc-capable batch

Versioned + reproducible — see ADR-004 + ADR-005

pgvector + HNSWM=16 · ef_construction=64 · ef_search=40

BM25 + GINtsvector GENERATED · ts_rank

RRF fuser · k=601.0 / (k + rank)

cross-encoder rerankms-marco-MiniLM-L-6-v2 · CPU <50ms

~0.81 recall@10 — see ADR-001 + ADR-002 + ADR-003

Redis semantic cache0.95 cosine · 1h TTL · 50%+ hit-rate

ab_routerstable hash · t-test analysis

FastAPI · /search/hybridP99 < 100ms target

retrieval_agentOpenAI tool_choice=auto + memory

A/B routed serving — see semantic_cache.py

Prometheus + Grafanap50/p95/p99 · recall · cost

sla_verifier · drift_diagnosis1k-query bench · 2 scenarios

infra/replication_managermulti-region async · lag → PD P2

runbook YAML6 scenarios · P1/P2 · diagnosis + fix

Production-shape — see DESIGN.md + runbook

# Recall + precision — hybrid+rerank ~0.81 measured

dense-only baseline ~0.67 on 50 golden queries

hybrid (RRF k=60) ~0.75 — +12% recall, no infra

hybrid + cross-encoder rerank ~0.81 — +21% over dense baseline

→ ms-marco-MiniLM-L-6-v2 reranks top-20 in <50ms (CPU only)

# Cost — $275 → $177 / mo at 1M queries (−36%)

1-yr Reserved Instances on RDS + ElastiCache + EC2 fleet

Semantic-cache hit-rate 50%+ measured · halves embed + query API

Hash-based incremental embedding (ADR-004) · ~$0.20/mo at 1% churn

→ ADR-005 documents the embedding-version reversal

~0.81

recall@10 · hybrid+rerank · 50 golden

P99 < 100ms

SLA target · sla_verifier.py

5 ADRs

committed in starter kit

Curriculum · 5 modules · 16 hours

Modules 01-02 unlock with PRO. Modules 03-05 with EXPERT.

Modules 01-02 (~6h) ship a working hybrid retriever with BM25 + RRF + cross-encoder hitting ~0.81 recall@10 on the bundled 50 golden queries — included with PRO. Modules 03-05 (~10h additional) layer on the scale, observability, and production-platform story and unlock with EXPERT.

P14 · 5 modules · 16 hours · ~100 lessons

Free preview EXPERT required

M01

⊘Embed, Store & Search — pgvector + HNSW + FastAPI

Set up pgvector with HNSW indexing, embed 1,000 tech documents (scale-up to 1M via the bundled script), build a FastAPI semantic search endpoint, and compare results against keyword search side-by-side.

3h18 lessonsPRO TIER

Unlock with PRO →

M02

⊘Hybrid Search & Precision Tuning — BM25 + RRF + Reranker + Eval

Add BM25 + GIN, implement Reciprocal Rank Fusion (k=60), wire a cross-encoder reranker (ms-marco-MiniLM-L-6-v2), build the eval harness with recall@10 + MRR + nDCG, sweep HNSW ef_search, beat BM25-only by ~+23% on 50 golden queries.

3h22 lessonsPRO TIER

Unlock with PRO →

M03

⊘Scale & Agent Integration — Incremental + Agent Function Calling

Build a 1M-doc-capable async batch pipeline with resume-on-failure, hash-based incremental updates, and zero-downtime embedding-model migration. Wire retrieval as an OpenAI function-calling agent tool, with short-term + long-term agent memory.

3.5h22 lessonsEXPERT TIER

Unlock with EXPERT →

M04

⊘Observability, Cost & Resilience — Prometheus + Cache + Drift

Build Prometheus dashboards (recall@10, p99 latency, cost-per-query), add Redis semantic caching (0.95 cosine threshold, 50%+ hit-rate), wire an A/B router with t-test significance, run the SLA verifier against 1k queries, and diagnose 2 drift scenarios.

3.5h21 lessonsEXPERT TIER

Unlock with EXPERT →

M05

⊘Own a Retrieval Platform — Multi-region + GDPR + RBAC + Runbook

Design multi-region async replication with lag monitoring + PagerDuty alerting, implement GDPR-compliant cascade deletion + compliance report, wire PII detection + RBAC redaction, ship a 5-layer DESIGN.md and a runbook YAML with 6 P1/P2 failure scenarios.

3h17 lessonsEXPERT TIER

Unlock with EXPERT →

Modules 01-02 with PRO ($29/mo) · Modules 03-05 with EXPERT ($79/mo)

See plans →

Backed by curriculum

Vector Databases & Retrieval Infrastructure

15 modules~28 hourspgvector · HNSW · RRF · Cross-encoder

Open curriculum

iThe Vector Databases curriculum is the foundation for this project — it’s not a sales add-on. EXPERT subscribers get full access to all modules.

The build, in 3 phases

Embed. Retrieve. Scale.

Each phase ends with a tagged release, a passing eval suite, and a runbook drill. No ambiguity about where you are.

01~6h

Foundation (Modules 01-02)

Working hybrid retriever with BM25 + RRF + cross-encoder hitting ~0.81 recall@10 on 50 golden queries. HNSW tuned, eval harness green.

✓pgvector + HNSW index (recall-vs-latency tuned)
✓Hybrid /search endpoint with RRF + reranker
✓Eval harness · recall@10 + MRR + nDCG

02~7h

Production (Modules 03-04)

1M-capable pipeline with hash-based incremental + versioned embeddings + OpenAI agent. Prometheus + Grafana + semantic cache + drift diagnosis + SLA verifier all green.

✓Incremental embedding · resume-on-failure
✓Agent with function calling + memory
✓Semantic cache · A/B router · drift diagnosis

03~3h

Platform (Module 05)

Multi-region replication code, GDPR + PII RBAC, sharding, DESIGN.md, runbook YAML with 6 P1/P2 failure scenarios.

✓Multi-region async replication + lag monitoring
✓GDPR cascade delete + compliance report
✓Runbook YAML · 6 scenarios · P1/P2

Project setup · 15 minutes

One command. Local pgvector + Redis + FastAPI + cross-encoder.

What lives in the repo

You get the real platform on day one — pgvector + HNSW + GIN indexes inside Postgres, a FastAPI service for semantic + hybrid + agent endpoints, Redis for the semantic cache, sentence-transformers cross-encoder for reranking, and Prometheus + Grafana for the dashboards.

seed/ + migrations/ — 5-table schema, BM25 GENERATED column, HNSW tune migrations
api/ + scripts/ — FastAPI endpoints, RRF fuser, reranker, ingest, eval, HNSW benchmark
embedding_pipeline.py + incremental_manager.py + embedding_version.py — async batch + hash-incremental + zero-DT model migration
retrieval_agent.py + agent_memory.py — OpenAI function calling + short/long-term memory
metrics.py + semantic_cache.py + ab_router.py + drift_diagnosis.py + sla_verifier.py — Prometheus instrumentation + cache + A/B + drift + SLA
infra/ + runbook/ + DESIGN.md — multi-region + GDPR + PII RBAC + sharding + 6-scenario runbook YAML
docs/adr/ + docs/cost-model/ — 5 ADRs (one Deprecated) + the runnable cost-model CSV

Download · Starter Kit · 68 files · 224 KB

AI Retrieval Platform Starter Kit

Pre-built retrieval platform: 5 tutorial modules of source, Docker compose (pgvector + Redis), 5K seeded docs, 50 golden queries, 2K labeled pairs, 1K drift corpus, 7 pytest gates. Now bundled: 5 committed ADR markdown files (docs/adr/) and the runnable cost-model CSV (docs/cost-model/) — unzip and read them straight from the repo.

EXPERT project · 68 files · ADRs + cost model bundled · last updated 2026-05-09

~/projects/ai-retrieval-platform — zsh

1. Unzip and start the stack

$ unzip ai-retrieval-platform-starter.zip

$ cd ai-retrieval-platform && cp .env.example .env

$ docker compose up -d # pgvector + Redis

2. Ingest the bundled 5K corpus + run hybrid search

$ python scripts/ingest.py # 5K docs from data/documents.jsonl

$ curl -X POST localhost:8000/search/hybrid \

$ -d '{"q": "how does RRF work", "top_k": 10}'

3. Run the eval harness · recall@10 + MRR + nDCG

$ python scripts/eval.py # against 50 golden queries

4. Sweep HNSW ef_search + verify SLA

$ python scripts/benchmark_hnsw.py # ef_search 10–200 vs p50 latency

$ python sla_verifier.py # 1k-query bench · p50/p95/p99 + recall + cost

docs · 3 tenants × 3 topics

golden queries

labeled pairs · cross-encoder

scale-up script · capable

Production hardening

The same RAG retrieval — but built for the production case.

Most retrieval tutorials show you a flat NumPy scan against a pickled index. This shows what changes when 4 tenants share infrastructure, on-call owns the recall dashboard, and finance asks for cost-per-1k-queries.

Notebook RAGWhat most teams ship

Vector index

Flat scan or pickled NumPy

Hybrid

Cosine only — keyword queries miss

Updates

Re-embed everything nightly

Agent integration

String-concat retrieved text

Caching

None — every query embeds + queries

Observability

Print latencies in the notebook

Your retrieval platformModules 01–05

✓

Vector index

pgvector HNSW with M / ef_search tuned + recall benchmark

✓

Hybrid

BM25 + RRF + cross-encoder reranker (+23% over BM25-only)

✓

Updates

Hash-based incremental (incremental_manager.py, ~1%/day)

✓

Agent integration

OpenAI function calling + agent_memory short/long-term

✓

Module 04 ships a runnable cost-model CSV inside the starter-kit zip at docs/cost-model/. 1-tenant beta load (~1M queries/mo, 1M-vector corpus), real AWS RDS + EC2 + ElastiCache + OpenAI list prices, with the 1-yr Reserved Instance, semantic-cache hit-rate, and hash-based incremental embedding levers wired up. Preview the CSV →

ComponentBaseline / moOptimized / moDelta

Postgres + pgvector (RDS)

db.t4g.medium · 100GB gp3 · 1M × 1536-dim ~ 6GB

$98

$68

−$30

ElastiCache Redis · semantic cache

cache.t4g.small · primary + replica · 50%+ hit-rate

$54

$36

−$18

FastAPI serving · 3 replicas

EC2 t4g.medium × 3 · /search + /agent · cross-encoder on CPU

$90

$66

−$24

OpenAI embeddings · one-shot at ingest

1M docs · ADR-004 hash-incremental drops to ~$0.20/mo at 1% churn

$20

−$19

OpenAI agent function calls (M05)

~5M tok/mo · semantic-cache hit-rate halves call volume

$10

−$7

S3 + egress + Grafana free tier

embedding backups + eval results · 50GB free observability

—

Total · 1 tenant @ 1M queries

~$0.30 per 1k queries at baseline

$275

$177

−$98 (−36%)

Optimization levers

1-yr Reserved Instances

Commit to 12-month reserved capacity on RDS + ElastiCache + the 3 EC2 instances once load is stable for 30 days. Standard ~30% off list.

−$72 / mo

Semantic cache hit-rate

Part 4's semantic_cache.py with cosine threshold 0.95 + 1h TTL hits 50%+ on real query distributions. Halves embedding + query API costs.

−$15 / mo

Hash-based incremental embedding (ADR-004)

incremental_manager.py SHA-256 of content; re-embed only on change (~1%/day vs 100% nightly). Turns embedding from monthly recurring to amortized one-shot.

−$15 / mo

EXPERT benefit · cohort beta

Async architecture review with a staff-level reviewer (cohort beta).

Submit your repo, your ADR draft, or your recall-vs-latency benchmark. A staff or principal-level reviewer who has shipped this exact stack at scale responds within 7 days with line-by-line comments + a Loom walkthrough. Cohort capped at 12 members.

Bring a diff, an ADR draft, or a recall benchmark.

The cohort beta runs as async architecture review — pick a reviewer by topic, send the artifact, get inline comments + a Loom walkthrough back. No back-and-forth scheduling. No 30-minute slot pressure.

Mira K.

Ex-staff · search relevance · web-scale vector search

pgvector / Qdrant tuning at 100M+ vectors, hybrid retrieval, RRF + reranker depth

“Send the diff. I'll go line-by-line through your HNSW parameters and your fuser logic and pick out the recall regressions.”

Daniel T.

Principal · RAG infra · public AI company

Agent function calling, retrieval-as-tool patterns, semantic cache design, drift detection

“Send your worst recall report. We'll walk it backwards from the eval harness to the embedding-version mismatch.”

Anya S.

Eng manager · ML platform · public Series-D

Org design for retrieval teams, hiring rubrics, staff-MLE interview prep, scope-of-work for retrieval platforms

“If you're prepping for staff-MLE promo, send your ADR draft. We'll work backwards from the rubric.”

Format

Async

Turnaround

7 days

Cohort

12 members

Scope

ADR + arch review

Request a slot →

What your tier unlocks

PRO unlocks Modules 01-02. EXPERT unlocks the full platform.

PRO is the entry point — Modules 01-02 plus the rest of the PRO catalog. EXPERT unlocks Modules 03-05 of this build, the 5 ADRs, the cost-model CSV, and the cohort-beta async review.

What you getFREEPROEXPERT

Modules 01-02 of P14

Embed + Search + Hybrid + Reranker (~6h)

—

Included

Modules 03-05 of P14

Scale + Agent + Observability + Platform (~10h)

—

Included

5 committed ADRs + cost-model CSV

Starter kit docs/adr/ + docs/cost-model/

—

Included

PRO project catalog

Production-grade builds

All current

All current + this one

Curriculum

All 7 tracks

Phase 1 only

All

All + bonus modules

Code review

Senior+ reviewers

—

4 / month

Unlimited

Cohort-beta architecture review

Async · 7-day turnaround · 12-member cap

—

Included

Certificate

Verifiable on LinkedIn

—

Yes

Yes + LinkedIn rec

$79/mo

billed monthly · open enrollment · cancel anytime

or annual

$699/yr save 26%

Unlock EXPERT →

Who this is for

Pick this if you own the recall dashboard, not just a query.

Senior retrieval engineers

You've shipped semantic search. Now you own the eval harness, the recall dashboard, the embedding-version migration plan, and the architecture review with platform.

AI infra engineers

You absorb new RAG features without absorbing new vendors. pgvector + Redis + FastAPI + Prometheus — tools your platform team already operates.

Engineering managers · search / RAG

You need a reference architecture for the retrieval-quality + cost questions your CTO will ask before the AI team gets headcount or a model-serving budget.

Founding engineers · AI startups

Your investors will ask about retrieval quality and unit economics before they ask about scale. The 5 ADRs + cost model + recall benchmark is the answer.

Related curriculum

Going deeper? Four tracks back this project.

The Vector Databases curriculum is the foundation. These four tracks let you go deeper on the parts that matter most for your role.

FAQ · EXPERT tier

Quick answers.

How is this different from PRO?+

Modules 01-02 (pgvector + HNSW + BM25 + RRF + cross-encoder reranker + eval harness) are included with PRO at $29/mo. The rest of the platform — Modules 03-05 (scale + agent + observability + multi-region/GDPR/RBAC), the 5 committed ADRs, the runnable cost-model CSV, and the cohort-beta async architecture review — unlocks with EXPERT at $79/mo. PRO gets you the hybrid retriever; EXPERT gets you the platform you'd defend in an architecture review.

Why pgvector over Pinecone or Qdrant?+

ADR-001 lays out the full tradeoff. Short version: pgvector wins on single-store hybrid (the BM25 + GIN index lives next to the vector index, so RRF can fan-out within a single SQL query plan), zero-new-infra, and tutorial reproducibility. Pinecone is right when the team has zero Postgres ops expertise; Qdrant is right when vector-native payload filters dominate your workload. Both are documented as exit ramps; the starter kit ships docker-compose.qdrant.yml.

What's the actual recall@10 you measured?+

Hybrid + cross-encoder rerank hits ~0.81 on the bundled 50 golden queries (data/golden_queries.json). Dense-only baseline is ~0.67; RRF without rerank is ~0.75. The page lists 0.90 as an aspirational SLA target — that's a fine-tuning ceiling, not a measured number on the bundled corpus. Module 02's eval harness validates the actual number against your test set.

Does the 1M-document scale actually work?+

The pipeline is 1M-capable; the bundled corpus is 5K. Module 03 ships scripts/scale_up_to_full.py to generate documents on demand, and the embedding pipeline (embedding_pipeline.py) is async + resume-on-failure + hash-based incremental — designed for 1M scale. Module 03's checkpoint validates the resume behavior, but the full 1M run is your local reproduction (takes ~2 hours + ~$20 of OpenAI API).

How long until I can finish this project?+

16 hours of focused work across 5 modules. Most learners spread it across 4-5 weeks alongside a day job. Modules 01-02 alone (~6 hours) get you a working hybrid retriever with reranking — included with PRO at $29/mo.

Is this enough to interview for staff retrieval / AI infra roles?+

It's a strong forcing function. Staff retrieval interviews lean heavily on system design (hybrid retrieval, recall vs latency tradeoffs, embedding-model migration, multi-region) and on having opinions backed by real tradeoffs. The 5 ADRs you'll commit (one Deprecated documenting the single-fixed-embedding-model reversal, with receipts) are exactly the artifacts a panel asks about. Pair with the cohort-beta review on your final repo and you have a portfolio.

Related projects

Paired with this project

P06·PAID·ai

Enterprise RAG — retrieval-quality build

EXPERT-tier retrieval-quality RAG: 4-strategy chunking A/B (62/78/85%), hybrid BM25 + dense + RRF, cross-encoder reranker, RAGAS 4-metric canary, LLM gateway with fallback. 5 ADRs + cost-model CSV bundled.

Explore project →

P17·PAID·analytics

Full-stack AI platform — full RAG system + production hardening

EXPERT-tier full-stack RAG: pgvector + HNSW + hybrid retrieval + RRF + cross-encoder rerank, 4-class query router with confidence threshold, 3-level failure cascade (RAG → LLM-only → cached), per-tenant index isolation, eval gates, cost guardrails, 6-mode incident simulator, 5 committed ADRs (one Deprecated), runnable cost-model CSV. 6 modules · 20-22h. Modules 01-03 with PRO.

Explore project →

Ready to ship a real retrieval platform?

Start with PRO ($29/mo) for Modules 01-02 — pgvector + hybrid retrieval + reranker + eval harness. Or unlock the full 5-module platform plus 5 ADRs, the cost-model CSV, and cohort-beta architecture review with EXPERT ($79/mo).

See EXPERT benefits

P14 · AI retrieval platform · EXPERT · PRO unlocks M01-M02Unlock EXPERT →

Build aproduction retrievalplatform with pgvector + RRF

Modules 01-02 unlock with PRO. Modules 03-05 with EXPERT.

Embed. Retrieve. Scale.

One command. Local pgvector + Redis + FastAPI + cross-encoder.

What lives in the repo

AI Retrieval Platform Starter Kit

The same RAG retrieval — but built for the production case.

Write the ADRs staff engineers actually get judged on.

pgvector + HNSW over Qdrant / Pinecone / Weaviate

Hybrid retrieval is RRF, not score averaging or learned-sparse

Reranker is a CPU cross-encoder, not LLM-as-judge

Index updates use hash-based incremental, not nightly full re-embed

Single fixed embedding model assumption

Read the FinOps story for the platform you actually ship.

Optimization levers

Async architecture review with a staff-level reviewer (cohort beta).

Bring a diff, an ADR draft, or a recall benchmark.

PRO unlocks Modules 01-02. EXPERT unlocks the full platform.

Pick this if you own the recall dashboard, not just a query.

Senior retrieval engineers

AI infra engineers

Engineering managers · search / RAG

Founding engineers · AI startups

Going deeper? Four tracks back this project.

RAG Learning Path

Agentic Workflows

Data Observability & Quality

API & External System Integration

Quick answers.

Paired with this project

Ready to ship a real retrieval platform?

Build a
production retrieval
platform with pgvector + RRF