Skip to content
ai-de.net/Projects/P06 · Enterprise RAG — retrieval-quality build
EXPERT-tier · PRO unlocks Module 01AI & vectors trackP06

Build a
retrieval-quality
RAG platform — that survives a relevance review

Ship a production RAG platform anchored on retrieval quality: 4 chunking strategies with measured 62/78/85% accuracy A/B, hybrid BM25 + dense + RRF retrieval, cross-encoder reranking, RAGAS 4-metric eval harness, and a multi-provider LLM gateway. Module 01 unlocks with PRO; the retrieval stack unlocks with EXPERT.

Timeline
16.5 hours
Difficulty
Senior+
Stack
FastAPI · OpenAI · Pinecone · Qdrant · Redis · Prometheus

The retrieval-quality portfolio piece for staff AI / ML platform roles — 5 committed ADRs, a runnable cost-model CSV, a chunking-A/B benchmark with cited numbers, and a hybrid + reranker pipeline you can defend in a relevance-review.

By the end you will have wired
  • Multi-format parsers (PDF / DOCX / Markdown / Text) + 4-strategy ChunkingFactory
  • Pinecone HNSW index over text-embedding-3-small (1536-dim) with namespace per tenant
  • Hybrid retriever — BM25 + dense fused via Reciprocal Rank Fusion (k=60)
  • Cross-encoder reranker (top-50 → top-10) with disable-able score floor
  • RAGAS 4-metric harness (faithfulness / answer_relevancy / context_precision / context_recall)
  • LLM gateway with fallback chain, response cache (1h TTL), per-tenant cost tracker
  • 5 ADRs (one Deprecated) committed alongside the code, plus a runnable cost-model CSV
PREREQ · STAFF+Built for engineers who debug retrieval failures, not RAG demos. Comfortable with Python services, vector indexes, and at least one of: BM25 / hybrid retrieval, cross-encoders, or RAGAS-style eval harnesses. Not a “hello LangChain” tutorial.
enterprise-rag.platform · tenant=acme · · 4-stage pipeline
RAGAS canary green
Ingest
Retrieve
Generate
Operate
PDFParserpypdf · page markers
DOCXParserpython-docx · tables
ChunkingFactoryRecursive default · Semantic opt-in
/upload + workerFastAPI background task
Multi-format parsers + 4-strategy chunker
Pinecone HNSW1536-dim · namespace/tenant
BM25 indexin-process · lexical
RRF mergek=60 (paper default)
Cross-encodertop-50 → top-10
Hybrid BM25 + dense + reranker
LLMGatewaygpt-4o → gpt-4o-mini fallback
Response cache1h TTL · 20% hit rate
Query rewriterHyDE + multi-query
/chat streamcitations + scores
Gateway + fallback chain + streaming
RAGAS canary4 metrics · faithfulness > 0.85
Prometheusrag_query_latency · cache_hits
TenantManagernamespace + quota
Circuit breakerdegraded mode runbook
RAGAS + Prometheus + SLA guardrails
# Chunking A/B is the headline retrieval-quality artifact
Fixed-size (1000c, 200 overlap) → 62% accuracy@10 — cuts mid-sentence (ADR-005 Deprecated)
Recursive (paragraph → sentence → word) → 78% — +16pp default
Semantic (embedding-boundary detection) → 85% — opt-in for legal/medical/contract docs
→ ChunkingFactory routes by document tag; defaults are honest, upgrades are paid
# 5 ADRs + cost-model CSV bundled in the kit
docs/adr/001-hybrid-retrieval-rrf.md — BM25 + dense via RRF k=60
docs/adr/005-fixed-size-chunking-deprecated.md — the chunking-A/B reversal
docs/cost-model/enterprise-rag-cost-model.csv — 5 tenants × 300k req/mo
→ commit the artifacts a relevance review actually opens
70 files
in starter zip · ADRs bundled
−75%
model spend at the reference scenario
RAGAS × 4
metrics wired in CI
Curriculum · 5 modules · 16.5 hours · 3 phases

Module 01 unlocks with PRO. The retrieval stack with EXPERT.

Module 01 (~3h) ships multi-format parsers, the 4-strategy ChunkingFactory, the FastAPI upload API, and a React drag-and-drop uploader — included with PRO. Modules 02–05 (~13.5h) layer the Pinecone HNSW index, hybrid retriever, cross-encoder reranker, RAGAS harness, multi-provider LLM gateway, and multi-tenant blueprint — and unlock with EXPERT.

P06 · 5 modules · 16.5 hours · 35+ lessons
Free preview EXPERT required
M01
Document Ingestion Pipeline
Multi-format parsers (PDF / DOCX / Markdown / Text via pypdf + python-docx) + 4 chunking strategies on a factory pattern (Fixed / Recursive / Semantic / Paragraph) with measured retrieval-accuracy A/B (62/78/85%). FastAPI upload route with background processing + React drag-and-drop uploader.
Phase 1: Foundation3h6 lessonsPRO TIER
Unlock with PRO →
M02
Vector Database & Embeddings
Pinecone HNSW index over text-embedding-3-small (1536-dim, cosine), namespace-per-tenant isolation, batch embedding service with cost tracking, streaming embedder with content hashing + dead-letter queue, HNSW vs IVF deep-dive (95-99% recall, 100-1000× faster than brute force).
Phase 2: Retrieval quality3h7 lessonsEXPERT TIER
Unlock with EXPERT →
M03
Querying, Prompting & Generation
Hybrid retrieval (BM25 + dense) fused via Reciprocal Rank Fusion, cross-encoder reranker (top-50 → top-10), multi-provider LLM client (OpenAI + Anthropic) with token counting, query rewriter + HyDE + multi-query expansion, streaming chat with citations + score breakdowns.
Phase 2: Retrieval quality3.5h8 lessonsEXPERT TIER
Unlock with EXPERT →
M04
Production Hardening & Deployment
RAGAS evaluation harness (faithfulness / answer_relevancy / context_precision / context_recall), Prometheus metrics, multi-stage Dockerfile + non-root user + health checks, retrieval explainability route, API-key auth + per-tenant rate limiting, render.yaml + GitHub Actions deploy.
Phase 3: Production3.5h7 lessonsEXPERT TIER
Unlock with EXPERT →
M05
AI Platform Orchestration & Multi-Tenancy
LLM Gateway with fallback chain (gpt-4o → gpt-4o-mini), response caching (1h TTL), per-tenant cost tracker. TenantManager with namespace-per-tenant + defense-in-depth filtering. SLA + circuit breaker + degraded-mode runbook. Event-driven sync via Kafka. Agentic RAG scaffold + feedback loop.
Phase 3: Production3.5h8 lessonsEXPERT TIER
Unlock with EXPERT →
Module 01 with PRO ($29/mo) · Modules 02–05 with EXPERT ($79/mo)
See plans →
Backed by curriculum
RAG Learning Path
8 modules35 hoursRAG patterns · Retrieval · Reranking · Eval
Open curriculum
iThe RAG curriculum is the foundation for the project — same stack you build here, taught from chunking and retrieval-quality first principles. EXPERT subscribers get full access to all modules.
The build, in 3 phases

Foundation. Retrieval quality. Platform.

Each phase ends with a tagged release, a passing RAGAS canary on the 4-metric harness, and a measurable retrieval-quality delta on the seed corpus. No ambiguity about where you are.

01~3h
Foundation (Module 01)

Multi-format ingest live locally. ChunkingFactory routing by document tag; recursive default benchmarked at +16pp over fixed-size.

  • PDF + DOCX + Markdown + Text parsers with page-marker / table-aware extraction
  • ChunkingFactory with 4 strategies + measured A/B vs golden set
  • FastAPI upload + React uploader + background processing with status polling
02~6.5h
Retrieval quality (Modules 02–04)

Hybrid retrieval + cross-encoder reranker live. RAGAS 4-metric canary green; faithfulness > 0.85 enforced as quality floor.

  • Pinecone HNSW (1536-dim) + BM25 + RRF merge with k=60 fusion constant
  • Cross-encoder rerank (top-50 → top-10) with disable-able score floor
  • RAGAS canary in CI on the 40-question golden set + Prometheus metrics
03~3.5h
Platform (Module 05)

LLM Gateway with fallback chain + per-tenant cost tracker. Multi-tenant namespace isolation, SLA + circuit breaker, runbook drilled.

  • LLM gateway library with response cache (1h TTL) + per-tenant rate limit
  • TenantManager with namespace-per-tenant + defense-in-depth filtering
  • Failure-mode runbook + circuit-breaker degraded mode in M05 capstone
Project setup · 15 minutes

One command. Local FastAPI + Qdrant + Redis (no API key needed).

What lives in the repo

You get the unified production code on day one — FastAPI as the gateway, Qdrant for local vector storage (Pinecone-compatible interface), Redis for caching + rate-limit fallback, sentence-transformers cross-encoder for reranking, plus the lazy-init OpenAI/Anthropic clients so imports stay clean even without keys.

  • docker-compose.yml — Qdrant + Redis + API container with health checks
  • app/services/document_parser.py + chunking.py — pypdf + python-docx + 4-strategy ChunkingFactory
  • app/services/vector_store.py + rag_service.py — Pinecone/Qdrant + hybrid retrieve + cross-encoder reranker
  • scripts/ragas_eval.py — RAGAS canary harness with 4-metric scoring
  • gateway/llm_gateway.py + tenant/tenant_manager.py — fallback chain + response cache + per-tenant quota
  • docs/adr/ + docs/cost-model/ — 5 ADRs (one Deprecated) + the runnable cost-model CSV
Download · Starter Kit · 70 files · 75 KB

Enterprise RAG Starter Kit

Pre-built RAG platform with parsers, ChunkingFactory, hybrid retriever, RAGAS harness, LLM gateway, and the 4-document seed corpus that powered the chunking A/B benchmark. Now bundled: 5 committed ADR markdown files (docs/adr/) and the runnable cost-model CSV (docs/cost-model/) — unzip and read them straight from the repo.

EXPERT project · 70 files · ADRs + cost model bundled · last updated 2026-05-09
~/projects/enterprise-rag — zsh
1. Unzip and bring up Qdrant + Redis (no API key needed)
$ unzip enterprise-rag-starter.zip -d enterprise-rag
$ cd enterprise-rag && cp .env.example .env
$ docker-compose up -d
2. Install Python deps and boot the FastAPI app
$ python -m venv .venv && source .venv/bin/activate
$ pip install -r requirements.txt
$ uvicorn app.main:app --reload --port 8000
3. Upload a sample doc and run a hybrid query
$ curl -F file=@data/sample_docs/long_handbook.md http://localhost:8000/upload
$ curl -X POST http://localhost:8000/chat \
$ -d '{"query":"what is the parental leave policy"}'
4. Read the ADRs and re-run the cost model
$ cat docs/adr/001-hybrid-retrieval-rrf.md
$ open docs/cost-model/enterprise-rag-cost-model.csv
4
seed docs · long_handbook + 3
~80
chunks · recursive default
40
golden questions · A/B set
RAGAS × 4
metrics in CI
Production hardening

The same RAG demo — but built for the relevance-review case.

Most RAG tutorials stop at vector search + a prompt template. This shows what changes when retrieval failures land in your inbox, not the LLM’s — and a relevance reviewer wants to see the chunking-A/B numbers behind your defaults.

Notebook RAGWhat most teams ship
×
Chunking
Fixed-size 1000 chars, hope for the best
×
Retrieval
Vector similarity only, top-K from one index
×
Reranking
None — trust the bi-encoder ordering
×
Eval
Eyeball a few outputs, ship
×
LLM call
Hardcoded gpt-4o, no fallback
×
Failure mode
Bad answer, no signal, no recovery
Your retrieval-quality platformModules 02–05
Chunking
ChunkingFactory routes by tag; recursive default (78%), semantic opt-in for legal/medical (85%) — ADR-003
Retrieval
BM25 + Pinecone HNSW fused via RRF k=60 — ADR-001
Reranking
Cross-encoder (ms-marco-MiniLM-L-6-v2) over top-50 → top-10 with disable-able score floor — ADR-002
Eval
RAGAS canary on 40-question golden set — faithfulness floor > 0.85 from sla.yaml
LLM call
LLMGateway with gpt-4o → gpt-4o-mini fallback + 1h response cache — ADR-004
Failure mode
Circuit breaker + degraded mode (LLM down → chunks-only; vector-store down → BM25-only) + on-call runbook
EXPERT-only · architecture decision records

Write the ADRs staff retrieval engineers actually get judged on.

Five ADRs ship inside the starter-kit zip at docs/adr/, one per major decision in the build, including a real Deprecated ADR documenting the fixed-size-chunking → recursive reversal that the chunking-A/B benchmark forced. Preview ADR-001 →

ADR-001Accepted

Hybrid retrieval (BM25 + dense) with Reciprocal Rank Fusion

Context
Dense-only misses exact-match queries (SKUs, function names, headings) by ~10-15pp; BM25-only loses paraphrase recall
Decision
Run both retrievers in parallel; fuse rankings via RRF k=60; feed top-50 to the reranker
Tradeoff
Two indexes to keep consistent + ~30-40ms parallel latency vs hit-rate@10 lift from ~74% to ~88%
Reversal
Drop BM25 when its marginal hit-rate < 5pp; ~3 engineer-days to disable
ADR-002Accepted

Cross-encoder reranking (top-50 → top-10) precision lever

Context
Bi-encoders score chunks independently — never compare query and chunk in same forward pass; LLM-as-reranker is too expensive on every query
Decision
ms-marco-MiniLM-L-6-v2 cross-encoder on dedicated CPU pool; rerank top-50 → top-10 with score floor
Tradeoff
+150-200ms p95 latency + CPU compute vs +14pp hit-rate@10 (74% → 88%) on the chunking-A/B bench
Reversal
Disable per-endpoint via .env flag if precision lift drops < 5pp on RAGAS canary
ADR-003Accepted

Recursive chunking default; semantic for high-value docs only

Context
Chunking A/B on 4-doc seed: Fixed 62% / Paragraph 71% / Recursive 78% / Semantic 85% retrieval accuracy@10
Decision
ChunkingFactory.from_metadata — recursive default; semantic for legal/medical/contract tags; paragraph for clean Markdown
Tradeoff
Operational complexity (factory + tag policy + 2 thresholds) vs avoiding either-or (~$30k/yr semantic-everywhere or 16pp accuracy regression)
Reversal
ENV flag drops factory to recursive-only if mix stops earning its complexity; ~3 engineer-days
ADR-004Accepted

LLM Gateway with fallback chain (gpt-4o → gpt-4o-mini)

Context
Per-handler LLM calls fail vendor outages, repeat per-tenant cost-attribution code, and miss 20% cache opportunity
Decision
LLMGateway singleton — fallback chain, 1h response cache, per-tenant rate-limit + cost cap, in-process v1
Tradeoff
Single point of failure + test-state cleanup vs centralised outage / cost / cache surface
Reversal
Extract as out-of-process service (~5 engineer-days, interface preserved)
EXPERT-only · cost model

Read the per-query story, not just the latency one.

Module 04 ships a runnable cost-model CSV inside the starter-kit zip at docs/cost-model/. 5-tenant reference load (10k qpd · 300k req/mo), real OpenAI + Pinecone serverless + AWS list prices, with the model-cascade and 20%-cache-hit levers wired up. The version you defend to a CFO. Preview the CSV →

ComponentBaseline / moOptimized / moDelta
OpenAI GPT-4o (premium tier)
100% baseline · 20% optimized via router · 2K in / 300 out / req
$2,400
$384
−$2,016
OpenAI GPT-4o-mini (mini tier)
80% optimized via router · ~17× cheaper than premium
$92
OpenAI text-embedding-3-small
query embeddings + amortised ingest
$5
$5
Pinecone serverless · HNSW
2M vectors · 5 namespaces (one per tenant)
$70
$70
ElastiCache Redis (cache.t4g.small + replica)
response cache + BM25 cache + rate-limit fallback
$54
$54
Reranker compute (cross-encoder, self-hosted)
ms-marco-MiniLM on t4g.medium · ~150ms p95
$30
$30
Total · 5 tenants · 300k req/mo
$0.0085 per req baseline → $0.0021 optimized
$2,559
$635
−$1,924 (−75%)

Optimization levers

Model cascade (gpt-4o-mini default + gpt-4o escalation)
ADR-004. LLM gateway routes 80% of traffic to gpt-4o-mini; only escalates the 20% that need premium reasoning. Largest single lever.
−$1,901 / mo (alone)
Semantic response cache (1h TTL)
ADR-004 again — gateway hashes prompt + tenant + retrieved-chunks fingerprint. 20% hit rate at the reference workload. Stacks with cascade.
−$480 / mo (alone)
Cross-encoder reranking → tighter prompts
ADR-002. Top-50 → top-10 means the LLM sees only highest-precision chunks, reducing average input ~5%.
−$30 / mo (additional)
EXPERT benefit · cohort beta

Async architecture review with a staff-level reviewer (cohort beta).

Submit your repo, your ADR draft, or your RAGAS canary numbers. A staff or principal-level reviewer who has shipped this exact stack responds within 7 days with line-by-line comments. Cohort capped at 12 members.

Bring a diff, an ADR draft, or a RAGAS regression report.

The cohort beta runs as async architecture review — pick a reviewer by topic, send the artifact, get inline comments + a Loom walkthrough back. No back-and-forth scheduling. No 30-minute slot pressure.

PR
Priya R.
Ex-staff · retrieval platform · top-3 search co.
Hybrid retrieval, reranker tuning, chunking strategy A/B, eval-driven retrieval design
Send the diff. I'll go line-by-line through your RRF fusion and your reranker score floor and pick out the cases where lexical and semantic disagree.
JK
Jamal K.
Principal · AI infra · public Series-D
LLM gateway design, per-tenant cost attribution, fallback chain tradeoffs, cache strategy
Send your CSV and your gateway code. We'll walk it backwards from the totals row to the assumption that breaks first when traffic 3×s.
RW
Rachel W.
Eng manager · AI · public mid-cap
Org design for retrieval teams, RAGAS regression gates, hiring rubrics, staff promo packets
If you're prepping for staff, send your ADR-005 + your chunking A/B writeup. We'll work backwards from the rubric.
Format
Async
Turnaround
7 days
Cohort
12 members
Scope
ADR + RAGAS review
Request a slot
What your tier unlocks

PRO unlocks Module 01. EXPERT unlocks the full retrieval stack.

PRO is the entry point — Module 01 (multi-format parsers + 4-strategy chunker + upload API + uploader UI) plus the rest of the PRO catalog. EXPERT unlocks Modules 02–05 of this build, the 5 ADRs, the cost-model CSV, and the cohort-beta async architecture review.

What you getFREEPROEXPERT
Module 01 of P06
Document Ingestion + ChunkingFactory (~3h)
Included
Included
Modules 02–05 of P06
Retrieval / Production / Platform (~13.5h)
Included
5 committed ADRs + cost-model CSV
Starter kit docs/adr/ + docs/cost-model/
Included
PRO project catalog
Production-grade builds
2
All current
All current + this one
Curriculum
All 7 tracks
Phase 1 only
All
All + bonus modules
Code review
Senior+ reviewers
4 / month
Unlimited
Cohort-beta architecture review
Async · 7-day turnaround · 12-member cap
Included
Certificate
Verifiable on LinkedIn
Yes
Yes + LinkedIn rec
$79/mo
billed monthly · open enrollment · cancel anytime
or annual
$699/yr save 26%
Unlock EXPERT
Who this is for

Pick this if you own retrieval quality, not just the prompt.

AI

AI engineers shipping RAG to staff/principal

You debug retrieval failures, you tune reranker thresholds, and your manager asks 'what's our hit-rate@10?' on review day. The chunking A/B + RAGAS harness is the answer.

PL

ML platform leads building retrieval infra

You absorb retrieval as a platform service. Pinecone + BM25 + reranker + LLM gateway as a single library that the rest of the org calls — that's the playbook.

FR

Founding engineers · AI startups

Your first paying customer will ask 'why did the bot say that?' before they ask about scale. Citations + explainability + RAGAS faithfulness is the answer in one repo.

EM

Engineering managers · AI

You need a relevance-review rubric for the AI roadmap your VP will ask about. The 5 ADRs (one Deprecated) are exactly the artifacts a panel actually opens.

FAQ · EXPERT tier

Quick answers.

Different lanes. P06 owns retrieval *quality* — hybrid BM25 + dense + RRF, cross-encoder reranking, 4-strategy chunking A/B (62/78/85% accuracy), RAGAS 4-metric eval harness, HNSW vs IVF tradeoffs, query rewriting / HyDE / multi-query. P30 owns retrieval *governance* — Postgres RLS multi-tenancy, Presidio PII redaction, prompt-injection / jailbreak guardrails, GDPR data-deletion, audit logs. Pick P06 if your manager asks 'what's our hit-rate@10?' Pick P30 if your CISO asks 'what about the audit log?'
Module 01 (multi-format parsers + 4-strategy ChunkingFactory + upload API + React uploader) is included with PRO at $29/mo — a working ingestion pipeline you can ship a demo on. Modules 02–05 (Pinecone HNSW + hybrid retriever + cross-encoder reranker + RAGAS canary + LLM gateway + multi-tenant blueprint), plus the 5 committed ADRs, the runnable cost-model CSV, and the cohort-beta async architecture review, unlock with EXPERT at $79/mo. PRO gets you the foundation; EXPERT gets you the retrieval-quality system you'd defend in a relevance review.
Real and reproducible on the seed corpus. 4 documents (long_handbook + short_faq + structured_pricing + noisy_release_notes) × 40 hand-built golden questions × 4 strategies = the benchmark in the kit. ADR-005 documents the reversal from fixed-size (62%) to recursive (78%) when the benchmark exposed mid-sentence cuts in the long-handbook content. Real workloads will move these numbers; the methodology is what travels.
Yes. The starter kit ships with Qdrant in docker-compose as the default vector store — Pinecone-compatible interface, no API key needed. The Pinecone code path is enabled by an env flag if you want to run against the managed service. The chunking, retrieval, rerank, and RAGAS code are vector-store-agnostic.
Yes — receipts and a learning-budget letter are downloadable on subscription. Many EXPERT learners are reimbursed under engineering training or AI upskilling budgets. The cost-model CSV is itself a defensible business artifact for that conversation.
It's a strong forcing function. Staff AI / ML platform interviews lean heavily on retrieval system design (chunking, hybrid retrieval, reranking, eval) and on having opinions backed by real tradeoffs. The 5 ADRs you commit (one Deprecated, with the chunking A/B receipts) are exactly the artifacts a panel asks about. Pair with the cohort-beta async review on your final repo and you have a portfolio piece that survives a staff promo packet.

Ready to ship a retrieval-quality RAG platform?

Start with PRO ($29/mo) for Module 01 — multi-format parsers + 4-strategy ChunkingFactory. Or unlock the full 5-module retrieval stack plus 5 ADRs, the cost-model CSV, and cohort-beta architecture review with EXPERT ($79/mo).

P06 · Enterprise RAG · EXPERT · PRO unlocks M01Unlock EXPERT →
Press Cmd+K to open