Build an
LLM training-data
pipeline with crawl + dedup + RAG
Ship a real training-data platform: aiohttp crawler with rate-limit + extractor fallback chain, MinHash + LSH dedup at 95% precision, multi-signal quality scoring, hybrid tokenization (tiktoken + custom BPE), pgvector or Pinecone retrieval (Protocol), vLLM batch serving, RAGAS + ensemble-judge eval with CI regression gate, Airflow + Ray orchestration, and a per-tenant cost tracker. Modules 01-02 unlock with PRO; the full platform with EXPERT.
The dataset-engineering portfolio piece for staff AI infra roles — 5 committed ADRs, a runnable cost-model CSV, a dual-backend retrieval Protocol, an Airflow + Ray orchestration design, and a CI eval gate that blocks merges on regression.
- aiohttp crawler with TokenBucket rate limiting + 4-extractor fallback chain (1M-doc-capable)
- MinHash + LSH dedup at 95% precision (datasketch tuned, validated against labeled fixture)
- Hybrid tokenization with tiktoken default + pedagogical custom BPE + sequence packing
- Dual-backend RAG (pgvector or Pinecone via Protocol) with chunking + MRR/NDCG eval harness
- vLLM serving + Locust load test + RAGAS + ensemble judge (multi-doc synthesis)
- Airflow DAGs (ingest + eval) + Ray fan-out + GitHub Actions CI gate (10% regression cap)
- 5 ADRs (one Deprecated) committed alongside the code, plus a runnable cost-model CSV
Modules 01-02 unlock with PRO. Modules 03-05 with EXPERT.
Modules 01-02 (~12h) ship a working RAG system from raw web data — the same pattern behind ChatGPT-style applications. Included with PRO. Modules 03-05 (~12h additional) layer on the production serving + evaluation + LLMOps story and unlock with EXPERT.
Crawl. Train-data. Operate.
Each phase ends with a tagged release, a passing eval suite, and a runbook drill. No ambiguity about where you are.
Working RAG system from raw web data. aiohttp crawler · MinHash + LSH dedup · quality scoring · tokenization · pgvector or Pinecone.
- ✓1M-doc-capable async crawler + 4-extractor chain
- ✓MinHash + LSH dedup @ 95% precision (validated)
- ✓Dual-backend RAG with MRR/NDCG eval harness
vLLM batch serving with Locust load test, RAGAS + ensemble judge eval, GitHub Actions CI gate that blocks merge on regression.
- ✓vLLM service · AWQ quantization · P99 < 200ms
- ✓Ensemble judge · multi-doc synthesis · failure analysis
- ✓CI eval gate · 10% regression cap · PR comment posting
Airflow + Ray orchestration, Prometheus cost tracking, runbook + architecture document for staff capstone.
- ✓Airflow ingest DAG + eval DAG
- ✓Ray fan-out · @ray.remote(num_gpus=0.5)
- ✓Cost tracker + 6-scenario runbook
Two-tier requirements. Local aiohttp + Postgres + Ray + Airflow + vLLM (GPU optional).
What lives in the repo
You get the real platform on day one — aiohttp + extractors for the crawler, MinHash via datasketch for dedup, tiktoken + custom BPE for tokenization, pgvector OR Pinecone via a retrieval Protocol, vLLM for GPU serving, Airflow + Ray for orchestration, and Prometheus + Locust for observability + load testing.
- base_crawler.py + crawl/ + extractors.py — aiohttp async crawler + TokenBucket rate limiter + 4-extractor fallback
- minhash.py + dedup/ + distributed_dedup.py + pipeline/dedup.py — MinHash + LSH (datasketch) + optional pyspark Tier-2 scaling
- tokenizers/bpe.py + tokenize/ + packing.py + augmentation/ — pedagogical custom BPE + tiktoken production path + sequence packing
- rag/ (pgvector + Pinecone clients via Protocol) + retrieval/ — dual-backend RAG with chunking + evaluate_retrieval (MRR + NDCG)
- serve/ + serving/ + inference/ + models/ — vLLM launchers + FastAPI async + BatchInferenceEngine
- evaluation/ + eval/ + bench/locustfile.py — RAGAS + ensemble judge + Locust 50-user P50/P95/P99 load test
- dags/ + .github/workflows/eval.yml + app/metrics.py — Airflow ingest + eval DAGs · CI gate (10% regression cap) · Prometheus
- docs/adr/ + docs/cost-model/ — 5 ADRs (one Deprecated) + the runnable cost-model CSV
LLM Training-Data Pipeline Starter Kit
Pre-built dataset-engineering platform: 9 sub-tutorials of source, Docker compose (pgvector + Redis), 120 raw_corpus fixtures, 10.3 MB dedup_input + 5.6 MB quality_input, 100 golden eval Q's, Tier-1 and Tier-2 requirements, Airflow DAGs, GitHub Actions CI gate. Now bundled: 5 committed ADR markdown files (docs/adr/) and the runnable cost-model CSV (docs/cost-model/) — unzip and read straight from the repo.
The same dataset demo — but built for the production case.
Most LLM training-data tutorials show you a notebook scraping a few URLs into a single file. This shows what changes when you crawl 1M pages, on-call owns the dedup precision dashboard, and finance asks for cost-per-1M-documents.
aiohttp + TokenBucket + 4 extractors + retry policyWrite the ADRs staff engineers actually get judged on.
Five ADRs ship inside the starter-kit zip at docs/adr/, one per major decision in the build, including a real Deprecated ADR documenting the Pinecone-only → dual-backend retrieval reversal. The kind of doc that travels with you to your next role. Preview ADR-001 →
aiohttp + custom crawler over Scrapy / requests-html
aiohttp + TokenBucket + 4-extractor fallback (BS4 · readability · trafilatura · pypdf)MinHash + LSH (datasketch) over hash-only / embedding similarity dedup
datasketch.MinHashLSH · shingle k=5 · num_perm=128 · 0.85 Jaccard thresholdHybrid tokenization — tiktoken default + custom BPE pedagogical
tiktoken cl100k_base for production · tokenizers/bpe.py from-scratch for the merge-loop walkthroughRay + Airflow for orchestration over Spark / Dask / single-node cron
@ray.remote(num_gpus=0.5) · Airflow DAGs in dags/{ingest,eval}_dag.pyPinecone-only vector backend
RetrievalBackend Protocol with both pgvector and Pinecone clientsRead the FinOps story for the pipeline you actually ship.
Module 05 ships a runnable cost-model CSV inside the starter-kit zip at docs/cost-model/. Single team @ 1M docs/quarter, real AWS RDS + EC2 + GPU + OpenAI list prices, with the spot-batch + 1-yr Reserved Instance + LLM-judge sampling levers wired up. Preview the CSV →
Optimization levers
Async architecture review with a staff-level reviewer (cohort beta).
Submit your repo, your ADR draft, or your dedup-precision benchmark. A staff or principal-level reviewer who has shipped this exact stack at scale responds within 7 days with line-by-line comments + a Loom walkthrough. Cohort capped at 12 members.
Bring a diff, an ADR draft, or a dedup benchmark.
The cohort beta runs as async architecture review — pick a reviewer by topic, send the artifact, get inline comments + a Loom walkthrough back. No back-and-forth scheduling. No 30-minute slot pressure.
PRO unlocks Modules 01-02. EXPERT unlocks the full platform.
PRO is the entry point — Modules 01-02 plus the rest of the PRO catalog. EXPERT unlocks Modules 03-05 of this build, the 5 ADRs, the cost-model CSV, and the cohort-beta async review.
Pick this if you own the dedup dashboard, not just a notebook.
Senior data engineers
You've shipped batch pipelines. Now you own the dedup precision dashboard, the embedding-model migration plan, and the architecture review with platform.
AI infra engineers
You absorb new dataset shapes without absorbing new vendors. aiohttp + datasketch + pgvector + Airflow + Ray — tools your platform team already operates.
Engineering managers · ML platform
You need a reference architecture for the dataset-quality + cost questions your CTO will ask before the AI team gets headcount or a GPU budget.
Founding engineers · AI startups
Your investors will ask about dataset quality + unit economics before they ask about scale. The 5 ADRs + cost model is the answer.
Going deeper? Four tracks back this project.
The Dataset Engineering curriculum is the foundation. These four tracks let you go deeper on the parts that matter most for your role.
Quick answers.
Ready to ship a real training-data pipeline?
Start with PRO ($29/mo) for Modules 01-02 — crawl + dedup + quality + tokenization + RAG. Or unlock the full 5-module platform plus 5 ADRs, the cost-model CSV, and cohort-beta architecture review with EXPERT ($79/mo).