Skip to content
ai-de.net/Projects/P16 · LLM training-data pipeline — crawl + dedup + RAG + LLMOps
EXPERT-tier · PRO unlocks Modules 01-02AI & vectors trackP16

Build an
LLM training-data
pipeline with crawl + dedup + RAG

Ship a real training-data platform: aiohttp crawler with rate-limit + extractor fallback chain, MinHash + LSH dedup at 95% precision, multi-signal quality scoring, hybrid tokenization (tiktoken + custom BPE), pgvector or Pinecone retrieval (Protocol), vLLM batch serving, RAGAS + ensemble-judge eval with CI regression gate, Airflow + Ray orchestration, and a per-tenant cost tracker. Modules 01-02 unlock with PRO; the full platform with EXPERT.

Timeline
24 hours
Difficulty
Senior+
Stack
aiohttp · MinHash + LSH · tiktoken · Ray · vLLM · Airflow

The dataset-engineering portfolio piece for staff AI infra roles — 5 committed ADRs, a runnable cost-model CSV, a dual-backend retrieval Protocol, an Airflow + Ray orchestration design, and a CI eval gate that blocks merges on regression.

By the end you will have wired
  • aiohttp crawler with TokenBucket rate limiting + 4-extractor fallback chain (1M-doc-capable)
  • MinHash + LSH dedup at 95% precision (datasketch tuned, validated against labeled fixture)
  • Hybrid tokenization with tiktoken default + pedagogical custom BPE + sequence packing
  • Dual-backend RAG (pgvector or Pinecone via Protocol) with chunking + MRR/NDCG eval harness
  • vLLM serving + Locust load test + RAGAS + ensemble judge (multi-doc synthesis)
  • Airflow DAGs (ingest + eval) + Ray fan-out + GitHub Actions CI gate (10% regression cap)
  • 5 ADRs (one Deprecated) committed alongside the code, plus a runnable cost-model CSV
PREREQ · SENIOR+Built for engineers shipping production data pipelines / RAG. Comfortable with Python (async, typing), pandas / large files, and ML fundamentals (embeddings, tokenization). Distributed-compute experience helpful but not required. Not a “what is RAG” course.
llm-ingestion.platform · 5 modules · 9 sub-parts · 1M-doc-capable · CI eval gate · dual-backend RAG
ingest DAG + eval DAG armed
Crawl + clean
Dedup + score
Tokenize + RAG
Serve + eval + ops
aiohttp + TokenBucket1M-doc-capable batch · politeness rate limit
extractors fallbackBS4 · readability · trafilatura · pypdf
raw_corpus120 fixtures · 5 topic domains
robots.txt awareURL dedup · retry policy
aiohttp + extractors — see ADR-001
MinHash + LSH (datasketch)shingle k=5 · num_perm=128 · 95% precision
DocumentScorerlength · language · regex · toxicity
10.3 MB dedup_input fixtureengineered exact + near-dupe clusters
optional pysparkTier-2 distributed dedup
Dedup at scale — see ADR-002
tiktoken (default)cl100k_base · production encoding
custom BPE (pedagogical)tokenizers/bpe.py · merge loop visible
PackingStrategyfirst-fit · max_len 4096
RAG · pgvector OR PineconeProtocol · MRR/NDCG eval
Versioned dataset — see ADR-003 + ADR-005
vLLM + AWQ/GPTQFastAPI async · KV cache · prefix cache
RAGAS + ensemble judgesingle + multi-doc synthesis · failure_analysis
Airflow DAGs + Rayingest + eval · @ray.remote(num_gpus=0.5)
CI eval gate.github/workflows/eval.yml · 10% regression cap
Production LLMOps — see ADR-004
# Recall + dedup — 95% precision · MRR/NDCG measured
1k+ docs crawled async with TokenBucket rate limiting
MinHash + LSH @ 95% precision on labeled validation set
Ensemble judge (single + multi-doc synthesis) catches semantic dupes
→ data/eval/{eval_questions,golden_examples,llm_judge_examples}.jsonl bundled in zip
# Cost — $926 → $543 / mo at 1M docs/quarter (−41%)
Spot instances for crawl + dedup + embed batch (Ray fault-tolerant)
1-yr Reserved Instances on persistent serving + RDS + Airflow
LLM-judge sampling: 5% of docs vs 100% on every CI run
→ ADR-005 documents the Pinecone-only → dual-backend reversal
95% precision
MinHash dedup · 100K validated
5 ADRs
committed in starter kit
P99 < 200ms
vLLM serving · Locust validated
Curriculum · 5 modules · 24 hours · 9 sub-parts

Modules 01-02 unlock with PRO. Modules 03-05 with EXPERT.

Modules 01-02 (~12h) ship a working RAG system from raw web data — the same pattern behind ChatGPT-style applications. Included with PRO. Modules 03-05 (~12h additional) layer on the production serving + evaluation + LLMOps story and unlock with EXPERT.

P16 · 5 modules · 24 hours · 9 sub-parts
Free preview EXPERT required
M01
Data Foundation — Crawl + Dedup + Quality
aiohttp crawler with TokenBucket rate limiting and 4-extractor fallback chain (parts 1). MinHash + LSH dedup via datasketch at 95% precision (part 2). Multi-signal quality scoring with rejection reasons (part 3). 1M-doc-capable batch; 10.3 MB dedup fixture + 5.6 MB quality fixture bundled.
6h28 lessonsPRO TIER
Unlock with PRO →
M02
Dataset → RAG System — Tokenization + RAG
Hybrid tokenization: tiktoken (default production path) + custom from-scratch BPE (pedagogical merge loop) + PackingStrategy (part 4). Dual-backend retrieval via Protocol — pgvector OR Pinecone — with chunking strategies, MRR/NDCG eval harness on 50 golden queries (part 5). Output as HuggingFace-compatible Parquet shards.
6h30 lessonsPRO TIER
Unlock with PRO →
M03
Model & Serving — Integration + vLLM + Locust
Local model loading via transformers + bitsandbytes 4-bit quantization, BatchInferenceEngine with OOM recovery + Parquet checkpointing (part 6). vLLM with AWQ/GPTQ quantization, FastAPI async serving layer, KV cache + prefix caching, P50/P95/P99 latency benchmarking via Locust at 50 concurrent users (part 7). <200ms P99 target.
6h24 lessonsEXPERT TIER
Unlock with EXPERT →
M04
Evaluation & Feedback — RAGAS + ensemble judge + CI gate
Build an evaluation dataset with single-doc + multi-doc synthesis questions. RAGAS scaffolding for context relevance / faithfulness / answer relevance. Ensemble judge with 3D scoring (factual accuracy, relevance, completeness). Failure analysis with error categorization. GitHub Actions CI gate that blocks merge on >10% regression with PR comment posting.
3h14 lessonsEXPERT TIER
Unlock with EXPERT →
M05
Production LLMOps — Airflow + Ray + cost tracker + runbook
Airflow ingest DAG (weekly: crawl → dedup → quality → embed → vector store) + eval DAG (daily: eval suite → baseline compare → alert on regression). Ray fan-out via @ray.remote(num_gpus=0.5) for distributed embedding generation. Prometheus metrics + cost tracking + budget alerts. Architecture document + 6-scenario runbook + capstone artifact.
3h14 lessonsEXPERT TIER
Unlock with EXPERT →
Modules 01-02 with PRO ($29/mo) · Modules 03-05 with EXPERT ($79/mo)
See plans →
Backed by curriculum
Data Curation & Dataset Engineering
13 modules~38 hoursMinHash + LSH · tiktoken + BPE · dual-backend RAG · Airflow + Ray
Open curriculum
iThe Dataset Engineering curriculum is the foundation for this project — it’s not a sales add-on. EXPERT subscribers get full access to all modules.
The build, in 3 phases

Crawl. Train-data. Operate.

Each phase ends with a tagged release, a passing eval suite, and a runbook drill. No ambiguity about where you are.

01~12h
Foundation (Modules 01-02)

Working RAG system from raw web data. aiohttp crawler · MinHash + LSH dedup · quality scoring · tokenization · pgvector or Pinecone.

  • 1M-doc-capable async crawler + 4-extractor chain
  • MinHash + LSH dedup @ 95% precision (validated)
  • Dual-backend RAG with MRR/NDCG eval harness
02~9h
Production (Modules 03-04)

vLLM batch serving with Locust load test, RAGAS + ensemble judge eval, GitHub Actions CI gate that blocks merge on regression.

  • vLLM service · AWQ quantization · P99 < 200ms
  • Ensemble judge · multi-doc synthesis · failure analysis
  • CI eval gate · 10% regression cap · PR comment posting
03~3h
Operate (Module 05)

Airflow + Ray orchestration, Prometheus cost tracking, runbook + architecture document for staff capstone.

  • Airflow ingest DAG + eval DAG
  • Ray fan-out · @ray.remote(num_gpus=0.5)
  • Cost tracker + 6-scenario runbook
Project setup · 20 minutes

Two-tier requirements. Local aiohttp + Postgres + Ray + Airflow + vLLM (GPU optional).

What lives in the repo

You get the real platform on day one — aiohttp + extractors for the crawler, MinHash via datasketch for dedup, tiktoken + custom BPE for tokenization, pgvector OR Pinecone via a retrieval Protocol, vLLM for GPU serving, Airflow + Ray for orchestration, and Prometheus + Locust for observability + load testing.

  • base_crawler.py + crawl/ + extractors.py — aiohttp async crawler + TokenBucket rate limiter + 4-extractor fallback
  • minhash.py + dedup/ + distributed_dedup.py + pipeline/dedup.py — MinHash + LSH (datasketch) + optional pyspark Tier-2 scaling
  • tokenizers/bpe.py + tokenize/ + packing.py + augmentation/ — pedagogical custom BPE + tiktoken production path + sequence packing
  • rag/ (pgvector + Pinecone clients via Protocol) + retrieval/ — dual-backend RAG with chunking + evaluate_retrieval (MRR + NDCG)
  • serve/ + serving/ + inference/ + models/ — vLLM launchers + FastAPI async + BatchInferenceEngine
  • evaluation/ + eval/ + bench/locustfile.py — RAGAS + ensemble judge + Locust 50-user P50/P95/P99 load test
  • dags/ + .github/workflows/eval.yml + app/metrics.py — Airflow ingest + eval DAGs · CI gate (10% regression cap) · Prometheus
  • docs/adr/ + docs/cost-model/ — 5 ADRs (one Deprecated) + the runnable cost-model CSV
Download · Starter Kit · 219 files · 1.1 MB compressed

LLM Training-Data Pipeline Starter Kit

Pre-built dataset-engineering platform: 9 sub-tutorials of source, Docker compose (pgvector + Redis), 120 raw_corpus fixtures, 10.3 MB dedup_input + 5.6 MB quality_input, 100 golden eval Q's, Tier-1 and Tier-2 requirements, Airflow DAGs, GitHub Actions CI gate. Now bundled: 5 committed ADR markdown files (docs/adr/) and the runnable cost-model CSV (docs/cost-model/) — unzip and read straight from the repo.

EXPERT project · 219 files · ADRs + cost model bundled · last updated 2026-05-09
~/projects/llm-ingestion-pipeline — zsh
1. Unzip and start the local stack
$ unzip llm-ingestion-pipeline-starter.zip
$ cd llm-ingestion-pipeline && cp .env.example .env
$ make up # pgvector + Redis + Airflow standalone
2. Run the ingest pipeline (Tier-1 path · CPU only)
$ pip install -r requirements-tier1.txt
$ python pipeline.py --seeds data/raw_corpus --tier 1
3. Run the eval suite + CI gate
$ python eval/run_comparison.py --baseline main --candidate HEAD
$ # Same script GitHub Actions calls in eval.yml
4. Locust load test the vLLM serving layer
$ bash bench/run_loadtest.sh # 50 users · P50/P95/P99 + CSV report
120
raw_corpus fixtures
10.3 MB
dedup_input · labeled clusters
5.6 MB
quality_input · seeded issues
100
eval golden Q's (90 + 10 hand-curated)
Production hardening

The same dataset demo — but built for the production case.

Most LLM training-data tutorials show you a notebook scraping a few URLs into a single file. This shows what changes when you crawl 1M pages, on-call owns the dedup precision dashboard, and finance asks for cost-per-1M-documents.

Notebook datasetWhat most teams ship
×
Crawl
`requests.get` in a loop, no politeness
×
Dedup
`set()` of hashes — exact only
×
Quality
"skip short docs"
×
Tokenization
tiktoken-only, no packing
×
Retrieval
sentence-transformers + cosine in Python
×
Serving + Ops
`for q in queries: ...` in a notebook
Your production pipelineModules 01–05
Crawl
aiohttp + TokenBucket + 4 extractors + retry policy
Dedup
MinHash + LSH @ 95% precision (datasketch tuned, validated)
Quality
Multi-signal scorer (length + language + regex + toxicity) with rejection reasons
Tokenization
tiktoken (default) + custom BPE for vocab study + greedy packing
Retrieval
Dual-backend (pgvector OR Pinecone) Protocol + chunking + MRR/NDCG eval
Serving + Ops
vLLM + FastAPI + Locust + Airflow DAGs + CI eval gate (10% cap)
EXPERT-only · architecture decision records

Write the ADRs staff engineers actually get judged on.

Five ADRs ship inside the starter-kit zip at docs/adr/, one per major decision in the build, including a real Deprecated ADR documenting the Pinecone-only → dual-backend retrieval reversal. The kind of doc that travels with you to your next role. Preview ADR-001 →

ADR-001Accepted

aiohttp + custom crawler over Scrapy / requests-html

Context
Tutorial reproducibility + control over the request loop at 1M-doc-capable scale
Decision
aiohttp + TokenBucket + 4-extractor fallback (BS4 · readability · trafilatura · pypdf)
Tradeoff
No JS rendering + reimplement politeness vs zero Scrapy onramp + smaller memory footprint
Reversal
Scrapy swap is ~1 engineer-week behind the BaseCrawler.fetch interface
ADR-002Accepted

MinHash + LSH (datasketch) over hash-only / embedding similarity dedup

Context
Catch exact + near-dupes (95% similar) at 100K-doc scale on a single laptop
Decision
datasketch.MinHashLSH · shingle k=5 · num_perm=128 · 0.85 Jaccard threshold
Tradeoff
Misses paraphrased dupes (50% similar) vs $0 + 95% precision on the labeled fixture
Reversal
Embedding-similarity hybrid swap is 1 engineer-week if paraphrase recall matters
ADR-003Accepted

Hybrid tokenization — tiktoken default + custom BPE pedagogical

Context
Production-grade dataset output AND learner-readable BPE algorithm
Decision
tiktoken cl100k_base for production · tokenizers/bpe.py from-scratch for the merge-loop walkthrough
Tradeoff
Two ways to do the same thing vs both production speed AND pedagogical depth
Reversal
Drop custom BPE if pedagogical content moves elsewhere; ~50 lines deleted
ADR-004Accepted

Ray + Airflow for orchestration over Spark / Dask / single-node cron

Context
GPU-first fan-out (embedding) + DAG scheduling + tutorial reproducibility
Decision
Ray @ray.remote(num_gpus=0.5) · Airflow DAGs in dags/{ingest,eval}_dag.py
Tradeoff
Two systems to operate vs first-class GPU placement + industry-default scheduler
Reversal
Dask swap ~3 days; full Spark migration ~2 weeks; both reversible
ADR-005Deprecated

Pinecone-only vector backend

Context
v1 design used Pinecone managed API — fastest path to a working RAG demo
Decision
Reverted in M02 — added RetrievalBackend Protocol with both pgvector and Pinecone clients
Why reversed
Tutorial reproducibility (no vendor account) + Module 09 vendor-lock concerns + cost defense at <100K and >10M scale
Replaced by
rag/pgvector_client.py + rag/pinecone_client.py — same Protocol, default Tier-1 = pgvector
EXPERT-only · cost model

Read the FinOps story for the pipeline you actually ship.

Module 05 ships a runnable cost-model CSV inside the starter-kit zip at docs/cost-model/. Single team @ 1M docs/quarter, real AWS RDS + EC2 + GPU + OpenAI list prices, with the spot-batch + 1-yr Reserved Instance + LLM-judge sampling levers wired up. Preview the CSV →

ComponentBaseline / moOptimized / moDelta
Ray cluster (crawl + dedup + embed batch)
EC2 c7g.xlarge × 4 · interruptible (spot for batch)
$220
$66
−$154
Postgres + pgvector (RDS)
db.t4g.medium · 100GB gp3 · 1M × 1536-dim
$98
$68
−$30
OpenAI embeddings · one-shot at ingest
1M docs · text-embedding-3-small · hash-incremental amortizes
$20
$1
−$19
OpenAI / Anthropic LLM-judge
~5M tok/mo · CI eval gate + ensemble judge
$15
$5
−$10
vLLM serving · 2 GPU replicas
EC2 g5.xlarge × 2 · 24/7 · partial commit
$520
$364
−$156
Airflow + S3 + observability
scheduler + worker + Parquet output + Grafana free tier
$53
$39
−$14
Total · single team @ 1M docs/quarter
~$0.93 per 1k docs at baseline
$926
$543
−$383 (−41%)

Optimization levers

Spot instances for batch jobs
Crawl + dedup + tokenize + embed are interruptible. Ray fault tolerance + the resume-on-failure jobs table absorb spot reclaims. ~70% spot capacity at avg 70% discount.
−$154 / mo
1-yr Reserved Instances
Commit to 12-month reserved capacity on RDS + Airflow + 1 GPU replica running 24/7. Standard ~30% off list.
−$200 / mo
LLM-judge sampling
Sample 5% of docs randomly per CI run instead of judging every doc. Eval signal still statistically valid for 100-question suite.
−$10 / mo
EXPERT benefit · cohort beta

Async architecture review with a staff-level reviewer (cohort beta).

Submit your repo, your ADR draft, or your dedup-precision benchmark. A staff or principal-level reviewer who has shipped this exact stack at scale responds within 7 days with line-by-line comments + a Loom walkthrough. Cohort capped at 12 members.

Bring a diff, an ADR draft, or a dedup benchmark.

The cohort beta runs as async architecture review — pick a reviewer by topic, send the artifact, get inline comments + a Loom walkthrough back. No back-and-forth scheduling. No 30-minute slot pressure.

MK
Mira K.
Ex-staff · dataset platform · AI lab
Web-scale crawling, MinHash tuning, multi-tenant dataset versioning, dedup at 100M+
Send the diff. I'll go line-by-line through your TokenBucket and your LSH thresholds and pick out the false-positive cluster.
DT
Daniel T.
Principal · LLM training data · public AI company
Training-data quality, RAGAS calibration, judge ensembles, regression gating
Send your worst eval report. We'll walk it backwards from the failed CI gate to the embedding-version mismatch upstream.
AS
Anya S.
Eng manager · ML platform · public Series-D
Org design for dataset teams, hiring rubrics, staff-MLE interview prep, scope-of-work for dataset-engineering roles
If you're prepping for staff-MLE promo, send your ADR draft. We'll work backwards from the rubric.
Format
Async
Turnaround
7 days
Cohort
12 members
Scope
ADR + arch review
Request a slot
What your tier unlocks

PRO unlocks Modules 01-02. EXPERT unlocks the full platform.

PRO is the entry point — Modules 01-02 plus the rest of the PRO catalog. EXPERT unlocks Modules 03-05 of this build, the 5 ADRs, the cost-model CSV, and the cohort-beta async review.

What you getFREEPROEXPERT
Modules 01-02 of P16
Crawl + Dedup + Quality + Tokenization + RAG (~12h)
Included
Included
Modules 03-05 of P16
Serve + Eval + Production LLMOps (~12h)
Included
5 committed ADRs + cost-model CSV
Starter kit docs/adr/ + docs/cost-model/
Included
PRO project catalog
Production-grade builds
2
All current
All current + this one
Curriculum
All 7 tracks
Phase 1 only
All
All + bonus modules
Code review
Senior+ reviewers
4 / month
Unlimited
Cohort-beta architecture review
Async · 7-day turnaround · 12-member cap
Included
Certificate
Verifiable on LinkedIn
Yes
Yes + LinkedIn rec
$79/mo
billed monthly · open enrollment · cancel anytime
or annual
$699/yr save 26%
Unlock EXPERT
Who this is for

Pick this if you own the dedup dashboard, not just a notebook.

DE

Senior data engineers

You've shipped batch pipelines. Now you own the dedup precision dashboard, the embedding-model migration plan, and the architecture review with platform.

AI

AI infra engineers

You absorb new dataset shapes without absorbing new vendors. aiohttp + datasketch + pgvector + Airflow + Ray — tools your platform team already operates.

EM

Engineering managers · ML platform

You need a reference architecture for the dataset-quality + cost questions your CTO will ask before the AI team gets headcount or a GPU budget.

FR

Founding engineers · AI startups

Your investors will ask about dataset quality + unit economics before they ask about scale. The 5 ADRs + cost model is the answer.

FAQ · EXPERT tier

Quick answers.

Modules 01-02 (crawl + dedup + quality + tokenization + RAG) are included with PRO at $29/mo — you build a working RAG system from raw web data, the same pattern behind ChatGPT-style applications. The rest of the platform — Modules 03-05 (vLLM serving + RAGAS evaluation + Airflow + Ray production LLMOps), the 5 committed ADRs, the runnable cost-model CSV, and the cohort-beta async architecture review — unlocks with EXPERT at $79/mo. PRO gets you a Q&A system; EXPERT gets you the platform you'd defend in an architecture review.
ADR-001 lays out the full tradeoff. Short version: aiohttp wins on tutorial reproducibility (no Scrapy mental model overhead) and tight async control over the request loop. Scrapy is the right answer for teams that already operate it; for the tutorial path aiohttp + TokenBucket + a 4-extractor fallback chain (BeautifulSoup, readability-lxml, trafilatura, pypdf) gets you to a working crawl in 30 lines. Reversal to Scrapy is ~1 engineer-week behind the BaseCrawler.fetch interface.
ADR-002 has the full algorithm. At 100K-doc scale (the tutorial validation set), datasketch's MinHashLSH index hits 95% precision on a labeled near-dupe fixture with shingle_size=5, num_perm=128, threshold=0.85. At 1M+ scale the LSH index doesn't fit in 16 GB RAM — that's where Tier-2's distributed_dedup.py via pyspark comes in, sharding the LSH across workers. The data/processed/dedup_input.jsonl fixture (10.3 MB, engineered exact + near-dupe clusters) ships in the zip so you can reproduce the precision number.
vLLM in Module 03 needs a GPU (g5.xlarge minimum on AWS). Modules 01-02 run CPU-only — and the Tier-1 path uses an offline mock embedder so the first-run RAG demo doesn't even download sentence-transformers. The cost-model CSV documents both regimes.
24 hours of focused work across 5 modules (9 sub-parts). Most learners spread it across 6-8 weeks alongside a day job. Modules 01-02 alone (~12 hours) get you a working RAG system from raw web data — included with PRO at $29/mo.
It's a strong forcing function. Staff dataset interviews lean heavily on system design (crawl politeness, dedup precision/recall, tokenization tradeoffs, retrieval-backend choice, orchestration) and on having opinions backed by real tradeoffs. The 5 ADRs you'll commit (one Deprecated documenting the Pinecone-only → dual-backend reversal, with two parallel client files as receipts) are exactly the artifacts a panel asks about. Pair with the cohort-beta review on your final repo and you have a portfolio.

Ready to ship a real training-data pipeline?

Start with PRO ($29/mo) for Modules 01-02 — crawl + dedup + quality + tokenization + RAG. Or unlock the full 5-module platform plus 5 ADRs, the cost-model CSV, and cohort-beta architecture review with EXPERT ($79/mo).

P16 · LLM training-data pipeline · EXPERT · PRO unlocks M01-M02Unlock EXPERT →
Press Cmd+K to open