ai-de.net/Projects/P16 · LLM training-data pipeline — crawl + dedup + RAG + LLMOps

Last updated 2026-05-22By AI-DE Engineering Team

EXPERT-tier · PRO unlocks Modules 01-02AI & vectors trackP16

Build an
LLM training-data
pipeline with crawl + dedup + RAG

Ship a real training-data platform: aiohttp crawler with rate-limit + extractor fallback chain, MinHash + LSH dedup at 95% precision, multi-signal quality scoring, hybrid tokenization (tiktoken + custom BPE), pgvector or Pinecone retrieval (Protocol), vLLM batch serving, RAGAS + ensemble-judge eval with CI regression gate, Airflow + Ray orchestration, and a per-tenant cost tracker. Modules 01-02 unlock with PRO; the full platform with EXPERT.

Timeline

24 hours

Difficulty

Senior+

Stack

aiohttp · MinHash + LSH · tiktoken · Ray · vLLM · Airflow

See EXPERT benefits

The dataset-engineering portfolio piece for staff AI infra roles — 5 committed ADRs, a runnable cost-model CSV, a dual-backend retrieval Protocol, an Airflow + Ray orchestration design, and a CI eval gate that blocks merges on regression.

By the end you will have wired

aiohttp crawler with TokenBucket rate limiting + 4-extractor fallback chain (1M-doc-capable)
MinHash + LSH dedup at 95% precision (datasketch tuned, validated against labeled fixture)
Hybrid tokenization with tiktoken default + pedagogical custom BPE + sequence packing
Dual-backend RAG (pgvector or Pinecone via Protocol) with chunking + MRR/NDCG eval harness
vLLM serving + Locust load test + RAGAS + ensemble judge (multi-doc synthesis)
Airflow DAGs (ingest + eval) + Ray fan-out + GitHub Actions CI gate (10% regression cap)
5 ADRs (one Deprecated) committed alongside the code, plus a runnable cost-model CSV

PREREQ · SENIOR+Built for engineers shipping production data pipelines / RAG. Comfortable with Python (async, typing), pandas / large files, and ML fundamentals (embeddings, tokenization). Distributed-compute experience helpful but not required. Not a “what is RAG” course.

llm-ingestion.platform · 5 modules · 9 sub-parts · 1M-doc-capable · CI eval gate · dual-backend RAG

ingest DAG + eval DAG armed

Crawl + clean

Dedup + score

Tokenize + RAG

Serve + eval + ops

aiohttp + TokenBucket1M-doc-capable batch · politeness rate limit

extractors fallbackBS4 · readability · trafilatura · pypdf

raw_corpus120 fixtures · 5 topic domains

robots.txt awareURL dedup · retry policy

aiohttp + extractors — see ADR-001

MinHash + LSH (datasketch)shingle k=5 · num_perm=128 · 95% precision

DocumentScorerlength · language · regex · toxicity

10.3 MB dedup_input fixtureengineered exact + near-dupe clusters

optional pysparkTier-2 distributed dedup

Dedup at scale — see ADR-002

tiktoken (default)cl100k_base · production encoding

custom BPE (pedagogical)tokenizers/bpe.py · merge loop visible

PackingStrategyfirst-fit · max_len 4096

RAG · pgvector OR PineconeProtocol · MRR/NDCG eval

Versioned dataset — see ADR-003 + ADR-005

vLLM + AWQ/GPTQFastAPI async · KV cache · prefix cache

RAGAS + ensemble judgesingle + multi-doc synthesis · failure_analysis

Airflow DAGs + Rayingest + eval · @ray.remote(num_gpus=0.5)

CI eval gate.github/workflows/eval.yml · 10% regression cap

Production LLMOps — see ADR-004

# Recall + dedup — 95% precision · MRR/NDCG measured

1k+ docs crawled async with TokenBucket rate limiting

MinHash + LSH @ 95% precision on labeled validation set

Ensemble judge (single + multi-doc synthesis) catches semantic dupes

→ data/eval/{eval_questions,golden_examples,llm_judge_examples}.jsonl bundled in zip

# Cost — $926 → $543 / mo at 1M docs/quarter (−41%)

Spot instances for crawl + dedup + embed batch (Ray fault-tolerant)

1-yr Reserved Instances on persistent serving + RDS + Airflow

LLM-judge sampling: 5% of docs vs 100% on every CI run

→ ADR-005 documents the Pinecone-only → dual-backend reversal

95% precision

MinHash dedup · 100K validated

5 ADRs

committed in starter kit

P99 < 200ms

vLLM serving · Locust validated

Curriculum · 5 modules · 24 hours · 9 sub-parts

Modules 01-02 unlock with PRO. Modules 03-05 with EXPERT.

Modules 01-02 (~12h) ship a working RAG system from raw web data — the same pattern behind ChatGPT-style applications. Included with PRO. Modules 03-05 (~12h additional) layer on the production serving + evaluation + LLMOps story and unlock with EXPERT.

P16 · 5 modules · 24 hours · 9 sub-parts

Free preview EXPERT required

M01

⊘Data Foundation — Crawl + Dedup + Quality

aiohttp crawler with TokenBucket rate limiting and 4-extractor fallback chain (parts 1). MinHash + LSH dedup via datasketch at 95% precision (part 2). Multi-signal quality scoring with rejection reasons (part 3). 1M-doc-capable batch; 10.3 MB dedup fixture + 5.6 MB quality fixture bundled.

6h28 lessonsPRO TIER

Unlock with PRO →

M02

⊘Dataset → RAG System — Tokenization + RAG

Hybrid tokenization: tiktoken (default production path) + custom from-scratch BPE (pedagogical merge loop) + PackingStrategy (part 4). Dual-backend retrieval via Protocol — pgvector OR Pinecone — with chunking strategies, MRR/NDCG eval harness on 50 golden queries (part 5). Output as HuggingFace-compatible Parquet shards.

6h30 lessonsPRO TIER

Unlock with PRO →

M03

⊘Model & Serving — Integration + vLLM + Locust

Local model loading via transformers + bitsandbytes 4-bit quantization, BatchInferenceEngine with OOM recovery + Parquet checkpointing (part 6). vLLM with AWQ/GPTQ quantization, FastAPI async serving layer, KV cache + prefix caching, P50/P95/P99 latency benchmarking via Locust at 50 concurrent users (part 7). <200ms P99 target.

6h24 lessonsEXPERT TIER

Unlock with EXPERT →

M04

⊘Evaluation & Feedback — RAGAS + ensemble judge + CI gate

Build an evaluation dataset with single-doc + multi-doc synthesis questions. RAGAS scaffolding for context relevance / faithfulness / answer relevance. Ensemble judge with 3D scoring (factual accuracy, relevance, completeness). Failure analysis with error categorization. GitHub Actions CI gate that blocks merge on >10% regression with PR comment posting.

3h14 lessonsEXPERT TIER

Unlock with EXPERT →

M05

⊘Production LLMOps — Airflow + Ray + cost tracker + runbook

Airflow ingest DAG (weekly: crawl → dedup → quality → embed → vector store) + eval DAG (daily: eval suite → baseline compare → alert on regression). Ray fan-out via @ray.remote(num_gpus=0.5) for distributed embedding generation. Prometheus metrics + cost tracking + budget alerts. Architecture document + 6-scenario runbook + capstone artifact.

3h14 lessonsEXPERT TIER

Unlock with EXPERT →

Modules 01-02 with PRO ($29/mo) · Modules 03-05 with EXPERT ($79/mo)

See plans →

Backed by curriculum

Data Curation & Dataset Engineering

13 modules~38 hoursMinHash + LSH · tiktoken + BPE · dual-backend RAG · Airflow + Ray

Open curriculum

iThe Dataset Engineering curriculum is the foundation for this project — it’s not a sales add-on. EXPERT subscribers get full access to all modules.

The build, in 3 phases

Crawl. Train-data. Operate.

Each phase ends with a tagged release, a passing eval suite, and a runbook drill. No ambiguity about where you are.

01~12h

Foundation (Modules 01-02)

Working RAG system from raw web data. aiohttp crawler · MinHash + LSH dedup · quality scoring · tokenization · pgvector or Pinecone.

✓1M-doc-capable async crawler + 4-extractor chain
✓MinHash + LSH dedup @ 95% precision (validated)
✓Dual-backend RAG with MRR/NDCG eval harness

02~9h

Production (Modules 03-04)

vLLM batch serving with Locust load test, RAGAS + ensemble judge eval, GitHub Actions CI gate that blocks merge on regression.

✓vLLM service · AWQ quantization · P99 < 200ms
✓Ensemble judge · multi-doc synthesis · failure analysis
✓CI eval gate · 10% regression cap · PR comment posting

03~3h

Operate (Module 05)

Airflow + Ray orchestration, Prometheus cost tracking, runbook + architecture document for staff capstone.

✓Airflow ingest DAG + eval DAG
✓Ray fan-out · @ray.remote(num_gpus=0.5)
✓Cost tracker + 6-scenario runbook

Project setup · 20 minutes

Two-tier requirements. Local aiohttp + Postgres + Ray + Airflow + vLLM (GPU optional).

What lives in the repo

You get the real platform on day one — aiohttp + extractors for the crawler, MinHash via datasketch for dedup, tiktoken + custom BPE for tokenization, pgvector OR Pinecone via a retrieval Protocol, vLLM for GPU serving, Airflow + Ray for orchestration, and Prometheus + Locust for observability + load testing.

base_crawler.py + crawl/ + extractors.py — aiohttp async crawler + TokenBucket rate limiter + 4-extractor fallback
minhash.py + dedup/ + distributed_dedup.py + pipeline/dedup.py — MinHash + LSH (datasketch) + optional pyspark Tier-2 scaling
tokenizers/bpe.py + tokenize/ + packing.py + augmentation/ — pedagogical custom BPE + tiktoken production path + sequence packing
rag/ (pgvector + Pinecone clients via Protocol) + retrieval/ — dual-backend RAG with chunking + evaluate_retrieval (MRR + NDCG)
serve/ + serving/ + inference/ + models/ — vLLM launchers + FastAPI async + BatchInferenceEngine
evaluation/ + eval/ + bench/locustfile.py — RAGAS + ensemble judge + Locust 50-user P50/P95/P99 load test
dags/ + .github/workflows/eval.yml + app/metrics.py — Airflow ingest + eval DAGs · CI gate (10% regression cap) · Prometheus
docs/adr/ + docs/cost-model/ — 5 ADRs (one Deprecated) + the runnable cost-model CSV

Download · Starter Kit · 219 files · 1.1 MB compressed

LLM Training-Data Pipeline Starter Kit

Pre-built dataset-engineering platform: 9 sub-tutorials of source, Docker compose (pgvector + Redis), 120 raw_corpus fixtures, 10.3 MB dedup_input + 5.6 MB quality_input, 100 golden eval Q's, Tier-1 and Tier-2 requirements, Airflow DAGs, GitHub Actions CI gate. Now bundled: 5 committed ADR markdown files (docs/adr/) and the runnable cost-model CSV (docs/cost-model/) — unzip and read straight from the repo.

EXPERT project · 219 files · ADRs + cost model bundled · last updated 2026-05-09

~/projects/llm-ingestion-pipeline — zsh

1. Unzip and start the local stack

$ unzip llm-ingestion-pipeline-starter.zip

$ cd llm-ingestion-pipeline && cp .env.example .env

$ make up # pgvector + Redis + Airflow standalone

2. Run the ingest pipeline (Tier-1 path · CPU only)

$ pip install -r requirements-tier1.txt

$ python pipeline.py --seeds data/raw_corpus --tier 1

3. Run the eval suite + CI gate

$ python eval/run_comparison.py --baseline main --candidate HEAD

$ # Same script GitHub Actions calls in eval.yml

4. Locust load test the vLLM serving layer

$ bash bench/run_loadtest.sh # 50 users · P50/P95/P99 + CSV report

120

raw_corpus fixtures

10.3 MB

dedup_input · labeled clusters

5.6 MB

quality_input · seeded issues

100

eval golden Q's (90 + 10 hand-curated)

Production hardening

The same dataset demo — but built for the production case.

Most LLM training-data tutorials show you a notebook scraping a few URLs into a single file. This shows what changes when you crawl 1M pages, on-call owns the dedup precision dashboard, and finance asks for cost-per-1M-documents.

Notebook datasetWhat most teams ship

Crawl

`requests.get` in a loop, no politeness

Dedup

`set()` of hashes — exact only

Quality

"skip short docs"

Tokenization

tiktoken-only, no packing

Retrieval

sentence-transformers + cosine in Python

Serving + Ops

`for q in queries: ...` in a notebook

Your production pipelineModules 01–05

✓

Crawl

aiohttp + TokenBucket + 4 extractors + retry policy

✓

Dedup

MinHash + LSH @ 95% precision (datasketch tuned, validated)

✓

Quality

Multi-signal scorer (length + language + regex + toxicity) with rejection reasons

✓

Tokenization

tiktoken (default) + custom BPE for vocab study + greedy packing

✓

Module 05 ships a runnable cost-model CSV inside the starter-kit zip at docs/cost-model/. Single team @ 1M docs/quarter, real AWS RDS + EC2 + GPU + OpenAI list prices, with the spot-batch + 1-yr Reserved Instance + LLM-judge sampling levers wired up. Preview the CSV →

ComponentBaseline / moOptimized / moDelta

Ray cluster (crawl + dedup + embed batch)

EC2 c7g.xlarge × 4 · interruptible (spot for batch)

$220

$66

−$154

Postgres + pgvector (RDS)

db.t4g.medium · 100GB gp3 · 1M × 1536-dim

$98

$68

−$30

OpenAI embeddings · one-shot at ingest

1M docs · text-embedding-3-small · hash-incremental amortizes

$20

−$19

OpenAI / Anthropic LLM-judge

~5M tok/mo · CI eval gate + ensemble judge

$15

−$10

vLLM serving · 2 GPU replicas

EC2 g5.xlarge × 2 · 24/7 · partial commit

$520

$364

−$156

Airflow + S3 + observability

scheduler + worker + Parquet output + Grafana free tier

$53

$39

−$14

Total · single team @ 1M docs/quarter

~$0.93 per 1k docs at baseline

$926

$543

−$383 (−41%)

Optimization levers

Spot instances for batch jobs

Crawl + dedup + tokenize + embed are interruptible. Ray fault tolerance + the resume-on-failure jobs table absorb spot reclaims. ~70% spot capacity at avg 70% discount.

−$154 / mo

1-yr Reserved Instances

Commit to 12-month reserved capacity on RDS + Airflow + 1 GPU replica running 24/7. Standard ~30% off list.

−$200 / mo

LLM-judge sampling

Sample 5% of docs randomly per CI run instead of judging every doc. Eval signal still statistically valid for 100-question suite.

−$10 / mo

EXPERT benefit · cohort beta

Async architecture review with a staff-level reviewer (cohort beta).

Submit your repo, your ADR draft, or your dedup-precision benchmark. A staff or principal-level reviewer who has shipped this exact stack at scale responds within 7 days with line-by-line comments + a Loom walkthrough. Cohort capped at 12 members.

Bring a diff, an ADR draft, or a dedup benchmark.

The cohort beta runs as async architecture review — pick a reviewer by topic, send the artifact, get inline comments + a Loom walkthrough back. No back-and-forth scheduling. No 30-minute slot pressure.

Mira K.

Ex-staff · dataset platform · AI lab

Web-scale crawling, MinHash tuning, multi-tenant dataset versioning, dedup at 100M+

“Send the diff. I'll go line-by-line through your TokenBucket and your LSH thresholds and pick out the false-positive cluster.”

Daniel T.

Principal · LLM training data · public AI company

Training-data quality, RAGAS calibration, judge ensembles, regression gating

“Send your worst eval report. We'll walk it backwards from the failed CI gate to the embedding-version mismatch upstream.”

Anya S.

Eng manager · ML platform · public Series-D

Org design for dataset teams, hiring rubrics, staff-MLE interview prep, scope-of-work for dataset-engineering roles

“If you're prepping for staff-MLE promo, send your ADR draft. We'll work backwards from the rubric.”

Format

Async

Turnaround

7 days

Cohort

12 members

Scope

ADR + arch review

Request a slot →

What your tier unlocks

PRO unlocks Modules 01-02. EXPERT unlocks the full platform.

PRO is the entry point — Modules 01-02 plus the rest of the PRO catalog. EXPERT unlocks Modules 03-05 of this build, the 5 ADRs, the cost-model CSV, and the cohort-beta async review.

What you getFREEPROEXPERT

Modules 01-02 of P16

Crawl + Dedup + Quality + Tokenization + RAG (~12h)

—

Included

Modules 03-05 of P16

Serve + Eval + Production LLMOps (~12h)

—

Included

5 committed ADRs + cost-model CSV

Starter kit docs/adr/ + docs/cost-model/

—

Included

PRO project catalog

Production-grade builds

All current

All current + this one

Curriculum

All 7 tracks

Phase 1 only

All

All + bonus modules

Code review

Senior+ reviewers

—

4 / month

Unlimited

Cohort-beta architecture review

Async · 7-day turnaround · 12-member cap

—

Included

Certificate

Verifiable on LinkedIn

—

Yes

Yes + LinkedIn rec

$79/mo

billed monthly · open enrollment · cancel anytime

or annual

$699/yr save 26%

Unlock EXPERT →

Who this is for

Pick this if you own the dedup dashboard, not just a notebook.

Senior data engineers

You've shipped batch pipelines. Now you own the dedup precision dashboard, the embedding-model migration plan, and the architecture review with platform.

AI infra engineers

You absorb new dataset shapes without absorbing new vendors. aiohttp + datasketch + pgvector + Airflow + Ray — tools your platform team already operates.

Engineering managers · ML platform

You need a reference architecture for the dataset-quality + cost questions your CTO will ask before the AI team gets headcount or a GPU budget.

Founding engineers · AI startups

Your investors will ask about dataset quality + unit economics before they ask about scale. The 5 ADRs + cost model is the answer.

Related curriculum

Going deeper? Four tracks back this project.

The Dataset Engineering curriculum is the foundation. These four tracks let you go deeper on the parts that matter most for your role.

FAQ · EXPERT tier

Quick answers.

How is this different from PRO?+

Modules 01-02 (crawl + dedup + quality + tokenization + RAG) are included with PRO at $29/mo — you build a working RAG system from raw web data, the same pattern behind ChatGPT-style applications. The rest of the platform — Modules 03-05 (vLLM serving + RAGAS evaluation + Airflow + Ray production LLMOps), the 5 committed ADRs, the runnable cost-model CSV, and the cohort-beta async architecture review — unlocks with EXPERT at $79/mo. PRO gets you a Q&A system; EXPERT gets you the platform you'd defend in an architecture review.

Why aiohttp over Scrapy?+

ADR-001 lays out the full tradeoff. Short version: aiohttp wins on tutorial reproducibility (no Scrapy mental model overhead) and tight async control over the request loop. Scrapy is the right answer for teams that already operate it; for the tutorial path aiohttp + TokenBucket + a 4-extractor fallback chain (BeautifulSoup, readability-lxml, trafilatura, pypdf) gets you to a working crawl in 30 lines. Reversal to Scrapy is ~1 engineer-week behind the BaseCrawler.fetch interface.

How does MinHash + LSH actually work at 1M scale?+

ADR-002 has the full algorithm. At 100K-doc scale (the tutorial validation set), datasketch's MinHashLSH index hits 95% precision on a labeled near-dupe fixture with shingle_size=5, num_perm=128, threshold=0.85. At 1M+ scale the LSH index doesn't fit in 16 GB RAM — that's where Tier-2's distributed_dedup.py via pyspark comes in, sharding the LSH across workers. The data/processed/dedup_input.jsonl fixture (10.3 MB, engineered exact + near-dupe clusters) ships in the zip so you can reproduce the precision number.

Do I need a GPU?+

vLLM in Module 03 needs a GPU (g5.xlarge minimum on AWS). Modules 01-02 run CPU-only — and the Tier-1 path uses an offline mock embedder so the first-run RAG demo doesn't even download sentence-transformers. The cost-model CSV documents both regimes.

How long until I can finish this project?+

24 hours of focused work across 5 modules (9 sub-parts). Most learners spread it across 6-8 weeks alongside a day job. Modules 01-02 alone (~12 hours) get you a working RAG system from raw web data — included with PRO at $29/mo.

Is this enough to interview for staff dataset / LLM-platform roles?+

It's a strong forcing function. Staff dataset interviews lean heavily on system design (crawl politeness, dedup precision/recall, tokenization tradeoffs, retrieval-backend choice, orchestration) and on having opinions backed by real tradeoffs. The 5 ADRs you'll commit (one Deprecated documenting the Pinecone-only → dual-backend reversal, with two parallel client files as receipts) are exactly the artifacts a panel asks about. Pair with the cohort-beta review on your final repo and you have a portfolio.

Related projects

Paired with this project

P13·PAID·ai

Agentic data pipeline — LangGraph supervisor + HITL + ADRs

EXPERT-tier agent platform: LangGraph supervisor + 4 worker agents, RBAC tool registry, Redis checkpointing + 24h time-travel, HITL via interrupt_before + Slack actionable buttons, FailureDetector + ToolCallGuard, multi-tenant platform-design capstone, 5 committed ADRs (one Deprecated), runnable cost-model CSV. 6 modules · 17-18h. Modules 01-03 with PRO.

Explore project →

Ready to ship a real training-data pipeline?

Start with PRO ($29/mo) for Modules 01-02 — crawl + dedup + quality + tokenization + RAG. Or unlock the full 5-module platform plus 5 ADRs, the cost-model CSV, and cohort-beta architecture review with EXPERT ($79/mo).

See EXPERT benefits

P16 · LLM training-data pipeline · EXPERT · PRO unlocks M01-M02Unlock EXPERT →

Build anLLM training-datapipeline with crawl + dedup + RAG

Modules 01-02 unlock with PRO. Modules 03-05 with EXPERT.

Crawl. Train-data. Operate.

Two-tier requirements. Local aiohttp + Postgres + Ray + Airflow + vLLM (GPU optional).

What lives in the repo

LLM Training-Data Pipeline Starter Kit

The same dataset demo — but built for the production case.

Write the ADRs staff engineers actually get judged on.

aiohttp + custom crawler over Scrapy / requests-html

MinHash + LSH (datasketch) over hash-only / embedding similarity dedup

Hybrid tokenization — tiktoken default + custom BPE pedagogical

Ray + Airflow for orchestration over Spark / Dask / single-node cron

Pinecone-only vector backend

Read the FinOps story for the pipeline you actually ship.

Optimization levers

Async architecture review with a staff-level reviewer (cohort beta).

Bring a diff, an ADR draft, or a dedup benchmark.

PRO unlocks Modules 01-02. EXPERT unlocks the full platform.

Pick this if you own the dedup dashboard, not just a notebook.

Senior data engineers

AI infra engineers

Engineering managers · ML platform

Founding engineers · AI startups

Going deeper? Four tracks back this project.

Data Curation & Dataset Engineering

LLM Data Pipelines Deep Dive

MLOps for Data Engineers

Python for Data Engineers

Quick answers.

Paired with this project

Ready to ship a real training-data pipeline?

Build an
LLM training-data
pipeline with crawl + dedup + RAG