What is an LLM Pipeline?
A data engineering system that collects, cleans, deduplicates, tokenizes, and packages text at scale — producing the training datasets that power large language models.
Quick Answer
An LLM data pipeline is a data engineering system that prepares training data for large language models. It crawls web sources, filters low-quality content, removes PII, deduplicates near-identical documents, tokenizes text into integer sequences, and packages datasets in formats like Parquet or Arrow for training jobs. Unlike RAG (which operates at inference time), an LLM pipeline runs offline and produces static datasets — it is a throughput problem at billion-token scale.
What is an LLM Pipeline?
Training a large language model requires trillions of tokens of high-quality text. Assembling that dataset is a data engineering problem — not a machine learning problem. GPT-4, LLaMA, and Mistral were all trained on datasets built by pipelines that processed petabytes of raw web crawl data (Common Crawl, GitHub, books, Wikipedia) through the same core stages.
The same pipeline architecture applies at smaller scale for fine-tuning: taking a pre-trained model and training it further on a domain-specific corpus (medical records, legal documents, code). The pipeline is smaller, but the deduplication, quality filtering, and tokenization steps are identical.
Pre-training Pipeline
- • Crawl: Common Crawl, GitHub, books, Wikipedia
- • Scale: trillions of tokens
- • Distributed: Ray / Dask across many machines
- • Output: packed token sequences in Parquet/Arrow
Fine-tuning Pipeline
- • Crawl: domain-specific sources (docs, PDFs, APIs)
- • Scale: millions to billions of tokens
- • Single machine or small cluster
- • Output: instruction-formatted JSONL for SFT
Why LLM Pipelines Matter
Without a Proper Pipeline
- • Near-duplicates cause models to memorize instead of generalize
- • PII in training data creates legal and compliance risk
- • Toxic or low-quality text degrades model behavior
- • Mismatched tokenizers cause silent data corruption
- • No lineage — impossible to audit or reproduce datasets
With a Proper Pipeline
- • MinHash/LSH removes 30–60% of web crawl as near-duplicates
- • PII masking with NER keeps training data compliant
- • Perplexity and toxicity filters retain high-quality text
- • Sequence packing maximizes GPU utilization during training
- • Dataset cards and HuggingFace Hub provide full lineage
What You Can Build
Pre-training Dataset
Crawl and process web data at billion-token scale for training a base language model from scratch.
Fine-tuning Corpus
Collect domain-specific PDFs, docs, and web pages to adapt a pre-trained model to a vertical (legal, medical, code).
Instruction Dataset
Format question-answer pairs in JSONL for supervised fine-tuning (SFT) to make models follow instructions.
RLHF Data Pipeline
Collect human preference labels and format comparison pairs for reinforcement learning from human feedback.
Evaluation Dataset
Build held-out benchmark datasets with known answers to measure model quality across tasks.
RAG Knowledge Base
Process documents into chunks and embeddings as input to a vector store for retrieval-augmented generation.
How an LLM Pipeline Works
Every LLM data pipeline has four core stages: collect raw text, filter and clean it, tokenize into integer sequences, and package into training-ready datasets. Each stage outputs to the next via Parquet files or Arrow streams — enabling restartable, auditable processing.
COLLECT
crawl + extract
CLEAN
filter + dedupe + PII
TOKENIZE
BPE + pack
PACKAGE
Parquet + HF Hub
MinHash near-duplicate detection with datasketch
from datasketch import MinHash, MinHashLSH
lsh = MinHashLSH(threshold=0.8, num_perm=128)
def get_minhash(text: str) -> MinHash:
m = MinHash(num_perm=128)
for word in text.split():
m.update(word.encode('utf8' ))
return m
seen_ids = set()
for doc_id, text in documents:
m = get_minhash(text)
result = lsh.query(m)
if result: # near-duplicate found
seen_ids.add(doc_id)
else:
lsh.insert(doc_id, m)
BPE tokenization + sequence packing with tiktoken
import tiktoken
import pyarrow as pa
enc = tiktoken.get_encoding("cl100k_base")
CONTEXT_LEN = 2048
def pack_sequences(texts, ctx_len=CONTEXT_LEN):
buffer, packed = [], []
for text in texts:
tokens = enc.encode(text)
buffer.extend(tokens + [enc.eot_token])
while len(buffer) >= ctx_len:
packed.append(buffer[:ctx_len])
buffer = buffer[ctx_len:]
return packed
# Write to Parquet via PyArrow
sequences = pack_sequences(clean_texts)
table = pa.table({
"input_ids": sequences
})
pa.parquet.write_table(table, "dataset.parquet")
LLM Pipeline vs Other Approaches
LLM Pipeline vs RAG
LLM Data Pipeline
- • Runs offline, before training
- • Produces token sequences on disk
- • Throughput measured in GB/hour
- • Modifies model weights permanently
RAG Pipeline
- • Runs at inference time, per query
- • Produces embeddings in a vector store
- • Latency measured in milliseconds
- • Knowledge updated without retraining
LLM Pipeline vs Standard ETL
LLM Data Pipeline
- • Unstructured text → token integer arrays
- • Semantic deduplication (MinHash/LSH)
- • NLP-specific filters (perplexity, toxicity)
- • Output: Parquet token files or HF datasets
Standard ETL
- • Structured rows → cleaned, typed columns
- • Key-based deduplication
- • Schema validation and type coercion
- • Output: tables in a data warehouse
Scrapy vs Playwright for Crawling
Scrapy
- • High-throughput async crawling framework
- • Built-in robots.txt, rate limiting, retries
- • Static HTML only — no JS rendering
- • Best for large-scale static content
Playwright
- • Headless browser — renders JavaScript
- • Slower, heavier resource usage
- • Required for SPAs and dynamic content
- • Best for JS-heavy sites and auth flows
| Stage | Tool | Purpose | Scale |
|---|---|---|---|
| Crawling | Scrapy / Playwright | Collect raw text from web | Millions of pages |
| Extraction | trafilatura / BeautifulSoup | Extract clean text from HTML | Millions of docs |
| Deduplication | MinHash + LSH (datasketch) | Remove near-duplicates | Billions of docs |
| PII removal | SpaCy NER / Presidio | Mask names, emails, SSNs | All docs |
| Quality filter | perplexity (kenlm) | Remove low-quality text | All docs |
| Tokenization | tiktoken / sentencepiece | Convert text → token IDs | All docs |
| Packaging | PyArrow + HuggingFace Hub | Publish training dataset | Final dataset |
Common Mistakes
Skipping deduplication
Web crawl data contains massive near-duplication — the same article on thousands of mirror sites. Without MinHash/LSH deduplication, models memorize repeated patterns instead of learning language. Deduplication typically removes 30–60% of a raw crawl.
Mismatching tokenizers between training and inference
If you tokenize training data with tiktoken cl100k_base but run inference with a LLaMA sentencepiece tokenizer, token IDs are incompatible and model output is garbage. The tokenizer used to produce training data must exactly match the tokenizer at inference.
No sequence packing — wasting context window
If documents are shorter than the context window, naive batching pads with zeros — wasting compute. Sequence packing concatenates multiple short documents into a single 2048-token sequence separated by end-of-text tokens. This can double training throughput.
Storing token arrays as JSON or CSV
Token ID arrays are large integers best stored in columnar binary formats (Parquet, Arrow). Storing as JSON inflates file size 3–5× and serialization/deserialization becomes the throughput bottleneck.
Who Should Learn LLM Pipelines?
Data Engineer
You build ETL pipelines and want to apply those skills to LLM training data. Crawling, deduplication, and Parquet packaging are natural extensions.
ML Engineer
You fine-tune models and need to assemble domain-specific datasets. Understanding the pipeline ensures your training data is clean and correctly tokenized.
AI Platform Engineer
You're building the data infrastructure that feeds LLM training and RAG systems. This is the foundational layer for everything AI in your organization.
Related Concepts
FAQs
- What is an LLM data pipeline?
- An LLM data pipeline is a data engineering system that collects, cleans, deduplicates, tokenizes, and packages large volumes of text data for training or fine-tuning large language models. Unlike a standard ETL pipeline that produces structured rows, an LLM pipeline produces token sequences — fixed-length integer arrays that a model can learn from. Key stages: web crawling → quality filtering → deduplication → PII removal → tokenization → dataset packaging.
- What is the difference between an LLM pipeline and RAG?
- An LLM data pipeline prepares training data offline — it runs before the model is trained and produces datasets on disk (Parquet, Arrow). RAG (Retrieval-Augmented Generation) operates at inference time — it retrieves relevant documents from a vector store and injects them into the prompt at query time. LLM pipelines are a data engineering problem (throughput, deduplication, tokenization). RAG is an ML application problem (retrieval quality, reranking, context length).
- What is tokenization in an LLM pipeline?
- Tokenization converts raw text into integer token IDs that a model can process. Byte-Pair Encoding (BPE) — used by GPT-4 and LLaMA — merges frequent character pairs iteratively to build a vocabulary of ~50K–128K tokens. A tokenizer also handles sequence packing (combining short documents to fill a context window) and truncation (splitting long documents into fixed-length chunks). The choice of tokenizer affects model performance and must match the tokenizer used during inference.
- What is near-duplicate detection in LLM pipelines?
- Training data scraped from the web contains massive near-duplication — the same article republished across thousands of sites. Near-duplicates degrade model quality by over-fitting to repeated patterns. MinHash + LSH (Locality-Sensitive Hashing) identifies near-duplicate documents in O(n) time using probabilistic hash sketches. For exact deduplication, URL normalization and content hashing are sufficient. Both should be applied before tokenization.
- What tools are used to build LLM data pipelines?
- Common tools: Scrapy or Playwright for web crawling, Ray or Dask for distributed processing, PyArrow/Parquet for storage, tiktoken (OpenAI) or sentencepiece (Google) for tokenization, MinHash/LSH (datasketch library) for deduplication, SpaCy or Presidio for PII detection and masking, and HuggingFace Datasets for packaging and publishing training datasets.
What You'll Build with AI-DE
The LLM Ingestion Pipeline project takes you through all four stages of a production LLM data pipeline — from async web crawling to pushing a tokenized dataset to HuggingFace Hub.
- • Async Scrapy crawler with robots.txt compliance and rate limiting
- • MinHash/LSH deduplication removing 40%+ of crawl as near-duplicates
- • SpaCy NER-based PII masking for names, emails, and phone numbers
- • BPE tokenization with tiktoken + sequence packing to fill 2048-token windows
- • Parquet export and HuggingFace Hub push with dataset card and lineage