Skip to content
LLM Pipeline

What is an LLM Pipeline?

A data engineering system that collects, cleans, deduplicates, tokenizes, and packages text at scale — producing the training datasets that power large language models.

Quick Answer

An LLM data pipeline is a data engineering system that prepares training data for large language models. It crawls web sources, filters low-quality content, removes PII, deduplicates near-identical documents, tokenizes text into integer sequences, and packages datasets in formats like Parquet or Arrow for training jobs. Unlike RAG (which operates at inference time), an LLM pipeline runs offline and produces static datasets — it is a throughput problem at billion-token scale.

What is an LLM Pipeline?

Training a large language model requires trillions of tokens of high-quality text. Assembling that dataset is a data engineering problem — not a machine learning problem. GPT-4, LLaMA, and Mistral were all trained on datasets built by pipelines that processed petabytes of raw web crawl data (Common Crawl, GitHub, books, Wikipedia) through the same core stages.

The same pipeline architecture applies at smaller scale for fine-tuning: taking a pre-trained model and training it further on a domain-specific corpus (medical records, legal documents, code). The pipeline is smaller, but the deduplication, quality filtering, and tokenization steps are identical.

Pre-training Pipeline

  • • Crawl: Common Crawl, GitHub, books, Wikipedia
  • • Scale: trillions of tokens
  • • Distributed: Ray / Dask across many machines
  • • Output: packed token sequences in Parquet/Arrow

Fine-tuning Pipeline

  • • Crawl: domain-specific sources (docs, PDFs, APIs)
  • • Scale: millions to billions of tokens
  • • Single machine or small cluster
  • • Output: instruction-formatted JSONL for SFT

Why LLM Pipelines Matter

Without a Proper Pipeline

  • • Near-duplicates cause models to memorize instead of generalize
  • • PII in training data creates legal and compliance risk
  • • Toxic or low-quality text degrades model behavior
  • • Mismatched tokenizers cause silent data corruption
  • • No lineage — impossible to audit or reproduce datasets

With a Proper Pipeline

  • • MinHash/LSH removes 30–60% of web crawl as near-duplicates
  • • PII masking with NER keeps training data compliant
  • • Perplexity and toxicity filters retain high-quality text
  • • Sequence packing maximizes GPU utilization during training
  • • Dataset cards and HuggingFace Hub provide full lineage

What You Can Build

Pre-training Dataset

Crawl and process web data at billion-token scale for training a base language model from scratch.

Fine-tuning Corpus

Collect domain-specific PDFs, docs, and web pages to adapt a pre-trained model to a vertical (legal, medical, code).

Instruction Dataset

Format question-answer pairs in JSONL for supervised fine-tuning (SFT) to make models follow instructions.

RLHF Data Pipeline

Collect human preference labels and format comparison pairs for reinforcement learning from human feedback.

Evaluation Dataset

Build held-out benchmark datasets with known answers to measure model quality across tasks.

RAG Knowledge Base

Process documents into chunks and embeddings as input to a vector store for retrieval-augmented generation.

How an LLM Pipeline Works

Every LLM data pipeline has four core stages: collect raw text, filter and clean it, tokenize into integer sequences, and package into training-ready datasets. Each stage outputs to the next via Parquet files or Arrow streams — enabling restartable, auditable processing.

COLLECT

crawl + extract

CLEAN

filter + dedupe + PII

TOKENIZE

BPE + pack

PACKAGE

Parquet + HF Hub

MinHash near-duplicate detection with datasketch

from datasketch import MinHash, MinHashLSH

lsh = MinHashLSH(threshold=0.8, num_perm=128)

def get_minhash(text: str) -> MinHash:
    m = MinHash(num_perm=128)
    for word in text.split():
        m.update(word.encode('utf8'  ))
    return m

seen_ids = set()
for doc_id, text in documents:
    m = get_minhash(text)
    result = lsh.query(m)
    if  result:  # near-duplicate found
        seen_ids.add(doc_id)
    else:
        lsh.insert(doc_id, m)

BPE tokenization + sequence packing with tiktoken

import tiktoken
import pyarrow as pa

enc = tiktoken.get_encoding("cl100k_base")
CONTEXT_LEN = 2048

def pack_sequences(texts, ctx_len=CONTEXT_LEN):
    buffer, packed = [], []
    for text in texts:
        tokens = enc.encode(text)
        buffer.extend(tokens + [enc.eot_token])
        while len(buffer) >= ctx_len:
            packed.append(buffer[:ctx_len])
            buffer = buffer[ctx_len:]
    return packed

# Write to Parquet via PyArrow
sequences = pack_sequences(clean_texts)
table = pa.table({
    "input_ids": sequences
})
pa.parquet.write_table(table, "dataset.parquet")

LLM Pipeline vs Other Approaches

LLM Pipeline vs RAG

LLM Data Pipeline

  • • Runs offline, before training
  • • Produces token sequences on disk
  • • Throughput measured in GB/hour
  • • Modifies model weights permanently

RAG Pipeline

  • • Runs at inference time, per query
  • • Produces embeddings in a vector store
  • • Latency measured in milliseconds
  • • Knowledge updated without retraining
Verdict: Use an LLM pipeline to bake knowledge into model weights (expensive, permanent). Use RAG to inject up-to-date facts at query time (cheaper, updateable). Most production systems need both.

LLM Pipeline vs Standard ETL

LLM Data Pipeline

  • • Unstructured text → token integer arrays
  • • Semantic deduplication (MinHash/LSH)
  • • NLP-specific filters (perplexity, toxicity)
  • • Output: Parquet token files or HF datasets

Standard ETL

  • • Structured rows → cleaned, typed columns
  • • Key-based deduplication
  • • Schema validation and type coercion
  • • Output: tables in a data warehouse
Verdict: LLM pipelines extend ETL principles to unstructured text. The core skills (distributed processing, data quality, lineage) transfer — but the transformations are NLP-specific.

Scrapy vs Playwright for Crawling

Scrapy

  • • High-throughput async crawling framework
  • • Built-in robots.txt, rate limiting, retries
  • • Static HTML only — no JS rendering
  • • Best for large-scale static content

Playwright

  • • Headless browser — renders JavaScript
  • • Slower, heavier resource usage
  • • Required for SPAs and dynamic content
  • • Best for JS-heavy sites and auth flows
Verdict: Scrapy for throughput (thousands of pages/minute). Playwright for JavaScript-rendered content. Most LLM dataset pipelines use Scrapy + a fallback Playwright pass for JS sites.
StageToolPurposeScale
CrawlingScrapy / PlaywrightCollect raw text from webMillions of pages
Extractiontrafilatura / BeautifulSoupExtract clean text from HTMLMillions of docs
DeduplicationMinHash + LSH (datasketch)Remove near-duplicatesBillions of docs
PII removalSpaCy NER / PresidioMask names, emails, SSNsAll docs
Quality filterperplexity (kenlm)Remove low-quality textAll docs
Tokenizationtiktoken / sentencepieceConvert text → token IDsAll docs
PackagingPyArrow + HuggingFace HubPublish training datasetFinal dataset

Common Mistakes

Skipping deduplication

Web crawl data contains massive near-duplication — the same article on thousands of mirror sites. Without MinHash/LSH deduplication, models memorize repeated patterns instead of learning language. Deduplication typically removes 30–60% of a raw crawl.

Mismatching tokenizers between training and inference

If you tokenize training data with tiktoken cl100k_base but run inference with a LLaMA sentencepiece tokenizer, token IDs are incompatible and model output is garbage. The tokenizer used to produce training data must exactly match the tokenizer at inference.

No sequence packing — wasting context window

If documents are shorter than the context window, naive batching pads with zeros — wasting compute. Sequence packing concatenates multiple short documents into a single 2048-token sequence separated by end-of-text tokens. This can double training throughput.

Storing token arrays as JSON or CSV

Token ID arrays are large integers best stored in columnar binary formats (Parquet, Arrow). Storing as JSON inflates file size 3–5× and serialization/deserialization becomes the throughput bottleneck.

Who Should Learn LLM Pipelines?

Data Engineer

You build ETL pipelines and want to apply those skills to LLM training data. Crawling, deduplication, and Parquet packaging are natural extensions.

ML Engineer

You fine-tune models and need to assemble domain-specific datasets. Understanding the pipeline ensures your training data is clean and correctly tokenized.

AI Platform Engineer

You're building the data infrastructure that feeds LLM training and RAG systems. This is the foundational layer for everything AI in your organization.

Related Concepts

FAQs

What is an LLM data pipeline?
An LLM data pipeline is a data engineering system that collects, cleans, deduplicates, tokenizes, and packages large volumes of text data for training or fine-tuning large language models. Unlike a standard ETL pipeline that produces structured rows, an LLM pipeline produces token sequences — fixed-length integer arrays that a model can learn from. Key stages: web crawling → quality filtering → deduplication → PII removal → tokenization → dataset packaging.
What is the difference between an LLM pipeline and RAG?
An LLM data pipeline prepares training data offline — it runs before the model is trained and produces datasets on disk (Parquet, Arrow). RAG (Retrieval-Augmented Generation) operates at inference time — it retrieves relevant documents from a vector store and injects them into the prompt at query time. LLM pipelines are a data engineering problem (throughput, deduplication, tokenization). RAG is an ML application problem (retrieval quality, reranking, context length).
What is tokenization in an LLM pipeline?
Tokenization converts raw text into integer token IDs that a model can process. Byte-Pair Encoding (BPE) — used by GPT-4 and LLaMA — merges frequent character pairs iteratively to build a vocabulary of ~50K–128K tokens. A tokenizer also handles sequence packing (combining short documents to fill a context window) and truncation (splitting long documents into fixed-length chunks). The choice of tokenizer affects model performance and must match the tokenizer used during inference.
What is near-duplicate detection in LLM pipelines?
Training data scraped from the web contains massive near-duplication — the same article republished across thousands of sites. Near-duplicates degrade model quality by over-fitting to repeated patterns. MinHash + LSH (Locality-Sensitive Hashing) identifies near-duplicate documents in O(n) time using probabilistic hash sketches. For exact deduplication, URL normalization and content hashing are sufficient. Both should be applied before tokenization.
What tools are used to build LLM data pipelines?
Common tools: Scrapy or Playwright for web crawling, Ray or Dask for distributed processing, PyArrow/Parquet for storage, tiktoken (OpenAI) or sentencepiece (Google) for tokenization, MinHash/LSH (datasketch library) for deduplication, SpaCy or Presidio for PII detection and masking, and HuggingFace Datasets for packaging and publishing training datasets.

What You'll Build with AI-DE

The LLM Ingestion Pipeline project takes you through all four stages of a production LLM data pipeline — from async web crawling to pushing a tokenized dataset to HuggingFace Hub.

  • • Async Scrapy crawler with robots.txt compliance and rate limiting
  • • MinHash/LSH deduplication removing 40%+ of crawl as near-duplicates
  • • SpaCy NER-based PII masking for names, emails, and phone numbers
  • • BPE tokenization with tiktoken + sequence packing to fill 2048-token windows
  • • Parquet export and HuggingFace Hub push with dataset card and lineage
Press Cmd+K to open