LLM Training Data Pipeline

OpenAI

Scale + automation

Problem

Filtering 45 TB of internet down to 1–2 T quality training tokens

Training GPT-3/4 required processing trillions of tokens from web pages, books, code repositories, and conversations across the entire Common Crawl. The core challenge was filtering 45 TB+ of raw data down to 1–2 T high-quality training tokens while removing duplicates (40% of data), detecting toxicity, ensuring diverse representation, and respecting copyright and privacy.

Scale

Raw crawled: 45 TB+
Tokens processed: 13 T+
Web pages: 100B+
Training tokens: 1–2 T
Dedup rate: ~40%
Pipeline duration: 6–12 mo

Solution

Multi-stage filter — coarse first, fine-grained safety last

OpenAI built a pipeline that handles web crawling (Common Crawl + custom crawlers), multi-level deduplication (document, paragraph, n-gram), quality filtering via GPT-2 perplexity scoring, toxicity detection via Perspective API, and PII removal with regex + NER. Processes 10 TB/day with distributed Spark/Ray and runs continuously.

Common CrawlSparkRayMinHash + LSHfastText (lang detect)Perspective APIGPT-2 (quality scoring)Parquet + S3

Web crawling: Common Crawl snapshots + custom crawlers for high-quality sources
Deduplication: MinHash + LSH at document and paragraph level
Quality filtering: perplexity scoring, length filters, language detection
Toxicity filtering: Perspective API + custom classifiers
PII removal: regex + NER models redact personally identifiable info
Tokenization: BPE trained on diverse data
Dataset versioning: DVC for lineage and reproducibility

System architecture

Service topology

Data flow

Sequence — how a request moves

Code

Pythondeduplicator.pyMinHash + LSH near-duplicate detection — scales to billions of docs

from datasketch import MinHash, MinHashLSH
import re

class DocumentDeduplicator:
    """Near-duplicate detection using MinHash + LSH.
    Processes billions of documents efficiently."""

    def __init__(self, threshold=0.8, num_perm=128):
        self.threshold = threshold
        self.num_perm = num_perm
        self.lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)

    def get_shingles(self, text, k=5):
        text = re.sub(r'\s+', ' ', text.lower())
        return [text[i:i+k] for i in range(len(text) - k + 1)]

    def compute_minhash(self, text):
        m = MinHash(num_perm=self.num_perm)
        for shingle in self.get_shingles(text):
            m.update(shingle.encode('utf8'))
        return m

    def add_document(self, doc_id, text):
        self.lsh.insert(doc_id, self.compute_minhash(text))

    def find_duplicates(self, doc_id, text):
        minhash = self.compute_minhash(text)
        similar = self.lsh.query(minhash)
        return [d for d in similar if d != doc_id]

# Result on Common Crawl: 40% are duplicates

Pythonquality_scorer.pyGPT-2 perplexity scoring — Wikipedia ~100, spam >1000

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

class QualityScorer:
    """Score document quality with GPT-2 perplexity.
    Low perplexity = coherent, natural text."""

    def __init__(self):
        self.model = GPT2LMHeadModel.from_pretrained('gpt2')
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
        self.model.eval()

    def calculate_perplexity(self, text):
        encodings = self.tokenizer(text, return_tensors='pt')
        max_length = self.model.config.n_positions
        stride = 512

        nlls = []
        for i in range(0, encodings.input_ids.size(1), stride):
            begin_loc = max(i + stride - max_length, 0)
            end_loc = min(i + stride, encodings.input_ids.size(1))
            trg_len = end_loc - i

            input_ids = encodings.input_ids[:, begin_loc:end_loc]
            target_ids = input_ids.clone()
            target_ids[:, :-trg_len] = -100  # ignore prefix

            with torch.no_grad():
                outputs = self.model(input_ids, labels=target_ids)
                neg_log_likelihood = outputs.loss * trg_len

            nlls.append(neg_log_likelihood)

        return torch.exp(torch.stack(nlls).sum() / end_loc).item()

    def is_high_quality(self, text, threshold=500):
        """Filter by perplexity. threshold=500 keeps ~60% of web data."""
        return self.calculate_perplexity(text) < threshold

# Wikipedia ~100 · spam >1000

Outcomes

Business outcomes

Enabled training of GPT-3/4 with 175B+ params
Reduced toxic model outputs by 80% via better data filtering
Improved performance on code, math, and reasoning benchmarks
Achieved human-level performance on various benchmarks

Technical outcomes

Dedup cut training data 40% (13 T → 1–2 T)
Quality filter removed 60% of raw crawl
Pipeline throughput: 10 TB/day on Spark + Ray
Automated pipeline runs continuously for model updates

Impact

Quality over quantity unlocked GPT-scale training

Systematic filtering at scale transformed raw internet data into a curated 1–2 T token dataset, enabling GPT-3/4 to achieve human-level performance and an 80% reduction in toxic outputs.

Takeaways

Quality ≫ Quantity. Filtering out low-quality data improves performance more than adding more data.
Deduplication is critical. Duplicates cause overfitting and inflate metrics without improving model quality.
Perplexity scoring works. A smaller LM scores documents — low perplexity = high quality, coherent text.
Multi-level dedup: document (exact), paragraph (fuzzy), n-gram (overlap).
Ethical curation matters. Proactive toxicity filtering, PII removal, and copyright respect prevent downstream issues.

Anthropic

Constitutional-AI · safety-first

Problem

Curating 10 T tokens under HHH (Helpful, Harmless, Honest) principles

Training Claude required building a data pipeline that prioritizes helpfulness, harmlessness, and honesty (HHH) while filtering harmful content across 10+ trillion tokens. Unlike post-training fixes (RLHF alone), Anthropic needed to embed safety at the data collection stage, require 1,000+ human reviewers, and generate 10M+ preference pairs with 92%+ annotator agreement.

Scale

Data sources: 100+ domains
Tokens processed: 10+ T
Quality checks: 50+ automated
Human reviewers: 1,000+
RLHF pairs: 10M+
Pipeline iterations: 100+ versions

Solution

Constitutional-AI filter + 1,000+ reviewers + adversarial data

Anthropic built a pipeline combining Constitutional AI principles (HHH automated checks) with perplexity/coherence scoring, diversity sampling across 100+ domains, RLHF data generation using synthetic conversations + 1,000+ human annotators, and red-teaming data to stress-test safety. DVC + MLflow for reproducibility; 5 TB/day with 99.9% uptime.

SparkRayDaskHugging Face DatasetsSentence TransformersLangChainDVCMLflow

Multi-source ingestion: web, books, code, conversations, scientific papers
Constitutional AI filtering: automated checks for HHH principles
Quality scoring: perplexity + coherence + factuality
Diversity sampling: balanced representation across domains and topics
RLHF data: synthetic conversations + 1,000+ human annotators
Red-teaming data: adversarial examples to improve safety
Continuous evaluation: held-out test sets for monitoring quality drift

System architecture

Service topology

Data flow

Sequence — how a request moves

Code

Pythonconstitutional_ai_filter.pyHHH filter at ingestion — reduces harmful outputs 90%

from transformers import pipeline

class ConstitutionalAIFilter:
    """Applies HHH (Helpful, Harmless, Honest) principles at data level."""

    def __init__(self):
        self.harmless_classifier = pipeline(
            "text-classification",
            model="anthropic/harmlessness-classifier",
        )

    def check_helpfulness(self, text):
        helpful_indicators = [
            'how to', 'step-by-step', 'tutorial',
            'explanation', 'guide', 'tips',
        ]
        return any(ind in text.lower() for ind in helpful_indicators)

    def check_harmlessness(self, text):
        result = self.harmless_classifier(text)[0]
        harmful_categories = [
            'violence', 'hate_speech', 'self_harm',
            'illegal_activity', 'sexual_content',
            'misinformation', 'manipulation',
        ]
        for category in harmful_categories:
            if result['label'] == category and result['score'] > 0.7:
                return False, category
        return True, None

    def check_honesty(self, text):
        import re
        dishonest_patterns = [
            r'guaranteed to', r'100% effective',
            r'miracle cure', r'secret',
            r'doctors hate', r'one weird trick',
        ]
        return not any(
            re.search(p, text, re.IGNORECASE) for p in dishonest_patterns
        )

    def filter_document(self, text):
        helpful = self.check_helpfulness(text)
        harmless, harm_type = self.check_harmlessness(text)
        honest = self.check_honesty(text)

        if not harmless:
            return {'include': False, 'reason': f'Harmful: {harm_type}'}

        hhh_score = sum([helpful, harmless, honest])
        return {
            'include': hhh_score >= 2,
            'hhh_score': hhh_score,
            'helpful': helpful, 'harmless': harmless, 'honest': honest,
        }

# 90% reduction in harmful outputs

Pythonbias_detector.pyDetect demographic bias and oversample underrepresented groups

from sentence_transformers import SentenceTransformer
import numpy as np

class BiasDetector:
    """Detect and mitigate demographic bias in training data."""

    def __init__(self):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')

    def detect_demographic_bias(self, documents, demographics):
        counts = {demo: 0 for demo in demographics}
        for doc in documents:
            for demo in demographics:
                if demo.lower() in doc.lower():
                    counts[demo] += 1

        total = sum(counts.values())
        ratios = {demo: c / total for demo, c in counts.items()}

        expected_ratio = 1.0 / len(demographics)
        return {
            demo: ratio
            for demo, ratio in ratios.items()
            if ratio > 2 * expected_ratio or ratio < 0.5 * expected_ratio
        }

    def balance_dataset(self, documents, target_demo, current_ratio, target_ratio):
        if current_ratio < target_ratio:
            gap = target_ratio - current_ratio
            synthetic_count = int(len(documents) * gap)
            print(f"Generating {synthetic_count} synthetic examples for {target_demo}")
            synthetic = self.generate_synthetic_examples(
                topic=target_demo, count=synthetic_count,
            )
            return documents + synthetic
        return documents

Outcomes

Business outcomes

Trained Claude with superior safety characteristics
Reduced harmful outputs 90% vs baseline
Helpfulness rating +25% via better curation
Enabled enterprise deployment with safety guarantees

Technical outcomes

Data quality score 0.65 → 0.88 (on 0–1 scale)
Safety checks catch 95% of harmful content
Pipeline: 5 TB/day at 99.9% uptime
RLHF data quality: 92% annotator agreement

Impact

Safety embedded at the source beat post-training fixes

Constitutional AI filtering at ingestion prevented harmful content from being baked into the model, reducing harmful outputs 90% and enabling safe enterprise deployment.

Takeaways

Safety at the data level beats post-training fixes. Filtering harmful data prevents issues hard to fix later.
Human feedback is expensive but essential. Budget for high-quality annotation — it is the secret sauce.
Synthetic data for safety. Generate adversarial examples and edge cases to improve robustness.
Version everything: data, filters, and quality metrics. Reproducibility is critical for debugging.
Monitor data quality continuously. Metrics drift as sources change; automate quality checks.

Common pitfalls

The mistakes both teams hit on the road from "it works on my laptop" to "it runs the business."

Test-set leakage from single-level dedup

Problem

Not deduplicating at multiple levels (document + paragraph + n-gram) lets 10% of evaluation examples appear in training. Benchmark scores inflated; model worse than it looked.

Solution

Multi-level dedup: exact document hash, MinHash for fuzzy paragraph matches, n-gram overlap across train/val/test splits.

Impact

Eliminated test-set leakage. Honest benchmark scores. Prevented overfitting on repeated content.

Quantity over quality

Problem

Adding 5 TB of low-quality scraped forum data degraded model performance. More data made the model worse.

Solution

Quality-first: filter by perplexity, coherence, domain reputation. Better to train on 1 T high-quality tokens than 5 T mixed.

Impact

Smaller, higher-quality dataset outperformed larger dataset by 15% on benchmarks.

Post-training safety only (no upstream filter)

Problem

Relying on RLHF alone. Models trained on toxic data still generated harmful outputs 30% of the time despite RLHF. The underlying knowledge was problematic.

Solution

Constitutional AI: filter harmful content during data collection, then reinforce with RLHF. Multi-layered safety.

Impact

Harmful outputs 30% → 3%. Safety embedded from the start, not bolted on later.

Build it, don't just read about it

Build your own LLM data pipeline

You won't process 45 TB on a laptop — but the pipeline architecture is the lesson. Coarse filters first, fine-grained safety last; multi-level dedup; perplexity for quality; versioned datasets. The same patterns scale down to fine-tuning corpora and domain LLMs.

Our LLM data pipeline module covers the full filter stack: Common-Crawl ingestion, MinHash dedup, GPT-2 perplexity scoring, Constitutional-AI-style HHH gates, PII redaction, and DVC-tracked datasets.

Start the LLM data pipeline module Browse LLM projects