How large should a training dataset be?

Dataset size depends on model type and task. For fine-tuning LLMs on instruction-following tasks: 1k–100k high-quality examples is sufficient. For pre-training language models from scratch: billions to trillions of tokens. For classical ML on tabular data: rule of thumb is 10× the number of features. The Chinchilla scaling laws show that for transformers, tokens per parameter should be roughly 20:1 for optimal compute efficiency.

What is the train/val/test split ratio?

Common ratios: 80/10/10 for datasets under 100k examples, 98/1/1 for very large datasets where 1% still gives thousands of evaluation examples. More important than the ratio: ensure no leakage between splits, use a fixed seed for reproducibility, and always hold out the test set until final evaluation — never use it for hyperparameter tuning.

What is a data card for a dataset?

A data card is a documentation artifact for a dataset that records: data sources and collection method, preprocessing and filtering steps applied, known limitations and biases, PII handling, license and usage restrictions, example count by split, and intended use cases. Writing a data card before releasing a dataset internally is a sign of dataset engineering maturity.

How to Build a Training Dataset: Step-by-Step Guide

Build a training dataset in 6 steps: define requirements → collect with provenance → deduplicate (exact then near) → quality filter → version with data card → create reproducible splits. Skipping any step creates a dataset that is either unusable for experiments (no versioning), untrustworthy (no dedup), or legally problematic (no provenance).

Define dataset requirements

Before collecting any data, define: task type (classification, generation, regression), target distribution, minimum size, quality criteria, and compliance constraints (PII, license, IP clearance). Without requirements, you cannot evaluate whether your dataset is ready.

Collect raw data with provenance

Ingest raw data and track provenance metadata at collection time. Source URL, collection timestamp, and license cannot be reconstructed after the fact. Use a structured metadata schema from the start.

# Collect with provenance tracking
import json, hashlib
from datetime import datetime

def collect_with_provenance(source_url: str, text: str) -> dict:
    return {
        "text": text,
        "source_url": source_url,
        "collected_at": datetime.utcnow().isoformat(),
        "content_hash": hashlib.sha256(text.encode()).hexdigest(),
        "license": "CC-BY-4.0",  # track license at collection
    }

Deduplicate: exact then near

Run exact deduplication (hash match) first — it is free. Then run near-deduplication with MinHash LSH to catch paraphrases and boilerplate. Always dedup before quality filtering to avoid wasting compute.

# Two-stage deduplication
from datasketch import MinHash, MinHashLSH

# Stage 1: exact dedup (content hash)
seen_hashes = set()
unique_docs = [d for d in docs
               if d["content_hash"] not in seen_hashes
               and not seen_hashes.add(d["content_hash"])]

# Stage 2: near-dedup (MinHash LSH, threshold=0.8)
lsh = MinHashLSH(threshold=0.8, num_perm=128)
deduped = []
for doc in unique_docs:
    mh = MinHash(num_perm=128)
    for token in doc["text"].split():
        mh.update(token.encode())
    if not lsh.query(mh):
        lsh.insert(doc["content_hash"], mh)
        deduped.append(doc)

Apply quality filters

Filter on heuristics that correlate with quality. Remove very short or very long texts, high repetition ratios, wrong language, and toxic content. For LLMs, add perplexity filtering against a small reference model.

import langdetect

def quality_filter(doc: dict) -> bool:
    text = doc["text"]
    tokens = text.split()

    # Heuristic filters
    if len(tokens) < 50 or len(tokens) > 100_000:
        return False  # too short or too long

    # Repetition: most frequent word > 20% of all words
    word_counts = {}
    for w in tokens:
        word_counts[w] = word_counts.get(w, 0) + 1
    if max(word_counts.values()) / len(tokens) > 0.20:
        return False  # highly repetitive

    # Language detection
    try:
        if langdetect.detect(text) != "en":
            return False
    except Exception:
        return False

    return True

filtered = [d for d in deduped if quality_filter(d)]

Version the dataset with a data card

Commit the processed dataset to DVC or Hugging Face Hub. Write a data card documenting sources, filters, known limitations, and example counts. Future engineers (and future you) cannot safely use a dataset without this documentation.

# Version with DVC
dvc add data/training/v2_filtered.parquet
git add data/training/v2_filtered.parquet.dvc
git commit -m "dataset: v2 — 847k examples after dedup+filter (was 2.1M raw)"

# data_card.yaml (committed alongside the dataset)
# dataset_name: instruction-finetuning-v2
# version: 2.0.0
# sources: [internal-annotations, public-alpaca]
# filters_applied: [exact-dedup, near-dedup-0.8, quality-heuristics, lang-en]
# example_count: 847_412
# known_limitations: "English only. No math or code examples."
# pii_handling: "Presidio PII scrubbing applied to all examples"

Create reproducible train/val/test splits

Split on the correct unit for your task (user, conversation, or document — not row index). Use a fixed random seed. Hold out the test set and do not touch it until final evaluation.

import random
from collections import defaultdict

random.seed(42)  # fixed seed for reproducibility

# Group by user_id to prevent user leakage across splits
by_user = defaultdict(list)
for ex in dataset:
    by_user[ex["user_id"]].append(ex)

user_ids = list(by_user.keys())
random.shuffle(user_ids)

n = len(user_ids)
train_users = user_ids[:int(n * 0.80)]
val_users   = user_ids[int(n * 0.80):int(n * 0.90)]
test_users  = user_ids[int(n * 0.90):]

train = [ex for uid in train_users for ex in by_user[uid]]
val   = [ex for uid in val_users   for ex in by_user[uid]]
test  = [ex for uid in test_users  for ex in by_user[uid]]

When to Apply This Process

→Fine-tuning any LLM on a new task or domain
→Building a training dataset from production feedback (data flywheel)
→Migrating from ad-hoc experiment datasets to a versioned dataset registry
→Any ML project where training data quality is suspected as the bottleneck

Common Issues

✗

Splits leak through shared identifiers

If two examples share a user_id or conversation_id and end up in train and test respectively, your model sees test-time distribution at training. Always split on the correct unit of independence for your task.

✗

Quality filters remove too much data

Aggressive quality filtering can remove 90%+ of raw data. Tune thresholds on a sample first. Track how many examples each filter removes. If one filter removes >50%, investigate whether it is actually capturing quality or just filtering rare-but-valid examples.

✗

Near-dedup threshold too tight

A Jaccard threshold of 0.5 will remove near-duplicates but also many valid paraphrases. Start with 0.8, inspect a sample of removed examples, then tune. Too-tight dedup can remove diversity that helps generalization.

FAQ

How large should a training dataset be?: Fine-tuning LLMs: 1k–100k high-quality examples. Pre-training from scratch: billions to trillions of tokens. Classical ML: ~10× the number of features. Quality beats quantity — a 10k curated dataset outperforms 1M noisy examples.
What is the train/val/test split ratio?: 80/10/10 for datasets under 100k examples. 98/1/1 for very large datasets. More important than ratio: no leakage between splits, fixed seed, and never use test set for hyperparameter tuning.
What is a data card?: Documentation for a dataset: sources, collection method, filters applied, known limitations, PII handling, example count, and license. A dataset without a data card is a liability — the next user has no way to evaluate whether it is appropriate for their task.

→

What is Dataset Engineering?

/guide/what-is-dataset-engineering

→

Dataset Engineering Learning Path

/learn/dataset-engineering

→

Build a Feature Store

/projects/predictflow-feature-store

How to Build a Training Dataset: Step-by-Step Guide

When to Apply This Process

Common Issues

FAQ

Related