How to Build a Training Dataset: Step-by-Step Guide
Build a training dataset in 6 steps: define requirements → collect with provenance → deduplicate (exact then near) → quality filter → version with data card → create reproducible splits. Skipping any step creates a dataset that is either unusable for experiments (no versioning), untrustworthy (no dedup), or legally problematic (no provenance).
Define dataset requirements
Before collecting any data, define: task type (classification, generation, regression), target distribution, minimum size, quality criteria, and compliance constraints (PII, license, IP clearance). Without requirements, you cannot evaluate whether your dataset is ready.
Collect raw data with provenance
Ingest raw data and track provenance metadata at collection time. Source URL, collection timestamp, and license cannot be reconstructed after the fact. Use a structured metadata schema from the start.
# Collect with provenance tracking
import json, hashlib
from datetime import datetime
def collect_with_provenance(source_url: str, text: str) -> dict:
return {
"text": text,
"source_url": source_url,
"collected_at": datetime.utcnow().isoformat(),
"content_hash": hashlib.sha256(text.encode()).hexdigest(),
"license": "CC-BY-4.0", # track license at collection
}
Deduplicate: exact then near
Run exact deduplication (hash match) first — it is free. Then run near-deduplication with MinHash LSH to catch paraphrases and boilerplate. Always dedup before quality filtering to avoid wasting compute.
# Two-stage deduplication
from datasketch import MinHash, MinHashLSH
# Stage 1: exact dedup (content hash)
seen_hashes = set()
unique_docs = [d for d in docs
if d["content_hash"] not in seen_hashes
and not seen_hashes.add(d["content_hash"])]
# Stage 2: near-dedup (MinHash LSH, threshold=0.8)
lsh = MinHashLSH(threshold=0.8, num_perm=128)
deduped = []
for doc in unique_docs:
mh = MinHash(num_perm=128)
for token in doc["text"].split():
mh.update(token.encode())
if not lsh.query(mh):
lsh.insert(doc["content_hash"], mh)
deduped.append(doc)
Apply quality filters
Filter on heuristics that correlate with quality. Remove very short or very long texts, high repetition ratios, wrong language, and toxic content. For LLMs, add perplexity filtering against a small reference model.
import langdetect
def quality_filter(doc: dict) -> bool:
text = doc["text"]
tokens = text.split()
# Heuristic filters
if len(tokens) < 50 or len(tokens) > 100_000:
return False # too short or too long
# Repetition: most frequent word > 20% of all words
word_counts = {}
for w in tokens:
word_counts[w] = word_counts.get(w, 0) + 1
if max(word_counts.values()) / len(tokens) > 0.20:
return False # highly repetitive
# Language detection
try:
if langdetect.detect(text) != "en":
return False
except Exception:
return False
return True
filtered = [d for d in deduped if quality_filter(d)]
Version the dataset with a data card
Commit the processed dataset to DVC or Hugging Face Hub. Write a data card documenting sources, filters, known limitations, and example counts. Future engineers (and future you) cannot safely use a dataset without this documentation.
# Version with DVC
dvc add data/training/v2_filtered.parquet
git add data/training/v2_filtered.parquet.dvc
git commit -m "dataset: v2 — 847k examples after dedup+filter (was 2.1M raw)"
# data_card.yaml (committed alongside the dataset)
# dataset_name: instruction-finetuning-v2
# version: 2.0.0
# sources: [internal-annotations, public-alpaca]
# filters_applied: [exact-dedup, near-dedup-0.8, quality-heuristics, lang-en]
# example_count: 847_412
# known_limitations: "English only. No math or code examples."
# pii_handling: "Presidio PII scrubbing applied to all examples"
Create reproducible train/val/test splits
Split on the correct unit for your task (user, conversation, or document — not row index). Use a fixed random seed. Hold out the test set and do not touch it until final evaluation.
import random
from collections import defaultdict
random.seed(42) # fixed seed for reproducibility
# Group by user_id to prevent user leakage across splits
by_user = defaultdict(list)
for ex in dataset:
by_user[ex["user_id"]].append(ex)
user_ids = list(by_user.keys())
random.shuffle(user_ids)
n = len(user_ids)
train_users = user_ids[:int(n * 0.80)]
val_users = user_ids[int(n * 0.80):int(n * 0.90)]
test_users = user_ids[int(n * 0.90):]
train = [ex for uid in train_users for ex in by_user[uid]]
val = [ex for uid in val_users for ex in by_user[uid]]
test = [ex for uid in test_users for ex in by_user[uid]]
When to Apply This Process
- →Fine-tuning any LLM on a new task or domain
- →Building a training dataset from production feedback (data flywheel)
- →Migrating from ad-hoc experiment datasets to a versioned dataset registry
- →Any ML project where training data quality is suspected as the bottleneck
Common Issues
Splits leak through shared identifiers
If two examples share a user_id or conversation_id and end up in train and test respectively, your model sees test-time distribution at training. Always split on the correct unit of independence for your task.
Quality filters remove too much data
Aggressive quality filtering can remove 90%+ of raw data. Tune thresholds on a sample first. Track how many examples each filter removes. If one filter removes >50%, investigate whether it is actually capturing quality or just filtering rare-but-valid examples.
Near-dedup threshold too tight
A Jaccard threshold of 0.5 will remove near-duplicates but also many valid paraphrases. Start with 0.8, inspect a sample of removed examples, then tune. Too-tight dedup can remove diversity that helps generalization.
FAQ
- How large should a training dataset be?
- Fine-tuning LLMs: 1k–100k high-quality examples. Pre-training from scratch: billions to trillions of tokens. Classical ML: ~10× the number of features. Quality beats quantity — a 10k curated dataset outperforms 1M noisy examples.
- What is the train/val/test split ratio?
- 80/10/10 for datasets under 100k examples. 98/1/1 for very large datasets. More important than ratio: no leakage between splits, fixed seed, and never use test set for hyperparameter tuning.
- What is a data card?
- Documentation for a dataset: sources, collection method, filters applied, known limitations, PII handling, example count, and license. A dataset without a data card is a liability — the next user has no way to evaluate whether it is appropriate for their task.