Context
LLM training-data tokenization has two distinct purposes in this project: (1) production token-counting + sequence packing for the actual dataset construction, and (2) pedagogical understanding of how BPE actually works under the hood. The classic options:
- tiktoken-only — OpenAI's BPE encoder, fast, well-tested. Production-ready but opaque (no learner-readable internals).
- HuggingFace
tokenizers— Rust-backed, supports many encoder types. Production-ready but heavy. - sentencepiece — Google's BPE/unigram library. Used by Llama / Mistral / Gemma. Wide adoption.
- Custom from-scratch BPE — pedagogical, slow, not production ready.
We need both a working production encoder AND a learner-readable implementation. The constraint: a tutorial that only ships the custom BPE forces learners through an O(N²) merge loop on real data; one that only ships tiktoken hides the algorithm.
Decision
Adopt a hybrid: ship tokenizers/bpe.py as a from-scratch
pedagogical BPE (Counter-based pair frequency, merge loop, vocab serialization)
and use tiktoken as the Tier-1 default for actual dataset
construction. Sequence packing (greedy + first-fit) lives in
packing.py and is encoder-agnostic.
# tokenizers/bpe.py — pedagogical implementation
from collections import Counter
class BPETokenizer:
def __init__(self, vocab_size: int = 8000):
self.vocab_size = vocab_size
self.merges: list[tuple[str, str]] = []
self.vocab: dict[str, int] = {}
def train(self, texts: list[str]) -> None:
# 1. start with character vocab
# 2. count adjacent pair frequencies
# 3. merge the most frequent pair → new symbol
# 4. repeat until vocab_size reached
word_freqs = Counter(t.split() for t in texts)
...
def encode(self, text: str) -> list[int]: ...
def decode(self, tokens: list[int]) -> str: ...
# packing.py — encoder-agnostic
class PackingStrategy:
def __init__(self, max_len: int = 2048):
self.max_len = max_len
def greedy_pack(self, sequences: list[list[int]]) -> list[list[int]]:
# First-fit decreasing: sort by length desc, pack into bins
...
# Production path (Tier-1 default)
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode(text)
packed = PackingStrategy(max_len=4096).greedy_pack([tokens])
Tradeoffs we accept
| Lever | tiktoken-only | HF tokenizers | sentencepiece | Custom BPE | Hybrid (chosen) |
|---|---|---|---|---|---|
| Production encoding speed | Fast (Rust) | Fast (Rust) | Fast (C++) | Slow (Python) | Fast (tiktoken default) |
| Learner-readable internals | Opaque | Opaque | Opaque | Yes (from-scratch) | Yes (custom_BPE) |
| Vocab serialization format | OpenAI-specific | HF-specific | sentencepiece-specific | JSON (custom) | Both |
| Multilingual / code-aware | Yes (cl100k_base) | Yes | Yes | Build it | Yes |
| Works without first-run download | Pre-shipped tables | Yes (after download) | Yes | Yes | Yes |
| Tutorial reproducibility | Easy | Easy | Easy | Educational + slow | Best of both |
We optimize for production-grade dataset output + learner-readable
algorithm. The custom BPE in tokenizers/bpe.py is intentionally
not production-grade — it's there so a learner can step through the
merge loop in a debugger. The dataset construction path uses tiktoken
because that's what production teams actually use.
Consequences (positive)
- Module 04 ships in <3 hours of learner time: tiktoken handles the encoding heavy-lifting; custom BPE explains the algorithm.
- The PackingStrategy class in
packing.pyis encoder-agnostic — first- fit decreasing greedy pack works on any token list, regardless of encoder. - The
dataset_builder.pyshards output as Parquet — encoder choice is downstream of the storage format. - A learner who wants to use sentencepiece (for Llama-shaped vocabs)
can swap
import tiktokenforimport sentencepiecewithout touching the packing or dataset-building code.
Consequences (negative)
- Two ways to do the same thing. Some learners will be confused
by the dual presence of a custom BPE and tiktoken. Mitigation:
Module 04 explicitly opens with "tiktoken is the production path;
custom BPE is for understanding". Documented in
tokenizers/__init__.py. - Custom BPE is slow. Training the from-scratch BPE on the full bundled corpus takes ~20 minutes; tiktoken just-works in seconds. Mitigation: Module 04 trains custom BPE on a 1k-doc subset for the pedagogical exercise.
- No multilingual coverage in custom BPE. The from-scratch implementation handles ASCII and basic Unicode; complex scripts would need pre-tokenization. Mitigation: documented as out-of-scope.
- Vocab compatibility. A custom BPE vocab is not interchangeable with tiktoken's. Mitigation: dataset shards include the vocab file in the same directory; downstream consumers know which to load.
Reversal plan
The encoder interface (encode(text) -> list[int],
decode(tokens) -> str) is the same shape for tiktoken / HF /
sentencepiece / custom. Replacement is bounded:
- HuggingFace tokenizers — replace
import tiktokenwithfrom tokenizers import Tokenizer. ~5-line change. Get multilingual + custom-vocab support. - sentencepiece — replace with
import sentencepiece. Required for Llama-shaped vocabs. - Drop the custom BPE — if pedagogical content moves to a
separate "How tokenizers work" module, the project can drop
tokenizers/bpe.pyentirely. ~50 lines deleted.
Estimated effort: 0.5–2 engineer-days for any swap. Reversible.
References
tokenizers/bpe.py(pedagogical from-scratch BPE)tokenize/(tiktoken-based production tokenization helpers)packing.py(encoder-agnostic sequence packing)export/dataset_builder.py(Parquet shard writer)augmentation/data_augmentation.py(synthetic data generation, runs after tokenization)- ADR-002 (MinHash dedup — produces input to tokenization)
- ADR-005 (Pinecone-only deprecated — vector index is downstream of tokenization)