ADR-003: Hybrid tokenization: tiktoken default + custom BPE pedagogical | LLM Ingestion Pipeline

Context

LLM training-data tokenization has two distinct purposes in this project: (1) production token-counting + sequence packing for the actual dataset construction, and (2) pedagogical understanding of how BPE actually works under the hood. The classic options:

tiktoken-only — OpenAI's BPE encoder, fast, well-tested. Production-ready but opaque (no learner-readable internals).
HuggingFace tokenizers — Rust-backed, supports many encoder types. Production-ready but heavy.
sentencepiece — Google's BPE/unigram library. Used by Llama / Mistral / Gemma. Wide adoption.
Custom from-scratch BPE — pedagogical, slow, not production ready.

We need both a working production encoder AND a learner-readable implementation. The constraint: a tutorial that only ships the custom BPE forces learners through an O(N²) merge loop on real data; one that only ships tiktoken hides the algorithm.

Decision

Adopt a hybrid: ship tokenizers/bpe.py as a from-scratch pedagogical BPE (Counter-based pair frequency, merge loop, vocab serialization) and use tiktoken as the Tier-1 default for actual dataset construction. Sequence packing (greedy + first-fit) lives in packing.py and is encoder-agnostic.

# tokenizers/bpe.py — pedagogical implementation
from collections import Counter

class BPETokenizer:
    def __init__(self, vocab_size: int = 8000):
        self.vocab_size = vocab_size
        self.merges: list[tuple[str, str]] = []
        self.vocab: dict[str, int] = {}

    def train(self, texts: list[str]) -> None:
        # 1. start with character vocab
        # 2. count adjacent pair frequencies
        # 3. merge the most frequent pair → new symbol
        # 4. repeat until vocab_size reached
        word_freqs = Counter(t.split() for t in texts)
        ...

    def encode(self, text: str) -> list[int]: ...
    def decode(self, tokens: list[int]) -> str: ...

# packing.py — encoder-agnostic
class PackingStrategy:
    def __init__(self, max_len: int = 2048):
        self.max_len = max_len

    def greedy_pack(self, sequences: list[list[int]]) -> list[list[int]]:
        # First-fit decreasing: sort by length desc, pack into bins
        ...

# Production path (Tier-1 default)
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode(text)
packed = PackingStrategy(max_len=4096).greedy_pack([tokens])

Tradeoffs we accept

Lever	tiktoken-only	HF tokenizers	sentencepiece	Custom BPE	Hybrid (chosen)
Production encoding speed	Fast (Rust)	Fast (Rust)	Fast (C++)	Slow (Python)	Fast (tiktoken default)
Learner-readable internals	Opaque	Opaque	Opaque	Yes (from-scratch)	Yes (custom_BPE)
Vocab serialization format	OpenAI-specific	HF-specific	sentencepiece-specific	JSON (custom)	Both
Multilingual / code-aware	Yes (cl100k_base)	Yes	Yes	Build it	Yes
Works without first-run download	Pre-shipped tables	Yes (after download)	Yes	Yes	Yes
Tutorial reproducibility	Easy	Easy	Easy	Educational + slow	Best of both

We optimize for production-grade dataset output + learner-readable algorithm. The custom BPE in tokenizers/bpe.py is intentionally not production-grade — it's there so a learner can step through the merge loop in a debugger. The dataset construction path uses tiktoken because that's what production teams actually use.

Consequences (positive)

Module 04 ships in <3 hours of learner time: tiktoken handles the encoding heavy-lifting; custom BPE explains the algorithm.
The PackingStrategy class in packing.py is encoder-agnostic — first- fit decreasing greedy pack works on any token list, regardless of encoder.
The dataset_builder.py shards output as Parquet — encoder choice is downstream of the storage format.
A learner who wants to use sentencepiece (for Llama-shaped vocabs) can swap import tiktoken for import sentencepiece without touching the packing or dataset-building code.

Consequences (negative)

Two ways to do the same thing. Some learners will be confused by the dual presence of a custom BPE and tiktoken. Mitigation: Module 04 explicitly opens with "tiktoken is the production path; custom BPE is for understanding". Documented in tokenizers/__init__.py.
Custom BPE is slow. Training the from-scratch BPE on the full bundled corpus takes ~20 minutes; tiktoken just-works in seconds. Mitigation: Module 04 trains custom BPE on a 1k-doc subset for the pedagogical exercise.
No multilingual coverage in custom BPE. The from-scratch implementation handles ASCII and basic Unicode; complex scripts would need pre-tokenization. Mitigation: documented as out-of-scope.
Vocab compatibility. A custom BPE vocab is not interchangeable with tiktoken's. Mitigation: dataset shards include the vocab file in the same directory; downstream consumers know which to load.

Reversal plan

The encoder interface (encode(text) -> list[int], decode(tokens) -> str) is the same shape for tiktoken / HF / sentencepiece / custom. Replacement is bounded:

HuggingFace tokenizers — replace import tiktoken with from tokenizers import Tokenizer. ~5-line change. Get multilingual + custom-vocab support.
sentencepiece — replace with import sentencepiece. Required for Llama-shaped vocabs.
Drop the custom BPE — if pedagogical content moves to a separate "How tokenizers work" module, the project can drop tokenizers/bpe.py entirely. ~50 lines deleted.

Estimated effort: 0.5–2 engineer-days for any swap. Reversible.

References

tokenizers/bpe.py (pedagogical from-scratch BPE)
tokenize/ (tiktoken-based production tokenization helpers)
packing.py (encoder-agnostic sequence packing)
export/dataset_builder.py (Parquet shard writer)
augmentation/data_augmentation.py (synthetic data generation, runs after tokenization)
ADR-002 (MinHash dedup — produces input to tokenization)
ADR-005 (Pinecone-only deprecated — vector index is downstream of tokenization)