Skip to content
Back to LLM Ingestion Pipeline

Hybrid tokenization: tiktoken default + custom BPE pedagogical

✓ AcceptedLLM Ingestion Pipeline02 — Dataset → RAG System (sub-part: Tokenization & Dataset Construction)
By AI-DE Engineering Team·Stakeholders: ML engineer, data engineer, curriculum lead

Context

LLM training-data tokenization has two distinct purposes in this project: (1) production token-counting + sequence packing for the actual dataset construction, and (2) pedagogical understanding of how BPE actually works under the hood. The classic options:

  1. tiktoken-only — OpenAI's BPE encoder, fast, well-tested. Production-ready but opaque (no learner-readable internals).
  2. HuggingFace tokenizers — Rust-backed, supports many encoder types. Production-ready but heavy.
  3. sentencepiece — Google's BPE/unigram library. Used by Llama / Mistral / Gemma. Wide adoption.
  4. Custom from-scratch BPE — pedagogical, slow, not production ready.

We need both a working production encoder AND a learner-readable implementation. The constraint: a tutorial that only ships the custom BPE forces learners through an O(N²) merge loop on real data; one that only ships tiktoken hides the algorithm.

Decision

Adopt a hybrid: ship tokenizers/bpe.py as a from-scratch pedagogical BPE (Counter-based pair frequency, merge loop, vocab serialization) and use tiktoken as the Tier-1 default for actual dataset construction. Sequence packing (greedy + first-fit) lives in packing.py and is encoder-agnostic.

# tokenizers/bpe.py — pedagogical implementation
from collections import Counter

class BPETokenizer:
    def __init__(self, vocab_size: int = 8000):
        self.vocab_size = vocab_size
        self.merges: list[tuple[str, str]] = []
        self.vocab: dict[str, int] = {}

    def train(self, texts: list[str]) -> None:
        # 1. start with character vocab
        # 2. count adjacent pair frequencies
        # 3. merge the most frequent pair → new symbol
        # 4. repeat until vocab_size reached
        word_freqs = Counter(t.split() for t in texts)
        ...

    def encode(self, text: str) -> list[int]: ...
    def decode(self, tokens: list[int]) -> str: ...
# packing.py — encoder-agnostic
class PackingStrategy:
    def __init__(self, max_len: int = 2048):
        self.max_len = max_len

    def greedy_pack(self, sequences: list[list[int]]) -> list[list[int]]:
        # First-fit decreasing: sort by length desc, pack into bins
        ...
# Production path (Tier-1 default)
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode(text)
packed = PackingStrategy(max_len=4096).greedy_pack([tokens])

Tradeoffs we accept

Levertiktoken-onlyHF tokenizerssentencepieceCustom BPEHybrid (chosen)
Production encoding speedFast (Rust)Fast (Rust)Fast (C++)Slow (Python)Fast (tiktoken default)
Learner-readable internalsOpaqueOpaqueOpaqueYes (from-scratch)Yes (custom_BPE)
Vocab serialization formatOpenAI-specificHF-specificsentencepiece-specificJSON (custom)Both
Multilingual / code-awareYes (cl100k_base)YesYesBuild itYes
Works without first-run downloadPre-shipped tablesYes (after download)YesYesYes
Tutorial reproducibilityEasyEasyEasyEducational + slowBest of both

We optimize for production-grade dataset output + learner-readable algorithm. The custom BPE in tokenizers/bpe.py is intentionally not production-grade — it's there so a learner can step through the merge loop in a debugger. The dataset construction path uses tiktoken because that's what production teams actually use.

Consequences (positive)

  • Module 04 ships in <3 hours of learner time: tiktoken handles the encoding heavy-lifting; custom BPE explains the algorithm.
  • The PackingStrategy class in packing.py is encoder-agnostic — first- fit decreasing greedy pack works on any token list, regardless of encoder.
  • The dataset_builder.py shards output as Parquet — encoder choice is downstream of the storage format.
  • A learner who wants to use sentencepiece (for Llama-shaped vocabs) can swap import tiktoken for import sentencepiece without touching the packing or dataset-building code.

Consequences (negative)

  • Two ways to do the same thing. Some learners will be confused by the dual presence of a custom BPE and tiktoken. Mitigation: Module 04 explicitly opens with "tiktoken is the production path; custom BPE is for understanding". Documented in tokenizers/__init__.py.
  • Custom BPE is slow. Training the from-scratch BPE on the full bundled corpus takes ~20 minutes; tiktoken just-works in seconds. Mitigation: Module 04 trains custom BPE on a 1k-doc subset for the pedagogical exercise.
  • No multilingual coverage in custom BPE. The from-scratch implementation handles ASCII and basic Unicode; complex scripts would need pre-tokenization. Mitigation: documented as out-of-scope.
  • Vocab compatibility. A custom BPE vocab is not interchangeable with tiktoken's. Mitigation: dataset shards include the vocab file in the same directory; downstream consumers know which to load.

Reversal plan

The encoder interface (encode(text) -> list[int], decode(tokens) -> str) is the same shape for tiktoken / HF / sentencepiece / custom. Replacement is bounded:

  1. HuggingFace tokenizers — replace import tiktoken with from tokenizers import Tokenizer. ~5-line change. Get multilingual + custom-vocab support.
  2. sentencepiece — replace with import sentencepiece. Required for Llama-shaped vocabs.
  3. Drop the custom BPE — if pedagogical content moves to a separate "How tokenizers work" module, the project can drop tokenizers/bpe.py entirely. ~50 lines deleted.

Estimated effort: 0.5–2 engineer-days for any swap. Reversible.

References

  • tokenizers/bpe.py (pedagogical from-scratch BPE)
  • tokenize/ (tiktoken-based production tokenization helpers)
  • packing.py (encoder-agnostic sequence packing)
  • export/dataset_builder.py (Parquet shard writer)
  • augmentation/data_augmentation.py (synthetic data generation, runs after tokenization)
  • ADR-002 (MinHash dedup — produces input to tokenization)
  • ADR-005 (Pinecone-only deprecated — vector index is downstream of tokenization)
Built into the project

This decision shipped as part of LLM Ingestion Pipeline — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open