LLM Tokenization Explained
Tokenization converts raw text into integer token IDs that a language model can process. A tokenizer splits text into subword units (tokens), maps each to an integer from a fixed vocabulary, and outputs sequences the model trains on. The tokenizer used during training must exactly match the tokenizer used at inference — mixing them corrupts model output silently.
Tokenization in Practice
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
text = "Tokenization converts text into integers."
tokens = enc.encode(text)
# tokens = [26983, 2065, 34804, 1495, 25922, 13]
# words = ["Token","ization"," converts"," text"," into"," integers","."]
print(ff'{len(text)} chars → {len(tokens)} tokens')
# 41 chars → 7 tokens (~0.75 tokens per word)
# Decode back to text
decoded = enc.decode(tokens)
assert decoded == text # lossless round-trip
Core Concepts
Byte-Pair Encoding (BPE)
BPE starts with individual bytes, then iteratively merges the most frequent adjacent pair into a new token — until the vocabulary reaches the target size (32K–128K tokens). Common words become single tokens; rare words decompose into subword units. BPE handles any language, code, or Unicode via the byte fallback.
Vocabulary and Token IDs
A tokenizer vocabulary maps every token string to a unique integer. Special tokens are added for structural purposes: <|endoftext|> marks document boundaries, <|pad|> fills shorter sequences to a fixed length, and instruction-tuned models add <|im_start|> / <|im_end|> for message formatting.
Sequence Packing
Most training documents are shorter than the model context window (2048–8192 tokens). Naive batching pads short documents with zeros — wasting GPU compute. Sequence packing concatenates multiple documents into one full-length sequence separated by <|endoftext|> tokens. This eliminates padding and can double training throughput on short-document corpora.
tiktoken vs sentencepiece
| Property | tiktoken | sentencepiece |
|---|---|---|
| Algorithm | BPE with regex pre-tokenizer | BPE or Unigram LM |
| Developer | OpenAI | |
| Used by | GPT-3.5, GPT-4, o1 | LLaMA, Gemma, T5, Mistral |
| Speed | Very fast (Rust impl) | Fast (C++ impl) |
| Unicode handling | Byte-level BPE fallback | Byte-level fallback |
| Python install | pip install tiktoken | pip install sentencepiece |
| Interchangeable? | ✗ No | ✗ No |
Rule: always use the tokenizer that was used to train the base model. Check the model card on HuggingFace Hub.
Common Mistakes
Mixing tokenizers between training and inference
Training with tiktoken cl100k_base and running inference with a LLaMA sentencepiece tokenizer produces garbage — the integer IDs map to completely different tokens. Always load the tokenizer from the same model checkpoint used for training.
Not accounting for special tokens in sequence length
A 2048-token context window is consumed by both content tokens AND special tokens (system prompt, <|im_start|>, <|im_end|>). If your training sequences are packed to exactly 2048 without leaving room for inference-time system prompt tokens, the model never sees your prompt format during training.
Counting tokens by word count or character count
Token counts and word/character counts diverge significantly for code, non-English text, and URLs. Always measure dataset size in tokens using the actual tokenizer — not word count. A 1GB text file can be anywhere from 100M to 300M+ tokens depending on content.
FAQ
- What is tokenization in LLMs?
- Tokenization converts text into integer token IDs a model can process. A tokenizer splits text into subword units using BPE, maps each to an integer from a fixed vocabulary, and outputs sequences for training or inference.
- What is Byte-Pair Encoding (BPE)?
- BPE starts with individual bytes and iteratively merges the most frequent adjacent pair into a new token. Common words become single tokens; rare words decompose into subword units. Used by GPT-4 (tiktoken) and LLaMA (sentencepiece).
- What is sequence packing?
- Packing concatenates multiple short documents into one full context-length sequence to eliminate zero-padding. Improves training throughput 1.5–2× on short-document corpora.
- What is the difference between tiktoken and sentencepiece?
- tiktoken is OpenAI's BPE tokenizer (GPT models). sentencepiece is Google's tokenizer (LLaMA, Gemma, T5). They are not interchangeable — always use the tokenizer matching your base model.