Question 1

What is tokenization in LLMs?

Accepted Answer

Tokenization converts raw text into a sequence of integer token IDs that a language model can process. A token is a unit of text — roughly 3/4 of a word on average in English. "tokenization" might be split into ["token", "ization"] = [5963, 1634]. The tokenizer has a fixed vocabulary (typically 32K–128K tokens) trained on a large text corpus. Every LLM uses a specific tokenizer and must use the same tokenizer at training and inference time.

Question 2

What is Byte-Pair Encoding (BPE)?

Accepted Answer

BPE (Byte-Pair Encoding) is the tokenization algorithm used by GPT-4, LLaMA, and most modern LLMs. It starts with individual bytes as the base vocabulary, then iteratively merges the most frequent adjacent pair of tokens into a new token. This continues until the vocabulary reaches the target size (e.g., 100K tokens). The result is a vocabulary that handles common English words as single tokens while gracefully decomposing rare words into subword units.

Question 3

What is sequence packing in LLM training?

Accepted Answer

Sequence packing concatenates multiple short documents into a single fixed-length sequence (e.g., 2048 tokens) separated by end-of-text tokens. Without packing, short documents are padded with zeros to fill the context window — wasting GPU compute on padding tokens. With packing, every token in the batch is a real content token, doubling training throughput on short-document datasets.

Question 4

What is the difference between tiktoken and sentencepiece?

Accepted Answer

tiktoken (OpenAI) implements BPE with a regex-based pre-tokenizer that handles spaces, punctuation, and Unicode consistently — used by GPT-3.5, GPT-4, and related models. sentencepiece (Google) implements BPE or unigram language model tokenization with a language-agnostic byte-level fallback — used by LLaMA, Gemma, T5, and most non-OpenAI models. They are not interchangeable: training data tokenized with tiktoken cannot be used to train a sentencepiece-based model.

Property	tiktoken	sentencepiece
Algorithm	BPE with regex pre-tokenizer	BPE or Unigram LM
Developer	OpenAI	Google
Used by	GPT-3.5, GPT-4, o1	LLaMA, Gemma, T5, Mistral
Speed	Very fast (Rust impl)	Fast (C++ impl)
Unicode handling	Byte-level BPE fallback	Byte-level fallback
Python install	pip install tiktoken	pip install sentencepiece
Interchangeable?	✗ No	✗ No

LLM Tokenization Explained

Tokenization in Practice

Core Concepts

tiktoken vs sentencepiece

Common Mistakes

FAQ

Related