RAG Chunking Explained: What It Is and How It Works
Chunking splits source documents into smaller text segments before embedding. Each chunk becomes a retrievable unit in the vector database. Chunk size and strategy are the #1 lever for RAG retrieval quality — chunks too large dilute relevance; chunks too small lose sentence context. The sweet spot is 256–512 tokens with 10–20% overlap.
The Chunk Size Trade-off
Chunk size trade-off
Too small (< 128 tokens):
[chunk: "The policy"] [chunk: "states that"] [chunk: "refunds are"]
→ Fragments sentence context → low generation quality
Sweet spot (256-512 tokens):
[chunk: "The refund policy states that all purchases are eligible
for a 30-day refund if returned in original condition."]
→ Complete thought, precise retrieval ✓
Too large (> 1024 tokens):
[chunk: "Section 3. Returns and Exchanges. 3.1 General Policy...
3.2 International Orders... 3.3 Damaged Items..."]
→ Multiple topics → diluted embedding → noisy retrievalChunking Strategies
Fixed-Size
Fastest & Simplest
Split every N tokens with overlap. Fast, predictable, easy to tune. Can split mid-sentence. Good default for most use cases.
Recursive
Best Default
Splits on paragraphs → sentences → words. Respects natural document structure. LangChain's RecursiveCharacterTextSplitter implements this.
Semantic
Highest Quality
Embeds sentences and splits where embedding cosine distance jumps (topic change). Slower but produces the most coherent, topically consistent chunks.
Strategy Comparison
| Strategy | Speed | Quality | Use when |
|---|---|---|---|
| Fixed-size | Fast | Good | Prototyping, homogeneous docs |
| Recursive character | Fast | Better | Most production cases |
| Semantic | Slow | Best | High-stakes, heterogeneous docs |
| Parent-document | Medium | Best for generation | When context window matters |
| Markdown/HTML aware | Fast | Better | Structured web/wiki content |
Parent-Document Chunking
Index small chunks for precise retrieval, but return the full parent paragraph (or page) to the LLM for generation. Gives the precision of small-chunk retrieval with the context richness of large-chunk generation.
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
# Small chunks for retrieval (128 tokens)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=128)
# Large chunks returned to LLM (512 tokens)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=512)
retriever = ParentDocumentRetriever(
vectorstore=vectordb,
docstore=InMemoryStore(),
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)Common Mistakes
Using one chunk size for all document types
PDFs, markdown files, HTML pages, and code files all have different structure. Use a document-type-aware splitter (MarkdownTextSplitter, HTMLHeaderTextSplitter) for non-prose content.
Not using overlap
Without overlap, answers that span a chunk boundary get split across two chunks — neither contains the full answer. Use 10–20% overlap (e.g. 64 tokens overlap on a 512-token chunk).
Never benchmarking chunk strategy
Most teams pick a chunk size and never measure if it is optimal. Track hit rate (was the correct chunk in top-K?) on a golden eval set. Even a 10-question eval catches obvious problems.
Chunking code the same as prose
Code has function and class boundaries that are semantically meaningful. Use code-aware chunking (split on function definitions, not characters) to preserve code context.
FAQ
- What is chunking in RAG?
- Chunking splits source documents into smaller text segments before embedding. Each chunk becomes a retrievable unit in the vector database. Chunk size and strategy directly determine retrieval quality.
- What is the best chunk size for RAG?
- 256–512 tokens with 10–20% overlap is a good default. Smaller chunks improve precision for factual Q&A; larger chunks preserve context for complex reasoning. Always benchmark on your corpus.
- What is the difference between fixed-size and semantic chunking?
- Fixed-size chunking splits at a fixed token count — fast but can split mid-sentence. Semantic chunking detects topic boundaries using embeddings and splits at natural breaks — slower but produces more coherent chunks.
- What is parent-document chunking in RAG?
- Parent-document chunking indexes small chunks for precise retrieval but returns the larger parent segment to the LLM. This combines the precision of small-chunk retrieval with the context richness of large-chunk generation.