Skip to content

RAG Chunking Explained: What It Is and How It Works

Chunking splits source documents into smaller text segments before embedding. Each chunk becomes a retrievable unit in the vector database. Chunk size and strategy are the #1 lever for RAG retrieval quality — chunks too large dilute relevance; chunks too small lose sentence context. The sweet spot is 256–512 tokens with 10–20% overlap.

The Chunk Size Trade-off

Chunk size trade-off

Too small (< 128 tokens):
  [chunk: "The policy"] [chunk: "states that"] [chunk: "refunds are"]
  → Fragments sentence context → low generation quality

Sweet spot (256-512 tokens):
  [chunk: "The refund policy states that all purchases are eligible
    for a 30-day refund if returned in original condition."]
  → Complete thought, precise retrieval ✓

Too large (> 1024 tokens):
  [chunk: "Section 3. Returns and Exchanges. 3.1 General Policy...
    3.2 International Orders... 3.3 Damaged Items..."]
  → Multiple topics → diluted embedding → noisy retrieval

Chunking Strategies

Fixed-Size

Fastest & Simplest

Split every N tokens with overlap. Fast, predictable, easy to tune. Can split mid-sentence. Good default for most use cases.

Recursive

Best Default

Splits on paragraphs → sentences → words. Respects natural document structure. LangChain's RecursiveCharacterTextSplitter implements this.

Semantic

Highest Quality

Embeds sentences and splits where embedding cosine distance jumps (topic change). Slower but produces the most coherent, topically consistent chunks.

Strategy Comparison

StrategySpeedQualityUse when
Fixed-sizeFastGoodPrototyping, homogeneous docs
Recursive characterFastBetterMost production cases
SemanticSlowBestHigh-stakes, heterogeneous docs
Parent-documentMediumBest for generationWhen context window matters
Markdown/HTML awareFastBetterStructured web/wiki content

Parent-Document Chunking

Index small chunks for precise retrieval, but return the full parent paragraph (or page) to the LLM for generation. Gives the precision of small-chunk retrieval with the context richness of large-chunk generation.

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

# Small chunks for retrieval (128 tokens)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=128)
# Large chunks returned to LLM (512 tokens)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=512)

retriever = ParentDocumentRetriever(
    vectorstore=vectordb,
    docstore=InMemoryStore(),
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

Common Mistakes

Using one chunk size for all document types

PDFs, markdown files, HTML pages, and code files all have different structure. Use a document-type-aware splitter (MarkdownTextSplitter, HTMLHeaderTextSplitter) for non-prose content.

Not using overlap

Without overlap, answers that span a chunk boundary get split across two chunks — neither contains the full answer. Use 10–20% overlap (e.g. 64 tokens overlap on a 512-token chunk).

Never benchmarking chunk strategy

Most teams pick a chunk size and never measure if it is optimal. Track hit rate (was the correct chunk in top-K?) on a golden eval set. Even a 10-question eval catches obvious problems.

Chunking code the same as prose

Code has function and class boundaries that are semantically meaningful. Use code-aware chunking (split on function definitions, not characters) to preserve code context.

FAQ

What is chunking in RAG?
Chunking splits source documents into smaller text segments before embedding. Each chunk becomes a retrievable unit in the vector database. Chunk size and strategy directly determine retrieval quality.
What is the best chunk size for RAG?
256–512 tokens with 10–20% overlap is a good default. Smaller chunks improve precision for factual Q&A; larger chunks preserve context for complex reasoning. Always benchmark on your corpus.
What is the difference between fixed-size and semantic chunking?
Fixed-size chunking splits at a fixed token count — fast but can split mid-sentence. Semantic chunking detects topic boundaries using embeddings and splits at natural breaks — slower but produces more coherent chunks.
What is parent-document chunking in RAG?
Parent-document chunking indexes small chunks for precise retrieval but returns the larger parent segment to the LLM. This combines the precision of small-chunk retrieval with the context richness of large-chunk generation.

Related

Press Cmd+K to open