Context
Chunking is the most consequential decision in a RAG system. The chunk you
emit at ingest time is the chunk the LLM sees at query time — there is no
recovery from a bad cut. We benchmarked 4 strategies on the seed corpus
(long_handbook.md, short_faq.md, structured_pricing.md,
noisy_release_notes.md) using a hand-built golden set of 40
question/answer pairs:
| Strategy | Retrieval accuracy@10 | Cost per doc | Notes |
|---|---|---|---|
| Fixed-size (1000 chars, 200 overlap) | 62% | $0 (mechanical) | Cuts mid-sentence; mid-paragraph; mid-table |
| Recursive (paragraph → sentence → word) | 78% | $0 (mechanical) | Respects natural boundaries; ~16pp lift |
| Semantic (embedding-similarity boundaries) | 85% | ~$0.01/doc | Best accuracy; pays per ingest |
Paragraph (\n\n split) | 71% | $0 (mechanical) | Works on Markdown; collapses on PDF/DOCX |
Two forces in tension:
- Quality: higher accuracy is monotonic with how well the chunker respects the document's meaning — paragraphs > sentences > characters. Semantic chunking is the upper bound because it respects topic shifts, not just punctuation.
- Cost at ingest: semantic chunking pays an embedding call per ~100 sentences. At 10k docs/day that's ~$10/day in embeddings just for ingest, before indexing. Many docs (release notes, FAQs, user posts) don't justify the cost.
A single-strategy choice forces an either-or: pay for quality on every doc, or save money and accept lower retrieval accuracy on the documents that need it most.
Decision
We adopt recursive chunking as the default, with semantic chunking opt-in via metadata for documents that justify the per-doc cost.
# backend/app/services/chunking.py
class ChunkingFactory:
@staticmethod
def from_metadata(meta: DocumentMetadata) -> ChunkingStrategy:
# High-value docs (legal, medical, contracts) opt into semantic
if meta.tags & {"legal", "medical", "contract", "compliance"}:
return SemanticChunking(threshold=0.75)
# Markdown with clean paragraph structure → paragraph
if meta.file_type == "markdown" and meta.has_clean_paragraphs:
return ParagraphChunking()
# Default: recursive (paragraph → sentence → word)
return RecursiveChunking(chunk_size=1000, chunk_overlap=200)
The four strategies remain in the codebase. The factory is the policy; adding a strategy or changing the policy doesn't require touching every call site.
Why not fixed-size as the default? ADR-005 (Deprecated) covers the historical answer: we shipped fixed-size in v1, the 62% retrieval accuracy benchmark on the chunking-A/B bench is what killed it.
Tradeoffs we accept
| Lever | Alternative | Chosen |
|---|---|---|
| Default quality floor | One strategy everywhere | Recursive default + semantic upgrade path |
| Ingest cost | Single fixed cost | Variable: $0 default + ~$0.01/high-value doc |
| Operational complexity | One pipeline | Factory + tag policy + opt-in semantic |
| Calibration | None | Threshold + tag list + chunk-size tunables |
| New-format support | Add a parser | Add a parser + a chunking-strategy mapping |
We accept the operational complexity (a factory, a tag policy, two threshold knobs) because the alternative (semantic-everywhere = ~$30k/yr in embedding bills at our reference scenario, or fixed-size-everywhere = 16pp accuracy regression) is worse on either margin.
Consequences (positive)
- Headline retrieval accuracy stays at 78% out of the box — recursive is the default, so a learner with no tags gets the +16pp lift over fixed-size automatically.
- Compliance/legal/medical docs hit 85% without a global cost spike — semantic is opt-in, paid on the documents where the lift matters.
- Markdown gets paragraph chunking, which honors the author's intentional structure — release notes, blog posts, ADRs themselves get a third strategy without code changes.
- Factory pattern is testable.
ChunkingFactory.from_metadatais a pure function overDocumentMetadata; we test the policy directly. - The 4-strategy A/B is part of the ingestion narrative. The 62/78/85 numbers anchor the M01 lesson and recur in the ADR-005 deprecation rationale.
Consequences (negative)
- Tag policy drift.
legal/medical/contract/compliancetags must be applied at ingest. If teams forget to tag, semantic doesn't fire for them. Mitigated with a default-tag rule on known sources (e.g. ingestion from the legal SharePoint always carrieslegal). - Semantic threshold drift. 0.75 calibrated against the seed corpus. The runbook entry "recalibrate when retrieval accuracy regresses > 5pp" documents the procedure.
- Mixed-strategy debugging. The reranker (ADR-002) sees chunks of different sizes from different strategies; dense and BM25 indexes need to handle the variance. We accept; the index already handles variable-length chunks.
Reversal plan
If retrieval-accuracy regression on the canary set drops below 70% (i.e. the strategy mix is no longer earning its complexity), reversal is:
- Set
CHUNKING_FACTORY_MODE=recursive_onlyin.env. ChunkingFactory.from_metadatashort-circuits to recursive.- Re-ingest affected high-value documents on the new strategy.
- Document the reversal in
runbook/failure_modes.mdand re-run RAGAS.
Estimated effort: ~3 engineer-days for the cutover; re-ingest depends on corpus size.
References
backend/app/services/chunking.py—ChunkingStrategyABC + 4 implementations +ChunkingFactorybackend/app/services/document_parser.py— parser → chunker handoff with metadatabackend/app/routes/documents.py— ingestion API + tag policydata/sample_docs/— the 4-doc seed corpus the benchmarks ran against- ADR-005 (the Deprecated fixed-size predecessor)
eval/approach_decision.md— when-to-use decision tree