Context
When the M01 ingestion pipeline first shipped, the chunker was a single fixed-size strategy: split every document into 1000-character windows with 200-character overlap. The rationale was reasonable on paper:
- Mechanical and predictable. No surprises, no edge cases — the chunker emits the same output for the same input every time.
- Cheap. Zero embedding calls at ingest time, zero per-chunker tuning, zero per-document policy.
- The default in many tutorials. LangChain's
CharacterTextSplitter, LlamaIndex'sTokenTextSplitter, every "build a RAG in 5 lines" demo ships with fixed-size as the starter. The pattern is familiar. - Future-proof framing. "We can always swap chunkers later" was the plan — and we did, but later turned out to be 11 days later.
This held through ingestion testing on short_faq.md and
structured_pricing.md. Both have short, regular content; fixed-size at
1000 chars happens to align with paragraph boundaries by accident on
those two files. Smoke tests passed, retrieval looked fine.
What changed
We added two more documents to the seed corpus: long_handbook.md (a
12-page employee-handbook style document) and noisy_release_notes.md (a
release-notes file with mid-paragraph version numbers, code blocks, and
inline tables). Built a 40-question golden set across all 4 documents.
Ran the chunking A/B benchmark.
The fixed-size results were a regression we couldn't ignore:
| Strategy | Retrieval accuracy@10 |
|---|---|
| Fixed-size (1000 chars, 200 overlap) | 62% |
| Paragraph (\n\n split) | 71% (+9pp) |
| Recursive (paragraph → sentence → word) | 78% (+16pp) |
| Semantic (embedding-boundary detection) | 85% (+23pp) |
The 16pp gap between fixed-size and recursive isn't a tuning artefact —
it's the difference between cutting mid-sentence (which fixed-size does
unconditionally) and respecting natural boundaries (which recursive does
by design). On long_handbook.md specifically, fixed-size cut the answer
to "what is the parental leave policy" right through the middle of the
sentence that defined it; the chunk that contained the answer was not the
chunk that ranked highest.
That's the regression that killed this ADR.
What we got wrong (and what we'd do again)
Wrong: picking the most familiar default rather than the most correct one. Fixed-size feels sensible because every tutorial uses it. Tutorials use it because it's easy to explain, not because it's the correct production default. The cost of the familiarity bias was 16pp of retrieval accuracy and a re-ingestion cycle.
Wrong: under-testing the chunker against varied document shapes. Two short, regular files don't cover the cases where chunking matters. The lesson: build the chunking benchmark first, ship the chunker second. The 4-document seed corpus + 40-question golden set is now the gate, not the post-hoc validation.
Right: building the chunking infrastructure as a strategy pattern from
day 1. The ChunkingStrategy ABC and ChunkingFactory shipped in M01;
swapping out fixed-size for recursive was a 1-line change in the factory
default, not a rewrite. If we'd hardcoded fixed-size into a single
function, the reversal would have cost a week instead of a day.
Right: reversing fast. We caught the regression on the chunking A/B within a sprint, deprecated the ADR within 11 days, re-ingested the seed corpus on recursive within a day. The cost of the wrong default was bounded by how quickly we could prove it was wrong.
How we reversed it
The reversal was a one-day cycle:
- Switch the default.
ChunkingFactory.from_metadatareturnedRecursiveChunking(chunk_size=1000, chunk_overlap=200)instead ofFixedSizeChunking(...)(single line). - Re-ingest seed corpus. The streaming embedder fan-out (ADR-001) made re-ingestion idempotent — kick off the re-ingest, wait for the dead-letter queue to drain, verify the new chunk count matches the expected ~80 chunks across 4 docs.
- Re-run the A/B benchmark. Recursive at 78% vs the old 62%
confirmed in the canary RAGAS run. Faithfulness floor (
>0.85persla_definition.py) cleared.
The FixedSizeChunking class is still in the codebase. We didn't delete
it — it's a useful comparison point, an explicit anti-pattern reference
for new contributors who otherwise inherit the "just use 1000 chars"
muscle memory.
Why we reversed it (in one sentence)
Fixed-size chunking cuts mid-sentence by design and dropped retrieval accuracy 16pp below recursive on a 4-document benchmark; the ChunkingFactory pattern made the reversal a 1-line default-swap plus a one-day re-ingest.
What this ADR replaces
- The original M01 design assumed
FixedSizeChunkingwas the production default. ADR-003 (recursive default + semantic upgrade path) supersedes that. - The fixed-size class is preserved as an anti-pattern reference, not as a callable default. Any future call site that needs it must request it by name.
References
backend/app/services/chunking.py—FixedSizeChunkingclass (kept)backend/app/services/chunking.py::ChunkingFactory— the factory whose default reverted fromfixed_sizetorecursivedata/sample_docs/long_handbook.md— the document whose answer was getting cut mid-sentenceeval/approach_decision.md— chunking A/B numbers (62 / 71 / 78 / 85)scripts/ragas_eval.py— the canary that gated the re-ingest- ADR-003 (live design)
- ADR-001 (the hybrid retriever the chunks feed into)