ADR-005: Fixed-size chunking (DEPRECATED) | Enterprise RAG

Context

When the M01 ingestion pipeline first shipped, the chunker was a single fixed-size strategy: split every document into 1000-character windows with 200-character overlap. The rationale was reasonable on paper:

Mechanical and predictable. No surprises, no edge cases — the chunker emits the same output for the same input every time.
Cheap. Zero embedding calls at ingest time, zero per-chunker tuning, zero per-document policy.
The default in many tutorials. LangChain's CharacterTextSplitter, LlamaIndex's TokenTextSplitter, every "build a RAG in 5 lines" demo ships with fixed-size as the starter. The pattern is familiar.
Future-proof framing. "We can always swap chunkers later" was the plan — and we did, but later turned out to be 11 days later.

This held through ingestion testing on short_faq.md and structured_pricing.md. Both have short, regular content; fixed-size at 1000 chars happens to align with paragraph boundaries by accident on those two files. Smoke tests passed, retrieval looked fine.

What changed

We added two more documents to the seed corpus: long_handbook.md (a 12-page employee-handbook style document) and noisy_release_notes.md (a release-notes file with mid-paragraph version numbers, code blocks, and inline tables). Built a 40-question golden set across all 4 documents. Ran the chunking A/B benchmark.

The fixed-size results were a regression we couldn't ignore:

Strategy	Retrieval accuracy@10
Fixed-size (1000 chars, 200 overlap)	62%
Paragraph (\n\n split)	71% (+9pp)
Recursive (paragraph → sentence → word)	78% (+16pp)
Semantic (embedding-boundary detection)	85% (+23pp)

The 16pp gap between fixed-size and recursive isn't a tuning artefact — it's the difference between cutting mid-sentence (which fixed-size does unconditionally) and respecting natural boundaries (which recursive does by design). On long_handbook.md specifically, fixed-size cut the answer to "what is the parental leave policy" right through the middle of the sentence that defined it; the chunk that contained the answer was not the chunk that ranked highest.

That's the regression that killed this ADR.

What we got wrong (and what we'd do again)

Wrong: picking the most familiar default rather than the most correct one. Fixed-size feels sensible because every tutorial uses it. Tutorials use it because it's easy to explain, not because it's the correct production default. The cost of the familiarity bias was 16pp of retrieval accuracy and a re-ingestion cycle.

Wrong: under-testing the chunker against varied document shapes. Two short, regular files don't cover the cases where chunking matters. The lesson: build the chunking benchmark first, ship the chunker second. The 4-document seed corpus + 40-question golden set is now the gate, not the post-hoc validation.

Right: building the chunking infrastructure as a strategy pattern from day 1. The ChunkingStrategy ABC and ChunkingFactory shipped in M01; swapping out fixed-size for recursive was a 1-line change in the factory default, not a rewrite. If we'd hardcoded fixed-size into a single function, the reversal would have cost a week instead of a day.

Right: reversing fast. We caught the regression on the chunking A/B within a sprint, deprecated the ADR within 11 days, re-ingested the seed corpus on recursive within a day. The cost of the wrong default was bounded by how quickly we could prove it was wrong.

How we reversed it

The reversal was a one-day cycle:

Switch the default. ChunkingFactory.from_metadata returned RecursiveChunking(chunk_size=1000, chunk_overlap=200) instead of FixedSizeChunking(...) (single line).
Re-ingest seed corpus. The streaming embedder fan-out (ADR-001) made re-ingestion idempotent — kick off the re-ingest, wait for the dead-letter queue to drain, verify the new chunk count matches the expected ~80 chunks across 4 docs.
Re-run the A/B benchmark. Recursive at 78% vs the old 62% confirmed in the canary RAGAS run. Faithfulness floor (>0.85 per sla_definition.py) cleared.

The FixedSizeChunking class is still in the codebase. We didn't delete it — it's a useful comparison point, an explicit anti-pattern reference for new contributors who otherwise inherit the "just use 1000 chars" muscle memory.

Why we reversed it (in one sentence)

Fixed-size chunking cuts mid-sentence by design and dropped retrieval accuracy 16pp below recursive on a 4-document benchmark; the ChunkingFactory pattern made the reversal a 1-line default-swap plus a one-day re-ingest.

What this ADR replaces

The original M01 design assumed FixedSizeChunking was the production default. ADR-003 (recursive default + semantic upgrade path) supersedes that.
The fixed-size class is preserved as an anti-pattern reference, not as a callable default. Any future call site that needs it must request it by name.

References

backend/app/services/chunking.py — FixedSizeChunking class (kept)
backend/app/services/chunking.py::ChunkingFactory — the factory whose default reverted from fixed_size to recursive
data/sample_docs/long_handbook.md — the document whose answer was getting cut mid-sentence
eval/approach_decision.md — chunking A/B numbers (62 / 71 / 78 / 85)
scripts/ragas_eval.py — the canary that gated the re-ingest
ADR-003 (live design)
ADR-001 (the hybrid retriever the chunks feed into)