Skip to content
Back to Enterprise RAG

Recursive chunking as default; semantic for high-value docs

✓ AcceptedEnterprise RAG01 — Document Ingestion Pipeline
By AI-DE Engineering Team·Stakeholders: retrieval engineer, ingestion lead, cost owner

Context

Chunking is the most consequential decision in a RAG system. The chunk you emit at ingest time is the chunk the LLM sees at query time — there is no recovery from a bad cut. We benchmarked 4 strategies on the seed corpus (long_handbook.md, short_faq.md, structured_pricing.md, noisy_release_notes.md) using a hand-built golden set of 40 question/answer pairs:

StrategyRetrieval accuracy@10Cost per docNotes
Fixed-size (1000 chars, 200 overlap)62%$0 (mechanical)Cuts mid-sentence; mid-paragraph; mid-table
Recursive (paragraph → sentence → word)78%$0 (mechanical)Respects natural boundaries; ~16pp lift
Semantic (embedding-similarity boundaries)85%~$0.01/docBest accuracy; pays per ingest
Paragraph (\n\n split)71%$0 (mechanical)Works on Markdown; collapses on PDF/DOCX

Two forces in tension:

  1. Quality: higher accuracy is monotonic with how well the chunker respects the document's meaning — paragraphs > sentences > characters. Semantic chunking is the upper bound because it respects topic shifts, not just punctuation.
  2. Cost at ingest: semantic chunking pays an embedding call per ~100 sentences. At 10k docs/day that's ~$10/day in embeddings just for ingest, before indexing. Many docs (release notes, FAQs, user posts) don't justify the cost.

A single-strategy choice forces an either-or: pay for quality on every doc, or save money and accept lower retrieval accuracy on the documents that need it most.

Decision

We adopt recursive chunking as the default, with semantic chunking opt-in via metadata for documents that justify the per-doc cost.

# backend/app/services/chunking.py
class ChunkingFactory:
    @staticmethod
    def from_metadata(meta: DocumentMetadata) -> ChunkingStrategy:
        # High-value docs (legal, medical, contracts) opt into semantic
        if meta.tags & {"legal", "medical", "contract", "compliance"}:
            return SemanticChunking(threshold=0.75)

        # Markdown with clean paragraph structure → paragraph
        if meta.file_type == "markdown" and meta.has_clean_paragraphs:
            return ParagraphChunking()

        # Default: recursive (paragraph → sentence → word)
        return RecursiveChunking(chunk_size=1000, chunk_overlap=200)

The four strategies remain in the codebase. The factory is the policy; adding a strategy or changing the policy doesn't require touching every call site.

Why not fixed-size as the default? ADR-005 (Deprecated) covers the historical answer: we shipped fixed-size in v1, the 62% retrieval accuracy benchmark on the chunking-A/B bench is what killed it.

Tradeoffs we accept

LeverAlternativeChosen
Default quality floorOne strategy everywhereRecursive default + semantic upgrade path
Ingest costSingle fixed costVariable: $0 default + ~$0.01/high-value doc
Operational complexityOne pipelineFactory + tag policy + opt-in semantic
CalibrationNoneThreshold + tag list + chunk-size tunables
New-format supportAdd a parserAdd a parser + a chunking-strategy mapping

We accept the operational complexity (a factory, a tag policy, two threshold knobs) because the alternative (semantic-everywhere = ~$30k/yr in embedding bills at our reference scenario, or fixed-size-everywhere = 16pp accuracy regression) is worse on either margin.

Consequences (positive)

  • Headline retrieval accuracy stays at 78% out of the box — recursive is the default, so a learner with no tags gets the +16pp lift over fixed-size automatically.
  • Compliance/legal/medical docs hit 85% without a global cost spike — semantic is opt-in, paid on the documents where the lift matters.
  • Markdown gets paragraph chunking, which honors the author's intentional structure — release notes, blog posts, ADRs themselves get a third strategy without code changes.
  • Factory pattern is testable. ChunkingFactory.from_metadata is a pure function over DocumentMetadata; we test the policy directly.
  • The 4-strategy A/B is part of the ingestion narrative. The 62/78/85 numbers anchor the M01 lesson and recur in the ADR-005 deprecation rationale.

Consequences (negative)

  • Tag policy drift. legal / medical / contract / compliance tags must be applied at ingest. If teams forget to tag, semantic doesn't fire for them. Mitigated with a default-tag rule on known sources (e.g. ingestion from the legal SharePoint always carries legal).
  • Semantic threshold drift. 0.75 calibrated against the seed corpus. The runbook entry "recalibrate when retrieval accuracy regresses > 5pp" documents the procedure.
  • Mixed-strategy debugging. The reranker (ADR-002) sees chunks of different sizes from different strategies; dense and BM25 indexes need to handle the variance. We accept; the index already handles variable-length chunks.

Reversal plan

If retrieval-accuracy regression on the canary set drops below 70% (i.e. the strategy mix is no longer earning its complexity), reversal is:

  1. Set CHUNKING_FACTORY_MODE=recursive_only in .env.
  2. ChunkingFactory.from_metadata short-circuits to recursive.
  3. Re-ingest affected high-value documents on the new strategy.
  4. Document the reversal in runbook/failure_modes.md and re-run RAGAS.

Estimated effort: ~3 engineer-days for the cutover; re-ingest depends on corpus size.

References

  • backend/app/services/chunking.pyChunkingStrategy ABC + 4 implementations + ChunkingFactory
  • backend/app/services/document_parser.py — parser → chunker handoff with metadata
  • backend/app/routes/documents.py — ingestion API + tag policy
  • data/sample_docs/ — the 4-doc seed corpus the benchmarks ran against
  • ADR-005 (the Deprecated fixed-size predecessor)
  • eval/approach_decision.md — when-to-use decision tree
Built into the project

This decision shipped as part of Enterprise RAG — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open