Skip to content
Back to Full-Stack AI Platform

Three-level failure cascade: RAG → LLM-only → cached → honest error

✓ AcceptedFull-Stack AI Platform04 — Serving Layer & Production API
By AI-DE Engineering Team·Stakeholders: platform owner, on-call engineer, eng manager

Context

In production, RAG components fail. The retriever can timeout. The LLM API can rate-limit. The reranker can OOM. The vector index can lock during a VACUUM. Every one of these failures has happened in our systems. The question is what users see when they do.

The bad outcomes:

  • HTTP 500 with no body. Users see a blank error. Worst possible UX.
  • Confidently-wrong answer. The system silently picks an unrelated chunk and the LLM grounds in it. Users trust the answer because it looks confident.
  • Eternal retry. The system keeps retrying the failing component forever, eating the budget.

We need a cascade that degrades gracefully and tells the user what happened.

Three patterns considered:

  • Single-shot with HTTP 500. Simplest. Ships the bad UX above.
  • Synchronous retry-with-backoff. Retries the same path with exponential backoff. Eats the budget on persistent failures; same answer if the path was actually broken.
  • Cascade through alternative paths. RAG fails → fall back to LLM-only with no retrieval; LLM-only fails → fall back to a cached answer if available; cached miss → return an honest error.

Decision

Adopt the cascade. Three levels, each with explicit degradation marker on the response.

# serving/failure_router.py
@dataclass
class CascadeResult:
    text: str
    is_degraded: bool                  # True if we fell back from RAG
    degradation_level: int             # 0=RAG, 1=LLM-only, 2=cached, 3=error
    disclaimer: str | None             # User-visible note about degradation

class FailureRouter:
    async def execute(self, query: str, tenant: Tenant) -> CascadeResult:
        # Level 0: full RAG
        try:
            return await self._rag_path(query, tenant)
        except (RetrieverTimeout, RerankFailure) as e:
            self._record_degradation(0, e)

        # Level 1: LLM-only (no retrieval)
        try:
            result = await self._llm_only_path(query)
            return result._replace(
                is_degraded=True,
                degradation_level=1,
                disclaimer="Generated without retrieval (search temporarily unavailable). Verify against your records.",
            )
        except (LLMRateLimit, LLMTimeout) as e:
            self._record_degradation(1, e)

        # Level 2: cached answer (similar query in last 24h)
        cached = await self._cache_path(query, tenant)
        if cached:
            return cached._replace(
                is_degraded=True,
                degradation_level=2,
                disclaimer=f"Showing a cached answer from {cached.cache_age_hours}h ago — services degraded.",
            )

        # Level 3: honest error
        return CascadeResult(
            text="The system is degraded. We can't generate a reliable answer right now.",
            is_degraded=True,
            degradation_level=3,
            disclaimer="Please retry in a few minutes. If urgent, contact support.",
        )

The cascade is wrapped by a CircuitBreaker per component — repeated failures of the same level skip that level for ~30s rather than burning budget.

Tradeoffs we accept

LeverAlternativeChosen
Latency on degraded pathRetry the same pathCascade — accept ~50% extra latency on degraded queries to deliver something
User-visible degradationSilent fallbackExplicit disclaimer — accept the UX cost of "we're degraded" framing for honesty
Cache freshnessReject stale cacheShow cache up to 24h with disclaimer — accept stale answer over no answer
Cascade depth5+ levels3 levels — accept that beyond cache, there's nothing else to try
Per-tenant tuningSame cascade for everyoneSame default; per-tenant overrides via TenantConfig (e.g. compliance tenants disable cache fallback)

Consequences (positive)

  • Users never see a blank error. Every degraded response carries a disclaimer that names the degradation.
  • The cascade fires on real failures, not synthetic ones — verified by chaos testing in M06 (incident_simulations.py covers all three failure modes).
  • M05's eval gates on is_degraded rate. A spike in degradation rate (over 5%) blocks releases — the cascade is a signal, not a free pass.
  • Cache fallback is rare (~0.5% of queries) but high-value: it covers exactly the worst case where everything else has failed.
  • Trace data identifies the failure point: degradation_level=1 traces tell us "retrieval is the problem"; level=2 traces tell us "LLM is the problem". Debug surface is automatic.

Consequences (negative)

  • LLM-only path produces ungrounded answers. We mitigate with a stricter prompt that says "no retrieval results — answer cautiously"; M05's grounding_score eval catches when this drifts into hallucination territory.
  • Cache path has freshness risk. Mitigation: 24h hard cap; compliance tenants disable cache fallback entirely.
  • Cascade adds operational complexity. The CircuitBreaker has its own state machine; on-call engineers need a runbook (runbooks/cascade-debugging.md).
  • "Honest error" path requires status communication. We log to Slack on first cascade-to-error of the hour so on-call can investigate before the user does.

Reversal plan

Drop cascade (single-shot + HTTP 500): ~2 engineer-days. Trigger: degradation rate > 15% sustained (signal that the cascade is masking a real systemic failure).

Add a 4th level (RAG-from-different-region): ~2 engineer-weeks. Use only when we have multi-region retrieval indexes (we don't, at v1).

Per-tenant cascade depth: ~3 engineer-days. Already partially supported via TenantConfig.cache_disabled. Extend to per-tenant degradation level caps.

References

  • serving/failure_router.py — cascade implementation
  • serving/circuit_breaker.py — per-component breaker
  • tests/test_failure_cascade.py — chaos tests
  • runbooks/cascade-debugging.md — on-call playbook
  • incident_simulations.py — M06 chaos simulator covers all three levels
  • ADR-001 (SystemContract latency_p95_seconds — cascade respects total budget)
  • ADR-002 (RAG path uses pgvector retrieval)
  • ADR-005 (DEPRECATED — early row-filter tenant model that the cascade had to work around)
Built into the project

This decision shipped as part of Full-Stack AI Platform — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open