ADR-004: Three-level failure cascade: RAG → LLM-only → cached → honest error | Full-Stack AI Platform

Context

In production, RAG components fail. The retriever can timeout. The LLM API can rate-limit. The reranker can OOM. The vector index can lock during a VACUUM. Every one of these failures has happened in our systems. The question is what users see when they do.

The bad outcomes:

HTTP 500 with no body. Users see a blank error. Worst possible UX.
Confidently-wrong answer. The system silently picks an unrelated chunk and the LLM grounds in it. Users trust the answer because it looks confident.
Eternal retry. The system keeps retrying the failing component forever, eating the budget.

We need a cascade that degrades gracefully and tells the user what happened.

Three patterns considered:

Single-shot with HTTP 500. Simplest. Ships the bad UX above.
Synchronous retry-with-backoff. Retries the same path with exponential backoff. Eats the budget on persistent failures; same answer if the path was actually broken.
Cascade through alternative paths. RAG fails → fall back to LLM-only with no retrieval; LLM-only fails → fall back to a cached answer if available; cached miss → return an honest error.

Decision

Adopt the cascade. Three levels, each with explicit degradation marker on the response.

# serving/failure_router.py
@dataclass
class CascadeResult:
    text: str
    is_degraded: bool                  # True if we fell back from RAG
    degradation_level: int             # 0=RAG, 1=LLM-only, 2=cached, 3=error
    disclaimer: str | None             # User-visible note about degradation

class FailureRouter:
    async def execute(self, query: str, tenant: Tenant) -> CascadeResult:
        # Level 0: full RAG
        try:
            return await self._rag_path(query, tenant)
        except (RetrieverTimeout, RerankFailure) as e:
            self._record_degradation(0, e)

        # Level 1: LLM-only (no retrieval)
        try:
            result = await self._llm_only_path(query)
            return result._replace(
                is_degraded=True,
                degradation_level=1,
                disclaimer="Generated without retrieval (search temporarily unavailable). Verify against your records.",
            )
        except (LLMRateLimit, LLMTimeout) as e:
            self._record_degradation(1, e)

        # Level 2: cached answer (similar query in last 24h)
        cached = await self._cache_path(query, tenant)
        if cached:
            return cached._replace(
                is_degraded=True,
                degradation_level=2,
                disclaimer=f"Showing a cached answer from {cached.cache_age_hours}h ago — services degraded.",
            )

        # Level 3: honest error
        return CascadeResult(
            text="The system is degraded. We can't generate a reliable answer right now.",
            is_degraded=True,
            degradation_level=3,
            disclaimer="Please retry in a few minutes. If urgent, contact support.",
        )

The cascade is wrapped by a CircuitBreaker per component — repeated failures of the same level skip that level for ~30s rather than burning budget.

Tradeoffs we accept

Lever	Alternative	Chosen
Latency on degraded path	Retry the same path	Cascade — accept ~50% extra latency on degraded queries to deliver something
User-visible degradation	Silent fallback	Explicit `disclaimer` — accept the UX cost of "we're degraded" framing for honesty
Cache freshness	Reject stale cache	Show cache up to 24h with disclaimer — accept stale answer over no answer
Cascade depth	5+ levels	3 levels — accept that beyond cache, there's nothing else to try
Per-tenant tuning	Same cascade for everyone	Same default; per-tenant overrides via `TenantConfig` (e.g. compliance tenants disable cache fallback)

Consequences (positive)

Users never see a blank error. Every degraded response carries a disclaimer that names the degradation.
The cascade fires on real failures, not synthetic ones — verified by chaos testing in M06 (incident_simulations.py covers all three failure modes).
M05's eval gates on is_degraded rate. A spike in degradation rate (over 5%) blocks releases — the cascade is a signal, not a free pass.
Cache fallback is rare (~0.5% of queries) but high-value: it covers exactly the worst case where everything else has failed.
Trace data identifies the failure point: degradation_level=1 traces tell us "retrieval is the problem"; level=2 traces tell us "LLM is the problem". Debug surface is automatic.

Consequences (negative)

LLM-only path produces ungrounded answers. We mitigate with a stricter prompt that says "no retrieval results — answer cautiously"; M05's grounding_score eval catches when this drifts into hallucination territory.
Cache path has freshness risk. Mitigation: 24h hard cap; compliance tenants disable cache fallback entirely.
Cascade adds operational complexity. The CircuitBreaker has its own state machine; on-call engineers need a runbook (runbooks/cascade-debugging.md).
"Honest error" path requires status communication. We log to Slack on first cascade-to-error of the hour so on-call can investigate before the user does.

Reversal plan

Drop cascade (single-shot + HTTP 500): ~2 engineer-days. Trigger: degradation rate > 15% sustained (signal that the cascade is masking a real systemic failure).

Add a 4th level (RAG-from-different-region): ~2 engineer-weeks. Use only when we have multi-region retrieval indexes (we don't, at v1).

Per-tenant cascade depth: ~3 engineer-days. Already partially supported via TenantConfig.cache_disabled. Extend to per-tenant degradation level caps.

References

serving/failure_router.py — cascade implementation
serving/circuit_breaker.py — per-component breaker
tests/test_failure_cascade.py — chaos tests
runbooks/cascade-debugging.md — on-call playbook
incident_simulations.py — M06 chaos simulator covers all three levels
ADR-001 (SystemContract latency_p95_seconds — cascade respects total budget)
ADR-002 (RAG path uses pgvector retrieval)
ADR-005 (DEPRECATED — early row-filter tenant model that the cascade had to work around)