Context
In production, RAG components fail. The retriever can timeout. The LLM API can rate-limit. The reranker can OOM. The vector index can lock during a VACUUM. Every one of these failures has happened in our systems. The question is what users see when they do.
The bad outcomes:
- HTTP 500 with no body. Users see a blank error. Worst possible UX.
- Confidently-wrong answer. The system silently picks an unrelated chunk and the LLM grounds in it. Users trust the answer because it looks confident.
- Eternal retry. The system keeps retrying the failing component forever, eating the budget.
We need a cascade that degrades gracefully and tells the user what happened.
Three patterns considered:
- Single-shot with HTTP 500. Simplest. Ships the bad UX above.
- Synchronous retry-with-backoff. Retries the same path with exponential backoff. Eats the budget on persistent failures; same answer if the path was actually broken.
- Cascade through alternative paths. RAG fails → fall back to LLM-only with no retrieval; LLM-only fails → fall back to a cached answer if available; cached miss → return an honest error.
Decision
Adopt the cascade. Three levels, each with explicit degradation marker on the response.
# serving/failure_router.py
@dataclass
class CascadeResult:
text: str
is_degraded: bool # True if we fell back from RAG
degradation_level: int # 0=RAG, 1=LLM-only, 2=cached, 3=error
disclaimer: str | None # User-visible note about degradation
class FailureRouter:
async def execute(self, query: str, tenant: Tenant) -> CascadeResult:
# Level 0: full RAG
try:
return await self._rag_path(query, tenant)
except (RetrieverTimeout, RerankFailure) as e:
self._record_degradation(0, e)
# Level 1: LLM-only (no retrieval)
try:
result = await self._llm_only_path(query)
return result._replace(
is_degraded=True,
degradation_level=1,
disclaimer="Generated without retrieval (search temporarily unavailable). Verify against your records.",
)
except (LLMRateLimit, LLMTimeout) as e:
self._record_degradation(1, e)
# Level 2: cached answer (similar query in last 24h)
cached = await self._cache_path(query, tenant)
if cached:
return cached._replace(
is_degraded=True,
degradation_level=2,
disclaimer=f"Showing a cached answer from {cached.cache_age_hours}h ago — services degraded.",
)
# Level 3: honest error
return CascadeResult(
text="The system is degraded. We can't generate a reliable answer right now.",
is_degraded=True,
degradation_level=3,
disclaimer="Please retry in a few minutes. If urgent, contact support.",
)
The cascade is wrapped by a CircuitBreaker per component — repeated failures of the same level skip that level for ~30s rather than burning budget.
Tradeoffs we accept
| Lever | Alternative | Chosen |
|---|---|---|
| Latency on degraded path | Retry the same path | Cascade — accept ~50% extra latency on degraded queries to deliver something |
| User-visible degradation | Silent fallback | Explicit disclaimer — accept the UX cost of "we're degraded" framing for honesty |
| Cache freshness | Reject stale cache | Show cache up to 24h with disclaimer — accept stale answer over no answer |
| Cascade depth | 5+ levels | 3 levels — accept that beyond cache, there's nothing else to try |
| Per-tenant tuning | Same cascade for everyone | Same default; per-tenant overrides via TenantConfig (e.g. compliance tenants disable cache fallback) |
Consequences (positive)
- Users never see a blank error. Every degraded response carries a disclaimer that names the degradation.
- The cascade fires on real failures, not synthetic ones — verified by chaos testing in M06 (
incident_simulations.pycovers all three failure modes). - M05's eval gates on
is_degradedrate. A spike in degradation rate (over 5%) blocks releases — the cascade is a signal, not a free pass. - Cache fallback is rare (~0.5% of queries) but high-value: it covers exactly the worst case where everything else has failed.
- Trace data identifies the failure point:
degradation_level=1traces tell us "retrieval is the problem";level=2traces tell us "LLM is the problem". Debug surface is automatic.
Consequences (negative)
- LLM-only path produces ungrounded answers. We mitigate with a stricter prompt that says "no retrieval results — answer cautiously"; M05's grounding_score eval catches when this drifts into hallucination territory.
- Cache path has freshness risk. Mitigation: 24h hard cap; compliance tenants disable cache fallback entirely.
- Cascade adds operational complexity. The CircuitBreaker has its own state machine; on-call engineers need a runbook (
runbooks/cascade-debugging.md). - "Honest error" path requires status communication. We log to Slack on first cascade-to-error of the hour so on-call can investigate before the user does.
Reversal plan
Drop cascade (single-shot + HTTP 500): ~2 engineer-days. Trigger: degradation rate > 15% sustained (signal that the cascade is masking a real systemic failure).
Add a 4th level (RAG-from-different-region): ~2 engineer-weeks. Use only when we have multi-region retrieval indexes (we don't, at v1).
Per-tenant cascade depth: ~3 engineer-days. Already partially supported via TenantConfig.cache_disabled. Extend to per-tenant degradation level caps.
References
serving/failure_router.py— cascade implementationserving/circuit_breaker.py— per-component breakertests/test_failure_cascade.py— chaos testsrunbooks/cascade-debugging.md— on-call playbookincident_simulations.py— M06 chaos simulator covers all three levels- ADR-001 (SystemContract
latency_p95_seconds— cascade respects total budget) - ADR-002 (RAG path uses pgvector retrieval)
- ADR-005 (DEPRECATED — early row-filter tenant model that the cascade had to work around)