Context
The vLLM engine has three predictable failure modes under production load:
- GPU OOM under bursty long-context traffic. When 50 concurrent
3500-token requests arrive, KV cache utilization saturates at 100% and
subsequent requests fail with
CUDA out of memory. Documented in M04's chaos scenario #1. - Model load timeout on cold start. Spinning a new Ray Serve replica from scratch takes 30–90s for model load + warm-up. Requests that hit a freshly-spawned replica time out at the FastAPI client's default 30s timeout.
- Vendor outage (Anthropic / OpenAI fallback model). When the LLM gateway falls through to a vendor for cases vLLM can't serve, that vendor can be down. Anthropic and OpenAI have ~1× per quarter multi-hour outages on average.
Two patterns for handling these failures on the request-handler hot path:
- Retry with exponential backoff. Naive: catch the exception, sleep backoff_base * 2^retry + jitter, retry up to N times. Simple, easy to reason about. Failure mode: under GPU OOM, retries pile *more* load on an already-saturated replica. Under cold-start, retries compound the problem because every retry is also waiting for the cold replica. Retry storms turn 1× failures into 10× cascades.
- Circuit breaker with state machine. Track failure rate per downstream. After N consecutive failures, open the circuit — fast-fail subsequent requests with a cached fallback or 503-degraded response, don't retry. After a recovery timeout, allow probe requests (half-open state). On probe success, close the circuit. On probe failure, re-open.
Circuit breakers are the standard pattern for this exact problem class (see Hystrix, Resilience4j, Polly). Retry-without-circuit is an anti-pattern under the failure modes we have.
Decision
We adopt a circuit breaker (resilience/circuit_breaker.py) with a
3-state machine wrapping the vLLM call. State transitions:
failure_count >= failure_threshold
┌───────────────────────────────────────┐
│ │
▼ │
┌─────────┐ ┌─────────┐
│ CLOSED │ │ OPEN │
└────┬────┘ └────┬────┘
│ │
│ success (always) │ recovery_timeout
│ │ elapsed
│ ▼
│ ┌──────────┐
│ ◄──── success_count >= ──────────┤ HALF_OPEN│
│ success_threshold └──────────┘
│ │
│ ◄────────── failure ───────────────────┘
▼
(back to CLOSED)
Configuration on the vLLM downstream:
# resilience/circuit_breaker.py
vllm_circuit_breaker = ServingCircuitBreaker(
name="vllm_inference",
failure_threshold=5, # 5 consecutive failures → OPEN
recovery_timeout=30, # 30s in OPEN before HALF_OPEN probe
success_threshold=2, # 2 consecutive successes in HALF_OPEN → CLOSED
excluded_exceptions=(
ValidationError, # client errors don't count toward circuit
),
)
When the circuit is OPEN, the RAG pipeline raises CircuitBreakerOpen
with a retry_after hint. The handler returns a 503 with a graceful
degraded response (cached fallback if available, otherwise an honest "I
don't have enough context to answer right now" message).
Tradeoffs we accept
| Lever | Alternative | Chosen |
|---|---|---|
| Failure recovery | Retry-with-backoff | Circuit breaker state machine |
| Behaviour under GPU OOM | More load on a saturated replica | Fast-fail with degraded response |
| Time-to-recover after vendor outage | Each request waits for vendor timeout | Circuit absorbs the wait once; probes resume |
| User-visible failure | Slow timeouts (~30s) | Fast 503 with retry_after |
| Tunability | retries/backoff | failure_threshold + recovery_timeout |
| Implementation cost | A while-loop | A state machine + Prometheus exporter |
The largest concrete cost is the state machine itself — ~120 lines of code + Prometheus state gauge + per-call overhead (~10 microseconds per call to check current state). We accept the cost because the failure modes the circuit breaker prevents (retry storms, cascading timeouts) are production-incident-grade.
Consequences (positive)
- GPU OOM doesn't cascade. Once the circuit opens after 5 failures,
subsequent requests fast-fail with
CircuitBreakerOpeninstead of piling on the saturated GPU. The replica has 30s of breathing room to recover before the half-open probe. - Cold-start cascade is bounded. ADR-005's chaos scenario #3 shows TTFT > 90s on cold start. With the circuit breaker, subsequent requests during the cold phase fast-fail; they don't wait 90s each.
- Vendor outage is observable. The Prometheus
circuit_breaker_state{name="vllm_inference"}gauge moves from 0 (CLOSED) to 1 (OPEN) and surfaces on the M04 Grafana dashboard. Alert ruleCircuitBreakerOpenfires immediately. - Graceful degradation path. The handler can return a cached response or an honest 503 instead of an opaque server error. Users see an actionable error message, not a hang.
Consequences (negative)
- Threshold tuning matters.
failure_threshold=5is a heuristic. Too low, the circuit flaps on transient GPU contention; too high, the circuit opens too late to save the cascade. The runbook documents re-tuning when alert volume changes. - Excluded exceptions are workload-specific. Client validation errors
shouldn't count toward the circuit; we add them to
excluded_exceptions. Adding a new validation type means updating the exclusion list. - Half-open probe is real traffic. A user request gets routed to a recovering replica during half-open. The user might see one slow response. We accept this; alternatives (synthetic probes) cost more.
- Multiple downstreams need multiple breakers. vLLM, RAG retrieval, embedding API, vendor LLM fallback — each gets its own breaker. We end up with 3–4 circuit_breaker_state gauges; documented in the Grafana dashboard panel.
Reversal plan
If circuit-breaker false-positives become a recurring problem (e.g. tuning the threshold ends up oscillating), the reversal is:
- Set
CIRCUIT_BREAKER_ENABLED=falsein.env. vllm_circuit_breaker.call()short-circuits to a direct invocation with retry-with-backoff fallback.- Re-add a Prometheus alert on raw error rate (
error_rate{...} > 0.05 for 5m) since we lose the breaker's signal. - Document the new failure mode in the runbook.
Estimated effort: ~2 engineer-days.
References
resilience/circuit_breaker.py—ServingCircuitBreakerstate machineapi/rag/pipeline.py— circuit-breaker integration on vLLM callobservability/alert_rules.yml—CircuitBreakerOpenalert ruleobservability/metrics.py—circuit_breaker_stategauge exportchaos/trigger_failures.py— scenario #1 (GPU OOM) exercises the breakerrunbooks/finsight_failure_runbook.md— "circuit breaker open" detection PromQL + immediate action + root-cause investigation- ADR-001 (the vLLM engine the breaker wraps)
- ADR-002 (the Ray Serve replicas the breaker scopes per-replica)