Skip to content
Back to AI Serving Platform

Circuit breaker over naive retry-with-backoff

✓ AcceptedAI Serving Platform04 — Production LLMOps: Monitor, Break & Harden
By AI-DE Engineering Team·Stakeholders: platform engineer, on-call SRE, ML platform lead

Context

The vLLM engine has three predictable failure modes under production load:

  1. GPU OOM under bursty long-context traffic. When 50 concurrent 3500-token requests arrive, KV cache utilization saturates at 100% and subsequent requests fail with CUDA out of memory. Documented in M04's chaos scenario #1.
  2. Model load timeout on cold start. Spinning a new Ray Serve replica from scratch takes 30–90s for model load + warm-up. Requests that hit a freshly-spawned replica time out at the FastAPI client's default 30s timeout.
  3. Vendor outage (Anthropic / OpenAI fallback model). When the LLM gateway falls through to a vendor for cases vLLM can't serve, that vendor can be down. Anthropic and OpenAI have ~1× per quarter multi-hour outages on average.

Two patterns for handling these failures on the request-handler hot path:

  1. Retry with exponential backoff. Naive: catch the exception, sleep backoff_base * 2^retry + jitter, retry up to N times. Simple, easy to reason about. Failure mode: under GPU OOM, retries pile *more* load on an already-saturated replica. Under cold-start, retries compound the problem because every retry is also waiting for the cold replica. Retry storms turn 1× failures into 10× cascades.
  2. Circuit breaker with state machine. Track failure rate per downstream. After N consecutive failures, open the circuit — fast-fail subsequent requests with a cached fallback or 503-degraded response, don't retry. After a recovery timeout, allow probe requests (half-open state). On probe success, close the circuit. On probe failure, re-open.

Circuit breakers are the standard pattern for this exact problem class (see Hystrix, Resilience4j, Polly). Retry-without-circuit is an anti-pattern under the failure modes we have.

Decision

We adopt a circuit breaker (resilience/circuit_breaker.py) with a 3-state machine wrapping the vLLM call. State transitions:

                    failure_count >= failure_threshold
              ┌───────────────────────────────────────┐
              │                                       │
              ▼                                       │
        ┌─────────┐                              ┌─────────┐
        │ CLOSED  │                              │  OPEN   │
        └────┬────┘                              └────┬────┘
             │                                        │
             │ success (always)                       │ recovery_timeout
             │                                        │ elapsed
             │                                        ▼
             │                                  ┌──────────┐
             │ ◄──── success_count >= ──────────┤ HALF_OPEN│
             │       success_threshold          └──────────┘
             │                                        │
             │ ◄────────── failure ───────────────────┘
             ▼
        (back to CLOSED)

Configuration on the vLLM downstream:

# resilience/circuit_breaker.py
vllm_circuit_breaker = ServingCircuitBreaker(
    name="vllm_inference",
    failure_threshold=5,        # 5 consecutive failures → OPEN
    recovery_timeout=30,         # 30s in OPEN before HALF_OPEN probe
    success_threshold=2,         # 2 consecutive successes in HALF_OPEN → CLOSED
    excluded_exceptions=(
        ValidationError,         # client errors don't count toward circuit
    ),
)

When the circuit is OPEN, the RAG pipeline raises CircuitBreakerOpen with a retry_after hint. The handler returns a 503 with a graceful degraded response (cached fallback if available, otherwise an honest "I don't have enough context to answer right now" message).

Tradeoffs we accept

LeverAlternativeChosen
Failure recoveryRetry-with-backoffCircuit breaker state machine
Behaviour under GPU OOMMore load on a saturated replicaFast-fail with degraded response
Time-to-recover after vendor outageEach request waits for vendor timeoutCircuit absorbs the wait once; probes resume
User-visible failureSlow timeouts (~30s)Fast 503 with retry_after
Tunabilityretries/backofffailure_threshold + recovery_timeout
Implementation costA while-loopA state machine + Prometheus exporter

The largest concrete cost is the state machine itself — ~120 lines of code + Prometheus state gauge + per-call overhead (~10 microseconds per call to check current state). We accept the cost because the failure modes the circuit breaker prevents (retry storms, cascading timeouts) are production-incident-grade.

Consequences (positive)

  • GPU OOM doesn't cascade. Once the circuit opens after 5 failures, subsequent requests fast-fail with CircuitBreakerOpen instead of piling on the saturated GPU. The replica has 30s of breathing room to recover before the half-open probe.
  • Cold-start cascade is bounded. ADR-005's chaos scenario #3 shows TTFT > 90s on cold start. With the circuit breaker, subsequent requests during the cold phase fast-fail; they don't wait 90s each.
  • Vendor outage is observable. The Prometheus circuit_breaker_state{name="vllm_inference"} gauge moves from 0 (CLOSED) to 1 (OPEN) and surfaces on the M04 Grafana dashboard. Alert rule CircuitBreakerOpen fires immediately.
  • Graceful degradation path. The handler can return a cached response or an honest 503 instead of an opaque server error. Users see an actionable error message, not a hang.

Consequences (negative)

  • Threshold tuning matters. failure_threshold=5 is a heuristic. Too low, the circuit flaps on transient GPU contention; too high, the circuit opens too late to save the cascade. The runbook documents re-tuning when alert volume changes.
  • Excluded exceptions are workload-specific. Client validation errors shouldn't count toward the circuit; we add them to excluded_exceptions. Adding a new validation type means updating the exclusion list.
  • Half-open probe is real traffic. A user request gets routed to a recovering replica during half-open. The user might see one slow response. We accept this; alternatives (synthetic probes) cost more.
  • Multiple downstreams need multiple breakers. vLLM, RAG retrieval, embedding API, vendor LLM fallback — each gets its own breaker. We end up with 3–4 circuit_breaker_state gauges; documented in the Grafana dashboard panel.

Reversal plan

If circuit-breaker false-positives become a recurring problem (e.g. tuning the threshold ends up oscillating), the reversal is:

  1. Set CIRCUIT_BREAKER_ENABLED=false in .env.
  2. vllm_circuit_breaker.call() short-circuits to a direct invocation with retry-with-backoff fallback.
  3. Re-add a Prometheus alert on raw error rate (error_rate{...} > 0.05 for 5m) since we lose the breaker's signal.
  4. Document the new failure mode in the runbook.

Estimated effort: ~2 engineer-days.

References

  • resilience/circuit_breaker.pyServingCircuitBreaker state machine
  • api/rag/pipeline.py — circuit-breaker integration on vLLM call
  • observability/alert_rules.ymlCircuitBreakerOpen alert rule
  • observability/metrics.pycircuit_breaker_state gauge export
  • chaos/trigger_failures.py — scenario #1 (GPU OOM) exercises the breaker
  • runbooks/finsight_failure_runbook.md — "circuit breaker open" detection PromQL + immediate action + root-cause investigation
  • ADR-001 (the vLLM engine the breaker wraps)
  • ADR-002 (the Ray Serve replicas the breaker scopes per-replica)
Built into the project

This decision shipped as part of AI Serving Platform — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open