Context
When M01 first shipped, the deployment was deliberately simple: one vLLM replica behind FastAPI, single docker-compose service, single A10G GPU.
The reasoning at the time was sound:
- Get a working endpoint shipping fast. M01's goal is "deploy Mistral-7B behind a production API and benchmark it under real load." The simplest path to that goal is one vLLM container, one GPU, one Locust harness validating <500ms p99.
- vLLM continuous batching makes single-instance impressive. With
max_num_seqs=256, a single A10G handles ~500–1000 qpd at the p99 target. That covers the M01 reference workload and a meaningful fraction of M02's optimized scenario. - Cluster orchestration is a separate problem. Ray Serve, K8s, K3s — all valid choices, all M03 problems. M01 should ship without deciding them.
- Local-prod parity for the demo. A learner can run M01 on a single GPU laptop or a single cloud A10G. Multi-replica deployments require a cluster, which raises the local-dev bar.
This held through M01 and M02. M02's optimization work (RAG, semantic cache, batching tuning) is per-replica work — there's no benefit to multi-replica for those features, and they keep the single-instance story coherent.
What changed
M03 introduced the chaos engineering work. Scenario #3
(chaos/trigger_failures.py::cold_start_cascade) exercises this exact
question: "what happens when traffic spikes hit a cold deployment?"
We ran the scenario on a single-instance setup:
- Scale to 0 replicas (
ray serve scale finsight-llm --num-replicas=0). - Send 40 simultaneous requests through Nginx.
- Measure TTFT (time-to-first-token) per request.
Result on the single-instance baseline:
Cold start TTFT distribution (40 concurrent requests, single replica):
p50: 92 s <-- model load + warm-up + first batch
p95: 134 s
p99: 168 s
Failures: 12/40 (FastAPI client timeout at 30s)
168 second p99 TTFT is six orders of magnitude over our <200ms p99 SLA.
12 of 40 requests outright timed out. On the FinSight workload — where
analysts ask analytical questions and expect answers within seconds — this
is a user-visible outage.
Two scenarios that trigger this in production:
- Replica restart after failure. A replica crashes, autoscale spins a new one. The 30–90s model-load gap is when the cascade hits.
- Off-hours → market-hours scaling event. With min_replicas=0 off-hours, the first market-open request triggers cold start. 40+ concurrent analyst queries arrive within the first minute of trading.
The single-instance design has no answer to either. Both are common. That's the regression that killed this ADR.
What we got wrong (and what we'd do again)
Wrong: treating "single-instance is fine for v1" as a steady-state decision. It was a fine starting position for M01's "ship a working endpoint" goal. It was wrong as a steady-state production answer because the cold-start cascade is a real failure mode, not a hypothetical one.
Wrong: under-instrumenting the warm/cold transition. We measured steady-state p99 (200ms) and shipped. We didn't measure cold-start behaviour until M03's chaos scenario forced the question. Lesson: the chaos scenarios should run as part of M01's gate, not as M04's exercise.
Right: keeping single-instance through M01 and M02. Multi-replica adds operational surface (Ray cluster, replica health probes, load-balancer config) that doesn't earn its keep on a single-tenant prototype. We got real production value from single-instance through the optimization work; we paid the orchestration cost only when the workload demanded it.
Right: building the orchestration interface as Ray Serve from the
start. The reversal landed in M03 as a pure addition: ray_serve_app.py
and autoscaling_policy.py are new files; the M01 vLLM container is
unchanged, just deployed as a Ray Serve actor instead of as a
stand-alone container.
How we reversed it
The reversal was a one-week sprint:
- Define the autoscaling policy.
serving/autoscaling_policy.pywithDEFAULT(min=1, max=4) andMARKET_HOURS(min=2, max=4) configs. - Add the cron-driven cutover. A Kubernetes CronJob (or a system
cron in the v1 deploy) flips
MARKET_HOURS_AUTOSCALINGbetween 9am ET and 4pm ET to keep min=2 during trading hours. This eliminates the off-hours → market-open cold start. - Wrap vLLM as a Ray Serve actor.
serving/ray_serve_app.pydeclares the deployment. Health probes (serving/health_probe.py) validate replicas before routing traffic to them. - Add Nginx in front.
nginx/nginx.confround-robin upstream withproxy_connect_timeout=5s. Stable client-facing hostname. - Re-run the chaos scenario. Same
cold_start_cascadetest against the multi-replica deployment.
After-reversal numbers:
Cold start TTFT (40 concurrent requests, min_replicas=2 + 1 cold replica
spinning up to handle excess load):
p50: 198 ms <-- existing replica handles the request
p95: 412 ms <-- queue tail at peak burst
p99: 875 ms <-- still inside SLA-degraded budget
Failures: 0/40
Why we reversed it (in one sentence)
Single-instance vLLM has no answer to the cold-start cascade — 168 s
p99 TTFT and 30% timeout rate on a 40-request burst — and the
market-hours min_replicas=2 Ray Serve policy brings TTFT back inside
budget without changing the engine, the model, or the optimization
work.
What this ADR replaces
- The original M01 design assumed a single vLLM container was the production deployment shape. ADR-002 (Ray Serve multi-replica with market-hours autoscale) supersedes that.
- The single-replica
docker-compose.serving.ymlstays in the repo as the M01 dev-only topology — useful for first-time setup and laptop GPUs. The M03docker-compose.scaling.ymlis the production-shape reference.
References
serving/autoscaling_policy.py—DEFAULT+MARKET_HOURS_AUTOSCALINGserving/ray_serve_app.py— multi-replica deploymentchaos/trigger_failures.py—cold_start_cascadescenario (the regression test that pinned this)docker-compose.serving.yml— M01 single-instance (kept as dev-only)docker-compose.scaling.yml— M03 multi-replica (production shape)runbooks/finsight_failure_runbook.md— "cold-start cascade detected" runbook entry- ADR-002 (live design)
- ADR-004 (the circuit breaker that absorbs the in-flight cold-start failures)