ADR-005: Single-instance vLLM serving (DEPRECATED) | AI Serving Platform

Context

When M01 first shipped, the deployment was deliberately simple: one vLLM replica behind FastAPI, single docker-compose service, single A10G GPU.

The reasoning at the time was sound:

Get a working endpoint shipping fast. M01's goal is "deploy Mistral-7B behind a production API and benchmark it under real load." The simplest path to that goal is one vLLM container, one GPU, one Locust harness validating <500ms p99.
vLLM continuous batching makes single-instance impressive. With max_num_seqs=256, a single A10G handles ~500–1000 qpd at the p99 target. That covers the M01 reference workload and a meaningful fraction of M02's optimized scenario.
Cluster orchestration is a separate problem. Ray Serve, K8s, K3s — all valid choices, all M03 problems. M01 should ship without deciding them.
Local-prod parity for the demo. A learner can run M01 on a single GPU laptop or a single cloud A10G. Multi-replica deployments require a cluster, which raises the local-dev bar.

This held through M01 and M02. M02's optimization work (RAG, semantic cache, batching tuning) is per-replica work — there's no benefit to multi-replica for those features, and they keep the single-instance story coherent.

What changed

M03 introduced the chaos engineering work. Scenario #3 (chaos/trigger_failures.py::cold_start_cascade) exercises this exact question: "what happens when traffic spikes hit a cold deployment?"

We ran the scenario on a single-instance setup:

Scale to 0 replicas (ray serve scale finsight-llm --num-replicas=0).
Send 40 simultaneous requests through Nginx.
Measure TTFT (time-to-first-token) per request.

Result on the single-instance baseline:

Cold start TTFT distribution (40 concurrent requests, single replica):
  p50:  92 s   <-- model load + warm-up + first batch
  p95: 134 s
  p99: 168 s
  Failures: 12/40 (FastAPI client timeout at 30s)

168 second p99 TTFT is six orders of magnitude over our <200ms p99 SLA. 12 of 40 requests outright timed out. On the FinSight workload — where analysts ask analytical questions and expect answers within seconds — this is a user-visible outage.

Two scenarios that trigger this in production:

Replica restart after failure. A replica crashes, autoscale spins a new one. The 30–90s model-load gap is when the cascade hits.
Off-hours → market-hours scaling event. With min_replicas=0 off-hours, the first market-open request triggers cold start. 40+ concurrent analyst queries arrive within the first minute of trading.

The single-instance design has no answer to either. Both are common. That's the regression that killed this ADR.

What we got wrong (and what we'd do again)

Wrong: treating "single-instance is fine for v1" as a steady-state decision. It was a fine starting position for M01's "ship a working endpoint" goal. It was wrong as a steady-state production answer because the cold-start cascade is a real failure mode, not a hypothetical one.

Wrong: under-instrumenting the warm/cold transition. We measured steady-state p99 (200ms) and shipped. We didn't measure cold-start behaviour until M03's chaos scenario forced the question. Lesson: the chaos scenarios should run as part of M01's gate, not as M04's exercise.

Right: keeping single-instance through M01 and M02. Multi-replica adds operational surface (Ray cluster, replica health probes, load-balancer config) that doesn't earn its keep on a single-tenant prototype. We got real production value from single-instance through the optimization work; we paid the orchestration cost only when the workload demanded it.

Right: building the orchestration interface as Ray Serve from the start. The reversal landed in M03 as a pure addition: ray_serve_app.py and autoscaling_policy.py are new files; the M01 vLLM container is unchanged, just deployed as a Ray Serve actor instead of as a stand-alone container.

How we reversed it

The reversal was a one-week sprint:

Define the autoscaling policy. serving/autoscaling_policy.py with DEFAULT (min=1, max=4) and MARKET_HOURS (min=2, max=4) configs.
Add the cron-driven cutover. A Kubernetes CronJob (or a system cron in the v1 deploy) flips MARKET_HOURS_AUTOSCALING between 9am ET and 4pm ET to keep min=2 during trading hours. This eliminates the off-hours → market-open cold start.
Wrap vLLM as a Ray Serve actor. serving/ray_serve_app.py declares the deployment. Health probes (serving/health_probe.py) validate replicas before routing traffic to them.
Add Nginx in front. nginx/nginx.conf round-robin upstream with proxy_connect_timeout=5s. Stable client-facing hostname.
Re-run the chaos scenario. Same cold_start_cascade test against the multi-replica deployment.

After-reversal numbers:

Cold start TTFT (40 concurrent requests, min_replicas=2 + 1 cold replica
                 spinning up to handle excess load):
  p50: 198 ms  <-- existing replica handles the request
  p95: 412 ms  <-- queue tail at peak burst
  p99: 875 ms  <-- still inside SLA-degraded budget
  Failures: 0/40

Why we reversed it (in one sentence)

Single-instance vLLM has no answer to the cold-start cascade — 168 s p99 TTFT and 30% timeout rate on a 40-request burst — and the market-hours min_replicas=2 Ray Serve policy brings TTFT back inside budget without changing the engine, the model, or the optimization work.

What this ADR replaces

The original M01 design assumed a single vLLM container was the production deployment shape. ADR-002 (Ray Serve multi-replica with market-hours autoscale) supersedes that.
The single-replica docker-compose.serving.yml stays in the repo as the M01 dev-only topology — useful for first-time setup and laptop GPUs. The M03 docker-compose.scaling.yml is the production-shape reference.

References

serving/autoscaling_policy.py — DEFAULT + MARKET_HOURS_AUTOSCALING
serving/ray_serve_app.py — multi-replica deployment
chaos/trigger_failures.py — cold_start_cascade scenario (the regression test that pinned this)
docker-compose.serving.yml — M01 single-instance (kept as dev-only)
docker-compose.scaling.yml — M03 multi-replica (production shape)
runbooks/finsight_failure_runbook.md — "cold-start cascade detected" runbook entry
ADR-002 (live design)
ADR-004 (the circuit breaker that absorbs the in-flight cold-start failures)