Context
By M03 we need the inference layer to scale beyond a single GPU replica. The reference scenario is 5 tenants × 2k qpd = 10k qpd with bursty arrival patterns concentrated 9am–4pm ET (market hours). A single A10G handles ~500–1000 qpd at the p99 SLA target; we need 1–4 replicas with automatic scale-up on queue depth and scale-down on idle.
Three orchestration options:
- Kubernetes + KEDA (or HPA + custom metrics). Industry standard. Replica count driven by custom Prometheus metrics (queue depth, p99 latency). Cost: a full K8s control plane (eks/gke/aks), a custom metrics adapter, manifests for Service/Deployment/HPA/PDB, image registry plumbing, plus the ops cost of running and patching K8s.
- Plain Docker Compose with manual scaling. Simplest. No autoscale — set N replicas manually, restart compose to change. Fine for v1, fails the bursty-load requirement.
- Ray Serve. Python-native serving framework. Autoscaling policy as
a Python config object, deployed via
ray serveCLI on a Ray cluster (head + workers). Per-replica health probes, request load balancing, model load orchestration built-in. Local dev runs indocker compose; prod runs on the same cluster topology, no manifest rewrites.
The K8s path is the right answer at large scale. At our scale (1–4 replicas, single-region, stateless replicas), the operational overhead of running K8s plus the manifest sprawl plus the custom-metrics adapter plumbing exceeds the benefit. The autoscaling policy is the same information either way; we'd be paying K8s tax to express it.
Decision
We adopt Ray Serve as the orchestration layer for v1. Replica count
driven by target_ongoing_requests=10 per replica with explicit
scale-up/down delays, plus a market-hours minimum-replicas policy.
# serving/autoscaling_policy.py
from ray.serve.config import AutoscalingConfig
DEFAULT_AUTOSCALING = AutoscalingConfig(
min_replicas=1,
max_replicas=4,
target_ongoing_requests=10, # scale-up trigger
upscale_delay_s=30, # debounce against burst-induced flapping
downscale_delay_s=300, # 5min idle before scale-down (preserve warmth)
smoothing_factor=0.5, # dampens oscillation
)
# Market-hours override (cron-driven)
MARKET_HOURS_AUTOSCALING = AutoscalingConfig(
min_replicas=2, # always-on during 9am-4pm ET
max_replicas=4,
target_ongoing_requests=10,
upscale_delay_s=30,
downscale_delay_s=300,
)
Nginx sits in front of Ray Serve as the L7 load balancer (round-robin across replicas), so the client always hits a stable hostname.
Tradeoffs we accept
| Lever | Alternative | Chosen |
|---|---|---|
| Operational surface | Full K8s + KEDA + manifests | Ray cluster (head + workers) + Python config |
| Local-prod parity | K8s in Kind/Minikube locally | Same docker-compose.scaling.yml for dev + prod-shape |
| Autoscale policy expression | YAML manifests + custom metrics adapter | Python AutoscalingConfig object |
| Health probe design | K8s liveness/readiness | Ray Serve built-in health_probe (1-token ping) |
| Long-term scale ceiling | K8s scales to 1000s of pods cleanly | Ray Serve fine to ~50 replicas; K8s migration documented |
The largest concrete cost is the scale ceiling: Ray Serve is fine for
single-region 1–50 replica deployments but doesn't have K8s' multi-region,
pod-disruption-budget, or fine-grained scheduling primitives. We accept
the ceiling because we're nowhere near it; we document the K8s migration
path in runbooks/finsight_failure_runbook.md for when we cross it.
Consequences (positive)
- Same topology dev + prod.
docker-compose.scaling.ymlboots a single-host Ray head + 2 workers + 2 Serve replicas locally; the prod cluster is the same shape, just on bigger nodes. - Autoscaling-as-Python. The policy lives in
autoscaling_policy.pynext to the serving code. Diffs are PR-reviewable like any other code change. No round-trip through ops to update YAML. - Built-in health probes. Ray Serve evicts unhealthy replicas via
health_probe.py(1-token ping with 10s timeout, 2 consecutive failures → eviction). We didn't have to invent a custom probe. - Market-hours policy is a single cron-driven cutover between two AutoscalingConfig objects; we set min=2 during 9am–4pm ET to avoid the cold-start cascade ADR-005 documents.
Consequences (negative)
- Ray cluster is a separate operational surface. Ray head node failure takes the cluster down; we mitigate with a sidecar restart policy in the compose file and a Prometheus alert on the dashboard's heartbeat.
- No multi-region story. Ray Serve clusters are single-region by design. If we expand to multi-region we'll need either federated Ray clusters (experimental) or the K8s migration path.
- Autoscaling parameters are coupled. Changing
target_ongoing_requestswithout re-tuningupscale_delay_scauses oscillation. The runbook documents the recalibration procedure. - Cold start is real. Spinning a new replica = ~30–90s model load. ADR-005 documents the cold-start cascade we hit and the market-hours-min=2 mitigation.
Reversal plan
If we hit the Ray Serve scale ceiling (>50 replicas, multi-region traffic, or per-tenant SLA isolation), the reversal is to migrate to K8s + KEDA:
- Wrap each Ray Serve replica's vLLM container as a K8s Deployment.
- Translate
AutoscalingConfig→ KEDA ScaledObject (target metric = queue depth from Prometheus). - Replace Nginx with a K8s Service + Ingress.
- Cut over per-tenant via Ingress annotation; keep Ray for legacy tenants until cutover is complete.
Estimated effort: ~3–4 engineer-weeks. The ServingCircuitBreaker (ADR-004) absorbs the cutover so users don't see failures during migration.
References
serving/autoscaling_policy.py— DEFAULT + MARKET_HOURS configsserving/ray_serve_app.py— Ray Serve deployment definitionserving/health_probe.py— replica health-check logicnginx/nginx.conf— round-robin upstreamdocker-compose.scaling.yml— local-prod-parity topologyrunbooks/finsight_failure_runbook.md— runbook entries for cluster failure modes- ADR-001 (the vLLM engine that Ray Serve orchestrates)
- ADR-005 (the cold-start cascade that forced market-hours min=2)