Skip to content
Back to AI Serving Platform

Ray Serve over Kubernetes for autoscaling

✓ AcceptedAI Serving Platform03 — Scale & Stream: Multi-Replica Ray Serve + SSE
By AI-DE Engineering Team·Stakeholders: platform engineer, infra lead, on-call SRE

Context

By M03 we need the inference layer to scale beyond a single GPU replica. The reference scenario is 5 tenants × 2k qpd = 10k qpd with bursty arrival patterns concentrated 9am–4pm ET (market hours). A single A10G handles ~500–1000 qpd at the p99 SLA target; we need 1–4 replicas with automatic scale-up on queue depth and scale-down on idle.

Three orchestration options:

  1. Kubernetes + KEDA (or HPA + custom metrics). Industry standard. Replica count driven by custom Prometheus metrics (queue depth, p99 latency). Cost: a full K8s control plane (eks/gke/aks), a custom metrics adapter, manifests for Service/Deployment/HPA/PDB, image registry plumbing, plus the ops cost of running and patching K8s.
  2. Plain Docker Compose with manual scaling. Simplest. No autoscale — set N replicas manually, restart compose to change. Fine for v1, fails the bursty-load requirement.
  3. Ray Serve. Python-native serving framework. Autoscaling policy as a Python config object, deployed via ray serve CLI on a Ray cluster (head + workers). Per-replica health probes, request load balancing, model load orchestration built-in. Local dev runs in docker compose; prod runs on the same cluster topology, no manifest rewrites.

The K8s path is the right answer at large scale. At our scale (1–4 replicas, single-region, stateless replicas), the operational overhead of running K8s plus the manifest sprawl plus the custom-metrics adapter plumbing exceeds the benefit. The autoscaling policy is the same information either way; we'd be paying K8s tax to express it.

Decision

We adopt Ray Serve as the orchestration layer for v1. Replica count driven by target_ongoing_requests=10 per replica with explicit scale-up/down delays, plus a market-hours minimum-replicas policy.

# serving/autoscaling_policy.py
from ray.serve.config import AutoscalingConfig

DEFAULT_AUTOSCALING = AutoscalingConfig(
    min_replicas=1,
    max_replicas=4,
    target_ongoing_requests=10,        # scale-up trigger
    upscale_delay_s=30,                # debounce against burst-induced flapping
    downscale_delay_s=300,             # 5min idle before scale-down (preserve warmth)
    smoothing_factor=0.5,              # dampens oscillation
)

# Market-hours override (cron-driven)
MARKET_HOURS_AUTOSCALING = AutoscalingConfig(
    min_replicas=2,                    # always-on during 9am-4pm ET
    max_replicas=4,
    target_ongoing_requests=10,
    upscale_delay_s=30,
    downscale_delay_s=300,
)

Nginx sits in front of Ray Serve as the L7 load balancer (round-robin across replicas), so the client always hits a stable hostname.

Tradeoffs we accept

LeverAlternativeChosen
Operational surfaceFull K8s + KEDA + manifestsRay cluster (head + workers) + Python config
Local-prod parityK8s in Kind/Minikube locallySame docker-compose.scaling.yml for dev + prod-shape
Autoscale policy expressionYAML manifests + custom metrics adapterPython AutoscalingConfig object
Health probe designK8s liveness/readinessRay Serve built-in health_probe (1-token ping)
Long-term scale ceilingK8s scales to 1000s of pods cleanlyRay Serve fine to ~50 replicas; K8s migration documented

The largest concrete cost is the scale ceiling: Ray Serve is fine for single-region 1–50 replica deployments but doesn't have K8s' multi-region, pod-disruption-budget, or fine-grained scheduling primitives. We accept the ceiling because we're nowhere near it; we document the K8s migration path in runbooks/finsight_failure_runbook.md for when we cross it.

Consequences (positive)

  • Same topology dev + prod. docker-compose.scaling.yml boots a single-host Ray head + 2 workers + 2 Serve replicas locally; the prod cluster is the same shape, just on bigger nodes.
  • Autoscaling-as-Python. The policy lives in autoscaling_policy.py next to the serving code. Diffs are PR-reviewable like any other code change. No round-trip through ops to update YAML.
  • Built-in health probes. Ray Serve evicts unhealthy replicas via health_probe.py (1-token ping with 10s timeout, 2 consecutive failures → eviction). We didn't have to invent a custom probe.
  • Market-hours policy is a single cron-driven cutover between two AutoscalingConfig objects; we set min=2 during 9am–4pm ET to avoid the cold-start cascade ADR-005 documents.

Consequences (negative)

  • Ray cluster is a separate operational surface. Ray head node failure takes the cluster down; we mitigate with a sidecar restart policy in the compose file and a Prometheus alert on the dashboard's heartbeat.
  • No multi-region story. Ray Serve clusters are single-region by design. If we expand to multi-region we'll need either federated Ray clusters (experimental) or the K8s migration path.
  • Autoscaling parameters are coupled. Changing target_ongoing_requests without re-tuning upscale_delay_s causes oscillation. The runbook documents the recalibration procedure.
  • Cold start is real. Spinning a new replica = ~30–90s model load. ADR-005 documents the cold-start cascade we hit and the market-hours-min=2 mitigation.

Reversal plan

If we hit the Ray Serve scale ceiling (>50 replicas, multi-region traffic, or per-tenant SLA isolation), the reversal is to migrate to K8s + KEDA:

  1. Wrap each Ray Serve replica's vLLM container as a K8s Deployment.
  2. Translate AutoscalingConfig → KEDA ScaledObject (target metric = queue depth from Prometheus).
  3. Replace Nginx with a K8s Service + Ingress.
  4. Cut over per-tenant via Ingress annotation; keep Ray for legacy tenants until cutover is complete.

Estimated effort: ~3–4 engineer-weeks. The ServingCircuitBreaker (ADR-004) absorbs the cutover so users don't see failures during migration.

References

  • serving/autoscaling_policy.py — DEFAULT + MARKET_HOURS configs
  • serving/ray_serve_app.py — Ray Serve deployment definition
  • serving/health_probe.py — replica health-check logic
  • nginx/nginx.conf — round-robin upstream
  • docker-compose.scaling.yml — local-prod-parity topology
  • runbooks/finsight_failure_runbook.md — runbook entries for cluster failure modes
  • ADR-001 (the vLLM engine that Ray Serve orchestrates)
  • ADR-005 (the cold-start cascade that forced market-hours min=2)
Built into the project

This decision shipped as part of AI Serving Platform — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open