Run a
cost-aware
AI serving platform — that survives a Locust storm
Ship a production inference platform anchored on serving quality: vLLM continuous batching with measured 1800ms → 127ms latency cuts, Ray Serve autoscale with market-hours policy, semantic cache (35% hit rate), ServingCircuitBreaker, 5 chaos failure scenarios with a runbook, and a runnable cost model. Module 01 unlocks with PRO; the platform unlocks with EXPERT.
The inference-serving portfolio piece for staff AI infra / ML platform roles — 5 committed ADRs, a runnable cost-model CSV with break-even-vs-OpenAI math, a 4-config latency-cost tradeoff bench, and a chaos runbook you can defend in an architecture review.
- vLLM endpoint serving Mistral-7B with PagedAttention + prefix caching (max_num_seqs=256, A10G)
- FastAPI gateway with auth + rate limit + Prometheus middleware + multi-stage Docker
- Redis semantic cache (HNSW · cosine ≥ 0.92) with volatility-aware TTL
- Ray Serve autoscale (1–4 replicas) with market-hours policy + Nginx load balancer
- SSE streaming + agent session management (Redis · TTL 3600s · max 20 turns)
- Prometheus + Grafana 7-panel dashboard + 6 alert rules + OpenTelemetry tracing
- ServingCircuitBreaker state machine + 5 chaos scenarios + on-call runbook
- 5 ADRs (one Deprecated) committed alongside the code, plus a runnable cost-model CSV
Module 01 unlocks with PRO. The platform unlocks with EXPERT.
Module 01 (~2h) ships a working vLLM endpoint with Mistral-7B, FastAPI gateway, Docker multi-stage build, Locust harness, and 4 smoke tests — included with PRO. Modules 02–04 (~9h) layer the semantic cache + RAG (M02), Ray Serve autoscale + SSE streaming + sessions (M03), Prometheus + circuit breaker + chaos engineering (M04) — and unlock with EXPERT.
Foundation. Optimization. Production.
Each phase ends with a tagged release, a passing Locust run, and a measurable latency-or-cost delta on the FinSight workload. No ambiguity about where you are.
Working vLLM endpoint live locally. Mistral-7B + FastAPI + Locust harness on a single A10G; <500ms p99 baseline measured.
- ✓vLLM engine with PagedAttention + prefix caching configured
- ✓FastAPI gateway + multi-stage Docker + 4 smoke tests passing
- ✓Locust harness with p50/p95/p99 latency reporting at 20 concurrent users
Latency p99 cut from 1800ms baseline to 127ms cached on the FinSight workload. Ray Serve autoscale live with market-hours policy; SSE streaming endpoint with TTFT < 100ms.
- ✓pgvector RAG + Redis semantic cache (35% hit rate measured)
- ✓4-config tradeoff bench (no cache / +batching / +RAG / +cache)
- ✓Ray Serve multi-replica deployment with autoscaling_policy.py
Prometheus + Grafana + OTel observability stack live; ServingCircuitBreaker wired to vLLM call site; 5 chaos scenarios drilled.
- ✓Grafana dashboard with 7 panels + 6 alert rules wired in
- ✓ServingCircuitBreaker state machine + cached fallback in RAG pipeline
- ✓Cost-model CLI + chaos runbook + architecture diagram (staff capstone)
One command. Local FastAPI + Redis (no GPU needed for smoke).
What lives in the repo
You get the unified production code on day one — vLLM as the inference engine (toggleable to MockLLM for CPU-only smoke), Ray Serve as the orchestration layer, FastAPI gateway, pgvector for RAG, Redis for semantic cache + sessions + rate-limit fallback, plus Prometheus + Grafana + OpenTelemetry for the full observability stack.
- docker-compose.serving.yml + scaling.yml — single-instance (M01 dev) and multi-replica (M03 prod) topologies
- serving/vllm_config.py + ray_serve_app.py — engine tuning (max_num_seqs, gpu_memory_utilization) + autoscale policy
- api/cache/semantic_cache.py + rag/pipeline.py — Redis HNSW semantic cache + pgvector RAG pipeline (4 versions)
- observability/ + resilience/ — Prometheus + Grafana dashboard + alert rules + circuit breaker
- chaos/trigger_failures.py + runbooks/ — 5 failure scenarios + markdown runbook (M04 capstone)
- docs/adr/ + docs/cost-model/ — 5 ADRs (one Deprecated) + the runnable cost-model CSV
AI Serving Platform Starter Kit
Pre-built serving platform with vLLM engine config, Ray Serve autoscale policy, semantic cache, RAG pipeline, Prometheus + Grafana stack, ServingCircuitBreaker, 5 chaos scenarios + runbook, and a 200-row FinSight financial-Q&A corpus + 100-row pgvector seed + 50-prompt Locust deck. Now bundled: 5 committed ADR markdown files (docs/adr/) and the runnable cost-model CSV (docs/cost-model/) — unzip and read them straight from the repo.
The same vLLM endpoint — but built for the SLA case.
Most serving tutorials stop at "I deployed a model." This shows what changes when 100 concurrent users are sending bursty traffic, the bill is real, and you’re on call when the GPU OOMs.
vLLM continuous batching + PagedAttention (max_num_seqs=256) — ADR-001Ray Serve autoscale (1–4 replicas, market-hours min=2) + Nginx LB — ADR-002ServingCircuitBreaker state machine (CLOSED/OPEN/ HALF_OPEN) + cached fallback — ADR-004finsight_failure_runbook.md)Write the ADRs staff serving engineers actually get judged on.
Five ADRs ship inside the starter-kit zip at docs/adr/, one per major decision in the build, including a real Deprecated ADR documenting the single-instance → Ray-Serve-multi-replica reversal that the cold-start cascade chaos test forced (TTFT 168s → 875ms p99). Preview ADR-001 →
vLLM continuous batching + PagedAttention over Triton/TensorRT
vLLM with max_num_seqs=256, gpu_memory_utilization=0.85, prefix caching, block_size=16Ray Serve over Kubernetes for autoscaling
Ray Serve with AutoscalingConfig (target_ongoing_requests=10) + market-hours min=2 cronRedis semantic cache (HNSW + threshold=0.92) over exact-match LRU
SemanticCache with all-MiniLM-L6-v2 embedder + cosine 0.92 + volatility TTL split (5min / 1day)Circuit breaker over naive retry-with-backoff
ServingCircuitBreaker 3-state machine (CLOSED/OPEN/HALF_OPEN) with failure_threshold=5, recovery_timeout=30sRead the GPU economics story, not just the latency one.
Module 04 ships a runnable cost-model CSV inside the starter-kit zip at docs/cost-model/. 5-tenant reference load (10k qpd · 300k req/mo), real AWS A10G + RDS + ElastiCache list prices + OpenAI gpt-4o break-even, with the autoscale and semantic-cache levers wired up. The version you defend to a CFO. Preview the CSV →
Optimization levers
Async architecture review with a staff-level reviewer (cohort beta).
Submit your repo, your ADR draft, or your Locust regression report. A staff or principal-level reviewer who has shipped this exact stack responds within 7 days with line-by-line comments. Cohort capped at 12 members.
Bring a diff, an ADR draft, or a chaos-runbook walkthrough.
The cohort beta runs as async architecture review — pick a reviewer by topic, send the artifact, get inline comments + a Loom walkthrough back. No back-and-forth scheduling. No 30-minute slot pressure.
PRO unlocks Module 01. EXPERT unlocks the full serving platform.
PRO is the entry point — Module 01 (vLLM endpoint + FastAPI gateway + Locust harness) plus the rest of the PRO catalog. EXPERT unlocks Modules 02–04 of this build, the 5 ADRs, the cost-model CSV, and the cohort-beta async architecture review.
Pick this if you own the SLA, not just the model.
Staff / principal AI infra engineers
You own the inference SLA, the GPU budget, and the chaos runbook your CISO will read after the next incident. The 5 ADRs are exactly what a staff promo panel asks for.
Engineering managers · AI
You need a defensible GPU spend model + an on-call playbook before next-quarter headcount. The cost-model CSV (with break-even-vs-OpenAI math) is the answer with citations.
ML platform leads
You absorb LLM serving without absorbing 6 new vendors. vLLM + Ray + Redis + Prometheus on existing infra — that's the playbook for production inference on your stack.
Founding engineers · AI startups
Your first paying customer will ask 'why did the bot say that?' before they ask about scale. Latency p99 + cost-per-request + circuit-breaker degraded mode is the answer in one repo.
Going deeper? Four tracks back this project.
The AI Inference & Serving Systems curriculum is the foundation. These four tracks let you go deeper on the parts that matter most for your role.
Quick answers.
Paired with this project
EXPERT-tier full-stack RAG: pgvector + HNSW + hybrid retrieval + RRF + cross-encoder rerank, 4-class query router with confidence threshold, 3-level failure cascade (RAG → LLM-only → cached), per-tenant index isolation, eval gates, cost guardrails, 6-mode incident simulator, 5 committed ADRs (one Deprecated), runnable cost-model CSV. 6 modules · 20-22h. Modules 01-03 with PRO.
Cost-aware LLM platform: token tracking, dual-tier cache, 4-strategy router, three-tier budget governance. 5 ADRs + cost-model CSV bundled.
Ready to ship a cost-aware serving platform?
Start with PRO ($29/mo) for Module 01 — vLLM endpoint + FastAPI gateway + Locust baseline. Or unlock the full 4-module platform plus 5 ADRs, the cost-model CSV, and cohort-beta architecture review with EXPERT ($79/mo).