Skip to content
ai-de.net/Projects/P15 · AI Serving Platform — vLLM + Ray Serve under SLA
Last updated By AI-DE Engineering Team
EXPERT-tier · PRO unlocks Module 01AI & vectors trackP15

Run a
cost-aware
AI serving platform — that survives a Locust storm

Ship a production inference platform anchored on serving quality: vLLM continuous batching with measured 1800ms → 127ms latency cuts, Ray Serve autoscale with market-hours policy, semantic cache (35% hit rate), ServingCircuitBreaker, 5 chaos failure scenarios with a runbook, and a runnable cost model. Module 01 unlocks with PRO; the platform unlocks with EXPERT.

Timeline
11 hours
Difficulty
Senior+
Stack
vLLM · Ray Serve · FastAPI · pgvector · Redis · Prometheus

The inference-serving portfolio piece for staff AI infra / ML platform roles — 5 committed ADRs, a runnable cost-model CSV with break-even-vs-OpenAI math, a 4-config latency-cost tradeoff bench, and a chaos runbook you can defend in an architecture review.

By the end you will have wired
  • vLLM endpoint serving Mistral-7B with PagedAttention + prefix caching (max_num_seqs=256, A10G)
  • FastAPI gateway with auth + rate limit + Prometheus middleware + multi-stage Docker
  • Redis semantic cache (HNSW · cosine ≥ 0.92) with volatility-aware TTL
  • Ray Serve autoscale (1–4 replicas) with market-hours policy + Nginx load balancer
  • SSE streaming + agent session management (Redis · TTL 3600s · max 20 turns)
  • Prometheus + Grafana 7-panel dashboard + 6 alert rules + OpenTelemetry tracing
  • ServingCircuitBreaker state machine + 5 chaos scenarios + on-call runbook
  • 5 ADRs (one Deprecated) committed alongside the code, plus a runnable cost-model CSV
PREREQ · STAFF+Built for engineers who run inference under SLA, not in notebooks. Comfortable with Python services, Docker, and at least one of: vLLM / TGI / Triton, Ray / K8s, or production observability stacks. Not a “hello vLLM” tutorial.
ai-serving-platform.platform · tenant=finsight · · 4-stage pipeline
Ray Serve autoscale armed
Ingest
Serve
Augment
Operate
Auth + RateLimitAPI key · 600 rpm
FastAPI gatewayX-Request-Id · X-Response-Time
Nginx LBround-robin upstream
OpenAI-API shape/v1/chat/completions
Auth + Nginx LB + OpenAI-compat gateway
vLLM engineMistral-7B · 8k context
Continuous batchingmax_num_seqs=256
PagedAttentionblock_size=16 · 55% mem savings
Ray Serve replicas1–4 · market-hours min=2
vLLM continuous batching + Ray Serve autoscale
Redis semantic cacheHNSW · cosine ≥ 0.92
pgvector RAGFinSight Q&A · top-5
Query rewriterTTL 5min volatile · 1d default
SSE streamingTTFT < 100ms target
Semantic cache + pgvector RAG + SSE stream
Prometheusp50/p95/p99 + 7 panels
OpenTelemetryOTLP gRPC :4317
ServingCircuitBreakerCLOSED/OPEN/HALF_OPEN
Chaos runbook5 scenarios · Locust harness
Prometheus + circuit breaker + chaos runbook
# 4-config cost-accuracy tradeoff bench (M02)
Baseline (no cache, no batching): 1800ms p99, $0.48/req, quality 6.2/10
+ vLLM batching + KV-cache: 420ms p99, $0.31/req — 35% cost cut at same quality
+ pgvector RAG: 195ms p99, $0.38/req — quality leap to 8.7/10
+ semantic cache: 127ms p99 (cached 8ms), $0.25/req — ship config
→ measured on FinSight workload; 4 configs are recipes, not benchmarks
# 5 ADRs + cost-model CSV bundled in the kit
docs/adr/001-vllm-continuous-batching.md — engine choice over Triton/TGI
docs/adr/005-single-instance-vllm-deprecated.md — the cold-start cascade reversal
docs/cost-model/ai-serving-platform-cost-model.csv — 5 tenants × 300k req/mo
→ commit the artifacts a staff promo panel will actually open
73 files
in starter zip · ADRs bundled
−66%
GPU spend at the reference scenario
5 chaos
failure scenarios + runbook
Curriculum · 4 modules · 11 hours · 3 phases

Module 01 unlocks with PRO. The platform unlocks with EXPERT.

Module 01 (~2h) ships a working vLLM endpoint with Mistral-7B, FastAPI gateway, Docker multi-stage build, Locust harness, and 4 smoke tests — included with PRO. Modules 02–04 (~9h) layer the semantic cache + RAG (M02), Ray Serve autoscale + SSE streaming + sessions (M03), Prometheus + circuit breaker + chaos engineering (M04) — and unlock with EXPERT.

P15 · 4 modules · 11 hours · 24+ lessons
Free preview EXPERT required
M01
Build Your First AI Serving Layer
Stand up vLLM serving Mistral-7B with PagedAttention + prefix caching (max_num_seqs=256, gpu_memory_utilization=0.85). FastAPI gateway with auth + Prometheus middleware + multi-stage Dockerfile (non-root user). Locust load test harness with p50/p95/p99 reporting against <500ms p99 baseline.
Phase 1: Foundation2h5 lessonsPRO TIER
Unlock with PRO →
M02
Optimize: RAG + Cost vs Accuracy
Cut p99 from 1800ms baseline to 127ms cached. Tune dynamic batching (scheduler_delay_factor=0.1), prefix-cache the 312-token FinSight system prompt, add pgvector RAG (top-5), Redis semantic cache (HNSW · threshold=0.92 · 35% hit). Measure every tradeoff via the 4-config A/B bench.
Phase 2: Optimization3h7 lessonsEXPERT TIER
Unlock with EXPERT →
M03
Scale & Stream: Ray Serve + SSE + Sessions
Ray Serve autoscale (min=1, max=4, target_ongoing_requests=10) with market-hours policy (min=2 during 9am–4pm ET). SSE streaming endpoint with TTFT < 100ms target. Agent session management (Redis TTL=3600s, max 20 turns, prune_context). Nginx round-robin load balancer.
Phase 2: Scale3h6 lessonsEXPERT TIER
Unlock with EXPERT →
M04
Production LLMOps: Monitor, Break, Harden
Prometheus + Grafana 7-panel dashboard, 6 alert rules (p99, KV cache, circuit breaker, cost spike, Redis drop, cache hit drop). OpenTelemetry tracing (OTLP gRPC). ServingCircuitBreaker state machine. 5 chaos failure scenarios + markdown runbook. Cost-model CLI with break-even-vs-OpenAI math.
Phase 3: Production3h8 lessonsEXPERT TIER
Unlock with EXPERT →
Module 01 with PRO ($29/mo) · Modules 02–04 with EXPERT ($79/mo)
See plans →
Backed by curriculum
AI Inference & Serving Systems
8 modules9 hoursInference · Batching · Autoscale · GPU econ
Open curriculum
iThe AI Inference & Serving Systems curriculum is the foundation for the project — same engine choices, same latency/throughput tradeoffs, taught from first principles. EXPERT subscribers get full access to all modules.
The build, in 3 phases

Foundation. Optimization. Production.

Each phase ends with a tagged release, a passing Locust run, and a measurable latency-or-cost delta on the FinSight workload. No ambiguity about where you are.

01~2h
Foundation (Module 01)

Working vLLM endpoint live locally. Mistral-7B + FastAPI + Locust harness on a single A10G; <500ms p99 baseline measured.

  • vLLM engine with PagedAttention + prefix caching configured
  • FastAPI gateway + multi-stage Docker + 4 smoke tests passing
  • Locust harness with p50/p95/p99 latency reporting at 20 concurrent users
02~6h
Optimization (Modules 02–03)

Latency p99 cut from 1800ms baseline to 127ms cached on the FinSight workload. Ray Serve autoscale live with market-hours policy; SSE streaming endpoint with TTFT < 100ms.

  • pgvector RAG + Redis semantic cache (35% hit rate measured)
  • 4-config tradeoff bench (no cache / +batching / +RAG / +cache)
  • Ray Serve multi-replica deployment with autoscaling_policy.py
03~3h
Production (Module 04)

Prometheus + Grafana + OTel observability stack live; ServingCircuitBreaker wired to vLLM call site; 5 chaos scenarios drilled.

  • Grafana dashboard with 7 panels + 6 alert rules wired in
  • ServingCircuitBreaker state machine + cached fallback in RAG pipeline
  • Cost-model CLI + chaos runbook + architecture diagram (staff capstone)
Project setup · 15 minutes

One command. Local FastAPI + Redis (no GPU needed for smoke).

What lives in the repo

You get the unified production code on day one — vLLM as the inference engine (toggleable to MockLLM for CPU-only smoke), Ray Serve as the orchestration layer, FastAPI gateway, pgvector for RAG, Redis for semantic cache + sessions + rate-limit fallback, plus Prometheus + Grafana + OpenTelemetry for the full observability stack.

  • docker-compose.serving.yml + scaling.yml — single-instance (M01 dev) and multi-replica (M03 prod) topologies
  • serving/vllm_config.py + ray_serve_app.py — engine tuning (max_num_seqs, gpu_memory_utilization) + autoscale policy
  • api/cache/semantic_cache.py + rag/pipeline.py — Redis HNSW semantic cache + pgvector RAG pipeline (4 versions)
  • observability/ + resilience/ — Prometheus + Grafana dashboard + alert rules + circuit breaker
  • chaos/trigger_failures.py + runbooks/ — 5 failure scenarios + markdown runbook (M04 capstone)
  • docs/adr/ + docs/cost-model/ — 5 ADRs (one Deprecated) + the runnable cost-model CSV
Download · Starter Kit · 73 files · 185 KB

AI Serving Platform Starter Kit

Pre-built serving platform with vLLM engine config, Ray Serve autoscale policy, semantic cache, RAG pipeline, Prometheus + Grafana stack, ServingCircuitBreaker, 5 chaos scenarios + runbook, and a 200-row FinSight financial-Q&A corpus + 100-row pgvector seed + 50-prompt Locust deck. Now bundled: 5 committed ADR markdown files (docs/adr/) and the runnable cost-model CSV (docs/cost-model/) — unzip and read them straight from the repo.

EXPERT project · 73 files · ADRs + cost model bundled · last updated 2026-05-09
~/projects/ai-serving-platform — zsh
1. Unzip and bring up the platform (MockLLM, no GPU needed)
$ unzip ai-serving-platform-starter.zip
$ cd ai-serving-platform-starter && cp .env.example .env
$ docker compose -f docker-compose.serving.yml up -d
2. Run the smoke tests against the mocked LLM
$ python -m venv .venv && source .venv/bin/activate
$ pip install -r requirements-core.txt
$ pytest tests/test_smoke.py -v
3. Send a request and watch the latency middleware
$ curl -X POST http://localhost:8000/v1/chat/completions \
$ -H 'Authorization: Bearer dev-key' \
$ -d '{"messages":[{"role":"user","content":"summarize Q4 earnings"}]}'
4. Read the ADRs and re-run the cost model
$ cat docs/adr/001-vllm-continuous-batching.md
$ open docs/cost-model/ai-serving-platform-cost-model.csv
200
Q&A corpus rows · FinSight
100
pgvector embeddings seed
50
Locust prompt deck (seed=42)
7
smoke gates · pass without GPU
Production hardening

The same vLLM endpoint — but built for the SLA case.

Most serving tutorials stop at "I deployed a model." This shows what changes when 100 concurrent users are sending bursty traffic, the bill is real, and you’re on call when the GPU OOMs.

Notebook servingWhat most teams ship
×
Engine
Single-request inference; static batching
×
Cache
None — every call hits the model
×
Scaling
1 instance always; manual add when slow
×
Failure mode
Naive retry-with-backoff or hang
×
Observability
Print logs; check vendor invoice
×
On-call
Trust GPU not to OOM
Your serving platformModules 02–04
Engine
vLLM continuous batching + PagedAttention (max_num_seqs=256) — ADR-001
Cache
Redis HNSW semantic cache (cosine ≥ 0.92, 35% hit rate, volatility TTL split) — ADR-003
Scaling
Ray Serve autoscale (1–4 replicas, market-hours min=2) + Nginx LB — ADR-002
Failure mode
ServingCircuitBreaker state machine (CLOSED/OPEN/ HALF_OPEN) + cached fallback — ADR-004
Observability
Prometheus (p99 + KV-cache + cost) + OTel tracing + Grafana 7-panel dashboard + 6 alert rules
On-call
5 chaos failure scenarios + markdown runbook (finsight_failure_runbook.md)
EXPERT-only · architecture decision records

Write the ADRs staff serving engineers actually get judged on.

Five ADRs ship inside the starter-kit zip at docs/adr/, one per major decision in the build, including a real Deprecated ADR documenting the single-instance → Ray-Serve-multi-replica reversal that the cold-start cascade chaos test forced (TTFT 168s → 875ms p99). Preview ADR-001 →

ADR-001Accepted

vLLM continuous batching + PagedAttention over Triton/TensorRT

Context
Triton needs TensorRT compilation step + adapter for OpenAI-API; TGI ~70-80% of vLLM throughput on Mistral-7B
Decision
vLLM with max_num_seqs=256, gpu_memory_utilization=0.85, prefix caching, block_size=16
Tradeoff
Younger engine vs continuous batching + 55% memory savings + zero-compile model swap
Reversal
Engine swap to Triton is ~2 engineer-weeks; OpenAI-API surface preserved
ADR-002Accepted

Ray Serve over Kubernetes for autoscaling

Context
5-tenant 1–4 replica scale; full K8s + KEDA + custom metrics adapter overhead exceeds benefit at this size
Decision
Ray Serve with AutoscalingConfig (target_ongoing_requests=10) + market-hours min=2 cron
Tradeoff
Single-region scale ceiling (~50 replicas) vs Python-native autoscale + same dev/prod topology
Reversal
K8s + KEDA migration is ~3-4 engineer-weeks; circuit breaker absorbs the cutover
ADR-003Accepted

Redis semantic cache (HNSW + threshold=0.92) over exact-match LRU

Context
Exact-match catches ~10% on FinSight (paraphrase-heavy); semantic catches ~35% — and removes the most expensive (long-context) repeats
Decision
SemanticCache with all-MiniLM-L6-v2 embedder + cosine 0.92 + volatility TTL split (5min / 1day)
Tradeoff
Per-miss embedding cost (~5-10ms + $0.00002) vs 35% hit rate eliminating GPU work entirely
Reversal
Disable via .env flag if hit rate drops below 15% on workload shift
ADR-004Accepted

Circuit breaker over naive retry-with-backoff

Context
GPU OOM under burst + cold-start replica + vendor outages — retries pile load on already-saturated systems
Decision
ServingCircuitBreaker 3-state machine (CLOSED/OPEN/HALF_OPEN) with failure_threshold=5, recovery_timeout=30s
Tradeoff
120 lines of state-machine code + Prometheus state gauge vs preventing retry-storm cascades
Reversal
Disable via .env flag; lose the breaker signal but preserve raw error-rate alerts
EXPERT-only · cost model

Read the GPU economics story, not just the latency one.

Module 04 ships a runnable cost-model CSV inside the starter-kit zip at docs/cost-model/. 5-tenant reference load (10k qpd · 300k req/mo), real AWS A10G + RDS + ElastiCache list prices + OpenAI gpt-4o break-even, with the autoscale and semantic-cache levers wired up. The version you defend to a CFO. Preview the CSV →

ComponentBaseline / moOptimized / moDelta
GPU compute (NVIDIA A10G · g5.xlarge)
baseline 4 replicas always-on · optimized 1.2 avg via Ray Serve autoscale
$2,899
$869
−$2,030
PostgreSQL + pgvector (RDS db.t4g.medium)
100GB gp3 · FinSight Q&A corpus + embeddings
$90
$90
ElastiCache Redis (cache.t4g.small + replica)
semantic cache + agent sessions + circuit-breaker fallback
$54
$54
Ray head node (t4g.small)
cluster coordinator + dashboard · port 8265
$15
$15
Observability (Prometheus + Grafana + OTel collector)
self-hosted on t4g.small · 7 panels + 6 alert rules
$20
$20
Nginx load balancer (in-compose)
round-robin upstream across Ray Serve replicas
$0
$0
Total · 5 tenants · 300k req/mo
$0.0103 per req baseline → $0.0035 per req optimized
$3,078
$1,048
−$2,030 (−66%)

Optimization levers

Ray Serve autoscale (market-hours min=2 / off-hours min=1)
ADR-002. Replica count scales with queued requests; market-hours policy prevents 90s+ TTFT cold-start cascade. Avg ~1.2 GPU-eq over 24h vs 4 always-on baseline.
−$2,029 / mo (alone)
Semantic cache (HNSW · threshold=0.92 · 35% hit rate)
ADR-003. Hashed embedding lookup eliminates 35% of GPU-bound model calls outright. Cached p99 = 8ms vs miss p99 = 195ms.
−$501 / mo (alone, stacks with autoscale)
vLLM continuous batching + PagedAttention (foundational)
ADR-001. max_num_seqs=256 fits in 16GB A10G via PagedAttention; ~8× per-GPU throughput vs static batching. Without this, baseline would need 24+ GPUs.
−$15,400 / mo (foundational)
EXPERT benefit · cohort beta

Async architecture review with a staff-level reviewer (cohort beta).

Submit your repo, your ADR draft, or your Locust regression report. A staff or principal-level reviewer who has shipped this exact stack responds within 7 days with line-by-line comments. Cohort capped at 12 members.

Bring a diff, an ADR draft, or a chaos-runbook walkthrough.

The cohort beta runs as async architecture review — pick a reviewer by topic, send the artifact, get inline comments + a Loom walkthrough back. No back-and-forth scheduling. No 30-minute slot pressure.

TR
Tomas R.
Ex-staff · inference platform · top-3 cloud
vLLM tuning, Ray Serve autoscale design, GPU economics, KV-cache strategy
Send the diff. I'll go line-by-line through your max_num_seqs and your prefix-cache TTL and pick out the edge cases that bite at 100 RPS.
JL
Jamie L.
Principal · cost engineering · enterprise SaaS
FinOps for AI workloads, break-even-vs-vendor math, reserved-instance modeling, on-call cost defense
Send your CSV and your autoscale policy. We'll walk it backwards from the totals row to the assumption that breaks first when traffic 3×s.
RW
Riya W.
Eng manager · AI infra · public mid-cap
Org design for serving teams, on-call rotation, hiring rubrics, staff promo packets
If you're prepping for staff, send your ADR-005 + your chaos runbook. We'll work backwards from the rubric.
Format
Async
Turnaround
7 days
Cohort
12 members
Scope
ADR + chaos review
Request a slot
What your tier unlocks

PRO unlocks Module 01. EXPERT unlocks the full serving platform.

PRO is the entry point — Module 01 (vLLM endpoint + FastAPI gateway + Locust harness) plus the rest of the PRO catalog. EXPERT unlocks Modules 02–04 of this build, the 5 ADRs, the cost-model CSV, and the cohort-beta async architecture review.

What you getFREEPROEXPERT
Module 01 of P15
vLLM endpoint + Locust harness (~2h)
Included
Included
Modules 02–04 of P15
Optimize / Scale / Production (~9h)
Included
5 committed ADRs + cost-model CSV
Starter kit docs/adr/ + docs/cost-model/
Included
PRO project catalog
Production-grade builds
2
All current
All current + this one
Curriculum
All 7 tracks
Phase 1 only
All
All + bonus modules
Code review
Senior+ reviewers
4 / month
Unlimited
Cohort-beta architecture review
Async · 7-day turnaround · 12-member cap
Included
Certificate
Verifiable on LinkedIn
Yes
Yes + LinkedIn rec
$79/mo
billed monthly · open enrollment · cancel anytime
or annual
$699/yr save 26%
Unlock EXPERT
Who this is for

Pick this if you own the SLA, not just the model.

ST

Staff / principal AI infra engineers

You own the inference SLA, the GPU budget, and the chaos runbook your CISO will read after the next incident. The 5 ADRs are exactly what a staff promo panel asks for.

EM

Engineering managers · AI

You need a defensible GPU spend model + an on-call playbook before next-quarter headcount. The cost-model CSV (with break-even-vs-OpenAI math) is the answer with citations.

PA

ML platform leads

You absorb LLM serving without absorbing 6 new vendors. vLLM + Ray + Redis + Prometheus on existing infra — that's the playbook for production inference on your stack.

FR

Founding engineers · AI startups

Your first paying customer will ask 'why did the bot say that?' before they ask about scale. Latency p99 + cost-per-request + circuit-breaker degraded mode is the answer in one repo.

FAQ · EXPERT tier

Quick answers.

Different lanes. P15 owns inference *serving infrastructure* — vLLM continuous batching + PagedAttention, Ray Serve autoscale with market-hours policy, semantic cache, ServingCircuitBreaker, 5 chaos scenarios, GPU economics. P14 owns retrieval *infrastructure* — pgvector HNSW + BM25 + RRF + cross-encoder reranker, OpenAI function-calling agent, drift detection. Pick P15 if your manager asks 'what's our p99 under burst?' Pick P14 if your manager asks 'what's our hit-rate@10?'
Module 01 (vLLM endpoint + FastAPI gateway + Docker + Locust harness + 4 smoke tests) is included with PRO at $29/mo — a working cost-tracked /chat behind a Locust baseline. Modules 02–04 (semantic cache + RAG + 4-config tradeoff bench in M02; Ray Serve autoscale + SSE + sessions in M03; Prometheus + circuit breaker + 5 chaos scenarios in M04), plus the 5 committed ADRs, the runnable cost-model CSV, and the cohort-beta async architecture review, unlock with EXPERT at $79/mo.
No for smoke testing — the kit ships a MockLLM stand-in selected with MODEL_BACKEND=mock. The 7 smoke gates pass against MockLLM with no API key, no model download, no GPU. Toggle to real vLLM (MODEL_BACKEND=vllm) when you have access to an A10G or larger. Production scenarios assume A10G g5.xlarge ($1.006/hr on AWS); the cost-model CSV documents alternatives.
Real and reproducible on the FinSight workload. The 4-config A/B bench in M02 measures: baseline (no cache, naive) at 1800ms p99, +vLLM batching + KV cache at 420ms p99, +pgvector RAG at 195ms p99, +semantic cache at 127ms cached p99 (8ms on hit). The starter zip ships the Locust deck + the prompt set + the bench harness so you can re-run on your hardware. Real workloads will move the numbers; the methodology travels.
11 hours of focused work across 4 modules. Most learners spread it across 3–4 weeks alongside a day job. Module 01 alone is ~2 hours and ships a working vLLM endpoint with Locust baseline — that's a meaningful PRO deliverable on its own if you want to gauge the project before unlocking EXPERT.
It's a strong forcing function. Staff AI infra interviews lean heavily on serving system design (latency budgets, autoscale, circuit breakers, GPU economics) and on having opinions backed by real tradeoffs. The 5 ADRs you commit (one Deprecated, with the cold-start cascade receipts) plus the chaos runbook are exactly the artifacts a panel asks about. Pair with the cohort-beta async review and you have a portfolio piece that survives a staff promo packet.
Related projects

Paired with this project

P17PAIDanalytics
Full-stack AI platform — full RAG system + production hardening

EXPERT-tier full-stack RAG: pgvector + HNSW + hybrid retrieval + RRF + cross-encoder rerank, 4-class query router with confidence threshold, 3-level failure cascade (RAG → LLM-only → cached), per-tenant index isolation, eval gates, cost guardrails, 6-mode incident simulator, 5 committed ADRs (one Deprecated), runnable cost-model CSV. 6 modules · 20-22h. Modules 01-03 with PRO.

Explore project →
P09PAIDai
AI cost optimization (CostGuard)

Cost-aware LLM platform: token tracking, dual-tier cache, 4-strategy router, three-tier budget governance. 5 ADRs + cost-model CSV bundled.

Explore project →

Ready to ship a cost-aware serving platform?

Start with PRO ($29/mo) for Module 01 — vLLM endpoint + FastAPI gateway + Locust baseline. Or unlock the full 4-module platform plus 5 ADRs, the cost-model CSV, and cohort-beta architecture review with EXPERT ($79/mo).

P15 · AI Serving Platform · EXPERT · PRO unlocks M01Unlock EXPERT →
Press Cmd+K to open