ai-de.net/Projects/P15 · AI Serving Platform — vLLM + Ray Serve under SLA

Last updated 2026-05-22By AI-DE Engineering Team

EXPERT-tier · PRO unlocks Module 01AI & vectors trackP15

Run a
cost-aware
AI serving platform — that survives a Locust storm

Ship a production inference platform anchored on serving quality: vLLM continuous batching with measured 1800ms → 127ms latency cuts, Ray Serve autoscale with market-hours policy, semantic cache (35% hit rate), ServingCircuitBreaker, 5 chaos failure scenarios with a runbook, and a runnable cost model. Module 01 unlocks with PRO; the platform unlocks with EXPERT.

Timeline

11 hours

Difficulty

Senior+

Stack

vLLM · Ray Serve · FastAPI · pgvector · Redis · Prometheus

See EXPERT benefits

The inference-serving portfolio piece for staff AI infra / ML platform roles — 5 committed ADRs, a runnable cost-model CSV with break-even-vs-OpenAI math, a 4-config latency-cost tradeoff bench, and a chaos runbook you can defend in an architecture review.

By the end you will have wired

vLLM endpoint serving Mistral-7B with PagedAttention + prefix caching (max_num_seqs=256, A10G)
FastAPI gateway with auth + rate limit + Prometheus middleware + multi-stage Docker
Redis semantic cache (HNSW · cosine ≥ 0.92) with volatility-aware TTL
Ray Serve autoscale (1–4 replicas) with market-hours policy + Nginx load balancer
SSE streaming + agent session management (Redis · TTL 3600s · max 20 turns)
Prometheus + Grafana 7-panel dashboard + 6 alert rules + OpenTelemetry tracing
ServingCircuitBreaker state machine + 5 chaos scenarios + on-call runbook
5 ADRs (one Deprecated) committed alongside the code, plus a runnable cost-model CSV

PREREQ · STAFF+Built for engineers who run inference under SLA, not in notebooks. Comfortable with Python services, Docker, and at least one of: vLLM / TGI / Triton, Ray / K8s, or production observability stacks. Not a “hello vLLM” tutorial.

ai-serving-platform.platform · tenant=finsight · · 4-stage pipeline

Ray Serve autoscale armed

Ingest

Serve

Augment

Operate

Auth + RateLimitAPI key · 600 rpm

FastAPI gatewayX-Request-Id · X-Response-Time

Nginx LBround-robin upstream

OpenAI-API shape/v1/chat/completions

Auth + Nginx LB + OpenAI-compat gateway

vLLM engineMistral-7B · 8k context

Continuous batchingmax_num_seqs=256

PagedAttentionblock_size=16 · 55% mem savings

Ray Serve replicas1–4 · market-hours min=2

vLLM continuous batching + Ray Serve autoscale

Redis semantic cacheHNSW · cosine ≥ 0.92

pgvector RAGFinSight Q&A · top-5

Query rewriterTTL 5min volatile · 1d default

SSE streamingTTFT < 100ms target

Semantic cache + pgvector RAG + SSE stream

Prometheusp50/p95/p99 + 7 panels

OpenTelemetryOTLP gRPC :4317

ServingCircuitBreakerCLOSED/OPEN/HALF_OPEN

Chaos runbook5 scenarios · Locust harness

Prometheus + circuit breaker + chaos runbook

# 4-config cost-accuracy tradeoff bench (M02)

Baseline (no cache, no batching): 1800ms p99, $0.48/req, quality 6.2/10

+ vLLM batching + KV-cache: 420ms p99, $0.31/req — 35% cost cut at same quality

+ pgvector RAG: 195ms p99, $0.38/req — quality leap to 8.7/10

+ semantic cache: 127ms p99 (cached 8ms), $0.25/req — ship config

→ measured on FinSight workload; 4 configs are recipes, not benchmarks

# 5 ADRs + cost-model CSV bundled in the kit

docs/adr/001-vllm-continuous-batching.md — engine choice over Triton/TGI

docs/adr/005-single-instance-vllm-deprecated.md — the cold-start cascade reversal

docs/cost-model/ai-serving-platform-cost-model.csv — 5 tenants × 300k req/mo

→ commit the artifacts a staff promo panel will actually open

73 files

in starter zip · ADRs bundled

−66%

GPU spend at the reference scenario

5 chaos

failure scenarios + runbook

Curriculum · 4 modules · 11 hours · 3 phases

Module 01 unlocks with PRO. The platform unlocks with EXPERT.

Module 01 (~2h) ships a working vLLM endpoint with Mistral-7B, FastAPI gateway, Docker multi-stage build, Locust harness, and 4 smoke tests — included with PRO. Modules 02–04 (~9h) layer the semantic cache + RAG (M02), Ray Serve autoscale + SSE streaming + sessions (M03), Prometheus + circuit breaker + chaos engineering (M04) — and unlock with EXPERT.

P15 · 4 modules · 11 hours · 24+ lessons

Free preview EXPERT required

M01

⊘Build Your First AI Serving Layer

Stand up vLLM serving Mistral-7B with PagedAttention + prefix caching (max_num_seqs=256, gpu_memory_utilization=0.85). FastAPI gateway with auth + Prometheus middleware + multi-stage Dockerfile (non-root user). Locust load test harness with p50/p95/p99 reporting against <500ms p99 baseline.

Phase 1: Foundation2h5 lessonsPRO TIER

Unlock with PRO →

M02

⊘Optimize: RAG + Cost vs Accuracy

Cut p99 from 1800ms baseline to 127ms cached. Tune dynamic batching (scheduler_delay_factor=0.1), prefix-cache the 312-token FinSight system prompt, add pgvector RAG (top-5), Redis semantic cache (HNSW · threshold=0.92 · 35% hit). Measure every tradeoff via the 4-config A/B bench.

Phase 2: Optimization3h7 lessonsEXPERT TIER

Unlock with EXPERT →

M03

⊘Scale & Stream: Ray Serve + SSE + Sessions

Ray Serve autoscale (min=1, max=4, target_ongoing_requests=10) with market-hours policy (min=2 during 9am–4pm ET). SSE streaming endpoint with TTFT < 100ms target. Agent session management (Redis TTL=3600s, max 20 turns, prune_context). Nginx round-robin load balancer.

Phase 2: Scale3h6 lessonsEXPERT TIER

Unlock with EXPERT →

M04

⊘Production LLMOps: Monitor, Break, Harden

Prometheus + Grafana 7-panel dashboard, 6 alert rules (p99, KV cache, circuit breaker, cost spike, Redis drop, cache hit drop). OpenTelemetry tracing (OTLP gRPC). ServingCircuitBreaker state machine. 5 chaos failure scenarios + markdown runbook. Cost-model CLI with break-even-vs-OpenAI math.

Phase 3: Production3h8 lessonsEXPERT TIER

Unlock with EXPERT →

Module 01 with PRO ($29/mo) · Modules 02–04 with EXPERT ($79/mo)

See plans →

Backed by curriculum

AI Inference & Serving Systems

8 modules9 hoursInference · Batching · Autoscale · GPU econ

Open curriculum

iThe AI Inference & Serving Systems curriculum is the foundation for the project — same engine choices, same latency/throughput tradeoffs, taught from first principles. EXPERT subscribers get full access to all modules.

The build, in 3 phases

Foundation. Optimization. Production.

Each phase ends with a tagged release, a passing Locust run, and a measurable latency-or-cost delta on the FinSight workload. No ambiguity about where you are.

01~2h

Foundation (Module 01)

Working vLLM endpoint live locally. Mistral-7B + FastAPI + Locust harness on a single A10G; <500ms p99 baseline measured.

✓vLLM engine with PagedAttention + prefix caching configured
✓FastAPI gateway + multi-stage Docker + 4 smoke tests passing
✓Locust harness with p50/p95/p99 latency reporting at 20 concurrent users

02~6h

Optimization (Modules 02–03)

Latency p99 cut from 1800ms baseline to 127ms cached on the FinSight workload. Ray Serve autoscale live with market-hours policy; SSE streaming endpoint with TTFT < 100ms.

✓pgvector RAG + Redis semantic cache (35% hit rate measured)
✓4-config tradeoff bench (no cache / +batching / +RAG / +cache)
✓Ray Serve multi-replica deployment with autoscaling_policy.py

03~3h

Production (Module 04)

Prometheus + Grafana + OTel observability stack live; ServingCircuitBreaker wired to vLLM call site; 5 chaos scenarios drilled.

✓Grafana dashboard with 7 panels + 6 alert rules wired in
✓ServingCircuitBreaker state machine + cached fallback in RAG pipeline
✓Cost-model CLI + chaos runbook + architecture diagram (staff capstone)

Project setup · 15 minutes

One command. Local FastAPI + Redis (no GPU needed for smoke).

What lives in the repo

You get the unified production code on day one — vLLM as the inference engine (toggleable to MockLLM for CPU-only smoke), Ray Serve as the orchestration layer, FastAPI gateway, pgvector for RAG, Redis for semantic cache + sessions + rate-limit fallback, plus Prometheus + Grafana + OpenTelemetry for the full observability stack.

docker-compose.serving.yml + scaling.yml — single-instance (M01 dev) and multi-replica (M03 prod) topologies
serving/vllm_config.py + ray_serve_app.py — engine tuning (max_num_seqs, gpu_memory_utilization) + autoscale policy
api/cache/semantic_cache.py + rag/pipeline.py — Redis HNSW semantic cache + pgvector RAG pipeline (4 versions)
observability/ + resilience/ — Prometheus + Grafana dashboard + alert rules + circuit breaker
chaos/trigger_failures.py + runbooks/ — 5 failure scenarios + markdown runbook (M04 capstone)
docs/adr/ + docs/cost-model/ — 5 ADRs (one Deprecated) + the runnable cost-model CSV

Download · Starter Kit · 73 files · 185 KB

AI Serving Platform Starter Kit

Pre-built serving platform with vLLM engine config, Ray Serve autoscale policy, semantic cache, RAG pipeline, Prometheus + Grafana stack, ServingCircuitBreaker, 5 chaos scenarios + runbook, and a 200-row FinSight financial-Q&A corpus + 100-row pgvector seed + 50-prompt Locust deck. Now bundled: 5 committed ADR markdown files (docs/adr/) and the runnable cost-model CSV (docs/cost-model/) — unzip and read them straight from the repo.

EXPERT project · 73 files · ADRs + cost model bundled · last updated 2026-05-09

~/projects/ai-serving-platform — zsh

1. Unzip and bring up the platform (MockLLM, no GPU needed)

$ unzip ai-serving-platform-starter.zip

$ cd ai-serving-platform-starter && cp .env.example .env

$ docker compose -f docker-compose.serving.yml up -d

2. Run the smoke tests against the mocked LLM

$ python -m venv .venv && source .venv/bin/activate

$ pip install -r requirements-core.txt

$ pytest tests/test_smoke.py -v

3. Send a request and watch the latency middleware

$ curl -X POST http://localhost:8000/v1/chat/completions \

$ -H 'Authorization: Bearer dev-key' \

$ -d '{"messages":[{"role":"user","content":"summarize Q4 earnings"}]}'

4. Read the ADRs and re-run the cost model

$ cat docs/adr/001-vllm-continuous-batching.md

$ open docs/cost-model/ai-serving-platform-cost-model.csv

200

Q&A corpus rows · FinSight

100

pgvector embeddings seed

Locust prompt deck (seed=42)

smoke gates · pass without GPU

Production hardening

The same vLLM endpoint — but built for the SLA case.

Most serving tutorials stop at "I deployed a model." This shows what changes when 100 concurrent users are sending bursty traffic, the bill is real, and you’re on call when the GPU OOMs.

Notebook servingWhat most teams ship

Engine

Single-request inference; static batching

Cache

None — every call hits the model

Scaling

1 instance always; manual add when slow

Failure mode

Naive retry-with-backoff or hang

Observability

Print logs; check vendor invoice

On-call

Trust GPU not to OOM

Your serving platformModules 02–04

✓

Engine

vLLM continuous batching + PagedAttention (max_num_seqs=256) — ADR-001

✓

Cache

Redis HNSW semantic cache (cosine ≥ 0.92, 35% hit rate, volatility TTL split) — ADR-003

✓

Scaling

Ray Serve autoscale (1–4 replicas, market-hours min=2) + Nginx LB — ADR-002

✓

Failure mode

ServingCircuitBreaker state machine (CLOSED/OPEN/ HALF_OPEN) + cached fallback — ADR-004

✓

Observability

Prometheus (p99 + KV-cache + cost) + OTel tracing + Grafana 7-panel dashboard + 6 alert rules

✓

On-call

5 chaos failure scenarios + markdown runbook (finsight_failure_runbook.md)

EXPERT-only · architecture decision records

Write the ADRs staff serving engineers actually get judged on.

Five ADRs ship inside the starter-kit zip at docs/adr/, one per major decision in the build, including a real Deprecated ADR documenting the single-instance → Ray-Serve-multi-replica reversal that the cold-start cascade chaos test forced (TTFT 168s → 875ms p99). Preview ADR-001 →

ADR-001Accepted

vLLM continuous batching + PagedAttention over Triton/TensorRT

Context

Triton needs TensorRT compilation step + adapter for OpenAI-API; TGI ~70-80% of vLLM throughput on Mistral-7B

Decision

vLLM with max_num_seqs=256, gpu_memory_utilization=0.85, prefix caching, block_size=16

Tradeoff

Younger engine vs continuous batching + 55% memory savings + zero-compile model swap

Reversal

Engine swap to Triton is ~2 engineer-weeks; OpenAI-API surface preserved

ADR-002Accepted

Ray Serve over Kubernetes for autoscaling

Context

5-tenant 1–4 replica scale; full K8s + KEDA + custom metrics adapter overhead exceeds benefit at this size

Decision

Ray Serve with AutoscalingConfig (target_ongoing_requests=10) + market-hours min=2 cron

Tradeoff

Single-region scale ceiling (~50 replicas) vs Python-native autoscale + same dev/prod topology

Reversal

K8s + KEDA migration is ~3-4 engineer-weeks; circuit breaker absorbs the cutover

ADR-003Accepted

Redis semantic cache (HNSW + threshold=0.92) over exact-match LRU

Context

Exact-match catches ~10% on FinSight (paraphrase-heavy); semantic catches ~35% — and removes the most expensive (long-context) repeats

Decision

SemanticCache with all-MiniLM-L6-v2 embedder + cosine 0.92 + volatility TTL split (5min / 1day)

Tradeoff

Per-miss embedding cost (~5-10ms + $0.00002) vs 35% hit rate eliminating GPU work entirely

Reversal

Disable via .env flag if hit rate drops below 15% on workload shift

ADR-004Accepted

Circuit breaker over naive retry-with-backoff

Context

GPU OOM under burst + cold-start replica + vendor outages — retries pile load on already-saturated systems

Decision

ServingCircuitBreaker 3-state machine (CLOSED/OPEN/HALF_OPEN) with failure_threshold=5, recovery_timeout=30s

Tradeoff

120 lines of state-machine code + Prometheus state gauge vs preventing retry-storm cascades

Reversal

Disable via .env flag; lose the breaker signal but preserve raw error-rate alerts

EXPERT-only · cost model

Read the GPU economics story, not just the latency one.

Module 04 ships a runnable cost-model CSV inside the starter-kit zip at docs/cost-model/. 5-tenant reference load (10k qpd · 300k req/mo), real AWS A10G + RDS + ElastiCache list prices + OpenAI gpt-4o break-even, with the autoscale and semantic-cache levers wired up. The version you defend to a CFO. Preview the CSV →

ComponentBaseline / moOptimized / moDelta

GPU compute (NVIDIA A10G · g5.xlarge)

baseline 4 replicas always-on · optimized 1.2 avg via Ray Serve autoscale

$2,899

$869

−$2,030

PostgreSQL + pgvector (RDS db.t4g.medium)

100GB gp3 · FinSight Q&A corpus + embeddings

$90

—

ElastiCache Redis (cache.t4g.small + replica)

semantic cache + agent sessions + circuit-breaker fallback

$54

—

Ray head node (t4g.small)

cluster coordinator + dashboard · port 8265

$15

—

Observability (Prometheus + Grafana + OTel collector)

self-hosted on t4g.small · 7 panels + 6 alert rules

$20

—

Nginx load balancer (in-compose)

round-robin upstream across Ray Serve replicas

—

Total · 5 tenants · 300k req/mo

$0.0103 per req baseline → $0.0035 per req optimized

$3,078

$1,048

−$2,030 (−66%)

Optimization levers

Ray Serve autoscale (market-hours min=2 / off-hours min=1)

ADR-002. Replica count scales with queued requests; market-hours policy prevents 90s+ TTFT cold-start cascade. Avg ~1.2 GPU-eq over 24h vs 4 always-on baseline.

−$2,029 / mo (alone)

Semantic cache (HNSW · threshold=0.92 · 35% hit rate)

ADR-003. Hashed embedding lookup eliminates 35% of GPU-bound model calls outright. Cached p99 = 8ms vs miss p99 = 195ms.

−$501 / mo (alone, stacks with autoscale)

vLLM continuous batching + PagedAttention (foundational)

ADR-001. max_num_seqs=256 fits in 16GB A10G via PagedAttention; ~8× per-GPU throughput vs static batching. Without this, baseline would need 24+ GPUs.

−$15,400 / mo (foundational)

EXPERT benefit · cohort beta

Async architecture review with a staff-level reviewer (cohort beta).

Submit your repo, your ADR draft, or your Locust regression report. A staff or principal-level reviewer who has shipped this exact stack responds within 7 days with line-by-line comments. Cohort capped at 12 members.

Bring a diff, an ADR draft, or a chaos-runbook walkthrough.

The cohort beta runs as async architecture review — pick a reviewer by topic, send the artifact, get inline comments + a Loom walkthrough back. No back-and-forth scheduling. No 30-minute slot pressure.

Tomas R.

Ex-staff · inference platform · top-3 cloud

vLLM tuning, Ray Serve autoscale design, GPU economics, KV-cache strategy

“Send the diff. I'll go line-by-line through your max_num_seqs and your prefix-cache TTL and pick out the edge cases that bite at 100 RPS.”

Jamie L.

Principal · cost engineering · enterprise SaaS

FinOps for AI workloads, break-even-vs-vendor math, reserved-instance modeling, on-call cost defense

“Send your CSV and your autoscale policy. We'll walk it backwards from the totals row to the assumption that breaks first when traffic 3×s.”

Riya W.

Eng manager · AI infra · public mid-cap

Org design for serving teams, on-call rotation, hiring rubrics, staff promo packets

“If you're prepping for staff, send your ADR-005 + your chaos runbook. We'll work backwards from the rubric.”

Format

Async

Turnaround

7 days

Cohort

12 members

Scope

ADR + chaos review

Request a slot →

What your tier unlocks

PRO unlocks Module 01. EXPERT unlocks the full serving platform.

PRO is the entry point — Module 01 (vLLM endpoint + FastAPI gateway + Locust harness) plus the rest of the PRO catalog. EXPERT unlocks Modules 02–04 of this build, the 5 ADRs, the cost-model CSV, and the cohort-beta async architecture review.

What you getFREEPROEXPERT

Module 01 of P15

vLLM endpoint + Locust harness (~2h)

—

Included

Modules 02–04 of P15

Optimize / Scale / Production (~9h)

—

Included

5 committed ADRs + cost-model CSV

Starter kit docs/adr/ + docs/cost-model/

—

Included

PRO project catalog

Production-grade builds

All current

All current + this one

Curriculum

All 7 tracks

Phase 1 only

All

All + bonus modules

Code review

Senior+ reviewers

—

4 / month

Unlimited

Cohort-beta architecture review

Async · 7-day turnaround · 12-member cap

—

Included

Certificate

Verifiable on LinkedIn

—

Yes

Yes + LinkedIn rec

$79/mo

billed monthly · open enrollment · cancel anytime

or annual

$699/yr save 26%

Unlock EXPERT →

Who this is for

Pick this if you own the SLA, not just the model.

Staff / principal AI infra engineers

You own the inference SLA, the GPU budget, and the chaos runbook your CISO will read after the next incident. The 5 ADRs are exactly what a staff promo panel asks for.

Engineering managers · AI

You need a defensible GPU spend model + an on-call playbook before next-quarter headcount. The cost-model CSV (with break-even-vs-OpenAI math) is the answer with citations.

ML platform leads

You absorb LLM serving without absorbing 6 new vendors. vLLM + Ray + Redis + Prometheus on existing infra — that's the playbook for production inference on your stack.

Founding engineers · AI startups

Your first paying customer will ask 'why did the bot say that?' before they ask about scale. Latency p99 + cost-per-request + circuit-breaker degraded mode is the answer in one repo.

Related curriculum

Going deeper? Four tracks back this project.

The AI Inference & Serving Systems curriculum is the foundation. These four tracks let you go deeper on the parts that matter most for your role.

FAQ · EXPERT tier

Quick answers.

How is this different from P14 ai-retrieval-platform?+

Different lanes. P15 owns inference *serving infrastructure* — vLLM continuous batching + PagedAttention, Ray Serve autoscale with market-hours policy, semantic cache, ServingCircuitBreaker, 5 chaos scenarios, GPU economics. P14 owns retrieval *infrastructure* — pgvector HNSW + BM25 + RRF + cross-encoder reranker, OpenAI function-calling agent, drift detection. Pick P15 if your manager asks 'what's our p99 under burst?' Pick P14 if your manager asks 'what's our hit-rate@10?'

How is this different from PRO?+

Module 01 (vLLM endpoint + FastAPI gateway + Docker + Locust harness + 4 smoke tests) is included with PRO at $29/mo — a working cost-tracked /chat behind a Locust baseline. Modules 02–04 (semantic cache + RAG + 4-config tradeoff bench in M02; Ray Serve autoscale + SSE + sessions in M03; Prometheus + circuit breaker + 5 chaos scenarios in M04), plus the 5 committed ADRs, the runnable cost-model CSV, and the cohort-beta async architecture review, unlock with EXPERT at $79/mo.

Do I need a real GPU to run this?+

No for smoke testing — the kit ships a MockLLM stand-in selected with MODEL_BACKEND=mock. The 7 smoke gates pass against MockLLM with no API key, no model download, no GPU. Toggle to real vLLM (MODEL_BACKEND=vllm) when you have access to an A10G or larger. Production scenarios assume A10G g5.xlarge ($1.006/hr on AWS); the cost-model CSV documents alternatives.

Are the latency numbers (1800ms → 127ms) reproducible?+

Real and reproducible on the FinSight workload. The 4-config A/B bench in M02 measures: baseline (no cache, naive) at 1800ms p99, +vLLM batching + KV cache at 420ms p99, +pgvector RAG at 195ms p99, +semantic cache at 127ms cached p99 (8ms on hit). The starter zip ships the Locust deck + the prompt set + the bench harness so you can re-run on your hardware. Real workloads will move the numbers; the methodology travels.

How long until I can finish this project?+

11 hours of focused work across 4 modules. Most learners spread it across 3–4 weeks alongside a day job. Module 01 alone is ~2 hours and ships a working vLLM endpoint with Locust baseline — that's a meaningful PRO deliverable on its own if you want to gauge the project before unlocking EXPERT.

Is this enough to interview for staff AI infra / inference-platform roles?+

It's a strong forcing function. Staff AI infra interviews lean heavily on serving system design (latency budgets, autoscale, circuit breakers, GPU economics) and on having opinions backed by real tradeoffs. The 5 ADRs you commit (one Deprecated, with the cold-start cascade receipts) plus the chaos runbook are exactly the artifacts a panel asks about. Pair with the cohort-beta async review and you have a portfolio piece that survives a staff promo packet.

Related projects

Paired with this project

P17·PAID·analytics

Full-stack AI platform — full RAG system + production hardening

EXPERT-tier full-stack RAG: pgvector + HNSW + hybrid retrieval + RRF + cross-encoder rerank, 4-class query router with confidence threshold, 3-level failure cascade (RAG → LLM-only → cached), per-tenant index isolation, eval gates, cost guardrails, 6-mode incident simulator, 5 committed ADRs (one Deprecated), runnable cost-model CSV. 6 modules · 20-22h. Modules 01-03 with PRO.

Explore project →

P09·PAID·ai

AI cost optimization (CostGuard)

Cost-aware LLM platform: token tracking, dual-tier cache, 4-strategy router, three-tier budget governance. 5 ADRs + cost-model CSV bundled.

Explore project →

Ready to ship a cost-aware serving platform?

Start with PRO ($29/mo) for Module 01 — vLLM endpoint + FastAPI gateway + Locust baseline. Or unlock the full 4-module platform plus 5 ADRs, the cost-model CSV, and cohort-beta architecture review with EXPERT ($79/mo).

See EXPERT benefits

P15 · AI Serving Platform · EXPERT · PRO unlocks M01Unlock EXPERT →

Run acost-awareAI serving platform — that survives a Locust storm

Module 01 unlocks with PRO. The platform unlocks with EXPERT.

Foundation. Optimization. Production.

One command. Local FastAPI + Redis (no GPU needed for smoke).

What lives in the repo

AI Serving Platform Starter Kit

The same vLLM endpoint — but built for the SLA case.

Write the ADRs staff serving engineers actually get judged on.

vLLM continuous batching + PagedAttention over Triton/TensorRT

Ray Serve over Kubernetes for autoscaling

Redis semantic cache (HNSW + threshold=0.92) over exact-match LRU

Circuit breaker over naive retry-with-backoff

Read the GPU economics story, not just the latency one.

Optimization levers

Async architecture review with a staff-level reviewer (cohort beta).

Bring a diff, an ADR draft, or a chaos-runbook walkthrough.

PRO unlocks Module 01. EXPERT unlocks the full serving platform.

Pick this if you own the SLA, not just the model.

Staff / principal AI infra engineers

Engineering managers · AI

ML platform leads

Founding engineers · AI startups

Going deeper? Four tracks back this project.

AI Inference & Serving Systems

MLOps for Data Engineers

Data Observability & Quality

Cost Optimization for Data Engineers

Quick answers.

Paired with this project

Ready to ship a cost-aware serving platform?

Run a
cost-aware
AI serving platform — that survives a Locust storm