Run a
cost-aware
AI platform — that survives a CFO review
Ship a cost-aware LLM platform with per-request token tracking, dual-tier caching, a 4-strategy router, three-tier budget governance, anomaly detection, and 5 committed ADRs. Modules 01–02 unlock with PRO; the optimization stack unlocks with EXPERT.
The FinOps + system-design portfolio piece for staff AI infra roles — 5 committed ADRs, a runnable cost-model CSV, a three-tier budget engine with fail-open, and an incident runbook you can defend in an architecture review.
- Per-request lifecycle tracer + cost recorder (Postgres llm_requests + cost_daily_summary)
- Dual-tier cache: hashed exact-match in front of embedding-based semantic lookup
- 4-strategy router (keyword / complexity / confidence / fallback) on a cost-latency-quality triangle
- Three-tier budget hierarchy (org → team → user) with fail-open enforcement
- Anomaly detection (spike + IQR + pattern-deviation) with severity-routed alerts
- 5 ADRs (one Deprecated) committed alongside the code, plus a runnable cost-model CSV
Modules 01–02 unlock with PRO. The optimization stack with EXPERT.
Modules 01–02 (~4h) ship a working cost-tracked LLM service with instrumentation, schema, aggregation queries, and a working /chat — included with PRO. Modules 03–05 (~10h) layer the dual-tier cache, the 4-strategy router, the three-tier budget engine, and the anomaly detector on top — and unlock with EXPERT.
Foundation. Optimization. Governance.
Each phase ends with a tagged release, a passing smoke-test run, and a measurable cost delta on the seed workload. No ambiguity about where you are.
Cost-tracked /chat live locally. Per-request lifecycle tracer, Postgres schema with detail + rollup tables, hourly aggregator.
- ✓Working /chat with X-Cost headers on every response
- ✓11k+ seed rows + working cost dashboards via SQL
- ✓ON-CONFLICT upsert tested against duplicate aggregator runs
Dual-tier cache + 4-strategy router live. End-to-end cost cut on the seed dataset; A/B evaluation harness runs in CI.
- ✓Exact + semantic cache with shared TTL + threshold runbook
- ✓Router with route_decision attribution on every llm_requests row
- ✓A/B harness comparing tier choices on a 50-prompt eval set
Three-tier budget with fail-open enforcement, anomaly detector with severity-routed alerts, on-call runbook drilled.
- ✓asyncpg-backed budget check on the hot path (< 8 ms p95)
- ✓Anomaly detector wired to Prometheus + alert routing rules
- ✓Cost incident runbook + sla.yaml targets ready to defend
One command. Local FastAPI + Postgres + Redis + Prometheus.
What lives in the repo
You get the real cost-aware platform on day one — FastAPI as the gateway, PostgreSQL for cost tracking + budget storage, Redis for exact-match cache + semantic embeddings + rate-limit fallback, Prometheus for metrics, and pytest with a mocked-LLM toggle so the smoke tests run end-to-end without an API key.
- docker-compose.cost.yml — Postgres 15 + Redis 7 + Prometheus + optional API container
- tracker.py + pricing.py + models.py — SimpleTracker around OpenAI; per-1M cost math; SQLAlchemy ORM
- src/cache/ — exact + semantic cache + optimizer + savings tracker
- src/routing/ — triangle, 4 strategies, fallback chain, eval harness
- src/cost/ — budget engine, governance, anomaly detector, alerts, failures
- docs/adr/ + docs/cost-model/ — 5 ADRs (one Deprecated) + the runnable cost-model CSV
AI Cost Optimization Starter Kit
Pre-built cost-aware platform with seeded Postgres (11k+ rows of llm_requests + daily summaries + budgets + anomalies), Redis, Prometheus, and a 13-test smoke suite that runs against a mocked LLM. Now bundled: 5 committed ADR markdown files (docs/adr/) and the runnable cost-model CSV (docs/cost-model/) — unzip and read them straight from the repo.
The same LLM client — but built for the bill-defending case.
Most cost tutorials show you a token counter wrapped around a single OpenAI call. This shows what changes when 8 tenants share infra, finance owns the org cap, and SRE owns the on-call when the budget DB blips.
tracker.py writes llm_requests with route attribution; hourly ON CONFLICT rollupcosine ≥ 0.92 semantic; ~50% hit at reference workload (ADR-001)triangle.py (ADR-003)asyncpg 50 ms timeout + Redis fail-open (ADR-002)anomaly.py; severity-routed alerts to Prometheus + on-callasyncpg.Pool for hot-path budget reads (ADR-005 Deprecated)Write the ADRs staff engineers actually get judged on.
Five ADRs ship inside the starter-kit zip at docs/adr/, one per major decision in the build, including a real Deprecated ADR documenting the sync-SQLAlchemy → asyncpg reversal that real production load forced. Preview ADR-001 →
Dual-tier caching: exact-match in front of semantic
SHA-256(prompt) Redis hash → fall through to cosine ≥ 0.92 on the embedding storeThree-tier budget hierarchy with fail-open enforcement
org → team → user caps; asyncpg 50ms timeout falls open to Redis 30 req/min rate limiterCost-latency-quality routing triangle with fallback chain
keyword → complexity → confidence → fallback_chain; step up on failure, never downPer-request detail + daily rollup with ON CONFLICT upsert
llm_requests (detail) + cost_daily_summary (rollup); hourly ON CONFLICT (date,model,endpoint,user_id) upsertRead the FinOps story, not just the latency one.
Module 03 ships a runnable cost-model CSV inside the starter-kit zip at docs/cost-model/. 8-tenant reference load (50k req/day · 1.5M req/mo), real OpenAI + AWS list prices, with the dual-tier cache and model-cascade levers wired up. The version you defend to a CFO. Preview the CSV →
Optimization levers
Async architecture review with a staff-level reviewer (cohort beta).
Submit your repo, your ADR draft, or your cost-model CSV. A staff or principal-level reviewer who has shipped this exact stack responds within 7 days with line-by-line comments. Cohort capped at 12 members.
Bring a diff, an ADR draft, or a cost-defense deck.
The cohort beta runs as async architecture review — pick a reviewer by topic, send the artifact, get inline comments + a Loom walkthrough back. No back-and-forth scheduling. No 30-minute slot pressure.
PRO unlocks Modules 01–02. EXPERT unlocks the full platform.
PRO is the entry point — Modules 01–02 (cost-tracked /chat with instrumentation + schema + aggregator) plus the rest of the PRO catalog. EXPERT unlocks Modules 03–05 of this build, the 5 ADRs, the cost-model CSV, and the cohort-beta async architecture review.
Pick this if you defend the bill, not just write the prompt.
Staff / principal AI engineers
You own the inference bill, the cache architecture, and the cost-attribution story your CFO will pull apart at QBR. The 5 ADRs are exactly the artifacts a staff promo panel asks about.
Engineering managers · AI
You need a cost defense for the AI roadmap your VP will ask about before next-quarter headcount. The cost-model CSV is the answer with citations.
Platform / infra leads
You absorb AI without absorbing 6 new vendors. Postgres, Redis, Prometheus, FastAPI — tools you already operate. This is the FinOps playbook for your existing stack.
Founding engineers · AI startups
Your investors will ask about unit economics before they ask about scale. The cost model + the runbook + the budget-engine fail-open is the answer in one repo.
Going deeper? Four tracks back this project.
The Cost Optimization curriculum is the FinOps foundation. These four tracks let you go deeper on the parts that matter most for your role.
Quick answers.
Paired with this project
EXPERT-tier inference build: vLLM continuous batching + PagedAttention, Ray Serve autoscale (market-hours min=2), Redis semantic cache (35% hit), ServingCircuitBreaker, 5 chaos scenarios + runbook, runnable cost-model CSV with break-even-vs-OpenAI math. Module 01 with PRO.
Cut a $300K Snowflake bill 60% — forensics, right-size, compact, govern.
Ready to ship a cost-aware AI platform?
Start with PRO ($29/mo) for Modules 01–02 — architecture + cost tracking. Or unlock the full 5-module platform plus 5 ADRs, the cost-model CSV, and cohort-beta architecture review with EXPERT ($79/mo).