Skip to content
ai-de.net/Projects/P09 · AI Cost Optimization — CostGuard platform
Last updated By AI-DE Engineering Team
EXPERT-tier · PRO unlocks Modules 01–02AI & vectors trackP09

Run a
cost-aware
AI platform — that survives a CFO review

Ship a cost-aware LLM platform with per-request token tracking, dual-tier caching, a 4-strategy router, three-tier budget governance, anomaly detection, and 5 committed ADRs. Modules 01–02 unlock with PRO; the optimization stack unlocks with EXPERT.

Timeline
14 hours
Difficulty
Senior+
Stack
FastAPI · OpenAI · Redis · Postgres · asyncpg · Prometheus

The FinOps + system-design portfolio piece for staff AI infra roles — 5 committed ADRs, a runnable cost-model CSV, a three-tier budget engine with fail-open, and an incident runbook you can defend in an architecture review.

By the end you will have wired
  • Per-request lifecycle tracer + cost recorder (Postgres llm_requests + cost_daily_summary)
  • Dual-tier cache: hashed exact-match in front of embedding-based semantic lookup
  • 4-strategy router (keyword / complexity / confidence / fallback) on a cost-latency-quality triangle
  • Three-tier budget hierarchy (org → team → user) with fail-open enforcement
  • Anomaly detection (spike + IQR + pattern-deviation) with severity-routed alerts
  • 5 ADRs (one Deprecated) committed alongside the code, plus a runnable cost-model CSV
PREREQ · SENIOR+Built for engineers running AI in production with a real bill to defend. Comfortable with Python services, Postgres, Redis, and at least one of: vendor LLM APIs, request-tracing, or FinOps. Not a “hello tiktoken” tutorial.
costguard.platform · tenant=acme · · 22 components wired
budget engine armed
Track
Optimize
Control
Persist
/chat requestFastAPI middleware
tracker.pytiktoken token count
trace.py6-stage lifecycle
cost_recorderINSERT llm_requests
Per-request lifecycle tracer
exact_cacheRedis hash · ~3 ms p50
semantic_cacheembedding · cosine ≥ 0.92
optimizer.pytrim + summarise history
router.pyMINI → STD → PREMIUM
Dual-tier cache + 4-strategy router
budget.pyorg → team → user
governance.pyfail-open via Redis
anomaly.pyspike + IQR + pattern
alerts.pyINFO / WARN / CRITICAL
Three-tier budget + anomaly
llm_requestsBIGSERIAL · 11k+ seed
cost_daily_summaryON CONFLICT upsert
budget_tiers+ overage_requests
Rediscache + embeddings + RL
Postgres + Redis + Prometheus scrape
# Live cost tracking on every /chat call
tiktoken counts input + output → pricing.py per-1M math
INSERT into llm_requests with route_decision attribution
Hourly aggregator UPSERTs cost_daily_summary on (date, model, endpoint, user)
→ dashboards refresh in < 500 ms; per-request debug stays auditable
# 5 ADRs + cost-model CSV bundled in the kit
docs/adr/001-dual-tier-caching.md — exact + semantic cache decision
docs/adr/005-sync-sqlalchemy-deprecated.md — the M02→M05 reversal
docs/cost-model/ai-cost-optimization-cost-model.csv — 8 tenants × 1.5M req/mo
→ commit the artifacts a staff promo panel will actually open
60 files
in starter zip · ADRs bundled
−83%
model spend at the reference scenario
13
smoke tests · mocked LLM mode
Curriculum · 5 modules · 14 hours · 3 phases

Modules 01–02 unlock with PRO. The optimization stack with EXPERT.

Modules 01–02 (~4h) ship a working cost-tracked LLM service with instrumentation, schema, aggregation queries, and a working /chat — included with PRO. Modules 03–05 (~10h) layer the dual-tier cache, the 4-strategy router, the three-tier budget engine, and the anomaly detector on top — and unlock with EXPERT.

P09 · 5 modules · 14 hours · 30+ lessons
Free preview EXPERT required
M01
System Architecture & Request Lifecycle
Map the cost-aware platform end-to-end. Trace a request from FastAPI middleware through token counting, cost recording, caching, routing, and budget enforcement. Build the lifecycle tracer and a cost-flow analyzer that names where each cent goes.
Phase 1: Foundation1.5h6 lessonsPRO TIER
Unlock with PRO →
M02
Token Tracking & Cost Visibility
Build the persistence layer: llm_requests + cost_daily_summary tables, the SimpleTracker wrapper around the OpenAI client, the pricing.py per-1M math, and an aggregator with ON CONFLICT upserts. Working /chat with cost-on-every-response by the end.
Phase 1: Foundation2.5h8 lessonsPRO TIER
Unlock with PRO →
M03
Caching, Prompt Optimization & Cost Reduction
Dual-tier cache: SHA-256 exact-match in front of embedding-based semantic lookup. Prompt optimizer trims history, summarises conversation, and dedups duplicate context. Cost-savings tracker compares before/after on the seed dataset.
Phase 2: Optimization3h7 lessonsEXPERT TIER
Unlock with EXPERT →
M04
Model Routing, Quality & Optimization
Cost-latency-quality triangle (MINI / STANDARD / PREMIUM). 4-strategy router (keyword + complexity + confidence + fallback chain). A/B evaluation harness, latency budget enforcement, exponential-backoff circuit breaker. Step-up-on-failure resilience pattern.
Phase 2: Optimization3.5h7 lessonsEXPERT TIER
Unlock with EXPERT →
M05
Platform Operations & Cost Governance
Three-tier budget hierarchy (org → team → user) with fail-open + Redis rate-limit fallback (asyncpg, 50ms timeout). Anomaly detector (spike + IQR + pattern-deviation) with severity-routed alerts. Cost incident runbook + SLA targets in sla.yaml.
Phase 3: Production3.5h8 lessonsEXPERT TIER
Unlock with EXPERT →
Modules 01–02 with PRO ($29/mo) · Modules 03–05 with EXPERT ($79/mo)
See plans →
Backed by curriculum
Cost Optimization for Data Engineers
10 modules14 hoursFinOps · Cost levers · Unit economics · Budgets
Open curriculum
iThe Cost Optimization curriculum is the foundation for the FinOps mindset this project applies to LLMs — same levers (orphaned compute, query tuning, budget guardrails), different layer of the stack. EXPERT subscribers get full access to all modules.
The build, in 3 phases

Foundation. Optimization. Governance.

Each phase ends with a tagged release, a passing smoke-test run, and a measurable cost delta on the seed workload. No ambiguity about where you are.

01~4h
Foundation (Modules 01–02)

Cost-tracked /chat live locally. Per-request lifecycle tracer, Postgres schema with detail + rollup tables, hourly aggregator.

  • Working /chat with X-Cost headers on every response
  • 11k+ seed rows + working cost dashboards via SQL
  • ON-CONFLICT upsert tested against duplicate aggregator runs
02~6.5h
Optimization (Modules 03–04)

Dual-tier cache + 4-strategy router live. End-to-end cost cut on the seed dataset; A/B evaluation harness runs in CI.

  • Exact + semantic cache with shared TTL + threshold runbook
  • Router with route_decision attribution on every llm_requests row
  • A/B harness comparing tier choices on a 50-prompt eval set
03~3.5h
Governance (Module 05)

Three-tier budget with fail-open enforcement, anomaly detector with severity-routed alerts, on-call runbook drilled.

  • asyncpg-backed budget check on the hot path (< 8 ms p95)
  • Anomaly detector wired to Prometheus + alert routing rules
  • Cost incident runbook + sla.yaml targets ready to defend
Project setup · 15 minutes

One command. Local FastAPI + Postgres + Redis + Prometheus.

What lives in the repo

You get the real cost-aware platform on day one — FastAPI as the gateway, PostgreSQL for cost tracking + budget storage, Redis for exact-match cache + semantic embeddings + rate-limit fallback, Prometheus for metrics, and pytest with a mocked-LLM toggle so the smoke tests run end-to-end without an API key.

  • docker-compose.cost.yml — Postgres 15 + Redis 7 + Prometheus + optional API container
  • tracker.py + pricing.py + models.py — SimpleTracker around OpenAI; per-1M cost math; SQLAlchemy ORM
  • src/cache/ — exact + semantic cache + optimizer + savings tracker
  • src/routing/ — triangle, 4 strategies, fallback chain, eval harness
  • src/cost/ — budget engine, governance, anomaly detector, alerts, failures
  • docs/adr/ + docs/cost-model/ — 5 ADRs (one Deprecated) + the runnable cost-model CSV
Download · Starter Kit · 60 files · 1.05 MB

AI Cost Optimization Starter Kit

Pre-built cost-aware platform with seeded Postgres (11k+ rows of llm_requests + daily summaries + budgets + anomalies), Redis, Prometheus, and a 13-test smoke suite that runs against a mocked LLM. Now bundled: 5 committed ADR markdown files (docs/adr/) and the runnable cost-model CSV (docs/cost-model/) — unzip and read them straight from the repo.

EXPERT project · 60 files · ADRs + cost model bundled · last updated 2026-05-09
~/projects/ai-cost-optimization — zsh
1. Unzip and bring up the platform (mocked LLM, no API key needed)
$ unzip ai-cost-optimization-starter.zip
$ cd ai-cost-optimization-starter && cp .env.example .env
$ docker compose -f docker-compose.cost.yml up -d
2. Run the 13 smoke tests against the mocked LLM
$ python -m venv .venv && source .venv/bin/activate
$ pip install -r requirements.txt
$ pytest tests/test_smoke.py -v
3. Send a cost-tracked request and inspect the row
$ curl -X POST http://localhost:8000/chat \
$ -H 'X-User-ID: u_42' -d '{"prompt":"Summarize Q4 earnings"}'
$ psql $DATABASE_URL -c "SELECT model,cost_usd,cached,route_decision FROM llm_requests ORDER BY id DESC LIMIT 1"
4. Read the ADRs and re-run the cost model
$ cat docs/adr/001-dual-tier-caching.md
$ open docs/cost-model/ai-cost-optimization-cost-model.csv
11,311
seed rows total
10k
llm_requests · 50 users · 5 models
100
cost_daily_summary rows
200 + 5
anomalies + overage requests
Production hardening

The same LLM client — but built for the bill-defending case.

Most cost tutorials show you a token counter wrapped around a single OpenAI call. This shows what changes when 8 tenants share infra, finance owns the org cap, and SRE owns the on-call when the budget DB blips.

Notebook trackerWhat most teams ship
×
Token counter
Print statement after each call
×
Cache
None — every call hits the model
×
Routing
One model for everything
×
Budget
Manual monthly invoice review
×
Anomaly
Notice next month when finance asks
×
Persistence
Sync everywhere; threadpool eats blocking I/O
Your cost-aware platformModules 03–05
Token counter
tracker.py writes llm_requests with route attribution; hourly ON CONFLICT rollup
Cache
Dual-tier — SHA-256 exact in front of cosine ≥ 0.92 semantic; ~50% hit at reference workload (ADR-001)
Routing
4-strategy chain (keyword/complexity/confidence/fallback) on a cost-latency-quality triangle.py (ADR-003)
Budget
Org → team → user cap with asyncpg 50 ms timeout + Redis fail-open (ADR-002)
Anomaly
Spike + IQR + pattern-deviation in anomaly.py; severity-routed alerts to Prometheus + on-call
Persistence
Sync SQLAlchemy for ORM/aggregator; asyncpg.Pool for hot-path budget reads (ADR-005 Deprecated)
EXPERT-only · architecture decision records

Write the ADRs staff engineers actually get judged on.

Five ADRs ship inside the starter-kit zip at docs/adr/, one per major decision in the build, including a real Deprecated ADR documenting the sync-SQLAlchemy → asyncpg reversal that real production load forced. Preview ADR-001 →

ADR-001Accepted

Dual-tier caching: exact-match in front of semantic

Context
FAQ-shape repeats want O(1) lookup; paraphrases want embedding similarity. Either alone leaves money on the table.
Decision
SHA-256(prompt) Redis hash → fall through to cosine ≥ 0.92 on the embedding store
Tradeoff
Two TTL knobs + 1 embedding call per cache miss vs ~50% combined hit rate at the reference workload
Reversal
Drop semantic tier when marginal hit rate < 15%; ~2 engineer-days to disable
ADR-002Accepted

Three-tier budget hierarchy with fail-open enforcement

Context
Finance, team leads, and engineers each need a knob; Postgres outage cannot freeze the LLM platform
Decision
org → team → user caps; asyncpg 50ms timeout falls open to Redis 30 req/min rate limiter
Tradeoff
Bounded overage during DB outage (~$0.75 / 5 min) vs false-rejecting paying users
Reversal
Hard-reject mode is a single .env flag; SLA drops to 99.5%
ADR-003Accepted

Cost-latency-quality routing triangle with fallback chain

Context
30× price gap MINI vs PREMIUM; static keyword routing only catches ~30%; classify-first adds 200ms p95 to every call
Decision
4-strategy chain — keyword complexityconfidence fallback_chain; step up on failure, never down
Tradeoff
Confidence router runs MINI on ~15% of requests (added cost) for ~5× ROI on routing decision
Reversal
Disable confidence router via .env flag; re-tune complexity threshold; ~3 engineer-days
ADR-004Accepted

Per-request detail + daily rollup with ON CONFLICT upsert

Context
One table can't answer both 'why did req_a31f cost $0.42' and 'sum cost by team / month' under 500ms
Decision
llm_requests (detail) + cost_daily_summary (rollup); hourly ON CONFLICT (date,model,endpoint,user_id) upsert
Tradeoff
~12% storage overhead + lockstep schema migrations vs 35ms vs 12s dashboard latency
Reversal
Deprecate rollup when detail-table latency clears the budget; freeze 90 days then drop
EXPERT-only · cost model

Read the FinOps story, not just the latency one.

Module 03 ships a runnable cost-model CSV inside the starter-kit zip at docs/cost-model/. 8-tenant reference load (50k req/day · 1.5M req/mo), real OpenAI + AWS list prices, with the dual-tier cache and model-cascade levers wired up. The version you defend to a CFO. Preview the CSV →

ComponentBaseline / moOptimized / moDelta
OpenAI GPT-4o (premium tier)
100% naive · 25% optimized via router · ~$0.006 avg/req
$9,000
$1,125
−$7,875
OpenAI GPT-4o-mini (mini tier)
75% optimized via router · ~30× cheaper than premium
$281
OpenAI text-embedding-3-small
semantic cache miss embedding · 750k calls/mo @ $0.00002
$15
PostgreSQL (RDS db.t4g.medium)
100GB gp3 · llm_requests + cost_daily_summary + budget_tiers
$90
$90
ElastiCache Redis (cache.t4g.small + replica)
exact cache + semantic embeddings + budget rate-limit fallback
$54
$54
Observability (Prometheus + OTel self-hosted)
compose-only · no Grafana Cloud dependency
$0
$0
Total · 8 tenants · 1.5M req/mo
$0.0010 per req optimized vs $0.0061 baseline
$9,144
$1,565
−$7,579 (−83%)

Optimization levers

Dual-tier caching (exact + semantic)
ADR-001. Hash lookup in front of embedding lookup; 50% combined hit rate at the reference workload removes half the model calls outright.
−$3,937 / mo
Model cascade (mini-first router)
ADR-003. 4-strategy router sends 75% of cache-miss traffic to GPT-4o-mini; only escalates the 25% that need it.
−$3,938 / mo
Prompt optimization (history compression + token trim)
Module 03 optimizer.py. Trims average input from 1500 to ~900 tokens on retrieval-heavy prompts.
−$300 / mo
EXPERT benefit · cohort beta

Async architecture review with a staff-level reviewer (cohort beta).

Submit your repo, your ADR draft, or your cost-model CSV. A staff or principal-level reviewer who has shipped this exact stack responds within 7 days with line-by-line comments. Cohort capped at 12 members.

Bring a diff, an ADR draft, or a cost-defense deck.

The cohort beta runs as async architecture review — pick a reviewer by topic, send the artifact, get inline comments + a Loom walkthrough back. No back-and-forth scheduling. No 30-minute slot pressure.

MR
Mira R.
Ex-staff · LLM platform · top-3 cloud
Routing strategies, cache tuning, vendor lock-in tradeoffs, model-promotion canaries
Send the diff. I'll go line-by-line through your router and the cache thresholds and pick out the strategies that just memorise the seed dataset.
DK
Daniel K.
Principal · cost engineering · enterprise SaaS
Cost-model defense, FinOps for AI workloads, unit-economics for staff promo packets
Send your CSV. We'll walk it backwards from the totals row to the assumption that breaks first when load doubles.
AS
Anya S.
Eng manager · AI infra · public Series-D
Org design for AI cost teams, hiring rubrics, staff-engineer interview prep, scoping
If you're prepping for staff promo, send your ADR draft and your cost model. We'll work backwards from the rubric.
Format
Async
Turnaround
7 days
Cohort
12 members
Scope
ADR + cost model
Request a slot
What your tier unlocks

PRO unlocks Modules 01–02. EXPERT unlocks the full platform.

PRO is the entry point — Modules 01–02 (cost-tracked /chat with instrumentation + schema + aggregator) plus the rest of the PRO catalog. EXPERT unlocks Modules 03–05 of this build, the 5 ADRs, the cost-model CSV, and the cohort-beta async architecture review.

What you getFREEPROEXPERT
Modules 01–02 of P09
Architecture + cost tracking (~4h)
Included
Included
Modules 03–05 of P09
Caching / Routing / Governance (~10h)
Included
5 committed ADRs + cost-model CSV
Starter kit docs/adr/ + docs/cost-model/
Included
PRO project catalog
Production-grade builds
2
All current
All current + this one
Curriculum
All 7 tracks
Phase 1 only
All
All + bonus modules
Code review
Senior+ reviewers
4 / month
Unlimited
Cohort-beta architecture review
Async · 7-day turnaround · 12-member cap
Included
Certificate
Verifiable on LinkedIn
Yes
Yes + LinkedIn rec
$79/mo
billed monthly · open enrollment · cancel anytime
or annual
$699/yr save 26%
Unlock EXPERT
Who this is for

Pick this if you defend the bill, not just write the prompt.

ST

Staff / principal AI engineers

You own the inference bill, the cache architecture, and the cost-attribution story your CFO will pull apart at QBR. The 5 ADRs are exactly the artifacts a staff promo panel asks about.

EM

Engineering managers · AI

You need a cost defense for the AI roadmap your VP will ask about before next-quarter headcount. The cost-model CSV is the answer with citations.

PA

Platform / infra leads

You absorb AI without absorbing 6 new vendors. Postgres, Redis, Prometheus, FastAPI — tools you already operate. This is the FinOps playbook for your existing stack.

FR

Founding engineers · AI startups

Your investors will ask about unit economics before they ask about scale. The cost model + the runbook + the budget-engine fail-open is the answer in one repo.

FAQ · EXPERT tier

Quick answers.

Modules 01–02 (architecture + token tracking + Postgres schema + aggregator) are included with PRO at $29/mo — you get a working cost-tracked /chat. Modules 03–05 (dual-tier cache, 4-strategy router, three-tier budget engine, anomaly detector), plus the 5 committed ADRs, the runnable cost-model CSV, and the cohort-beta async architecture review, unlock with EXPERT at $79/mo. PRO gets you cost visibility; EXPERT gets you the system you'd defend in an architecture review.
Pricing parity — the cost math, the routing tiers, and the cost-model CSV all work for either vendor. The live code path uses the OpenAI client because that's the most common case at the reference workload, but the pipeline interface is one method call (see ADR-001 reversal section). Swapping to Anthropic Claude is ~1 engineer-week behind the router; the math in the CSV is identical with the Anthropic columns substituted.
Yes — receipts and a learning-budget letter are downloadable on subscription. Many EXPERT learners are reimbursed under engineering training or AI cost-optimization budgets specifically. The cost-model CSV is itself a defensible business artifact for that conversation.
14 hours of focused work across 5 modules. Most learners spread it across 4–6 weeks alongside a day job. Modules 01–02 alone are ~4 hours and ship a working cost-tracked /chat — that's a meaningful PRO deliverable on its own if you want to gauge the project before unlocking EXPERT.
It's a strong forcing function. Staff AI interviews lean heavily on system design (cost, multi-tenancy, observability, FinOps) and on having opinions backed by real tradeoffs. The 5 ADRs you commit (one Deprecated, with the sync→async receipts) are exactly the artifacts a panel asks about. Pair with the cohort-beta async review on your final repo and you have a portfolio piece that survives a staff promo packet.
We instrument with Prometheus + OpenTelemetry but stop short of bundling Grafana — keeps the docker-compose lean and lets you wire your own dashboard tool (Grafana, Datadog, New Relic) against the same metrics. The runbook entry for adding Grafana Cloud is ~20 minutes; the ADRs document the metrics surface so the dashboard work is mechanical.
Related projects

Paired with this project

P15PAIDai
AI serving platform — vLLM + Ray Serve under SLA

EXPERT-tier inference build: vLLM continuous batching + PagedAttention, Ray Serve autoscale (market-hours min=2), Redis semantic cache (35% hit), ServingCircuitBreaker, 5 chaos scenarios + runbook, runnable cost-model CSV with break-even-vs-OpenAI math. Module 01 with PRO.

Explore project →
P26PAIDplatform
Cloud cost optimization

Cut a $300K Snowflake bill 60% — forensics, right-size, compact, govern.

Explore project →

Ready to ship a cost-aware AI platform?

Start with PRO ($29/mo) for Modules 01–02 — architecture + cost tracking. Or unlock the full 5-module platform plus 5 ADRs, the cost-model CSV, and cohort-beta architecture review with EXPERT ($79/mo).

P09 · AI Cost Optimization · EXPERT · PRO unlocks M01–M02Unlock EXPERT →
Press Cmd+K to open