ai-de.net/Projects/P09 · AI Cost Optimization — CostGuard platform

Last updated 2026-05-22By AI-DE Engineering Team

EXPERT-tier · PRO unlocks Modules 01–02AI & vectors trackP09

Run a
cost-aware
AI platform — that survives a CFO review

Ship a cost-aware LLM platform with per-request token tracking, dual-tier caching, a 4-strategy router, three-tier budget governance, anomaly detection, and 5 committed ADRs. Modules 01–02 unlock with PRO; the optimization stack unlocks with EXPERT.

Timeline

14 hours

Difficulty

Senior+

Stack

FastAPI · OpenAI · Redis · Postgres · asyncpg · Prometheus

See EXPERT benefits

The FinOps + system-design portfolio piece for staff AI infra roles — 5 committed ADRs, a runnable cost-model CSV, a three-tier budget engine with fail-open, and an incident runbook you can defend in an architecture review.

By the end you will have wired

Per-request lifecycle tracer + cost recorder (Postgres llm_requests + cost_daily_summary)
Dual-tier cache: hashed exact-match in front of embedding-based semantic lookup
4-strategy router (keyword / complexity / confidence / fallback) on a cost-latency-quality triangle
Three-tier budget hierarchy (org → team → user) with fail-open enforcement
Anomaly detection (spike + IQR + pattern-deviation) with severity-routed alerts
5 ADRs (one Deprecated) committed alongside the code, plus a runnable cost-model CSV

PREREQ · SENIOR+Built for engineers running AI in production with a real bill to defend. Comfortable with Python services, Postgres, Redis, and at least one of: vendor LLM APIs, request-tracing, or FinOps. Not a “hello tiktoken” tutorial.

costguard.platform · tenant=acme · · 22 components wired

budget engine armed

Track

Optimize

Control

Persist

/chat requestFastAPI middleware

tracker.pytiktoken token count

trace.py6-stage lifecycle

cost_recorderINSERT llm_requests

Per-request lifecycle tracer

exact_cacheRedis hash · ~3 ms p50

semantic_cacheembedding · cosine ≥ 0.92

optimizer.pytrim + summarise history

router.pyMINI → STD → PREMIUM

Dual-tier cache + 4-strategy router

budget.pyorg → team → user

governance.pyfail-open via Redis

anomaly.pyspike + IQR + pattern

alerts.pyINFO / WARN / CRITICAL

Three-tier budget + anomaly

llm_requestsBIGSERIAL · 11k+ seed

cost_daily_summaryON CONFLICT upsert

budget_tiers+ overage_requests

Rediscache + embeddings + RL

Postgres + Redis + Prometheus scrape

# Live cost tracking on every /chat call

tiktoken counts input + output → pricing.py per-1M math

INSERT into llm_requests with route_decision attribution

Hourly aggregator UPSERTs cost_daily_summary on (date, model, endpoint, user)

→ dashboards refresh in < 500 ms; per-request debug stays auditable

# 5 ADRs + cost-model CSV bundled in the kit

docs/adr/001-dual-tier-caching.md — exact + semantic cache decision

docs/adr/005-sync-sqlalchemy-deprecated.md — the M02→M05 reversal

docs/cost-model/ai-cost-optimization-cost-model.csv — 8 tenants × 1.5M req/mo

→ commit the artifacts a staff promo panel will actually open

60 files

in starter zip · ADRs bundled

−83%

model spend at the reference scenario

smoke tests · mocked LLM mode

Curriculum · 5 modules · 14 hours · 3 phases

Modules 01–02 unlock with PRO. The optimization stack with EXPERT.

Modules 01–02 (~4h) ship a working cost-tracked LLM service with instrumentation, schema, aggregation queries, and a working /chat — included with PRO. Modules 03–05 (~10h) layer the dual-tier cache, the 4-strategy router, the three-tier budget engine, and the anomaly detector on top — and unlock with EXPERT.

P09 · 5 modules · 14 hours · 30+ lessons

Free preview EXPERT required

M01

⊘System Architecture & Request Lifecycle

Map the cost-aware platform end-to-end. Trace a request from FastAPI middleware through token counting, cost recording, caching, routing, and budget enforcement. Build the lifecycle tracer and a cost-flow analyzer that names where each cent goes.

Phase 1: Foundation1.5h6 lessonsPRO TIER

Unlock with PRO →

M02

⊘Token Tracking & Cost Visibility

Build the persistence layer: llm_requests + cost_daily_summary tables, the SimpleTracker wrapper around the OpenAI client, the pricing.py per-1M math, and an aggregator with ON CONFLICT upserts. Working /chat with cost-on-every-response by the end.

Phase 1: Foundation2.5h8 lessonsPRO TIER

Unlock with PRO →

M03

⊘Caching, Prompt Optimization & Cost Reduction

Dual-tier cache: SHA-256 exact-match in front of embedding-based semantic lookup. Prompt optimizer trims history, summarises conversation, and dedups duplicate context. Cost-savings tracker compares before/after on the seed dataset.

Phase 2: Optimization3h7 lessonsEXPERT TIER

Unlock with EXPERT →

M04

⊘Model Routing, Quality & Optimization

Cost-latency-quality triangle (MINI / STANDARD / PREMIUM). 4-strategy router (keyword + complexity + confidence + fallback chain). A/B evaluation harness, latency budget enforcement, exponential-backoff circuit breaker. Step-up-on-failure resilience pattern.

Phase 2: Optimization3.5h7 lessonsEXPERT TIER

Unlock with EXPERT →

M05

⊘Platform Operations & Cost Governance

Three-tier budget hierarchy (org → team → user) with fail-open + Redis rate-limit fallback (asyncpg, 50ms timeout). Anomaly detector (spike + IQR + pattern-deviation) with severity-routed alerts. Cost incident runbook + SLA targets in sla.yaml.

Phase 3: Production3.5h8 lessonsEXPERT TIER

Unlock with EXPERT →

Modules 01–02 with PRO ($29/mo) · Modules 03–05 with EXPERT ($79/mo)

See plans →

Backed by curriculum

Cost Optimization for Data Engineers

10 modules14 hoursFinOps · Cost levers · Unit economics · Budgets

Open curriculum

iThe Cost Optimization curriculum is the foundation for the FinOps mindset this project applies to LLMs — same levers (orphaned compute, query tuning, budget guardrails), different layer of the stack. EXPERT subscribers get full access to all modules.

The build, in 3 phases

Foundation. Optimization. Governance.

Each phase ends with a tagged release, a passing smoke-test run, and a measurable cost delta on the seed workload. No ambiguity about where you are.

01~4h

Foundation (Modules 01–02)

Cost-tracked /chat live locally. Per-request lifecycle tracer, Postgres schema with detail + rollup tables, hourly aggregator.

✓Working /chat with X-Cost headers on every response
✓11k+ seed rows + working cost dashboards via SQL
✓ON-CONFLICT upsert tested against duplicate aggregator runs

02~6.5h

Optimization (Modules 03–04)

Dual-tier cache + 4-strategy router live. End-to-end cost cut on the seed dataset; A/B evaluation harness runs in CI.

✓Exact + semantic cache with shared TTL + threshold runbook
✓Router with route_decision attribution on every llm_requests row
✓A/B harness comparing tier choices on a 50-prompt eval set

03~3.5h

Governance (Module 05)

Three-tier budget with fail-open enforcement, anomaly detector with severity-routed alerts, on-call runbook drilled.

✓asyncpg-backed budget check on the hot path (< 8 ms p95)
✓Anomaly detector wired to Prometheus + alert routing rules
✓Cost incident runbook + sla.yaml targets ready to defend

Project setup · 15 minutes

One command. Local FastAPI + Postgres + Redis + Prometheus.

What lives in the repo

You get the real cost-aware platform on day one — FastAPI as the gateway, PostgreSQL for cost tracking + budget storage, Redis for exact-match cache + semantic embeddings + rate-limit fallback, Prometheus for metrics, and pytest with a mocked-LLM toggle so the smoke tests run end-to-end without an API key.

docker-compose.cost.yml — Postgres 15 + Redis 7 + Prometheus + optional API container
tracker.py + pricing.py + models.py — SimpleTracker around OpenAI; per-1M cost math; SQLAlchemy ORM
src/cache/ — exact + semantic cache + optimizer + savings tracker
src/routing/ — triangle, 4 strategies, fallback chain, eval harness
src/cost/ — budget engine, governance, anomaly detector, alerts, failures
docs/adr/ + docs/cost-model/ — 5 ADRs (one Deprecated) + the runnable cost-model CSV

Download · Starter Kit · 60 files · 1.05 MB

AI Cost Optimization Starter Kit

Pre-built cost-aware platform with seeded Postgres (11k+ rows of llm_requests + daily summaries + budgets + anomalies), Redis, Prometheus, and a 13-test smoke suite that runs against a mocked LLM. Now bundled: 5 committed ADR markdown files (docs/adr/) and the runnable cost-model CSV (docs/cost-model/) — unzip and read them straight from the repo.

EXPERT project · 60 files · ADRs + cost model bundled · last updated 2026-05-09

~/projects/ai-cost-optimization — zsh

1. Unzip and bring up the platform (mocked LLM, no API key needed)

$ unzip ai-cost-optimization-starter.zip

$ cd ai-cost-optimization-starter && cp .env.example .env

$ docker compose -f docker-compose.cost.yml up -d

2. Run the 13 smoke tests against the mocked LLM

$ python -m venv .venv && source .venv/bin/activate

$ pip install -r requirements.txt

$ pytest tests/test_smoke.py -v

3. Send a cost-tracked request and inspect the row

$ curl -X POST http://localhost:8000/chat \

$ -H 'X-User-ID: u_42' -d '{"prompt":"Summarize Q4 earnings"}'

$ psql $DATABASE_URL -c "SELECT model,cost_usd,cached,route_decision FROM llm_requests ORDER BY id DESC LIMIT 1"

4. Read the ADRs and re-run the cost model

$ cat docs/adr/001-dual-tier-caching.md

$ open docs/cost-model/ai-cost-optimization-cost-model.csv

11,311

seed rows total

10k

llm_requests · 50 users · 5 models

100

cost_daily_summary rows

200 + 5

anomalies + overage requests

Production hardening

The same LLM client — but built for the bill-defending case.

Most cost tutorials show you a token counter wrapped around a single OpenAI call. This shows what changes when 8 tenants share infra, finance owns the org cap, and SRE owns the on-call when the budget DB blips.

Notebook trackerWhat most teams ship

Token counter

Print statement after each call

Cache

None — every call hits the model

Routing

One model for everything

Budget

Manual monthly invoice review

Anomaly

Notice next month when finance asks

Persistence

Sync everywhere; threadpool eats blocking I/O

Your cost-aware platformModules 03–05

✓

Token counter

tracker.py writes llm_requests with route attribution; hourly ON CONFLICT rollup

✓

Cache

Dual-tier — SHA-256 exact in front of cosine ≥ 0.92 semantic; ~50% hit at reference workload (ADR-001)

✓

Routing

4-strategy chain (keyword/complexity/confidence/fallback) on a cost-latency-quality triangle.py (ADR-003)

✓

Budget

Org → team → user cap with asyncpg 50 ms timeout + Redis fail-open (ADR-002)

✓

Anomaly

Spike + IQR + pattern-deviation in anomaly.py; severity-routed alerts to Prometheus + on-call

✓

Persistence

Sync SQLAlchemy for ORM/aggregator; asyncpg.Pool for hot-path budget reads (ADR-005 Deprecated)

EXPERT-only · architecture decision records

Write the ADRs staff engineers actually get judged on.

Five ADRs ship inside the starter-kit zip at docs/adr/, one per major decision in the build, including a real Deprecated ADR documenting the sync-SQLAlchemy → asyncpg reversal that real production load forced. Preview ADR-001 →

ADR-001Accepted

Dual-tier caching: exact-match in front of semantic

Context

FAQ-shape repeats want O(1) lookup; paraphrases want embedding similarity. Either alone leaves money on the table.

Decision

SHA-256(prompt) Redis hash → fall through to cosine ≥ 0.92 on the embedding store

Tradeoff

Two TTL knobs + 1 embedding call per cache miss vs ~50% combined hit rate at the reference workload

Reversal

Drop semantic tier when marginal hit rate < 15%; ~2 engineer-days to disable

ADR-002Accepted

Three-tier budget hierarchy with fail-open enforcement

Context

Finance, team leads, and engineers each need a knob; Postgres outage cannot freeze the LLM platform

Decision

org → team → user caps; asyncpg 50ms timeout falls open to Redis 30 req/min rate limiter

Tradeoff

Bounded overage during DB outage (~$0.75 / 5 min) vs false-rejecting paying users

Reversal

Hard-reject mode is a single .env flag; SLA drops to 99.5%

ADR-003Accepted

Cost-latency-quality routing triangle with fallback chain

Context

30× price gap MINI vs PREMIUM; static keyword routing only catches ~30%; classify-first adds 200ms p95 to every call

Decision

4-strategy chain — keyword → complexity → confidence → fallback_chain; step up on failure, never down

Tradeoff

Confidence router runs MINI on ~15% of requests (added cost) for ~5× ROI on routing decision

Reversal

Disable confidence router via .env flag; re-tune complexity threshold; ~3 engineer-days

ADR-004Accepted

Per-request detail + daily rollup with ON CONFLICT upsert

Context

One table can't answer both 'why did req_a31f cost $0.42' and 'sum cost by team / month' under 500ms

Decision

llm_requests (detail) + cost_daily_summary (rollup); hourly ON CONFLICT (date,model,endpoint,user_id) upsert

Tradeoff

~12% storage overhead + lockstep schema migrations vs 35ms vs 12s dashboard latency

Reversal

Deprecate rollup when detail-table latency clears the budget; freeze 90 days then drop

EXPERT-only · cost model

Read the FinOps story, not just the latency one.

Module 03 ships a runnable cost-model CSV inside the starter-kit zip at docs/cost-model/. 8-tenant reference load (50k req/day · 1.5M req/mo), real OpenAI + AWS list prices, with the dual-tier cache and model-cascade levers wired up. The version you defend to a CFO. Preview the CSV →

ComponentBaseline / moOptimized / moDelta

OpenAI GPT-4o (premium tier)

100% naive · 25% optimized via router · ~$0.006 avg/req

$9,000

$1,125

−$7,875

OpenAI GPT-4o-mini (mini tier)

75% optimized via router · ~30× cheaper than premium

—

$281

—

OpenAI text-embedding-3-small

semantic cache miss embedding · 750k calls/mo @ $0.00002

—

$15

—

PostgreSQL (RDS db.t4g.medium)

100GB gp3 · llm_requests + cost_daily_summary + budget_tiers

$90

—

ElastiCache Redis (cache.t4g.small + replica)

exact cache + semantic embeddings + budget rate-limit fallback

$54

—

Observability (Prometheus + OTel self-hosted)

compose-only · no Grafana Cloud dependency

—

Total · 8 tenants · 1.5M req/mo

$0.0010 per req optimized vs $0.0061 baseline

$9,144

$1,565

−$7,579 (−83%)

Optimization levers

Dual-tier caching (exact + semantic)

ADR-001. Hash lookup in front of embedding lookup; 50% combined hit rate at the reference workload removes half the model calls outright.

−$3,937 / mo

Model cascade (mini-first router)

ADR-003. 4-strategy router sends 75% of cache-miss traffic to GPT-4o-mini; only escalates the 25% that need it.

−$3,938 / mo

Prompt optimization (history compression + token trim)

Module 03 optimizer.py. Trims average input from 1500 to ~900 tokens on retrieval-heavy prompts.

−$300 / mo

EXPERT benefit · cohort beta

Async architecture review with a staff-level reviewer (cohort beta).

Submit your repo, your ADR draft, or your cost-model CSV. A staff or principal-level reviewer who has shipped this exact stack responds within 7 days with line-by-line comments. Cohort capped at 12 members.

Bring a diff, an ADR draft, or a cost-defense deck.

The cohort beta runs as async architecture review — pick a reviewer by topic, send the artifact, get inline comments + a Loom walkthrough back. No back-and-forth scheduling. No 30-minute slot pressure.

Mira R.

Ex-staff · LLM platform · top-3 cloud

Routing strategies, cache tuning, vendor lock-in tradeoffs, model-promotion canaries

“Send the diff. I'll go line-by-line through your router and the cache thresholds and pick out the strategies that just memorise the seed dataset.”

Daniel K.

Principal · cost engineering · enterprise SaaS

Cost-model defense, FinOps for AI workloads, unit-economics for staff promo packets

“Send your CSV. We'll walk it backwards from the totals row to the assumption that breaks first when load doubles.”

Anya S.

Eng manager · AI infra · public Series-D

Org design for AI cost teams, hiring rubrics, staff-engineer interview prep, scoping

“If you're prepping for staff promo, send your ADR draft and your cost model. We'll work backwards from the rubric.”

Format

Async

Turnaround

7 days

Cohort

12 members

Scope

ADR + cost model

Request a slot →

What your tier unlocks

PRO unlocks Modules 01–02. EXPERT unlocks the full platform.

PRO is the entry point — Modules 01–02 (cost-tracked /chat with instrumentation + schema + aggregator) plus the rest of the PRO catalog. EXPERT unlocks Modules 03–05 of this build, the 5 ADRs, the cost-model CSV, and the cohort-beta async architecture review.

What you getFREEPROEXPERT

Modules 01–02 of P09

Architecture + cost tracking (~4h)

—

Included

Modules 03–05 of P09

Caching / Routing / Governance (~10h)

—

Included

5 committed ADRs + cost-model CSV

Starter kit docs/adr/ + docs/cost-model/

—

Included

PRO project catalog

Production-grade builds

All current

All current + this one

Curriculum

All 7 tracks

Phase 1 only

All

All + bonus modules

Code review

Senior+ reviewers

—

4 / month

Unlimited

Cohort-beta architecture review

Async · 7-day turnaround · 12-member cap

—

Included

Certificate

Verifiable on LinkedIn

—

Yes

Yes + LinkedIn rec

$79/mo

billed monthly · open enrollment · cancel anytime

or annual

$699/yr save 26%

Unlock EXPERT →

Who this is for

Pick this if you defend the bill, not just write the prompt.

Staff / principal AI engineers

You own the inference bill, the cache architecture, and the cost-attribution story your CFO will pull apart at QBR. The 5 ADRs are exactly the artifacts a staff promo panel asks about.

Engineering managers · AI

You need a cost defense for the AI roadmap your VP will ask about before next-quarter headcount. The cost-model CSV is the answer with citations.

Platform / infra leads

You absorb AI without absorbing 6 new vendors. Postgres, Redis, Prometheus, FastAPI — tools you already operate. This is the FinOps playbook for your existing stack.

Founding engineers · AI startups

Your investors will ask about unit economics before they ask about scale. The cost model + the runbook + the budget-engine fail-open is the answer in one repo.

Related curriculum

Going deeper? Four tracks back this project.

The Cost Optimization curriculum is the FinOps foundation. These four tracks let you go deeper on the parts that matter most for your role.

FAQ · EXPERT tier

Quick answers.

How is this different from PRO?+

Modules 01–02 (architecture + token tracking + Postgres schema + aggregator) are included with PRO at $29/mo — you get a working cost-tracked /chat. Modules 03–05 (dual-tier cache, 4-strategy router, three-tier budget engine, anomaly detector), plus the 5 committed ADRs, the runnable cost-model CSV, and the cohort-beta async architecture review, unlock with EXPERT at $79/mo. PRO gets you cost visibility; EXPERT gets you the system you'd defend in an architecture review.

The code uses OpenAI but pricing.py also lists Anthropic. Is this Anthropic-friendly?+

Pricing parity — the cost math, the routing tiers, and the cost-model CSV all work for either vendor. The live code path uses the OpenAI client because that's the most common case at the reference workload, but the pipeline interface is one method call (see ADR-001 reversal section). Swapping to Anthropic Claude is ~1 engineer-week behind the router; the math in the CSV is identical with the Anthropic columns substituted.

Can my company expense it?+

Yes — receipts and a learning-budget letter are downloadable on subscription. Many EXPERT learners are reimbursed under engineering training or AI cost-optimization budgets specifically. The cost-model CSV is itself a defensible business artifact for that conversation.

How long until I can finish this project?+

14 hours of focused work across 5 modules. Most learners spread it across 4–6 weeks alongside a day job. Modules 01–02 alone are ~4 hours and ship a working cost-tracked /chat — that's a meaningful PRO deliverable on its own if you want to gauge the project before unlocking EXPERT.

Is this enough to interview for staff AI infra roles?+

It's a strong forcing function. Staff AI interviews lean heavily on system design (cost, multi-tenancy, observability, FinOps) and on having opinions backed by real tradeoffs. The 5 ADRs you commit (one Deprecated, with the sync→async receipts) are exactly the artifacts a panel asks about. Pair with the cohort-beta async review on your final repo and you have a portfolio piece that survives a staff promo packet.

Why no Grafana dashboards in the kit?+

We instrument with Prometheus + OpenTelemetry but stop short of bundling Grafana — keeps the docker-compose lean and lets you wire your own dashboard tool (Grafana, Datadog, New Relic) against the same metrics. The runbook entry for adding Grafana Cloud is ~20 minutes; the ADRs document the metrics surface so the dashboard work is mechanical.

Related projects

Paired with this project

P15·PAID·ai

AI serving platform — vLLM + Ray Serve under SLA

EXPERT-tier inference build: vLLM continuous batching + PagedAttention, Ray Serve autoscale (market-hours min=2), Redis semantic cache (35% hit), ServingCircuitBreaker, 5 chaos scenarios + runbook, runnable cost-model CSV with break-even-vs-OpenAI math. Module 01 with PRO.

Explore project →

P26·PAID·platform

Cloud cost optimization

Cut a $300K Snowflake bill 60% — forensics, right-size, compact, govern.

Explore project →

Ready to ship a cost-aware AI platform?

Start with PRO ($29/mo) for Modules 01–02 — architecture + cost tracking. Or unlock the full 5-module platform plus 5 ADRs, the cost-model CSV, and cohort-beta architecture review with EXPERT ($79/mo).

See EXPERT benefits

P09 · AI Cost Optimization · EXPERT · PRO unlocks M01–M02Unlock EXPERT →

Run acost-awareAI platform — that survives a CFO review

Modules 01–02 unlock with PRO. The optimization stack with EXPERT.

Foundation. Optimization. Governance.

One command. Local FastAPI + Postgres + Redis + Prometheus.

What lives in the repo

AI Cost Optimization Starter Kit

The same LLM client — but built for the bill-defending case.

Write the ADRs staff engineers actually get judged on.

Dual-tier caching: exact-match in front of semantic

Three-tier budget hierarchy with fail-open enforcement

Cost-latency-quality routing triangle with fallback chain

Per-request detail + daily rollup with ON CONFLICT upsert

Read the FinOps story, not just the latency one.

Optimization levers

Async architecture review with a staff-level reviewer (cohort beta).

Bring a diff, an ADR draft, or a cost-defense deck.

PRO unlocks Modules 01–02. EXPERT unlocks the full platform.

Pick this if you defend the bill, not just write the prompt.

Staff / principal AI engineers

Engineering managers · AI

Platform / infra leads

Founding engineers · AI startups

Going deeper? Four tracks back this project.

API & External System Integration

Data Observability & Quality

AI Inference & Serving Systems

LLM Evaluation

Quick answers.

Paired with this project

Ready to ship a cost-aware AI platform?

Run a
cost-aware
AI platform — that survives a CFO review