Context
By Module 04 the platform tracks every token and routes every prompt — but nothing prevents a runaway team or a runaway model from burning the monthly budget in a weekend. Three forces drive the budget design:
- Multi-level accountability. Finance owns the org-wide cap. Team leads own per-team allocations (each team is a P&L line in the company plan). Engineers own per-user safety nets (so one bad agent loop doesn't take down a team). Any single-level system makes one of these stakeholders unhappy.
- Latency-critical hot path. Budget checks sit on every
/chatcall (Module 02 instrumentedlatency_msend-to-end at p95 ~ 380 ms). A budget-check pattern that blocks for 30 ms is a 8% latency hit; one that blocks for 200 ms is a regression we'd revert before shipping. - Database is a shared dependency. The same Postgres instance handles
llm_requestswrites,cost_daily_summaryupserts (ADR-004), and budget reads. A Postgres outage cannot freeze the LLM platform — that's an availability target written intosla.yaml.
The naive design (single global cap, hard-reject on breach) fails (1) and (2). The "freeze on Postgres outage" pattern fails (3). We needed a design that handles all three.
Decision
We adopt a three-tier hierarchical budget with fail-open enforcement.
Hierarchy: Org cap (top) → Team cap (mid) → User cap (bottom). Each level
has its own budget_tiers row keyed by (tier_id, period_start) with a
UNIQUE constraint. Spending is recorded at request time and rolled up daily
via the same ON CONFLICT pattern as cost_daily_summary (ADR-004).
# src/cost/budget.py
async def check_budget(org_id, team_id, user_id, est_cost_usd) -> BudgetVerdict:
# Three reads, async, single round-trip via UNNEST
org, team, user = await asyncpg_conn.fetchrow(_BUDGET_SQL, ...)
if user.spent + est_cost_usd > user.cap: return BudgetVerdict.OVER_USER
if team.spent + est_cost_usd > team.cap: return BudgetVerdict.OVER_TEAM
if org.spent + est_cost_usd > org.cap: return BudgetVerdict.OVER_ORG
return BudgetVerdict.OK
Fail-open: if Postgres is unreachable, the budget check returns OK and
falls back to a Redis-backed sliding-window rate limiter (30 req/min/user).
The overage is recorded in overage_requests for post-hoc reconciliation
when Postgres is reachable again. Hard-reject only triggers when both
Postgres and Redis are unreachable.
# src/cost/governance.py
try:
verdict = await asyncio.wait_for(check_budget(...), timeout=0.05)
except (asyncpg.PostgresError, asyncio.TimeoutError):
if rate_limiter.allow(user_id):
await record_overage(user_id, est_cost_usd, reason="db_unreachable")
return BudgetVerdict.OK
return BudgetVerdict.HARD_REJECT
The Postgres timeout is 50 ms — chosen as 1/8 of the p95 latency budget so even a slow-Postgres day adds < 6 ms to the median path.
Tradeoffs we accept
| Lever | Alternative | Chosen |
|---|---|---|
| Fairness model | Single global cap | 3-tier with org/team/user caps |
| Outage behaviour | Hard-reject if budget DB down | Fail-open with rate limit + overage log |
| Reconciliation | Synchronous strict | Eventual via overage_requests |
| Hot-path block | Per-call DB write | Per-call DB read + nightly rollup write |
| Data model | One table per scope | One budget_tiers table with tier_kind column |
We accept the overage risk because the worst-case is bounded — 30 req/min × 5 minutes of Postgres downtime × ~$0.005/req = ~$0.75 in absolute terms before either Postgres recovers or the rate limiter caps the spend. That's materially less than the cost of false-rejecting paying users during a Postgres blip.
Consequences (positive)
- Latency intact. The p95 budget-check latency is 8 ms (one async query,
single round-trip via Postgres
UNNEST). Hot-path impact is < 2%. - Three stakeholders, three knobs. Finance updates org caps in
governance.py; team leads update team caps via the admin UI; users see their per-user spent/cap on every response inX-Cost-Headers. - Postgres outage is recoverable, not catastrophic. The fail-open path is
exercised in
tests/test_smoke.py::test_budget_fails_open_when_db_down; the reconciliation flow has a runbook entry. - Reconciliation is auditable. Every fail-open decision lands in
overage_requestswith(user_id, amount_usd, reason, requested_at, approval_status)— finance can review weekly.
Consequences (negative)
- Three ways to over-spend. A team can hit its cap while the org cap is
fine; a user can hit their cap while the team cap is fine. Surface the
reason on the
429response (X-Budget-Reason: over_team_cap) so debug is obvious; the alternative — a single opaque "budget exceeded" — is worse. - Eventual consistency on reconciliation. Overage rows can sit unreconciled for 24 h before the nightly job picks them up. We accept the lag because the absolute amounts are small and the alternative is freezing the platform.
- More moving parts in incident response. The runbook entry "Postgres
unreachable" now has a sub-step: "Check
overage_requeststable for the fail-open volume; if > $X/hour, switch enforcement to hard-reject viaBUDGET_FAIL_OPEN=false."
Reversal plan
If finance loses tolerance for fail-open behaviour (e.g. a single bad outage costs > $10k overage), the reversal is:
- Set
BUDGET_FAIL_OPEN=falsein.env. src/cost/governance.pyswitches theexceptbranch toBudgetVerdict.HARD_REJECTinstead of the rate-limit fallback.- Document the SLA change in
sla.yaml: budget enforcement availability is now bound to Postgres availability (drops from 99.9% to 99.5%).
Estimated effort: 1 engineer-day, plus a coordinated SLA update with the on-call team.
References
migrations/005_budget_tables.sql—budget_tiers,overage_requests,cost_anomaliesschemassrc/cost/budget.py— three-tier budget checksrc/cost/governance.py— fail-open enforcement + reconciliationsrc/cost/failures.py— recovery strategies + retry/backoffrunbook/cost-incident-response.py— "Postgres unreachable" playbooksla.yaml— budget engine availability target (99.9%)- ADR-004 (per-request detail + daily rollup pattern this design rolls up into)