ADR-002: Three-tier budget hierarchy with fail-open enforcement | AI Cost Optimization

Context

By Module 04 the platform tracks every token and routes every prompt — but nothing prevents a runaway team or a runaway model from burning the monthly budget in a weekend. Three forces drive the budget design:

Multi-level accountability. Finance owns the org-wide cap. Team leads own per-team allocations (each team is a P&L line in the company plan). Engineers own per-user safety nets (so one bad agent loop doesn't take down a team). Any single-level system makes one of these stakeholders unhappy.
Latency-critical hot path. Budget checks sit on every /chat call (Module 02 instrumented latency_ms end-to-end at p95 ~ 380 ms). A budget-check pattern that blocks for 30 ms is a 8% latency hit; one that blocks for 200 ms is a regression we'd revert before shipping.
Database is a shared dependency. The same Postgres instance handles llm_requests writes, cost_daily_summary upserts (ADR-004), and budget reads. A Postgres outage cannot freeze the LLM platform — that's an availability target written into sla.yaml.

The naive design (single global cap, hard-reject on breach) fails (1) and (2). The "freeze on Postgres outage" pattern fails (3). We needed a design that handles all three.

Decision

We adopt a three-tier hierarchical budget with fail-open enforcement.

Hierarchy: Org cap (top) → Team cap (mid) → User cap (bottom). Each level has its own budget_tiers row keyed by (tier_id, period_start) with a UNIQUE constraint. Spending is recorded at request time and rolled up daily via the same ON CONFLICT pattern as cost_daily_summary (ADR-004).

# src/cost/budget.py
async def check_budget(org_id, team_id, user_id, est_cost_usd) -> BudgetVerdict:
    # Three reads, async, single round-trip via UNNEST
    org, team, user = await asyncpg_conn.fetchrow(_BUDGET_SQL, ...)

    if user.spent + est_cost_usd > user.cap:   return BudgetVerdict.OVER_USER
    if team.spent + est_cost_usd > team.cap:   return BudgetVerdict.OVER_TEAM
    if org.spent  + est_cost_usd > org.cap:    return BudgetVerdict.OVER_ORG
    return BudgetVerdict.OK

Fail-open: if Postgres is unreachable, the budget check returns OK and falls back to a Redis-backed sliding-window rate limiter (30 req/min/user). The overage is recorded in overage_requests for post-hoc reconciliation when Postgres is reachable again. Hard-reject only triggers when both Postgres and Redis are unreachable.

# src/cost/governance.py
try:
    verdict = await asyncio.wait_for(check_budget(...), timeout=0.05)
except (asyncpg.PostgresError, asyncio.TimeoutError):
    if rate_limiter.allow(user_id):
        await record_overage(user_id, est_cost_usd, reason="db_unreachable")
        return BudgetVerdict.OK
    return BudgetVerdict.HARD_REJECT

The Postgres timeout is 50 ms — chosen as 1/8 of the p95 latency budget so even a slow-Postgres day adds < 6 ms to the median path.

Tradeoffs we accept

Lever	Alternative	Chosen
Fairness model	Single global cap	3-tier with org/team/user caps
Outage behaviour	Hard-reject if budget DB down	Fail-open with rate limit + overage log
Reconciliation	Synchronous strict	Eventual via `overage_requests`
Hot-path block	Per-call DB write	Per-call DB read + nightly rollup write
Data model	One table per scope	One `budget_tiers` table with `tier_kind` column

We accept the overage risk because the worst-case is bounded — 30 req/min × 5 minutes of Postgres downtime × ~$0.005/req = ~$0.75 in absolute terms before either Postgres recovers or the rate limiter caps the spend. That's materially less than the cost of false-rejecting paying users during a Postgres blip.

Consequences (positive)

Latency intact. The p95 budget-check latency is 8 ms (one async query, single round-trip via Postgres UNNEST). Hot-path impact is < 2%.
Three stakeholders, three knobs. Finance updates org caps in governance.py; team leads update team caps via the admin UI; users see their per-user spent/cap on every response in X-Cost-Headers.
Postgres outage is recoverable, not catastrophic. The fail-open path is exercised in tests/test_smoke.py::test_budget_fails_open_when_db_down; the reconciliation flow has a runbook entry.
Reconciliation is auditable. Every fail-open decision lands in overage_requests with (user_id, amount_usd, reason, requested_at, approval_status) — finance can review weekly.

Consequences (negative)

Three ways to over-spend. A team can hit its cap while the org cap is fine; a user can hit their cap while the team cap is fine. Surface the reason on the 429 response (X-Budget-Reason: over_team_cap) so debug is obvious; the alternative — a single opaque "budget exceeded" — is worse.
Eventual consistency on reconciliation. Overage rows can sit unreconciled for 24 h before the nightly job picks them up. We accept the lag because the absolute amounts are small and the alternative is freezing the platform.
More moving parts in incident response. The runbook entry "Postgres unreachable" now has a sub-step: "Check overage_requests table for the fail-open volume; if > $X/hour, switch enforcement to hard-reject via BUDGET_FAIL_OPEN=false."

Reversal plan

If finance loses tolerance for fail-open behaviour (e.g. a single bad outage costs > $10k overage), the reversal is:

Set BUDGET_FAIL_OPEN=false in .env.
src/cost/governance.py switches the except branch to BudgetVerdict.HARD_REJECT instead of the rate-limit fallback.
Document the SLA change in sla.yaml: budget enforcement availability is now bound to Postgres availability (drops from 99.9% to 99.5%).

Estimated effort: 1 engineer-day, plus a coordinated SLA update with the on-call team.

References

migrations/005_budget_tables.sql — budget_tiers, overage_requests, cost_anomalies schemas
src/cost/budget.py — three-tier budget check
src/cost/governance.py — fail-open enforcement + reconciliation
src/cost/failures.py — recovery strategies + retry/backoff
runbook/cost-incident-response.py — "Postgres unreachable" playbook
sla.yaml — budget engine availability target (99.9%)
ADR-004 (per-request detail + daily rollup pattern this design rolls up into)