Skip to content
Back to AI Cost Optimization

Per-request detail + daily rollup with `ON CONFLICT` upserts

✓ AcceptedAI Cost Optimization02 — Token Tracking & Cost Visibility
By AI-DE Engineering Team·Stakeholders: platform engineer, finance partner, dashboard owner

Context

Module 02 has to answer two questions on the same data:

  1. Per-request debug. "Why did request req_a31f cost $0.42?" — the answer needs every detail (input tokens, output tokens, model, endpoint, user, cache hit, route decision, latency).
  2. Aggregate dashboards. "What did Team B spend on /chat last month?" — the answer needs a fast scan across millions of rows.

A single table that answers both well doesn't exist at scale. The forces:

  • Detail volume. At 50k req/day, the detail table grows by ~18M rows/year. Even with indexes, "sum cost grouped by team for last 30 days" is a 1.5M-row scan.
  • Dashboard latency budget. Grafana panels target < 500 ms refresh; a 1.5M-row aggregate over the detail table runs ~12 s on the seed workload. That's a regression no on-call wants to ship.
  • Idempotency requirement. The aggregator is a nightly job that occasionally retries on partial failure. Re-aggregation must not double-count.

We considered three patterns:

  1. Detail table only, with materialized views. Re-running the view on every dashboard render is the 12-second regression. Refreshing on a cron means stale dashboards.
  2. Rollup table only. Loses the per-request detail; debugging Q1 becomes "go look at the LLM provider's logs," which doesn't include our route-decision attribution.
  3. Both tables, kept consistent. The pattern below.

Decision

We adopt two tables plus an ON CONFLICT upsert for idempotent rollup.

Detail (llm_requests): one row per request. UUID primary key; columns include model, endpoint, user_id, team_id, org_id, input_tokens, output_tokens, cost_usd, cached (bool), route_decision (enum string), latency_ms, created_at.

Rollup (cost_daily_summary): one row per (date, model, endpoint, user_id) tuple. Columns are sums + p95 latency. Backed by a UNIQUE constraint on the same 4-tuple.

The upsert (migrations/001_create_cost_tables.sql + aggregator.py):

INSERT INTO cost_daily_summary
    (date, model, endpoint, user_id,
     total_requests, total_input_tokens, total_output_tokens,
     total_cost_usd, avg_latency_ms, p95_latency_ms, cache_hit_count)
SELECT
    DATE(created_at) AS date,
    model, endpoint, user_id,
    COUNT(*),
    SUM(input_tokens), SUM(output_tokens),
    SUM(cost_usd),
    AVG(latency_ms),
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms),
    COUNT(*) FILTER (WHERE cached = true)
FROM llm_requests
WHERE created_at >= $1 AND created_at < $2
GROUP BY DATE(created_at), model, endpoint, user_id

ON CONFLICT (date, model, endpoint, user_id)
DO UPDATE SET
    total_requests       = EXCLUDED.total_requests,
    total_input_tokens   = EXCLUDED.total_input_tokens,
    total_output_tokens  = EXCLUDED.total_output_tokens,
    total_cost_usd       = EXCLUDED.total_cost_usd,
    avg_latency_ms       = EXCLUDED.avg_latency_ms,
    p95_latency_ms       = EXCLUDED.p95_latency_ms,
    cache_hit_count      = EXCLUDED.cache_hit_count;

The aggregator runs hourly (not nightly — finance dashboards needed sub-day freshness) over a sliding window (now() - INTERVAL '2 hours' to now()) to catch late-arriving rows. Re-running the same window yields the same final state because of the ON CONFLICT clause.

Indexes:

  • Detail: (user_id, created_at DESC), (model, created_at DESC), (endpoint, created_at DESC), plus (prompt_hash) for ADR-001's exact cache lookup.
  • Rollup: the UNIQUE constraint doubles as the primary lookup index.

Tradeoffs we accept

LeverAlternativeChosen
Storage costOne tableTwo tables (detail + rollup)
Query latencyMaterialized viewPre-aggregated rollup
IdempotencyManual dedupON CONFLICT upsert
Aggregator cadenceNightlyHourly sliding window
Late arrivalsDrop on latenessSliding-window re-aggregation

We accept the storage cost (~12% overhead for the rollup vs detail) because the alternative is dashboards that don't load fast enough to use.

Consequences (positive)

  • Dashboard latency intact. Grafana panels query the rollup; p95 query latency is 35 ms vs 12 s on the detail table.
  • Per-request debug intact. Engineers query the detail table by UUID or user/time range with the existing indexes.
  • Idempotent re-aggregation. A retried hourly run leaves the rollup in the same state as a successful run.
  • Late-arriving rows tolerated. A 90-minute clock skew on a tenant's ingestion path still lands in the rollup correctly.

Consequences (negative)

  • Two writes per request. Detail is per-request synchronous; rollup is per-hour batch. The rollup write is amortised, but at 50k req/day it's a ~150-row hourly insert.
  • Two schemas to evolve in lockstep. A new column on the detail table often needs a corresponding column on the rollup. The migration template in migrations/ documents the lockstep pattern.
  • Aggregator is a job, not a query. A failure in the aggregator doesn't block the platform but does stale the dashboards. We mitigate with a Prometheus alert on cost_aggregator_lag_seconds > 7200.

Reversal plan

If detail-table query latency drops below the dashboard budget (e.g. via a column-store extension like Citus, or a Postgres 17+ improvement), we can deprecate the rollup:

  1. Switch dashboards to query llm_requests with the existing indexes.
  2. Verify p95 query latency < 500 ms on the live workload.
  3. Stop the hourly aggregator; freeze (don't drop) cost_daily_summary for 90 days as a fallback.
  4. Drop the rollup table after the freeze period.

Estimated effort: 1 engineer-week, plus Grafana panel rewrites.

References

  • migrations/001_create_cost_tables.sql — both tables + indexes
  • models.py — SQLAlchemy ORM definitions
  • aggregator.py — hourly sliding-window aggregator
  • seed/01_create_tables.sql — consolidated DDL with the UNIQUE constraint on (date, model, endpoint, user_id)
  • seed/02_sample_data_clean.sql — 10,000 detail rows + 100 rollup rows used in M02–M05 examples
  • ADR-001 (the prompt_hash index this table feeds for the exact-match cache)
  • ADR-002 (the budget engine that reads from the rollup for per-team caps)
Built into the project

This decision shipped as part of AI Cost Optimization — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open