Context
Module 02 has to answer two questions on the same data:
- Per-request debug. "Why did request
req_a31fcost $0.42?" — the answer needs every detail (input tokens, output tokens, model, endpoint, user, cache hit, route decision, latency). - Aggregate dashboards. "What did Team B spend on
/chatlast month?" — the answer needs a fast scan across millions of rows.
A single table that answers both well doesn't exist at scale. The forces:
- Detail volume. At 50k req/day, the detail table grows by ~18M rows/year. Even with indexes, "sum cost grouped by team for last 30 days" is a 1.5M-row scan.
- Dashboard latency budget. Grafana panels target < 500 ms refresh; a 1.5M-row aggregate over the detail table runs ~12 s on the seed workload. That's a regression no on-call wants to ship.
- Idempotency requirement. The aggregator is a nightly job that occasionally retries on partial failure. Re-aggregation must not double-count.
We considered three patterns:
- Detail table only, with materialized views. Re-running the view on every dashboard render is the 12-second regression. Refreshing on a cron means stale dashboards.
- Rollup table only. Loses the per-request detail; debugging Q1 becomes "go look at the LLM provider's logs," which doesn't include our route-decision attribution.
- Both tables, kept consistent. The pattern below.
Decision
We adopt two tables plus an ON CONFLICT upsert for idempotent rollup.
Detail (llm_requests): one row per request. UUID primary key; columns
include model, endpoint, user_id, team_id, org_id, input_tokens,
output_tokens, cost_usd, cached (bool), route_decision (enum string),
latency_ms, created_at.
Rollup (cost_daily_summary): one row per (date, model, endpoint, user_id) tuple. Columns are sums + p95 latency. Backed by a UNIQUE
constraint on the same 4-tuple.
The upsert (migrations/001_create_cost_tables.sql + aggregator.py):
INSERT INTO cost_daily_summary
(date, model, endpoint, user_id,
total_requests, total_input_tokens, total_output_tokens,
total_cost_usd, avg_latency_ms, p95_latency_ms, cache_hit_count)
SELECT
DATE(created_at) AS date,
model, endpoint, user_id,
COUNT(*),
SUM(input_tokens), SUM(output_tokens),
SUM(cost_usd),
AVG(latency_ms),
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms),
COUNT(*) FILTER (WHERE cached = true)
FROM llm_requests
WHERE created_at >= $1 AND created_at < $2
GROUP BY DATE(created_at), model, endpoint, user_id
ON CONFLICT (date, model, endpoint, user_id)
DO UPDATE SET
total_requests = EXCLUDED.total_requests,
total_input_tokens = EXCLUDED.total_input_tokens,
total_output_tokens = EXCLUDED.total_output_tokens,
total_cost_usd = EXCLUDED.total_cost_usd,
avg_latency_ms = EXCLUDED.avg_latency_ms,
p95_latency_ms = EXCLUDED.p95_latency_ms,
cache_hit_count = EXCLUDED.cache_hit_count;
The aggregator runs hourly (not nightly — finance dashboards needed sub-day
freshness) over a sliding window (now() - INTERVAL '2 hours' to now())
to catch late-arriving rows. Re-running the same window yields the same
final state because of the ON CONFLICT clause.
Indexes:
- Detail:
(user_id, created_at DESC),(model, created_at DESC),(endpoint, created_at DESC), plus(prompt_hash)for ADR-001's exact cache lookup. - Rollup: the UNIQUE constraint doubles as the primary lookup index.
Tradeoffs we accept
| Lever | Alternative | Chosen |
|---|---|---|
| Storage cost | One table | Two tables (detail + rollup) |
| Query latency | Materialized view | Pre-aggregated rollup |
| Idempotency | Manual dedup | ON CONFLICT upsert |
| Aggregator cadence | Nightly | Hourly sliding window |
| Late arrivals | Drop on lateness | Sliding-window re-aggregation |
We accept the storage cost (~12% overhead for the rollup vs detail) because the alternative is dashboards that don't load fast enough to use.
Consequences (positive)
- Dashboard latency intact. Grafana panels query the rollup; p95 query latency is 35 ms vs 12 s on the detail table.
- Per-request debug intact. Engineers query the detail table by UUID or user/time range with the existing indexes.
- Idempotent re-aggregation. A retried hourly run leaves the rollup in the same state as a successful run.
- Late-arriving rows tolerated. A 90-minute clock skew on a tenant's ingestion path still lands in the rollup correctly.
Consequences (negative)
- Two writes per request. Detail is per-request synchronous; rollup is per-hour batch. The rollup write is amortised, but at 50k req/day it's a ~150-row hourly insert.
- Two schemas to evolve in lockstep. A new column on the detail table
often needs a corresponding column on the rollup. The migration template
in
migrations/documents the lockstep pattern. - Aggregator is a job, not a query. A failure in the aggregator doesn't
block the platform but does stale the dashboards. We mitigate with a
Prometheus alert on
cost_aggregator_lag_seconds > 7200.
Reversal plan
If detail-table query latency drops below the dashboard budget (e.g. via a column-store extension like Citus, or a Postgres 17+ improvement), we can deprecate the rollup:
- Switch dashboards to query
llm_requestswith the existing indexes. - Verify p95 query latency < 500 ms on the live workload.
- Stop the hourly aggregator; freeze (don't drop)
cost_daily_summaryfor 90 days as a fallback. - Drop the rollup table after the freeze period.
Estimated effort: 1 engineer-week, plus Grafana panel rewrites.
References
migrations/001_create_cost_tables.sql— both tables + indexesmodels.py— SQLAlchemy ORM definitionsaggregator.py— hourly sliding-window aggregatorseed/01_create_tables.sql— consolidated DDL with the UNIQUE constraint on(date, model, endpoint, user_id)seed/02_sample_data_clean.sql— 10,000 detail rows + 100 rollup rows used in M02–M05 examples- ADR-001 (the
prompt_hashindex this table feeds for the exact-match cache) - ADR-002 (the budget engine that reads from the rollup for per-team caps)