Context
The agent pipeline writes two distinct kinds of state to disk:
- Orchestrator state — current node, intermediate results, error context, message history. High write rate (every node transition), short retention (24-72h), tolerant of small data loss on a hard crash.
- Business data — ingested rows, validated records, transformed output. Low-to-medium write rate, long retention, intolerant of data loss.
Conflating them in one store turns out to be a real production problem in v1: we tried LangGraph + PostgresSaver first. Three issues surfaced under load:
- Write contention. Orchestrator writes every node transition (~10/run, sometimes 50+ on retries). At 5k runs/day, that's ~50k row-level locks per day on the same
langgraph_checkpointstable. Read-heavy business queries (M04 dashboards) started timing out. - Schema churn. LangGraph minor versions occasionally bump the checkpoint schema. Postgres migrations on every minor LangGraph release wasn't sustainable; we had two production incidents from migration-on-deploy gone wrong.
- Retention pressure. 30-day retention on checkpoints + 12-month retention on business records put both on the same disk; we hit storage thresholds twice in 8 weeks.
Decision
Two-tier persistence:
- Redis (
cache.t4g.smallin production) — orchestrator checkpoints + agent message history + recovery state. 24-hour TTL on checkpoints (M05 TimeTravel keeps a 24-hour replay window).RedisSaveris the LangGraph integration. - Postgres (
db.t4g.medium) — ingested + validated + transformed business data. Standard SQLAlchemy ORM, normal application retention.
# src/memory/checkpointing.py
from langgraph.checkpoint.redis import RedisSaver
def make_checkpointer() -> RedisSaver:
return RedisSaver(
redis_client=redis_pool,
ttl_seconds=86400, # 24h replay window
key_prefix="agent_chk:",
)
# src/agents/supervisor.py
graph = compile_graph(checkpointer=make_checkpointer())
Module 05's TimeTravel reads the same Redis checkpoints — no separate replay store.
Tradeoffs we accept
| Lever | Alternative | Chosen |
|---|---|---|
| Durability | Postgres ACID writes for checkpoints | Redis AOF + RDB backups every 6h — accept risk of <6h checkpoint loss in disaster recovery |
| Single-store simplicity | Postgres for everything | Two stores — accept ops complexity for write isolation + schema isolation |
| Replay window | Indefinite checkpoint retention | 24h TTL — replay before then or it's gone |
| Cost | Postgres-only (~$50/mo) | Redis + Postgres (~$85/mo) |
| Consistency | Cross-store transactions | Eventual consistency — workers commit to Postgres before reporting "done" to the orchestrator |
Consequences (positive)
- Orchestrator writes don't contend with business reads. M04 dashboard latency dropped from p95=4.2s to p95=0.8s after the split.
- LangGraph minor version bumps require no Postgres migration. Redis treats checkpoints as opaque blobs; the schema bump only matters at deserialization time, which is per-process.
- M05 TimeTravel works against the same key-value store —
redis-cli HGETALL agent_chk:run_id_42returns the snapshot directly. Debugging isredis-cli, not SQL. - Storage retention pressure decoupled. Business retention is set once in Postgres; checkpoint TTL is set once in Redis.
Consequences (negative)
- Two stores to operate. Backups, monitoring, alerting, and runbook coverage doubles. Mitigation: Redis is
cache.t4g.small— small surface, low ops cost. - 24h replay window means M05 TimeTravel can't replay a 3-day-old run. Mitigation: Module 06's compliance audit log shipped to S3 covers long-term forensics.
- Cross-store consistency is the worker's responsibility. The Protocol contract: workers commit to Postgres before returning
{ status: "done" }to the orchestrator. Test intests/integration/test_worker_consistency.py.
Reversal plan
Postgres-only: Re-introduce LangGraph's PostgresSaver if/when one of these triggers fires:
- Run rate < 500/day (write-contention argument vanishes; ~$35/mo savings dominate).
- Compliance requirement to retain checkpoints > 24h with ACID guarantees.
- Redis ops burden becomes the bottleneck (would surprise me — but flagged for monitoring).
Estimated migration cost: ~3 engineer-days. The Saver swap is one config line; the risk is the schema-migration discipline we wanted to escape in the first place.
Hybrid (write to both): Mirror checkpoints to Postgres for a 30-day archival window. ~1 engineer-week. Use when compliance asks for long retention without losing the write-isolation benefits.
References
src/memory/checkpointing.py— RedisSaver wiringsrc/state/time_travel.py— M05 TimeTravel reads heresrc/observability/recovery.py— recovery-on-restart consumes checkpointstests/integration/test_worker_consistency.py— cross-store consistency testsrunbooks/incident-2026-03-14-checkpoint-contention.md— the incident that drove the split- ADR-001 (LangGraph orchestrator the Saver plugs into)
- ADR-005 (DEPRECATED — original ToolRegistry stored config in Postgres; reverted to Redis with hot-reload)