ADR-002: Redis for orchestrator checkpoints; Postgres only for business data | Agentic Data Pipeline

Context

The agent pipeline writes two distinct kinds of state to disk:

Orchestrator state — current node, intermediate results, error context, message history. High write rate (every node transition), short retention (24-72h), tolerant of small data loss on a hard crash.
Business data — ingested rows, validated records, transformed output. Low-to-medium write rate, long retention, intolerant of data loss.

Conflating them in one store turns out to be a real production problem in v1: we tried LangGraph + PostgresSaver first. Three issues surfaced under load:

Write contention. Orchestrator writes every node transition (~10/run, sometimes 50+ on retries). At 5k runs/day, that's ~50k row-level locks per day on the same langgraph_checkpoints table. Read-heavy business queries (M04 dashboards) started timing out.
Schema churn. LangGraph minor versions occasionally bump the checkpoint schema. Postgres migrations on every minor LangGraph release wasn't sustainable; we had two production incidents from migration-on-deploy gone wrong.
Retention pressure. 30-day retention on checkpoints + 12-month retention on business records put both on the same disk; we hit storage thresholds twice in 8 weeks.

Decision

Two-tier persistence:

Redis (cache.t4g.small in production) — orchestrator checkpoints + agent message history + recovery state. 24-hour TTL on checkpoints (M05 TimeTravel keeps a 24-hour replay window). RedisSaver is the LangGraph integration.
Postgres (db.t4g.medium) — ingested + validated + transformed business data. Standard SQLAlchemy ORM, normal application retention.

# src/memory/checkpointing.py
from langgraph.checkpoint.redis import RedisSaver

def make_checkpointer() -> RedisSaver:
    return RedisSaver(
        redis_client=redis_pool,
        ttl_seconds=86400,             # 24h replay window
        key_prefix="agent_chk:",
    )

# src/agents/supervisor.py
graph = compile_graph(checkpointer=make_checkpointer())

Module 05's TimeTravel reads the same Redis checkpoints — no separate replay store.

Tradeoffs we accept

Lever	Alternative	Chosen
Durability	Postgres ACID writes for checkpoints	Redis AOF + RDB backups every 6h — accept risk of <6h checkpoint loss in disaster recovery
Single-store simplicity	Postgres for everything	Two stores — accept ops complexity for write isolation + schema isolation
Replay window	Indefinite checkpoint retention	24h TTL — replay before then or it's gone
Cost	Postgres-only (~$50/mo)	Redis + Postgres (~$85/mo)
Consistency	Cross-store transactions	Eventual consistency — workers commit to Postgres before reporting "done" to the orchestrator

Consequences (positive)

Orchestrator writes don't contend with business reads. M04 dashboard latency dropped from p95=4.2s to p95=0.8s after the split.
LangGraph minor version bumps require no Postgres migration. Redis treats checkpoints as opaque blobs; the schema bump only matters at deserialization time, which is per-process.
M05 TimeTravel works against the same key-value store — redis-cli HGETALL agent_chk:run_id_42 returns the snapshot directly. Debugging is redis-cli, not SQL.
Storage retention pressure decoupled. Business retention is set once in Postgres; checkpoint TTL is set once in Redis.

Consequences (negative)

Two stores to operate. Backups, monitoring, alerting, and runbook coverage doubles. Mitigation: Redis is cache.t4g.small — small surface, low ops cost.
24h replay window means M05 TimeTravel can't replay a 3-day-old run. Mitigation: Module 06's compliance audit log shipped to S3 covers long-term forensics.
Cross-store consistency is the worker's responsibility. The Protocol contract: workers commit to Postgres before returning { status: "done" } to the orchestrator. Test in tests/integration/test_worker_consistency.py.

Reversal plan

Postgres-only: Re-introduce LangGraph's PostgresSaver if/when one of these triggers fires:

Run rate < 500/day (write-contention argument vanishes; ~$35/mo savings dominate).
Compliance requirement to retain checkpoints > 24h with ACID guarantees.
Redis ops burden becomes the bottleneck (would surprise me — but flagged for monitoring).

Estimated migration cost: ~3 engineer-days. The Saver swap is one config line; the risk is the schema-migration discipline we wanted to escape in the first place.

Hybrid (write to both): Mirror checkpoints to Postgres for a 30-day archival window. ~1 engineer-week. Use when compliance asks for long retention without losing the write-isolation benefits.

References

src/memory/checkpointing.py — RedisSaver wiring
src/state/time_travel.py — M05 TimeTravel reads here
src/observability/recovery.py — recovery-on-restart consumes checkpoints
tests/integration/test_worker_consistency.py — cross-store consistency tests
runbooks/incident-2026-03-14-checkpoint-contention.md — the incident that drove the split
ADR-001 (LangGraph orchestrator the Saver plugs into)
ADR-005 (DEPRECATED — original ToolRegistry stored config in Postgres; reverted to Redis with hot-reload)