Skip to content
Back to Agentic Data Pipeline

Redis for orchestrator checkpoints; Postgres only for business data

✓ AcceptedAgentic Data Pipeline03 — Multi-Agent Pipeline (carries through M05 TimeTravel)
By AI-DE Engineering Team·Stakeholders: platform owner, data engineer, eng manager

Context

The agent pipeline writes two distinct kinds of state to disk:

  1. Orchestrator state — current node, intermediate results, error context, message history. High write rate (every node transition), short retention (24-72h), tolerant of small data loss on a hard crash.
  2. Business data — ingested rows, validated records, transformed output. Low-to-medium write rate, long retention, intolerant of data loss.

Conflating them in one store turns out to be a real production problem in v1: we tried LangGraph + PostgresSaver first. Three issues surfaced under load:

  • Write contention. Orchestrator writes every node transition (~10/run, sometimes 50+ on retries). At 5k runs/day, that's ~50k row-level locks per day on the same langgraph_checkpoints table. Read-heavy business queries (M04 dashboards) started timing out.
  • Schema churn. LangGraph minor versions occasionally bump the checkpoint schema. Postgres migrations on every minor LangGraph release wasn't sustainable; we had two production incidents from migration-on-deploy gone wrong.
  • Retention pressure. 30-day retention on checkpoints + 12-month retention on business records put both on the same disk; we hit storage thresholds twice in 8 weeks.

Decision

Two-tier persistence:

  • Redis (cache.t4g.small in production) — orchestrator checkpoints + agent message history + recovery state. 24-hour TTL on checkpoints (M05 TimeTravel keeps a 24-hour replay window). RedisSaver is the LangGraph integration.
  • Postgres (db.t4g.medium) — ingested + validated + transformed business data. Standard SQLAlchemy ORM, normal application retention.
# src/memory/checkpointing.py
from langgraph.checkpoint.redis import RedisSaver

def make_checkpointer() -> RedisSaver:
    return RedisSaver(
        redis_client=redis_pool,
        ttl_seconds=86400,             # 24h replay window
        key_prefix="agent_chk:",
    )

# src/agents/supervisor.py
graph = compile_graph(checkpointer=make_checkpointer())

Module 05's TimeTravel reads the same Redis checkpoints — no separate replay store.

Tradeoffs we accept

LeverAlternativeChosen
DurabilityPostgres ACID writes for checkpointsRedis AOF + RDB backups every 6h — accept risk of <6h checkpoint loss in disaster recovery
Single-store simplicityPostgres for everythingTwo stores — accept ops complexity for write isolation + schema isolation
Replay windowIndefinite checkpoint retention24h TTL — replay before then or it's gone
CostPostgres-only (~$50/mo)Redis + Postgres (~$85/mo)
ConsistencyCross-store transactionsEventual consistency — workers commit to Postgres before reporting "done" to the orchestrator

Consequences (positive)

  • Orchestrator writes don't contend with business reads. M04 dashboard latency dropped from p95=4.2s to p95=0.8s after the split.
  • LangGraph minor version bumps require no Postgres migration. Redis treats checkpoints as opaque blobs; the schema bump only matters at deserialization time, which is per-process.
  • M05 TimeTravel works against the same key-value store — redis-cli HGETALL agent_chk:run_id_42 returns the snapshot directly. Debugging is redis-cli, not SQL.
  • Storage retention pressure decoupled. Business retention is set once in Postgres; checkpoint TTL is set once in Redis.

Consequences (negative)

  • Two stores to operate. Backups, monitoring, alerting, and runbook coverage doubles. Mitigation: Redis is cache.t4g.small — small surface, low ops cost.
  • 24h replay window means M05 TimeTravel can't replay a 3-day-old run. Mitigation: Module 06's compliance audit log shipped to S3 covers long-term forensics.
  • Cross-store consistency is the worker's responsibility. The Protocol contract: workers commit to Postgres before returning { status: "done" } to the orchestrator. Test in tests/integration/test_worker_consistency.py.

Reversal plan

Postgres-only: Re-introduce LangGraph's PostgresSaver if/when one of these triggers fires:

  1. Run rate < 500/day (write-contention argument vanishes; ~$35/mo savings dominate).
  2. Compliance requirement to retain checkpoints > 24h with ACID guarantees.
  3. Redis ops burden becomes the bottleneck (would surprise me — but flagged for monitoring).

Estimated migration cost: ~3 engineer-days. The Saver swap is one config line; the risk is the schema-migration discipline we wanted to escape in the first place.

Hybrid (write to both): Mirror checkpoints to Postgres for a 30-day archival window. ~1 engineer-week. Use when compliance asks for long retention without losing the write-isolation benefits.

References

  • src/memory/checkpointing.py — RedisSaver wiring
  • src/state/time_travel.py — M05 TimeTravel reads here
  • src/observability/recovery.py — recovery-on-restart consumes checkpoints
  • tests/integration/test_worker_consistency.py — cross-store consistency tests
  • runbooks/incident-2026-03-14-checkpoint-contention.md — the incident that drove the split
  • ADR-001 (LangGraph orchestrator the Saver plugs into)
  • ADR-005 (DEPRECATED — original ToolRegistry stored config in Postgres; reverted to Redis with hot-reload)
Built into the project

This decision shipped as part of Agentic Data Pipeline — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open