ADR-001: LangGraph chosen over CrewAI / AutoGen / custom orchestrator | Agentic Data Pipeline

Context

The agent orchestrator is the most expensive decision in this build to reverse — every worker, every tool call, every checkpoint, every failure-recovery path runs through it. Pick wrong and the M05 hardening layer (HITL, time-travel, failure detection) becomes a rewrite instead of an addition.

Four families on the table at v1 design review:

LangGraph (LangChain). Graph-based state machine. Native checkpointing. Streaming + interrupts as first-class primitives. Strong typing via TypedDict state. Heavy dependency surface (LangChain ecosystem).
CrewAI. Role-based agents with delegation. Declarative + readable. Less flexible state model — checkpointing is bolted-on, not native. Weaker typing.
AutoGen (Microsoft). Conversational multi-agent. Excellent for chat-driven workflows, weaker fit for data-pipeline orchestration where determinism + replay matter.
Custom orchestrator. Minimal dependency, full control. ~3 engineer-weeks to replicate LangGraph's checkpoint + interrupt features at production quality.

Constraints driving the pick:

HITL must be first-class. Module 05 needs interrupt_before semantics — pause mid-graph, persist state, resume after human approval. Building this on a non-graph orchestrator means reimplementing the checkpoint serializer.
Time-travel replay must work. Module 05's TimeTravel feature replays a graph from any checkpoint. The orchestrator's checkpoint format has to be stable + serializable.
State must be typed. Workers mutate shared state (run id, intermediate results, error context). Untyped dicts produce silent breakage.
Streaming is required for observability. Module 04's LangSmith + OpenTelemetry tracing depends on per-node event emission.

Decision

Adopt LangGraph as the orchestrator. Build worker agents as LangGraph nodes; supervisor routes via conditional edges; state persists via MemorySaver (in-memory, dev) or RedisSaver (production, M03+).

# src/orchestration/graph.py
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.redis import RedisSaver

graph = StateGraph(PipelineState)
graph.add_node("supervisor", supervisor_agent)
graph.add_node("ingestion", ingestion_worker)
graph.add_node("quality", quality_worker)
graph.add_node("transform", transform_worker)

graph.set_entry_point("supervisor")
graph.add_conditional_edges("supervisor", route_decision, {
    "ingest": "ingestion",
    "validate": "quality",
    "transform": "transform",
    "end": END,
})
graph.add_edge("ingestion", "supervisor")
graph.add_edge("quality", "supervisor")
graph.add_edge("transform", "supervisor")

compiled = graph.compile(
    checkpointer=RedisSaver(redis_client=redis),
    interrupt_before=["transform"],   # HITL gate before destructive writes
)

The interface is a Protocol — nothing in the worker code knows it's running inside LangGraph, so the orchestrator can be swapped if the economics change.

class WorkerAgent(Protocol):
    async def __call__(self, state: PipelineState) -> PipelineState: ...

Tradeoffs we accept

Lever	Alternative	Chosen
Dependency surface	Custom orchestrator (~3 eng-weeks build)	LangGraph (~1 day to wire) — accept LangChain dependency
HITL semantics	Custom interrupt-then-resume protocol	`interrupt_before=[...]` as first-class primitive
Vendor risk	Build everything ourselves	LangGraph is open source + Apache-2.0; vendor-replaceable
Learning curve	CrewAI's declarative DSL	LangGraph's graph-construction API — steeper but more expressive
Determinism	AutoGen's conversational model	LangGraph's deterministic state-machine — necessary for replay

Consequences (positive)

M05's HITL feature is one config flag (interrupt_before=[...]), not a custom subsystem.
M05's TimeTravel works against the same RedisSaver checkpoint we already use for crash recovery — no separate infrastructure.
LangSmith tracing integrates natively (langgraph + langsmith from the same vendor).
Workers stay framework-agnostic via the Protocol — orchestrator swap is contained to src/orchestration/.

Consequences (negative)

LangChain dependency surface is large (~200 transitive deps). We accept it for v1; mitigated by requirements-core.txt excluding optional providers.
LangGraph's API is still evolving — minor versions occasionally break (we pin in requirements.txt).
Custom orchestrator would have ~30% lower memory overhead at scale; we accept that until run rate crosses ~50k/mo.

Reversal plan

Custom orchestrator swap: ~3 engineer-weeks. The Protocol shape means worker code is portable; the work is in:

Reimplementing RedisSaver checkpoint format (~5 days)
Implementing interrupt_before semantics + resume protocol (~5 days)
Reimplementing conditional edges + state mutation guards (~3 days)
Migration window — old + new orchestrators in parallel for 1 sprint (~2 days)

CrewAI swap: ~1 engineer-week if we accept worse HITL ergonomics. Not worth it unless LangChain dependency surface becomes a deal-breaker.

Trigger conditions:

LangGraph licence change (currently Apache-2.0 — no current risk).
Run rate crosses ~50k/mo and orchestrator overhead becomes the bottleneck (we monitor via the Module 04 cost-tracking integration).
A LangGraph minor version breaks the Saver protocol contract a third time (we've had two minor breaks; a third triggers the rewrite).

References

src/orchestration/graph.py — graph construction + compilation
src/agents/supervisor.py — supervisor node implementation
src/agents/workers.py — worker Protocol + 4 implementations
src/memory/checkpointing.py — RedisSaver wiring
ADR-002 (Redis-as-checkpoint depends on this)
ADR-003 (supervisor-worker topology depends on this)
LangGraph docs: https://langchain-ai.github.io/langgraph/