Skip to content
Back to Agentic Data Pipeline

LangGraph chosen over CrewAI / AutoGen / custom orchestrator

✓ AcceptedAgentic Data Pipeline01 — Agent Foundation (carries through M03 multi-agent orchestration)
By AI-DE Engineering Team·Stakeholders: ML engineer, platform owner, eng manager

Context

The agent orchestrator is the most expensive decision in this build to reverse — every worker, every tool call, every checkpoint, every failure-recovery path runs through it. Pick wrong and the M05 hardening layer (HITL, time-travel, failure detection) becomes a rewrite instead of an addition.

Four families on the table at v1 design review:

  • LangGraph (LangChain). Graph-based state machine. Native checkpointing. Streaming + interrupts as first-class primitives. Strong typing via TypedDict state. Heavy dependency surface (LangChain ecosystem).
  • CrewAI. Role-based agents with delegation. Declarative + readable. Less flexible state model — checkpointing is bolted-on, not native. Weaker typing.
  • AutoGen (Microsoft). Conversational multi-agent. Excellent for chat-driven workflows, weaker fit for data-pipeline orchestration where determinism + replay matter.
  • Custom orchestrator. Minimal dependency, full control. ~3 engineer-weeks to replicate LangGraph's checkpoint + interrupt features at production quality.

Constraints driving the pick:

  1. HITL must be first-class. Module 05 needs interrupt_before semantics — pause mid-graph, persist state, resume after human approval. Building this on a non-graph orchestrator means reimplementing the checkpoint serializer.
  2. Time-travel replay must work. Module 05's TimeTravel feature replays a graph from any checkpoint. The orchestrator's checkpoint format has to be stable + serializable.
  3. State must be typed. Workers mutate shared state (run id, intermediate results, error context). Untyped dicts produce silent breakage.
  4. Streaming is required for observability. Module 04's LangSmith + OpenTelemetry tracing depends on per-node event emission.

Decision

Adopt LangGraph as the orchestrator. Build worker agents as LangGraph nodes; supervisor routes via conditional edges; state persists via MemorySaver (in-memory, dev) or RedisSaver (production, M03+).

# src/orchestration/graph.py
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.redis import RedisSaver

graph = StateGraph(PipelineState)
graph.add_node("supervisor", supervisor_agent)
graph.add_node("ingestion", ingestion_worker)
graph.add_node("quality", quality_worker)
graph.add_node("transform", transform_worker)

graph.set_entry_point("supervisor")
graph.add_conditional_edges("supervisor", route_decision, {
    "ingest": "ingestion",
    "validate": "quality",
    "transform": "transform",
    "end": END,
})
graph.add_edge("ingestion", "supervisor")
graph.add_edge("quality", "supervisor")
graph.add_edge("transform", "supervisor")

compiled = graph.compile(
    checkpointer=RedisSaver(redis_client=redis),
    interrupt_before=["transform"],   # HITL gate before destructive writes
)

The interface is a Protocol — nothing in the worker code knows it's running inside LangGraph, so the orchestrator can be swapped if the economics change.

class WorkerAgent(Protocol):
    async def __call__(self, state: PipelineState) -> PipelineState: ...

Tradeoffs we accept

LeverAlternativeChosen
Dependency surfaceCustom orchestrator (~3 eng-weeks build)LangGraph (~1 day to wire) — accept LangChain dependency
HITL semanticsCustom interrupt-then-resume protocolinterrupt_before=[...] as first-class primitive
Vendor riskBuild everything ourselvesLangGraph is open source + Apache-2.0; vendor-replaceable
Learning curveCrewAI's declarative DSLLangGraph's graph-construction API — steeper but more expressive
DeterminismAutoGen's conversational modelLangGraph's deterministic state-machine — necessary for replay

Consequences (positive)

  • M05's HITL feature is one config flag (interrupt_before=[...]), not a custom subsystem.
  • M05's TimeTravel works against the same RedisSaver checkpoint we already use for crash recovery — no separate infrastructure.
  • LangSmith tracing integrates natively (langgraph + langsmith from the same vendor).
  • Workers stay framework-agnostic via the Protocol — orchestrator swap is contained to src/orchestration/.

Consequences (negative)

  • LangChain dependency surface is large (~200 transitive deps). We accept it for v1; mitigated by requirements-core.txt excluding optional providers.
  • LangGraph's API is still evolving — minor versions occasionally break (we pin in requirements.txt).
  • Custom orchestrator would have ~30% lower memory overhead at scale; we accept that until run rate crosses ~50k/mo.

Reversal plan

Custom orchestrator swap: ~3 engineer-weeks. The Protocol shape means worker code is portable; the work is in:

  1. Reimplementing RedisSaver checkpoint format (~5 days)
  2. Implementing interrupt_before semantics + resume protocol (~5 days)
  3. Reimplementing conditional edges + state mutation guards (~3 days)
  4. Migration window — old + new orchestrators in parallel for 1 sprint (~2 days)

CrewAI swap: ~1 engineer-week if we accept worse HITL ergonomics. Not worth it unless LangChain dependency surface becomes a deal-breaker.

Trigger conditions:

  • LangGraph licence change (currently Apache-2.0 — no current risk).
  • Run rate crosses ~50k/mo and orchestrator overhead becomes the bottleneck (we monitor via the Module 04 cost-tracking integration).
  • A LangGraph minor version breaks the Saver protocol contract a third time (we've had two minor breaks; a third triggers the rewrite).

References

  • src/orchestration/graph.py — graph construction + compilation
  • src/agents/supervisor.py — supervisor node implementation
  • src/agents/workers.py — worker Protocol + 4 implementations
  • src/memory/checkpointing.py — RedisSaver wiring
  • ADR-002 (Redis-as-checkpoint depends on this)
  • ADR-003 (supervisor-worker topology depends on this)
  • LangGraph docs: https://langchain-ai.github.io/langgraph/
Built into the project

This decision shipped as part of Agentic Data Pipeline — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open