Context
The agent orchestrator is the most expensive decision in this build to reverse — every worker, every tool call, every checkpoint, every failure-recovery path runs through it. Pick wrong and the M05 hardening layer (HITL, time-travel, failure detection) becomes a rewrite instead of an addition.
Four families on the table at v1 design review:
- LangGraph (LangChain). Graph-based state machine. Native checkpointing. Streaming + interrupts as first-class primitives. Strong typing via TypedDict state. Heavy dependency surface (LangChain ecosystem).
- CrewAI. Role-based agents with delegation. Declarative + readable. Less flexible state model — checkpointing is bolted-on, not native. Weaker typing.
- AutoGen (Microsoft). Conversational multi-agent. Excellent for chat-driven workflows, weaker fit for data-pipeline orchestration where determinism + replay matter.
- Custom orchestrator. Minimal dependency, full control. ~3 engineer-weeks to replicate LangGraph's checkpoint + interrupt features at production quality.
Constraints driving the pick:
- HITL must be first-class. Module 05 needs
interrupt_beforesemantics — pause mid-graph, persist state, resume after human approval. Building this on a non-graph orchestrator means reimplementing the checkpoint serializer. - Time-travel replay must work. Module 05's TimeTravel feature replays a graph from any checkpoint. The orchestrator's checkpoint format has to be stable + serializable.
- State must be typed. Workers mutate shared state (run id, intermediate results, error context). Untyped dicts produce silent breakage.
- Streaming is required for observability. Module 04's LangSmith + OpenTelemetry tracing depends on per-node event emission.
Decision
Adopt LangGraph as the orchestrator. Build worker agents as LangGraph nodes; supervisor routes via conditional edges; state persists via MemorySaver (in-memory, dev) or RedisSaver (production, M03+).
# src/orchestration/graph.py
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.redis import RedisSaver
graph = StateGraph(PipelineState)
graph.add_node("supervisor", supervisor_agent)
graph.add_node("ingestion", ingestion_worker)
graph.add_node("quality", quality_worker)
graph.add_node("transform", transform_worker)
graph.set_entry_point("supervisor")
graph.add_conditional_edges("supervisor", route_decision, {
"ingest": "ingestion",
"validate": "quality",
"transform": "transform",
"end": END,
})
graph.add_edge("ingestion", "supervisor")
graph.add_edge("quality", "supervisor")
graph.add_edge("transform", "supervisor")
compiled = graph.compile(
checkpointer=RedisSaver(redis_client=redis),
interrupt_before=["transform"], # HITL gate before destructive writes
)
The interface is a Protocol — nothing in the worker code knows it's running inside LangGraph, so the orchestrator can be swapped if the economics change.
class WorkerAgent(Protocol):
async def __call__(self, state: PipelineState) -> PipelineState: ...
Tradeoffs we accept
| Lever | Alternative | Chosen |
|---|---|---|
| Dependency surface | Custom orchestrator (~3 eng-weeks build) | LangGraph (~1 day to wire) — accept LangChain dependency |
| HITL semantics | Custom interrupt-then-resume protocol | interrupt_before=[...] as first-class primitive |
| Vendor risk | Build everything ourselves | LangGraph is open source + Apache-2.0; vendor-replaceable |
| Learning curve | CrewAI's declarative DSL | LangGraph's graph-construction API — steeper but more expressive |
| Determinism | AutoGen's conversational model | LangGraph's deterministic state-machine — necessary for replay |
Consequences (positive)
- M05's HITL feature is one config flag (
interrupt_before=[...]), not a custom subsystem. - M05's TimeTravel works against the same
RedisSavercheckpoint we already use for crash recovery — no separate infrastructure. - LangSmith tracing integrates natively (
langgraph+langsmithfrom the same vendor). - Workers stay framework-agnostic via the Protocol — orchestrator swap is contained to
src/orchestration/.
Consequences (negative)
- LangChain dependency surface is large (~200 transitive deps). We accept it for v1; mitigated by
requirements-core.txtexcluding optional providers. - LangGraph's API is still evolving — minor versions occasionally break (we pin in
requirements.txt). - Custom orchestrator would have ~30% lower memory overhead at scale; we accept that until run rate crosses ~50k/mo.
Reversal plan
Custom orchestrator swap: ~3 engineer-weeks. The Protocol shape means worker code is portable; the work is in:
- Reimplementing
RedisSavercheckpoint format (~5 days) - Implementing
interrupt_beforesemantics + resume protocol (~5 days) - Reimplementing conditional edges + state mutation guards (~3 days)
- Migration window — old + new orchestrators in parallel for 1 sprint (~2 days)
CrewAI swap: ~1 engineer-week if we accept worse HITL ergonomics. Not worth it unless LangChain dependency surface becomes a deal-breaker.
Trigger conditions:
- LangGraph licence change (currently Apache-2.0 — no current risk).
- Run rate crosses ~50k/mo and orchestrator overhead becomes the bottleneck (we monitor via the Module 04 cost-tracking integration).
- A LangGraph minor version breaks the Saver protocol contract a third time (we've had two minor breaks; a third triggers the rewrite).
References
src/orchestration/graph.py— graph construction + compilationsrc/agents/supervisor.py— supervisor node implementationsrc/agents/workers.py— worker Protocol + 4 implementationssrc/memory/checkpointing.py— RedisSaver wiring- ADR-002 (Redis-as-checkpoint depends on this)
- ADR-003 (supervisor-worker topology depends on this)
- LangGraph docs: https://langchain-ai.github.io/langgraph/