# ADR-001 — LangGraph chosen over CrewAI / AutoGen / custom orchestrator

- **Status:** Accepted
- **Date:** 2026-04-04
- **Module:** 01 — Agent Foundation (carries through M03 multi-agent orchestration)
- **Stakeholders:** ML engineer, platform owner, eng manager

## Context

The agent orchestrator is the most expensive decision in this build to reverse — every worker, every tool call, every checkpoint, every failure-recovery path runs through it. Pick wrong and the M05 hardening layer (HITL, time-travel, failure detection) becomes a rewrite instead of an addition.

Four families on the table at v1 design review:

- **LangGraph (LangChain).** Graph-based state machine. Native checkpointing. Streaming + interrupts as first-class primitives. Strong typing via TypedDict state. Heavy dependency surface (LangChain ecosystem).
- **CrewAI.** Role-based agents with delegation. Declarative + readable. Less flexible state model — checkpointing is bolted-on, not native. Weaker typing.
- **AutoGen (Microsoft).** Conversational multi-agent. Excellent for chat-driven workflows, weaker fit for data-pipeline orchestration where determinism + replay matter.
- **Custom orchestrator.** Minimal dependency, full control. ~3 engineer-weeks to replicate LangGraph's checkpoint + interrupt features at production quality.

Constraints driving the pick:

1. **HITL must be first-class.** Module 05 needs `interrupt_before` semantics — pause mid-graph, persist state, resume after human approval. Building this on a non-graph orchestrator means reimplementing the checkpoint serializer.
2. **Time-travel replay must work.** Module 05's TimeTravel feature replays a graph from any checkpoint. The orchestrator's checkpoint format has to be stable + serializable.
3. **State must be typed.** Workers mutate shared state (run id, intermediate results, error context). Untyped dicts produce silent breakage.
4. **Streaming is required for observability.** Module 04's LangSmith + OpenTelemetry tracing depends on per-node event emission.

## Decision

**Adopt LangGraph as the orchestrator.** Build worker agents as LangGraph nodes; supervisor routes via conditional edges; state persists via `MemorySaver` (in-memory, dev) or `RedisSaver` (production, M03+).

```python
# src/orchestration/graph.py
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.redis import RedisSaver

graph = StateGraph(PipelineState)
graph.add_node("supervisor", supervisor_agent)
graph.add_node("ingestion", ingestion_worker)
graph.add_node("quality", quality_worker)
graph.add_node("transform", transform_worker)

graph.set_entry_point("supervisor")
graph.add_conditional_edges("supervisor", route_decision, {
    "ingest": "ingestion",
    "validate": "quality",
    "transform": "transform",
    "end": END,
})
graph.add_edge("ingestion", "supervisor")
graph.add_edge("quality", "supervisor")
graph.add_edge("transform", "supervisor")

compiled = graph.compile(
    checkpointer=RedisSaver(redis_client=redis),
    interrupt_before=["transform"],   # HITL gate before destructive writes
)
```

The interface is a Protocol — nothing in the worker code knows it's running inside LangGraph, so the orchestrator can be swapped if the economics change.

```python
class WorkerAgent(Protocol):
    async def __call__(self, state: PipelineState) -> PipelineState: ...
```

## Tradeoffs we accept

| Lever              | Alternative                              | Chosen                                                           |
| ------------------ | ---------------------------------------- | ---------------------------------------------------------------- |
| Dependency surface | Custom orchestrator (~3 eng-weeks build) | LangGraph (~1 day to wire) — accept LangChain dependency         |
| HITL semantics     | Custom interrupt-then-resume protocol    | `interrupt_before=[...]` as first-class primitive                |
| Vendor risk        | Build everything ourselves               | LangGraph is open source + Apache-2.0; vendor-replaceable        |
| Learning curve     | CrewAI's declarative DSL                 | LangGraph's graph-construction API — steeper but more expressive |
| Determinism        | AutoGen's conversational model           | LangGraph's deterministic state-machine — necessary for replay   |

## Consequences (positive)

- M05's HITL feature is one config flag (`interrupt_before=[...]`), not a custom subsystem.
- M05's TimeTravel works against the same `RedisSaver` checkpoint we already use for crash recovery — no separate infrastructure.
- LangSmith tracing integrates natively (`langgraph` + `langsmith` from the same vendor).
- Workers stay framework-agnostic via the Protocol — orchestrator swap is contained to `src/orchestration/`.

## Consequences (negative)

- LangChain dependency surface is large (~200 transitive deps). We accept it for v1; mitigated by `requirements-core.txt` excluding optional providers.
- LangGraph's API is still evolving — minor versions occasionally break (we pin in `requirements.txt`).
- Custom orchestrator would have ~30% lower memory overhead at scale; we accept that until run rate crosses ~50k/mo.

## Reversal plan

**Custom orchestrator swap:** ~3 engineer-weeks. The Protocol shape means worker code is portable; the work is in:

1. Reimplementing `RedisSaver` checkpoint format (~5 days)
2. Implementing `interrupt_before` semantics + resume protocol (~5 days)
3. Reimplementing conditional edges + state mutation guards (~3 days)
4. Migration window — old + new orchestrators in parallel for 1 sprint (~2 days)

**CrewAI swap:** ~1 engineer-week if we accept worse HITL ergonomics. Not worth it unless LangChain dependency surface becomes a deal-breaker.

**Trigger conditions:**

- LangGraph licence change (currently Apache-2.0 — no current risk).
- Run rate crosses ~50k/mo and orchestrator overhead becomes the bottleneck (we monitor via the Module 04 cost-tracking integration).
- A LangGraph minor version breaks the Saver protocol contract a third time (we've had two minor breaks; a third triggers the rewrite).

## References

- `src/orchestration/graph.py` — graph construction + compilation
- `src/agents/supervisor.py` — supervisor node implementation
- `src/agents/workers.py` — worker Protocol + 4 implementations
- `src/memory/checkpointing.py` — RedisSaver wiring
- ADR-002 (Redis-as-checkpoint depends on this)
- ADR-003 (supervisor-worker topology depends on this)
- LangGraph docs: https://langchain-ai.github.io/langgraph/
