ai-de.net/Projects/P13 · Agentic data pipeline — LangGraph supervisor + HITL + ADRs

Last updated 2026-05-22By AI-DE Engineering Team

EXPERT-tier · PRO unlocks Modules 01-03AI & vectors trackP13

Build the
agent execution layer
of modern data platforms

Ship a production agentic pipeline with a LangGraph supervisor + 4 worker agents (Ingestion / Quality / Transform / Loading), Redis checkpointing for time-travel replay, HITL via interrupt_before with Slack actionable approval, ToolCallGuard budget enforcement, RestrictedPython sandbox, and 5 committed ADRs. Modules 01-03 unlock with PRO; the platform unlocks with EXPERT.

Timeline

17-18 hours

Difficulty

Senior+

Stack

LangGraph · FastAPI · Postgres · Redis · LangSmith · Slack Bolt

See EXPERT benefits

The agent-platform system-design portfolio piece for staff AI roles — 5 committed ADRs (one Deprecated documenting a real RBAC reversal), a working LangGraph supervisor-worker pipeline, and a cost model that defends the cascade vs Sonnet-only baseline.

By the end you will have wired

LangGraph supervisor + 4 worker agents with conditional-edge routing and Redis checkpointing
RBAC-aware ToolRegistry with per-agent scoped views (post-ADR-005 reversal)
Production observability: Prometheus + OpenTelemetry + LangSmith + per-agent cost attribution
HITL approval flow via interrupt_before + Slack actionable buttons (Approve / Deny / Escalate)
FailureDetector + ToolCallGuard + ContextWindowManager + 24h TimeTravel checkpoint replay
5 ADRs (one Deprecated) + cost-model CSV + RestrictedPython sandbox + GitHub PR bot

PREREQ · SENIOR+Built for engineers shipping agents in production. Comfortable with Python services, async / asyncio, at least one of: LangGraph or equivalent agent framework, FastAPI, or production observability. Not a “what is an agent” course.

agentic_pipeline.platform · 6 modules · supervisor + 4 workers armed · LangGraph + Redis checkpoint

interrupt_before ✓

Inputs

Agents

Memory

Surfaces

postgresQueryDatabaseTool · pooled

rest_apiAPIClientTool · retry + rate-limit

s3 / filesFileProcessorTool · chunked read

kafkastream consumer · idempotent

Tool registry — see ADR-005 (Deprecated)

SupervisorLangGraph routing · weighted conf

IngestionWorkerRBAC-scoped tool view

QualityWorkervalidation + anomaly

TransformWorkerschema-evolution-safe

LoadingWorkeridempotent writes

Supervisor-worker topology — see ADR-003

RedisSavercheckpoint · 24h TTL

SemanticMemoryembedding + similarity

TimeTravelreplay from any checkpoint

AuditLogtool-call + approval trail

Two-tier persistence — see ADR-002

interrupt_beforeHITL gate · Slack approve

LangSmith traceper-agent + per-tool

Cost trackerUSD per run · per agent

GitHub PR botPyGithub · auto-PR

HITL approval flow — see ADR-004

# Judge cascade — 35% cost cut

Haiku handles 70% of worker LLM calls (USD 0.80/M in)

Sonnet only on supervisor routing + escalation paths

Cascade saves ~USD 75/mo at 5k runs/mo

→ ~USD 0.036 per run at optimized load

# HITL gate — interrupt_before

Destructive writes pause the LangGraph mid-run

State persists in Redis (24h TTL) — no re-execution on resume

Slack actionable buttons (Approve / Deny / Escalate)

→ graph.aupdate_state() resumes from exact checkpoint

4 + 1

workers + 1 supervisor

5 ADRs

committed in starter kit

−35%

cost vs Sonnet-only baseline

Curriculum · 6 modules · 17-18 hours · 3 phases

Modules 01-03 unlock with PRO. The full platform with EXPERT.

Modules 01-03 (~10h) ship a complete working multi-agent pipeline — foundation, 4 production tools, and a LangGraph supervisor + worker system with Redis checkpointing. Included with PRO. Modules 04-06 (~7.5h additional) layer on production observability, hardening (HITL + failure detection + time-travel), and a multi-tenant platform-design capstone. Unlock with EXPERT.

P13 · 6 modules · 17-18 hours · 60+ lessons

Free preview EXPERT required

M01

⊘Agent Foundation

Project scaffolding with LangGraph, base agent class with logging hooks, BaseDataTool abstraction, PipelineState TypedDict, and a Pydantic Settings configuration layer. The honest baseline before any working agent exists.

Phase 13h8 lessonsPRO TIER

Unlock with PRO →

M02

⊘Tool Implementation

Four production tools — QueryDatabaseTool (connection pooling + SQL safety), APIClientTool (retry + rate-limit + idempotency), FileProcessorTool (chunked read + format detect), ValidationTool (quality rules) — wired through an RBAC-aware registry per ADR-005's reversal.

Phase 13h9 lessonsPRO TIER

Unlock with PRO →

M03

⊘Multi-Agent Pipeline

LangGraph StateGraph with supervisor + 4 workers, conditional-edge routing, semantic memory store, RedisSaver checkpointing, RBAC scoping at the registry, and end-to-end execution. The PRO finish line: a working agentic pipeline.

Phase 14h11 lessonsPRO TIER

Unlock with PRO →

M04

⊘Production Ready

Prometheus metrics + OpenTelemetry tracing + LangSmith integration + FastAPI + per-agent cost tracking + Docker + Kubernetes manifests + task queue for horizontal scaling + agent evaluation framework.

Phase 22h8 lessonsEXPERT TIER

Unlock with EXPERT →

M05

⊘Harden the Pipeline

HITL via LangGraph interrupt_before + Slack actionable buttons, FailureDetector (loop + cascade), ToolCallGuard (budget enforcement), ContextWindowManager (rolling window + summarization), TimeTravel (24h Redis checkpoint replay), system-prompt tool-use rules.

Phase 22.5h9 lessonsEXPERT TIER

Unlock with EXPERT →

M06

⊘Design and Extend an AI Data Platform

Unstructured extraction (pdfplumber + BeautifulSoup), dynamic schema inference (Pandas + Pydantic), RestrictedPython sandbox, multi-tenant TenantContext + RBAC, feedback-driven improvement loop, GitHub PR bot (PyGithub), SLA YAML, and a capstone platform-design exercise.

Phase 33h10 lessonsEXPERT TIER

Unlock with EXPERT →

Modules 01-03 with PRO ($29/mo) · Modules 04-06 with EXPERT ($79/mo)

See plans →

Backed by curriculum

Agentic Workflows

7 modules24 hoursLangGraph · Tool-use · HITL · Multi-agent

Open curriculum

iThis curriculum is the foundation for the project — it’s not a sales add-on. EXPERT subscribers get full access to all modules.

The build, in 3 phases

Foundation. Production. Platform.

Each phase ends with a tagged release, a passing integration test, and a passing red-team failure-injection run. No ambiguity about where you are.

01~10h

Foundation (Modules 01-03)

Working multi-agent pipeline running locally. LangGraph supervisor + 4 worker agents + 4 tools + Redis checkpointing, end-to-end execution on the seeded sample data.

✓RBAC-aware ToolRegistry with 4 tools (post-ADR-005 reversal)
✓LangGraph supervisor + workers with conditional-edge routing
✓RedisSaver checkpointing + recovery on restart

02~2h

Production (Module 04)

Observable, deployable, costable. Prometheus + OTel + LangSmith tracing, per-agent cost attribution, FastAPI service, Docker + Kubernetes manifests, agent eval framework.

✓Prometheus + OpenTelemetry + LangSmith integration
✓Per-agent + per-tool cost attribution
✓Docker + Kubernetes deploy + agent eval suite

03~5.5h

Platform (Modules 05-06)

Hardening + multi-tenant platform design. HITL approval flow, failure detection + recovery, time-travel replay, multi-tenant TenantContext, RestrictedPython sandbox, capstone design.

✓HITL via interrupt_before + Slack actionable buttons (ADR-004)
✓FailureDetector + ToolCallGuard + TimeTravel replay
✓Multi-tenant + RestrictedPython sandbox + GitHub PR bot

Project setup · 10 minutes

One command. Local LangGraph + Postgres + Redis + LangSmith.

What lives in the repo

You get the real agent platform on day one — LangGraph supervisor + 4 worker agents, RBAC-aware ToolRegistry with 4 tools, RedisSaver checkpointing, FastAPI gateway, Prometheus + OpenTelemetry + LangSmith instrumentation, Docker + Kubernetes manifests, plus the M05 hardening (HITL / FailureDetector / TimeTravel) and M06 platform features.

src/agents/ — supervisor + 4 workers + HITL + FailureDetector + ContextManager
src/tools/ — 4 production tools + RBAC-aware registry (post-ADR-005)
src/orchestration/ + src/memory/ — LangGraph StateGraph + RedisSaver + semantic memory
src/observability/ — Prometheus + OTel + LangSmith + cost-tracking + Slack alerts
src/api/ + src/scaling/ + Dockerfile + k8s/ — FastAPI + task queue + container + Kubernetes manifests
docs/adr/ + docs/cost-model/ — 5 committed ADRs (one Deprecated) + the runnable cost-model CSV

Download · Starter Kit · 91 files · 104 KB

Agentic Data Pipeline Starter Kit

Pre-built LangGraph supervisor + 4 worker agents + RBAC tool registry + Redis checkpointing + FastAPI + Docker + Kubernetes. Now bundled: 5 ADR markdown files (docs/adr/) and the runnable cost-model CSV (docs/cost-model/) — unzip and read them straight from the repo.

EXPERT project · 91 files · ADRs + cost model bundled · last updated 2026-05-08

~/projects/agentic-data-pipeline — zsh

1. Unzip and start the platform

$ unzip agentic-data-pipeline-starter.zip

$ cd agentic-data-pipeline-starter && cp .env.example .env

$ docker compose up -d

2. Run the demo end-to-end

$ export ANTHROPIC_API_KEY=...

$ python scripts/demo.py \

$ --source postgres --checkpoint redis --trace langsmith

3. Open ADR-001 + the cost model

$ less docs/adr/001-langgraph-vs-crewai-vs-custom.md

$ open docs/cost-model/agentic-data-pipeline-cost-model.csv

4. Run the failure-injection suite

$ pytest tests/test_failure_detector.py tests/test_time_travel.py -v

4 + 1

workers + supervisor

files in starter kit

ADRs (incl. 1 Deprecated)

sample run traces

Production hardening

The same agent demo — but built for the auditable case.

Most agent tutorials show you a single LLM call in a loop. This shows what changes when the supervisor decides for 4 workers, every tool call hits an audit log, the cost model is defensible to a CFO, and a compliance reviewer asks which agent decided that.

Notebook agent demoWhat most teams ship

Orchestration

While-loop calling an LLM

Tool access

Every agent can call every tool

Failure mode

Loop until token budget; pray

HITL

Manual code edit, redeploy

Cost

Whatever the bill says next month

Replay

Re-run from scratch on every retry

Your agent platformModules 04–06

✓

Orchestration

LangGraph StateGraph + supervisor-worker conditional edges (ADR-001 + ADR-003)

✓

Tool access

ScopedToolView per agent role (RBAC at registry, post-ADR-005 reversal)

✓

Failure mode

FailureDetector + ToolCallGuard + budget cap; cascade detection in M05

✓

HITL

interrupt_before + Slack approve / deny / escalate (ADR-004)

✓

Cost

Per-agent token attribution + judge cascade −35%; CSV in docs/cost-model/

✓

Replay

TimeTravel reads any checkpoint in last 24h via RedisSaver (ADR-002)

EXPERT-only · architecture decision records

Write the ADRs staff engineers actually get judged on.

Five ADRs ship inside the starter-kit zip at docs/adr/, one per major decision in the build, including a real Deprecated ADR documenting the v0 ToolRegistry → RBAC reversal after a real Quality-agent overreach incident. The kind of doc that travels with you to your next role. Preview ADR-001 →

ADR-001Accepted

LangGraph chosen over CrewAI / AutoGen / custom orchestrator

Context

HITL + time-travel replay + typed state + streaming all need first-class primitives — pipeline / mesh / chain alternatives don't fit

Decision

Adopt LangGraph as the orchestrator. Workers wrap a Protocol (WorkerAgent) — orchestrator-replaceable

Tradeoff

Heavy LangChain dependency surface (~200 transitive deps); minor versions occasionally break Saver protocol

Reversal

Custom orchestrator swap is ~3 engineer-weeks; Protocol shape keeps workers portable

ADR-002Accepted

Redis for orchestrator checkpoints; Postgres only for business data

Context

PostgresSaver caused row-lock contention with M04 dashboards (p95=4.2s → 0.8s after split) + LangGraph version bumps required Postgres migrations

Decision

Two-tier persistence — RedisSaver for checkpoints (24h TTL), Postgres for business data

Tradeoff

Two stores to operate (~$35/mo extra) and 24h replay window vs indefinite

Reversal

Postgres-only re-introduction is ~3 engineer-days when run rate < 500/day

ADR-003Accepted

Hierarchical supervisor-worker topology, not peer-to-peer agents

Context

Per-agent cost attribution + deterministic replay + linear LLM cost scaling all need a single decision-maker

Decision

Supervisor + 4 workers; supervisor holds routing logic, workers execute tools and return state via add_conditional_edges

Tradeoff

Supervisor on critical path of every step; per-step concurrency lost in exchange for auditability

Reversal

Peer-to-peer mesh is ~2 engineer-weeks if real-time streaming becomes a requirement

ADR-004Accepted

HITL via LangGraph `interrupt_before` + Slack actionable buttons

Context

Pause + persist + resume + audit + timeout-then-reject — message-queue + DB-flag patterns add too much infra; native graph pause wins

Decision

interrupt_before=[...] at compile time + Slack Approve / Deny / Escalate buttons → graph.aupdate_state resumes

Tradeoff

Slack on critical path; 24h hard timeout (ties to Redis TTL); approve/deny/escalate only — no partial state edit

Reversal

SQS-backed approver web UI is ~2 engineer-weeks; LangGraph contract unchanged

ADR-005Deprecated

Single global ToolRegistry without RBAC scoping (v0)

Context

Day-2 MVP: all agents shared one registry; nothing prevented Quality agent from calling Transform tools

Decision

Reverted in M03 — added ScopedToolView with per-agent-role scopes (default-deny) at the registry layer

Why reversed

Quality agent silently applied a schema migration during a validation run on 2026-04-12 — broke M04 dashboard for 23 min

Replaced by

RBAC-aware registry; ~4.5 engineer-day reversal cost

EXPERT-only · cost model

Read the FinOps story, not just the latency one.

Module 04 ships a runnable cost-model CSV inside the starter-kit zip at docs/cost-model/. 5,000 agent-runs/mo load, real Anthropic + AWS RDS + ElastiCache + LangSmith list prices, with model-cascade and reserved-instance levers wired up. The version you’ll defend to a CFO. Preview the CSV →

ComponentBaseline / moOptimized / moDelta

Anthropic Claude Sonnet (planner + supervisor)

100% baseline → 30% optimized · 24M in / 6M out tok/mo

$144

$43

−$101

Anthropic Claude Haiku (worker agents)

70% of mix in optimized · ~17M in / 4M out tok/mo

$26

—

AWS RDS Postgres (db.t4g.medium)

100GB gp3 · business data store (per ADR-002)

$50

$35

−$15

AWS ElastiCache Redis (cache.t4g.small)

checkpoint + recovery state (per ADR-002)

$35

$26

−$9

LangSmith observability (Plus tier)

30k traces/mo budget · agent + tool spans

$39

—

GitHub Actions + container registry

~150 PR runs/mo × 6 min × Linux runners + GHCR

$12

—

Total · 5k runs/mo

~$0.056 per run at baseline · ~$0.036 optimized

$280

$181

−$99 (−35%)

Optimization levers

Model cascade (Haiku for workers · Sonnet for supervisor)

Route 70% of worker LLM calls to Haiku ($0.80/M in). Supervisor + escalation paths stay on Sonnet for routing quality. ADR-001 + ADR-003.

−$75 / mo · −27%

Idempotent tool-call cache

SHA-256 cache on (tool_name, args) for read-only tools. Redis-backed, 1h TTL on quality-sensitive, 24h on static reads. ~18% hit rate on regression suites.

−$14 / mo · grows with workload stability

RDS + ElastiCache 1-yr reserved

Commit to 12-month reserved capacity once load is stable for 30 days. ~30% off RDS, ~26% off ElastiCache. Break-even at month 4.

−$24 / mo · −28% on store cost

EXPERT benefit · cohort beta

Async architecture review with a staff-level reviewer (cohort beta).

Submit your repo, your ADR draft, or your supervisor-routing prompt. A staff or principal-level reviewer who has shipped this exact stack responds within 7 days with line-by-line comments. Cohort capped at 12 members.

Bring a diff, an ADR draft, or an HITL flow.

The cohort beta runs as async architecture review — pick a reviewer by topic, send the artifact, get inline comments + a Loom walkthrough back. No back-and-forth scheduling. No 30-minute slot pressure.

Mira R.

Ex-staff · agent platform · top-3 cloud

Multi-agent topology, supervisor design, RBAC at the tool layer, agent eval

“Send the diff. I'll go line-by-line through your supervisor prompt and the ScopedToolView wiring and pick out the agent-overreach paths.”

Daniel K.

Principal · LLM platform · enterprise SaaS

HITL design, approval-flow auditability, time-travel replay, compliance + audit log

“Send your worst stuck-approval. We'll walk it backwards from the Slack event log to the LangGraph checkpoint state and figure out where the protocol broke.”

Anya S.

Eng manager · AI platform · public Series-D

Org design for agent teams, hiring rubrics, staff-engineer interview prep, ADR review

“If you're prepping for staff promo, send your ADR draft. We'll work backwards from the rubric.”

Format

Async

Turnaround

7 days

Cohort

12 members

Scope

ADR + arch review

Request a slot →

What your tier unlocks

PRO unlocks Modules 01-03. EXPERT unlocks the full platform.

PRO is the entry point — Modules 01-03 (a working multi-agent pipeline) plus the rest of the PRO catalog. EXPERT unlocks Modules 04-06 of this build, the 5 ADRs, the cost-model CSV, and the cohort-beta async review.

What you getFREEPROEXPERT

Modules 01-03 of P13

Foundation + tools + multi-agent pipeline (~10h)

—

Included

Modules 04-06 of P13

Production + hardening + platform design (~7.5h)

—

Included

5 committed ADRs + cost-model CSV

Starter kit docs/adr/ + docs/cost-model/

—

Included

PRO project catalog

Production-grade builds

All current

All current + this one

Curriculum

All 7 tracks

Phase 1 only

All

All + bonus modules

Code review

Senior+ reviewers

—

4 / month

Unlimited

Cohort-beta architecture review

Async · 7-day turnaround · 12-member cap

—

Included

Certificate

Verifiable on LinkedIn

—

Yes

Yes + LinkedIn rec

$79/mo

billed monthly · open enrollment · cancel anytime

or annual

$699/yr save 26%

Unlock EXPERT →

Who this is for

Pick this if you own the supervisor, not just a feature.

Staff / principal engineers · agent platform

You own the supervisor prompt, the RBAC boundary, and the answer to 'which agent decided this?' that your VP asks at the next incident review.

Engineering managers · AI

You need a reference architecture for the agent platform your CTO will ask about before the AI team gets headcount or a budget for production deployment.

Platform / infra leads

You absorb LangGraph without absorbing 4 new vendors. Postgres, Redis, Prometheus, Slack — tools you already operate. This is the playbook.

Founding engineers · AI startups

Your investors will ask 'how do you know agents are safe to ship?' before they ask about scale. The 5 ADRs + HITL gate + RBAC registry is the answer.

Related curriculum

Going deeper? Four tracks back this project.

The Agentic Workflows curriculum is the foundation. These four tracks let you go deeper on agent eval, retrieval, production ops, and the platform-design discipline you'll need at staff level.

FAQ · EXPERT tier

Quick answers.

How is this different from PRO?+

Modules 01-03 (the working multi-agent pipeline — foundation, 4 tools, supervisor + workers + Redis checkpointing) are included with PRO at $29/mo. The rest of the platform — Modules 04-06 (production observability, HITL + hardening, multi-tenant platform design), the 5 committed ADRs, the runnable cost-model CSV, and the cohort-beta async architecture review — unlocks with EXPERT at $79/mo. PRO gets you a working pipeline; EXPERT gets you the platform you'd defend in an architecture review.

Is this still useful if I'm using CrewAI / AutoGen instead of LangGraph?+

Yes — most of the value is in the design decisions, not the framework. ADR-001 lays out exactly when LangGraph wins vs CrewAI / AutoGen / custom; the supervisor-worker topology in ADR-003 is framework-agnostic; the RBAC ToolRegistry in ADR-005 is a Python pattern. The orchestrator-specific code is contained to one directory (~200 lines) and is documented as a swap target.

Is the cohort-beta mentor program 1:1 video calls?+

Not for v1. The cohort beta runs as async review: you submit a diff / ADR / HITL flow / supervisor prompt, a staff-level reviewer responds within 7 days with inline comments + a Loom walkthrough. Cohort is capped at 12 members so reviewers can keep the SLA. We'll evaluate live 1:1 sessions once the cohort signal is solid.

How long until I can finish this project?+

17-18 hours of focused work across 6 modules. Most learners spread it across 4-6 weeks alongside a day job. Modules 01-03 alone are ~10 hours and get you a working multi-agent pipeline you can deploy locally.

Is this enough to interview for staff AI / agent-platform roles?+

It's a strong forcing function. Staff agent-platform interviews lean heavily on system design (multi-agent topology, HITL, audit, cost) and on having opinions backed by real tradeoffs. The 5 ADRs you commit (one Deprecated, with the RBAC reversal incident) are exactly the artifacts a panel asks about. Pair with the cohort-beta review on your final repo and you have the portfolio piece.

Can my company expense it?+

Yes — receipts and a learning-budget letter are downloadable on subscription. Many EXPERT learners are reimbursed under engineering training, AI upskilling, or platform-tooling budgets.

What is NOT in scope?+

Model fine-tuning. Pre-training. RAG retrieval pipelines (we use them; we don't build them — see /projects/enterprise-rag for that). Agent training via reinforcement learning. This is an agent execution platform — you ship the system that runs agents in production, not the system that creates them.

Related projects

Paired with this project

P16·PAID·ai

LLM training-data pipeline — crawl + dedup + RAG + LLMOps

EXPERT-tier dataset-engineering build: aiohttp crawler + MinHash/LSH dedup + quality scoring + tokenizer + pgvector/Pinecone RAG + vLLM serving + Airflow DAGs + Locust load tests + CI eval gate. Modules 01-02 with PRO; full platform with EXPERT.

Explore project →

Ready to ship the system that runs agents in production?

Start with PRO ($29/mo) for Modules 01-03 — the working multi-agent pipeline. Or unlock the full 6-module platform plus 5 ADRs, the cost-model CSV, and cohort-beta architecture review with EXPERT ($79/mo).

See EXPERT benefits

P13 · Agentic data pipeline · EXPERT · PRO unlocks M01-M03Unlock EXPERT →

Build theagent execution layerof modern data platforms

Modules 01-03 unlock with PRO. The full platform with EXPERT.

Foundation. Production. Platform.

One command. Local LangGraph + Postgres + Redis + LangSmith.

What lives in the repo

Agentic Data Pipeline Starter Kit

The same agent demo — but built for the auditable case.

Write the ADRs staff engineers actually get judged on.

LangGraph chosen over CrewAI / AutoGen / custom orchestrator

Redis for orchestrator checkpoints; Postgres only for business data

Hierarchical supervisor-worker topology, not peer-to-peer agents

HITL via LangGraph `interrupt_before` + Slack actionable buttons

Single global ToolRegistry without RBAC scoping (v0)

Read the FinOps story, not just the latency one.

Optimization levers

Async architecture review with a staff-level reviewer (cohort beta).

Bring a diff, an ADR draft, or an HITL flow.

PRO unlocks Modules 01-03. EXPERT unlocks the full platform.

Pick this if you own the supervisor, not just a feature.

Staff / principal engineers · agent platform

Engineering managers · AI

Platform / infra leads

Founding engineers · AI startups

Going deeper? Four tracks back this project.

LLM Evaluation

RAG Learning Path

MLOps for Data Engineers

System Design for Data Engineers

Quick answers.

Paired with this project

Ready to ship the system that runs agents in production?

Build the
agent execution layer
of modern data platforms