Skip to content
ai-de.net/Projects/P13 · Agentic data pipeline — LangGraph supervisor + HITL + ADRs
EXPERT-tier · PRO unlocks Modules 01-03AI & vectors trackP13

Build the
agent execution layer
of modern data platforms

Ship a production agentic pipeline with a LangGraph supervisor + 4 worker agents (Ingestion / Quality / Transform / Loading), Redis checkpointing for time-travel replay, HITL via interrupt_before with Slack actionable approval, ToolCallGuard budget enforcement, RestrictedPython sandbox, and 5 committed ADRs. Modules 01-03 unlock with PRO; the platform unlocks with EXPERT.

Timeline
17-18 hours
Difficulty
Senior+
Stack
LangGraph · FastAPI · Postgres · Redis · LangSmith · Slack Bolt

The agent-platform system-design portfolio piece for staff AI roles — 5 committed ADRs (one Deprecated documenting a real RBAC reversal), a working LangGraph supervisor-worker pipeline, and a cost model that defends the cascade vs Sonnet-only baseline.

By the end you will have wired
  • LangGraph supervisor + 4 worker agents with conditional-edge routing and Redis checkpointing
  • RBAC-aware ToolRegistry with per-agent scoped views (post-ADR-005 reversal)
  • Production observability: Prometheus + OpenTelemetry + LangSmith + per-agent cost attribution
  • HITL approval flow via interrupt_before + Slack actionable buttons (Approve / Deny / Escalate)
  • FailureDetector + ToolCallGuard + ContextWindowManager + 24h TimeTravel checkpoint replay
  • 5 ADRs (one Deprecated) + cost-model CSV + RestrictedPython sandbox + GitHub PR bot
PREREQ · SENIOR+Built for engineers shipping agents in production. Comfortable with Python services, async / asyncio, at least one of: LangGraph or equivalent agent framework, FastAPI, or production observability. Not a “what is an agent” course.
agentic_pipeline.platform · 6 modules · supervisor + 4 workers armed · LangGraph + Redis checkpoint
interrupt_before ✓
Inputs
Agents
Memory
Surfaces
postgresQueryDatabaseTool · pooled
rest_apiAPIClientTool · retry + rate-limit
s3 / filesFileProcessorTool · chunked read
kafkastream consumer · idempotent
Tool registry — see ADR-005 (Deprecated)
SupervisorLangGraph routing · weighted conf
IngestionWorkerRBAC-scoped tool view
QualityWorkervalidation + anomaly
TransformWorkerschema-evolution-safe
LoadingWorkeridempotent writes
Supervisor-worker topology — see ADR-003
RedisSavercheckpoint · 24h TTL
SemanticMemoryembedding + similarity
TimeTravelreplay from any checkpoint
AuditLogtool-call + approval trail
Two-tier persistence — see ADR-002
interrupt_beforeHITL gate · Slack approve
LangSmith traceper-agent + per-tool
Cost trackerUSD per run · per agent
GitHub PR botPyGithub · auto-PR
HITL approval flow — see ADR-004
# Judge cascade — 35% cost cut
Haiku handles 70% of worker LLM calls (USD 0.80/M in)
Sonnet only on supervisor routing + escalation paths
Cascade saves ~USD 75/mo at 5k runs/mo
→ ~USD 0.036 per run at optimized load
# HITL gate — interrupt_before
Destructive writes pause the LangGraph mid-run
State persists in Redis (24h TTL) — no re-execution on resume
Slack actionable buttons (Approve / Deny / Escalate)
→ graph.aupdate_state() resumes from exact checkpoint
4 + 1
workers + 1 supervisor
5 ADRs
committed in starter kit
−35%
cost vs Sonnet-only baseline
Curriculum · 6 modules · 17-18 hours · 3 phases

Modules 01-03 unlock with PRO. The full platform with EXPERT.

Modules 01-03 (~10h) ship a complete working multi-agent pipeline — foundation, 4 production tools, and a LangGraph supervisor + worker system with Redis checkpointing. Included with PRO. Modules 04-06 (~7.5h additional) layer on production observability, hardening (HITL + failure detection + time-travel), and a multi-tenant platform-design capstone. Unlock with EXPERT.

P13 · 6 modules · 17-18 hours · 60+ lessons
Free preview EXPERT required
M01
Agent Foundation
Project scaffolding with LangGraph, base agent class with logging hooks, BaseDataTool abstraction, PipelineState TypedDict, and a Pydantic Settings configuration layer. The honest baseline before any working agent exists.
Phase 13h8 lessonsPRO TIER
Unlock with PRO →
M02
Tool Implementation
Four production tools — QueryDatabaseTool (connection pooling + SQL safety), APIClientTool (retry + rate-limit + idempotency), FileProcessorTool (chunked read + format detect), ValidationTool (quality rules) — wired through an RBAC-aware registry per ADR-005's reversal.
Phase 13h9 lessonsPRO TIER
Unlock with PRO →
M03
Multi-Agent Pipeline
LangGraph StateGraph with supervisor + 4 workers, conditional-edge routing, semantic memory store, RedisSaver checkpointing, RBAC scoping at the registry, and end-to-end execution. The PRO finish line: a working agentic pipeline.
Phase 14h11 lessonsPRO TIER
Unlock with PRO →
M04
Production Ready
Prometheus metrics + OpenTelemetry tracing + LangSmith integration + FastAPI + per-agent cost tracking + Docker + Kubernetes manifests + task queue for horizontal scaling + agent evaluation framework.
Phase 22h8 lessonsEXPERT TIER
Unlock with EXPERT →
M05
Harden the Pipeline
HITL via LangGraph interrupt_before + Slack actionable buttons, FailureDetector (loop + cascade), ToolCallGuard (budget enforcement), ContextWindowManager (rolling window + summarization), TimeTravel (24h Redis checkpoint replay), system-prompt tool-use rules.
Phase 22.5h9 lessonsEXPERT TIER
Unlock with EXPERT →
M06
Design and Extend an AI Data Platform
Unstructured extraction (pdfplumber + BeautifulSoup), dynamic schema inference (Pandas + Pydantic), RestrictedPython sandbox, multi-tenant TenantContext + RBAC, feedback-driven improvement loop, GitHub PR bot (PyGithub), SLA YAML, and a capstone platform-design exercise.
Phase 33h10 lessonsEXPERT TIER
Unlock with EXPERT →
Modules 01-03 with PRO ($29/mo) · Modules 04-06 with EXPERT ($79/mo)
See plans →
Backed by curriculum
Agentic Workflows
7 modules24 hoursLangGraph · Tool-use · HITL · Multi-agent
Open curriculum
iThis curriculum is the foundation for the project — it’s not a sales add-on. EXPERT subscribers get full access to all modules.
The build, in 3 phases

Foundation. Production. Platform.

Each phase ends with a tagged release, a passing integration test, and a passing red-team failure-injection run. No ambiguity about where you are.

01~10h
Foundation (Modules 01-03)

Working multi-agent pipeline running locally. LangGraph supervisor + 4 worker agents + 4 tools + Redis checkpointing, end-to-end execution on the seeded sample data.

  • RBAC-aware ToolRegistry with 4 tools (post-ADR-005 reversal)
  • LangGraph supervisor + workers with conditional-edge routing
  • RedisSaver checkpointing + recovery on restart
02~2h
Production (Module 04)

Observable, deployable, costable. Prometheus + OTel + LangSmith tracing, per-agent cost attribution, FastAPI service, Docker + Kubernetes manifests, agent eval framework.

  • Prometheus + OpenTelemetry + LangSmith integration
  • Per-agent + per-tool cost attribution
  • Docker + Kubernetes deploy + agent eval suite
03~5.5h
Platform (Modules 05-06)

Hardening + multi-tenant platform design. HITL approval flow, failure detection + recovery, time-travel replay, multi-tenant TenantContext, RestrictedPython sandbox, capstone design.

  • HITL via interrupt_before + Slack actionable buttons (ADR-004)
  • FailureDetector + ToolCallGuard + TimeTravel replay
  • Multi-tenant + RestrictedPython sandbox + GitHub PR bot
Project setup · 10 minutes

One command. Local LangGraph + Postgres + Redis + LangSmith.

What lives in the repo

You get the real agent platform on day one — LangGraph supervisor + 4 worker agents, RBAC-aware ToolRegistry with 4 tools, RedisSaver checkpointing, FastAPI gateway, Prometheus + OpenTelemetry + LangSmith instrumentation, Docker + Kubernetes manifests, plus the M05 hardening (HITL / FailureDetector / TimeTravel) and M06 platform features.

  • src/agents/ — supervisor + 4 workers + HITL + FailureDetector + ContextManager
  • src/tools/ — 4 production tools + RBAC-aware registry (post-ADR-005)
  • src/orchestration/ + src/memory/ — LangGraph StateGraph + RedisSaver + semantic memory
  • src/observability/ — Prometheus + OTel + LangSmith + cost-tracking + Slack alerts
  • src/api/ + src/scaling/ + Dockerfile + k8s/ — FastAPI + task queue + container + Kubernetes manifests
  • docs/adr/ + docs/cost-model/ — 5 committed ADRs (one Deprecated) + the runnable cost-model CSV
Download · Starter Kit · 91 files · 104 KB

Agentic Data Pipeline Starter Kit

Pre-built LangGraph supervisor + 4 worker agents + RBAC tool registry + Redis checkpointing + FastAPI + Docker + Kubernetes. Now bundled: 5 ADR markdown files (docs/adr/) and the runnable cost-model CSV (docs/cost-model/) — unzip and read them straight from the repo.

EXPERT project · 91 files · ADRs + cost model bundled · last updated 2026-05-08
~/projects/agentic-data-pipeline — zsh
1. Unzip and start the platform
$ unzip agentic-data-pipeline-starter.zip
$ cd agentic-data-pipeline-starter && cp .env.example .env
$ docker compose up -d
2. Run the demo end-to-end
$ export ANTHROPIC_API_KEY=...
$ python scripts/demo.py \
$ --source postgres --checkpoint redis --trace langsmith
3. Open ADR-001 + the cost model
$ less docs/adr/001-langgraph-vs-crewai-vs-custom.md
$ open docs/cost-model/agentic-data-pipeline-cost-model.csv
4. Run the failure-injection suite
$ pytest tests/test_failure_detector.py tests/test_time_travel.py -v
4 + 1
workers + supervisor
91
files in starter kit
5
ADRs (incl. 1 Deprecated)
30
sample run traces
Production hardening

The same agent demo — but built for the auditable case.

Most agent tutorials show you a single LLM call in a loop. This shows what changes when the supervisor decides for 4 workers, every tool call hits an audit log, the cost model is defensible to a CFO, and a compliance reviewer asks which agent decided that.

Notebook agent demoWhat most teams ship
×
Orchestration
While-loop calling an LLM
×
Tool access
Every agent can call every tool
×
Failure mode
Loop until token budget; pray
×
HITL
Manual code edit, redeploy
×
Cost
Whatever the bill says next month
×
Replay
Re-run from scratch on every retry
Your agent platformModules 04–06
Orchestration
LangGraph StateGraph + supervisor-worker conditional edges (ADR-001 + ADR-003)
Tool access
ScopedToolView per agent role (RBAC at registry, post-ADR-005 reversal)
Failure mode
FailureDetector + ToolCallGuard + budget cap; cascade detection in M05
HITL
interrupt_before + Slack approve / deny / escalate (ADR-004)
Cost
Per-agent token attribution + judge cascade −35%; CSV in docs/cost-model/
Replay
TimeTravel reads any checkpoint in last 24h via RedisSaver (ADR-002)
EXPERT-only · architecture decision records

Write the ADRs staff engineers actually get judged on.

Five ADRs ship inside the starter-kit zip at docs/adr/, one per major decision in the build, including a real Deprecated ADR documenting the v0 ToolRegistry → RBAC reversal after a real Quality-agent overreach incident. The kind of doc that travels with you to your next role. Preview ADR-001 →

ADR-001Accepted

LangGraph chosen over CrewAI / AutoGen / custom orchestrator

Context
HITL + time-travel replay + typed state + streaming all need first-class primitives — pipeline / mesh / chain alternatives don't fit
Decision
Adopt LangGraph as the orchestrator. Workers wrap a Protocol (WorkerAgent) — orchestrator-replaceable
Tradeoff
Heavy LangChain dependency surface (~200 transitive deps); minor versions occasionally break Saver protocol
Reversal
Custom orchestrator swap is ~3 engineer-weeks; Protocol shape keeps workers portable
ADR-002Accepted

Redis for orchestrator checkpoints; Postgres only for business data

Context
PostgresSaver caused row-lock contention with M04 dashboards (p95=4.2s → 0.8s after split) + LangGraph version bumps required Postgres migrations
Decision
Two-tier persistence — RedisSaver for checkpoints (24h TTL), Postgres for business data
Tradeoff
Two stores to operate (~$35/mo extra) and 24h replay window vs indefinite
Reversal
Postgres-only re-introduction is ~3 engineer-days when run rate < 500/day
ADR-003Accepted

Hierarchical supervisor-worker topology, not peer-to-peer agents

Context
Per-agent cost attribution + deterministic replay + linear LLM cost scaling all need a single decision-maker
Decision
Supervisor + 4 workers; supervisor holds routing logic, workers execute tools and return state via add_conditional_edges
Tradeoff
Supervisor on critical path of every step; per-step concurrency lost in exchange for auditability
Reversal
Peer-to-peer mesh is ~2 engineer-weeks if real-time streaming becomes a requirement
ADR-004Accepted

HITL via LangGraph `interrupt_before` + Slack actionable buttons

Context
Pause + persist + resume + audit + timeout-then-reject — message-queue + DB-flag patterns add too much infra; native graph pause wins
Decision
interrupt_before=[...] at compile time + Slack Approve / Deny / Escalate buttons → graph.aupdate_state resumes
Tradeoff
Slack on critical path; 24h hard timeout (ties to Redis TTL); approve/deny/escalate only — no partial state edit
Reversal
SQS-backed approver web UI is ~2 engineer-weeks; LangGraph contract unchanged
ADR-005Deprecated

Single global ToolRegistry without RBAC scoping (v0)

Context
Day-2 MVP: all agents shared one registry; nothing prevented Quality agent from calling Transform tools
Decision
Reverted in M03 — added ScopedToolView with per-agent-role scopes (default-deny) at the registry layer
Why reversed
Quality agent silently applied a schema migration during a validation run on 2026-04-12 — broke M04 dashboard for 23 min
Replaced by
RBAC-aware registry; ~4.5 engineer-day reversal cost
EXPERT-only · cost model

Read the FinOps story, not just the latency one.

Module 04 ships a runnable cost-model CSV inside the starter-kit zip at docs/cost-model/. 5,000 agent-runs/mo load, real Anthropic + AWS RDS + ElastiCache + LangSmith list prices, with model-cascade and reserved-instance levers wired up. The version you’ll defend to a CFO. Preview the CSV →

ComponentBaseline / moOptimized / moDelta
Anthropic Claude Sonnet (planner + supervisor)
100% baseline → 30% optimized · 24M in / 6M out tok/mo
$144
$43
−$101
Anthropic Claude Haiku (worker agents)
70% of mix in optimized · ~17M in / 4M out tok/mo
$0
$26
AWS RDS Postgres (db.t4g.medium)
100GB gp3 · business data store (per ADR-002)
$50
$35
−$15
AWS ElastiCache Redis (cache.t4g.small)
checkpoint + recovery state (per ADR-002)
$35
$26
−$9
LangSmith observability (Plus tier)
30k traces/mo budget · agent + tool spans
$39
$39
GitHub Actions + container registry
~150 PR runs/mo × 6 min × Linux runners + GHCR
$12
$12
Total · 5k runs/mo
~$0.056 per run at baseline · ~$0.036 optimized
$280
$181
−$99 (−35%)

Optimization levers

Model cascade (Haiku for workers · Sonnet for supervisor)
Route 70% of worker LLM calls to Haiku ($0.80/M in). Supervisor + escalation paths stay on Sonnet for routing quality. ADR-001 + ADR-003.
−$75 / mo · −27%
Idempotent tool-call cache
SHA-256 cache on (tool_name, args) for read-only tools. Redis-backed, 1h TTL on quality-sensitive, 24h on static reads. ~18% hit rate on regression suites.
−$14 / mo · grows with workload stability
RDS + ElastiCache 1-yr reserved
Commit to 12-month reserved capacity once load is stable for 30 days. ~30% off RDS, ~26% off ElastiCache. Break-even at month 4.
−$24 / mo · −28% on store cost
EXPERT benefit · cohort beta

Async architecture review with a staff-level reviewer (cohort beta).

Submit your repo, your ADR draft, or your supervisor-routing prompt. A staff or principal-level reviewer who has shipped this exact stack responds within 7 days with line-by-line comments. Cohort capped at 12 members.

Bring a diff, an ADR draft, or an HITL flow.

The cohort beta runs as async architecture review — pick a reviewer by topic, send the artifact, get inline comments + a Loom walkthrough back. No back-and-forth scheduling. No 30-minute slot pressure.

MR
Mira R.
Ex-staff · agent platform · top-3 cloud
Multi-agent topology, supervisor design, RBAC at the tool layer, agent eval
Send the diff. I'll go line-by-line through your supervisor prompt and the ScopedToolView wiring and pick out the agent-overreach paths.
DK
Daniel K.
Principal · LLM platform · enterprise SaaS
HITL design, approval-flow auditability, time-travel replay, compliance + audit log
Send your worst stuck-approval. We'll walk it backwards from the Slack event log to the LangGraph checkpoint state and figure out where the protocol broke.
AS
Anya S.
Eng manager · AI platform · public Series-D
Org design for agent teams, hiring rubrics, staff-engineer interview prep, ADR review
If you're prepping for staff promo, send your ADR draft. We'll work backwards from the rubric.
Format
Async
Turnaround
7 days
Cohort
12 members
Scope
ADR + arch review
Request a slot
What your tier unlocks

PRO unlocks Modules 01-03. EXPERT unlocks the full platform.

PRO is the entry point — Modules 01-03 (a working multi-agent pipeline) plus the rest of the PRO catalog. EXPERT unlocks Modules 04-06 of this build, the 5 ADRs, the cost-model CSV, and the cohort-beta async review.

What you getFREEPROEXPERT
Modules 01-03 of P13
Foundation + tools + multi-agent pipeline (~10h)
Included
Included
Modules 04-06 of P13
Production + hardening + platform design (~7.5h)
Included
5 committed ADRs + cost-model CSV
Starter kit docs/adr/ + docs/cost-model/
Included
PRO project catalog
Production-grade builds
2
All current
All current + this one
Curriculum
All 7 tracks
Phase 1 only
All
All + bonus modules
Code review
Senior+ reviewers
4 / month
Unlimited
Cohort-beta architecture review
Async · 7-day turnaround · 12-member cap
Included
Certificate
Verifiable on LinkedIn
Yes
Yes + LinkedIn rec
$79/mo
billed monthly · open enrollment · cancel anytime
or annual
$699/yr save 26%
Unlock EXPERT
Who this is for

Pick this if you own the supervisor, not just a feature.

ST

Staff / principal engineers · agent platform

You own the supervisor prompt, the RBAC boundary, and the answer to 'which agent decided this?' that your VP asks at the next incident review.

EM

Engineering managers · AI

You need a reference architecture for the agent platform your CTO will ask about before the AI team gets headcount or a budget for production deployment.

PA

Platform / infra leads

You absorb LangGraph without absorbing 4 new vendors. Postgres, Redis, Prometheus, Slack — tools you already operate. This is the playbook.

FR

Founding engineers · AI startups

Your investors will ask 'how do you know agents are safe to ship?' before they ask about scale. The 5 ADRs + HITL gate + RBAC registry is the answer.

FAQ · EXPERT tier

Quick answers.

Modules 01-03 (the working multi-agent pipeline — foundation, 4 tools, supervisor + workers + Redis checkpointing) are included with PRO at $29/mo. The rest of the platform — Modules 04-06 (production observability, HITL + hardening, multi-tenant platform design), the 5 committed ADRs, the runnable cost-model CSV, and the cohort-beta async architecture review — unlocks with EXPERT at $79/mo. PRO gets you a working pipeline; EXPERT gets you the platform you'd defend in an architecture review.
Yes — most of the value is in the design decisions, not the framework. ADR-001 lays out exactly when LangGraph wins vs CrewAI / AutoGen / custom; the supervisor-worker topology in ADR-003 is framework-agnostic; the RBAC ToolRegistry in ADR-005 is a Python pattern. The orchestrator-specific code is contained to one directory (~200 lines) and is documented as a swap target.
Not for v1. The cohort beta runs as async review: you submit a diff / ADR / HITL flow / supervisor prompt, a staff-level reviewer responds within 7 days with inline comments + a Loom walkthrough. Cohort is capped at 12 members so reviewers can keep the SLA. We'll evaluate live 1:1 sessions once the cohort signal is solid.
17-18 hours of focused work across 6 modules. Most learners spread it across 4-6 weeks alongside a day job. Modules 01-03 alone are ~10 hours and get you a working multi-agent pipeline you can deploy locally.
It's a strong forcing function. Staff agent-platform interviews lean heavily on system design (multi-agent topology, HITL, audit, cost) and on having opinions backed by real tradeoffs. The 5 ADRs you commit (one Deprecated, with the RBAC reversal incident) are exactly the artifacts a panel asks about. Pair with the cohort-beta review on your final repo and you have the portfolio piece.
Yes — receipts and a learning-budget letter are downloadable on subscription. Many EXPERT learners are reimbursed under engineering training, AI upskilling, or platform-tooling budgets.
Model fine-tuning. Pre-training. RAG retrieval pipelines (we use them; we don't build them — see /projects/enterprise-rag for that). Agent training via reinforcement learning. This is an agent execution platform — you ship the system that runs agents in production, not the system that creates them.

Ready to ship the system that runs agents in production?

Start with PRO ($29/mo) for Modules 01-03 — the working multi-agent pipeline. Or unlock the full 6-module platform plus 5 ADRs, the cost-model CSV, and cohort-beta architecture review with EXPERT ($79/mo).

P13 · Agentic data pipeline · EXPERT · PRO unlocks M01-M03Unlock EXPERT →
Press Cmd+K to open