Skip to content
Back to Agentic Data Pipeline

HITL via LangGraph `interrupt_before` + Slack actionable buttons

✓ AcceptedAgentic Data Pipeline05 — Harden the Pipeline
By AI-DE Engineering Team·Stakeholders: platform owner, ML engineer, on-call engineer

Context

Some agent decisions require a human in the loop:

  • Destructive writes — bulk DELETE, schema mutation, data deletion for GDPR.
  • High-cost operations — agent decides to spawn a 200-iteration retry loop.
  • Compliance gates — multi-tenant data export to a non-default region.

We need a pause-and-wait pattern that:

  1. Persists the run's full state until a human responds.
  2. Resumes from the exact point of pause when approval arrives (no re-running the prefix).
  3. Routes the approval request to a channel an on-call engineer actually watches (Slack, not email).
  4. Auto-rejects after a timeout — runs that don't get approval shouldn't sit forever.
  5. Has a usable audit trail.

Three patterns considered:

  • Message-queue approval. Worker writes a request to an SQS/queue; another service picks it up; approver responds via API; queue notifies the worker. Battle-tested. Heavy infrastructure (queue + approver service + reconciliation logic).
  • DB-flag polling. Worker writes "waiting" status to Postgres; approver toggles a flag; worker polls. Simple. Polling load + flaky resume semantics.
  • LangGraph interrupt_before. Native graph-level pause. State persists in Redis (per ADR-002). Resume is graph.update_state(...) from the approver's input. Lean infrastructure, deeply integrated with the orchestrator.

Decision

Use LangGraph interrupt_before for the pause; Slack actionable buttons for the approval surface.

# src/orchestration/graph.py
graph = compiled_graph.compile(
    checkpointer=RedisSaver(...),
    interrupt_before=["transform_destructive", "schema_mutate", "compliance_export"],
)

# src/agents/hitl.py
async def request_approval(thread_id: str, action: PendingAction) -> None:
    """Post Slack message with Approve / Deny / Escalate buttons."""
    await slack.chat_postMessage(
        channel=APPROVAL_CHANNEL,
        text=f"Agent run {thread_id} needs approval for: {action.summary}",
        blocks=[
            section_block(action.detail),
            actions_block([
                button("Approve", value=f"{thread_id}:approve"),
                button("Deny", value=f"{thread_id}:deny"),
                button("Escalate", value=f"{thread_id}:escalate"),
            ]),
        ],
    )

# src/alerting/slackbot.py — handles the button click
@app.action("approve")
async def handle_approve(ack, body):
    await ack()
    thread_id, _ = body["actions"][0]["value"].split(":")
    await graph.aupdate_state(
        config={"configurable": {"thread_id": thread_id}},
        values={"approval": "granted", "approver": body["user"]["id"]},
    )
    await graph.ainvoke(None, {"configurable": {"thread_id": thread_id}})  # resume

24-hour timeout enforced by Redis TTL on the checkpoint (per ADR-002). After expiry, M05's failure-recovery scan auto-rejects with a "stale" reason and emits a metric.

Tradeoffs we accept

LeverAlternativeChosen
Pause primitiveCustom message-queue + worker patternLangGraph interrupt_before — accept LangGraph as the pause-mechanism vendor
Approval surfaceEmail / web dashboardSlack actionable buttons — accept Slack vendor lock for the convenience
TimeoutIndefinite wait24h hard cap (ties to Redis TTL from ADR-002)
Approver identityOAuth + dedicated approver appSlack user identity from the click event
Audit trailCustom audit tableM06 audit log captures every approval/denial via the same Slack event hook

Consequences (positive)

  • Pause-and-resume is one LangGraph config + 3 lines of graph.aupdate_state(...). No new services.
  • Approver experience is a button click in Slack — on-call engineers approve in ~10s without leaving the channel.
  • Audit trail is automatic: Slack event log + M06 audit log + LangGraph checkpoint history. Three sources of truth that agree.
  • Resume is exact — same state, same next node, same context. Zero re-execution.
  • Timeout aligns with the 24h Redis TTL — no separate timeout subsystem.

Consequences (negative)

  • Slack as critical-path infrastructure. If Slack is down, approvals don't happen. Mitigation: M05's failure-recovery treats "interrupted >2h with no approval" as a metric → on-call page via Prometheus AlertManager.
  • 24h timeout is sometimes too short (overnight runs that interrupt at 11pm UTC). Mitigation: timezone-aware approval routing in M05 — runs interrupted near EOD route to APAC region's Slack channel.
  • Slack user identity isn't a real auth boundary. Anyone with channel access can approve. Mitigation: APPROVAL_CHANNEL is a private channel with explicit membership; M06's compliance audit cross-references the Slack user ID against an approved-approvers list.
  • LangGraph interrupt_before doesn't support partial state updates mid-pause. The approver can approve or deny — they can't modify the action. Mitigation: M05's "Escalate" button returns the action to the supervisor for redesign instead of granting partial approval.

Reversal plan

Message-queue approval (full SQS flow): ~2 engineer-weeks. Triggers:

  1. Slack as a vendor becomes unacceptable (compliance, regional restrictions).
  2. Approval volume crosses ~500/day where Slack ergonomics break down.
  3. Approvers need richer state-modification capabilities than Approve/Deny/Escalate.

Implementation: SQS queue + dedicated approver web UI + reconciliation poller. Replaces the Slack handler in src/alerting/slackbot.py; LangGraph contract stays the same (interrupt_before + graph.aupdate_state).

Email + magic-link approval: ~1 engineer-week. Use when on-call engineers don't live in Slack (rare in 2026).

References

  • src/agents/hitl.py — interrupt + approval-request flow
  • src/alerting/slackbot.py — Slack handler
  • src/observability/recovery.py — timeout + auto-reject scan
  • tests/test_hitl_approval.py — approve/deny/timeout integration tests
  • runbooks/hitl-approval-stuck.md — what to do when an approval is missing
  • ADR-001 (LangGraph orchestrator)
  • ADR-002 (Redis checkpoint persists the paused state)
  • ADR-005 (DEPRECATED — earlier ToolRegistry RBAC mistake; HITL approval is partly a defense-in-depth response)
Built into the project

This decision shipped as part of Agentic Data Pipeline — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open