ADR-004: HITL via LangGraph `interrupt_before` + Slack actionable buttons | Agentic Data Pipeline

Context

Some agent decisions require a human in the loop:

Destructive writes — bulk DELETE, schema mutation, data deletion for GDPR.
High-cost operations — agent decides to spawn a 200-iteration retry loop.
Compliance gates — multi-tenant data export to a non-default region.

We need a pause-and-wait pattern that:

Persists the run's full state until a human responds.
Resumes from the exact point of pause when approval arrives (no re-running the prefix).
Routes the approval request to a channel an on-call engineer actually watches (Slack, not email).
Auto-rejects after a timeout — runs that don't get approval shouldn't sit forever.
Has a usable audit trail.

Three patterns considered:

Message-queue approval. Worker writes a request to an SQS/queue; another service picks it up; approver responds via API; queue notifies the worker. Battle-tested. Heavy infrastructure (queue + approver service + reconciliation logic).
DB-flag polling. Worker writes "waiting" status to Postgres; approver toggles a flag; worker polls. Simple. Polling load + flaky resume semantics.
LangGraph interrupt_before. Native graph-level pause. State persists in Redis (per ADR-002). Resume is graph.update_state(...) from the approver's input. Lean infrastructure, deeply integrated with the orchestrator.

Decision

Use LangGraph interrupt_before for the pause; Slack actionable buttons for the approval surface.

# src/orchestration/graph.py
graph = compiled_graph.compile(
    checkpointer=RedisSaver(...),
    interrupt_before=["transform_destructive", "schema_mutate", "compliance_export"],
)

# src/agents/hitl.py
async def request_approval(thread_id: str, action: PendingAction) -> None:
    """Post Slack message with Approve / Deny / Escalate buttons."""
    await slack.chat_postMessage(
        channel=APPROVAL_CHANNEL,
        text=f"Agent run {thread_id} needs approval for: {action.summary}",
        blocks=[
            section_block(action.detail),
            actions_block([
                button("Approve", value=f"{thread_id}:approve"),
                button("Deny", value=f"{thread_id}:deny"),
                button("Escalate", value=f"{thread_id}:escalate"),
            ]),
        ],
    )

# src/alerting/slackbot.py — handles the button click
@app.action("approve")
async def handle_approve(ack, body):
    await ack()
    thread_id, _ = body["actions"][0]["value"].split(":")
    await graph.aupdate_state(
        config={"configurable": {"thread_id": thread_id}},
        values={"approval": "granted", "approver": body["user"]["id"]},
    )
    await graph.ainvoke(None, {"configurable": {"thread_id": thread_id}})  # resume

24-hour timeout enforced by Redis TTL on the checkpoint (per ADR-002). After expiry, M05's failure-recovery scan auto-rejects with a "stale" reason and emits a metric.

Tradeoffs we accept

Lever	Alternative	Chosen
Pause primitive	Custom message-queue + worker pattern	LangGraph `interrupt_before` — accept LangGraph as the pause-mechanism vendor
Approval surface	Email / web dashboard	Slack actionable buttons — accept Slack vendor lock for the convenience
Timeout	Indefinite wait	24h hard cap (ties to Redis TTL from ADR-002)
Approver identity	OAuth + dedicated approver app	Slack user identity from the click event
Audit trail	Custom audit table	M06 audit log captures every approval/denial via the same Slack event hook

Consequences (positive)

Pause-and-resume is one LangGraph config + 3 lines of graph.aupdate_state(...). No new services.
Approver experience is a button click in Slack — on-call engineers approve in ~10s without leaving the channel.
Audit trail is automatic: Slack event log + M06 audit log + LangGraph checkpoint history. Three sources of truth that agree.
Resume is exact — same state, same next node, same context. Zero re-execution.
Timeout aligns with the 24h Redis TTL — no separate timeout subsystem.

Consequences (negative)

Slack as critical-path infrastructure. If Slack is down, approvals don't happen. Mitigation: M05's failure-recovery treats "interrupted >2h with no approval" as a metric → on-call page via Prometheus AlertManager.
24h timeout is sometimes too short (overnight runs that interrupt at 11pm UTC). Mitigation: timezone-aware approval routing in M05 — runs interrupted near EOD route to APAC region's Slack channel.
Slack user identity isn't a real auth boundary. Anyone with channel access can approve. Mitigation: APPROVAL_CHANNEL is a private channel with explicit membership; M06's compliance audit cross-references the Slack user ID against an approved-approvers list.
LangGraph interrupt_before doesn't support partial state updates mid-pause. The approver can approve or deny — they can't modify the action. Mitigation: M05's "Escalate" button returns the action to the supervisor for redesign instead of granting partial approval.

Reversal plan

Message-queue approval (full SQS flow): ~2 engineer-weeks. Triggers:

Slack as a vendor becomes unacceptable (compliance, regional restrictions).
Approval volume crosses ~500/day where Slack ergonomics break down.
Approvers need richer state-modification capabilities than Approve/Deny/Escalate.

Implementation: SQS queue + dedicated approver web UI + reconciliation poller. Replaces the Slack handler in src/alerting/slackbot.py; LangGraph contract stays the same (interrupt_before + graph.aupdate_state).

Email + magic-link approval: ~1 engineer-week. Use when on-call engineers don't live in Slack (rare in 2026).

References

src/agents/hitl.py — interrupt + approval-request flow
src/alerting/slackbot.py — Slack handler
src/observability/recovery.py — timeout + auto-reject scan
tests/test_hitl_approval.py — approve/deny/timeout integration tests
runbooks/hitl-approval-stuck.md — what to do when an approval is missing
ADR-001 (LangGraph orchestrator)
ADR-002 (Redis checkpoint persists the paused state)
ADR-005 (DEPRECATED — earlier ToolRegistry RBAC mistake; HITL approval is partly a defense-in-depth response)