Context
Some agent decisions require a human in the loop:
- Destructive writes — bulk DELETE, schema mutation, data deletion for GDPR.
- High-cost operations — agent decides to spawn a 200-iteration retry loop.
- Compliance gates — multi-tenant data export to a non-default region.
We need a pause-and-wait pattern that:
- Persists the run's full state until a human responds.
- Resumes from the exact point of pause when approval arrives (no re-running the prefix).
- Routes the approval request to a channel an on-call engineer actually watches (Slack, not email).
- Auto-rejects after a timeout — runs that don't get approval shouldn't sit forever.
- Has a usable audit trail.
Three patterns considered:
- Message-queue approval. Worker writes a request to an SQS/queue; another service picks it up; approver responds via API; queue notifies the worker. Battle-tested. Heavy infrastructure (queue + approver service + reconciliation logic).
- DB-flag polling. Worker writes "waiting" status to Postgres; approver toggles a flag; worker polls. Simple. Polling load + flaky resume semantics.
- LangGraph
interrupt_before. Native graph-level pause. State persists in Redis (per ADR-002). Resume isgraph.update_state(...)from the approver's input. Lean infrastructure, deeply integrated with the orchestrator.
Decision
Use LangGraph interrupt_before for the pause; Slack actionable buttons for the approval surface.
# src/orchestration/graph.py
graph = compiled_graph.compile(
checkpointer=RedisSaver(...),
interrupt_before=["transform_destructive", "schema_mutate", "compliance_export"],
)
# src/agents/hitl.py
async def request_approval(thread_id: str, action: PendingAction) -> None:
"""Post Slack message with Approve / Deny / Escalate buttons."""
await slack.chat_postMessage(
channel=APPROVAL_CHANNEL,
text=f"Agent run {thread_id} needs approval for: {action.summary}",
blocks=[
section_block(action.detail),
actions_block([
button("Approve", value=f"{thread_id}:approve"),
button("Deny", value=f"{thread_id}:deny"),
button("Escalate", value=f"{thread_id}:escalate"),
]),
],
)
# src/alerting/slackbot.py — handles the button click
@app.action("approve")
async def handle_approve(ack, body):
await ack()
thread_id, _ = body["actions"][0]["value"].split(":")
await graph.aupdate_state(
config={"configurable": {"thread_id": thread_id}},
values={"approval": "granted", "approver": body["user"]["id"]},
)
await graph.ainvoke(None, {"configurable": {"thread_id": thread_id}}) # resume
24-hour timeout enforced by Redis TTL on the checkpoint (per ADR-002). After expiry, M05's failure-recovery scan auto-rejects with a "stale" reason and emits a metric.
Tradeoffs we accept
| Lever | Alternative | Chosen |
|---|---|---|
| Pause primitive | Custom message-queue + worker pattern | LangGraph interrupt_before — accept LangGraph as the pause-mechanism vendor |
| Approval surface | Email / web dashboard | Slack actionable buttons — accept Slack vendor lock for the convenience |
| Timeout | Indefinite wait | 24h hard cap (ties to Redis TTL from ADR-002) |
| Approver identity | OAuth + dedicated approver app | Slack user identity from the click event |
| Audit trail | Custom audit table | M06 audit log captures every approval/denial via the same Slack event hook |
Consequences (positive)
- Pause-and-resume is one LangGraph config + 3 lines of
graph.aupdate_state(...). No new services. - Approver experience is a button click in Slack — on-call engineers approve in ~10s without leaving the channel.
- Audit trail is automatic: Slack event log + M06 audit log + LangGraph checkpoint history. Three sources of truth that agree.
- Resume is exact — same state, same next node, same context. Zero re-execution.
- Timeout aligns with the 24h Redis TTL — no separate timeout subsystem.
Consequences (negative)
- Slack as critical-path infrastructure. If Slack is down, approvals don't happen. Mitigation: M05's failure-recovery treats "interrupted >2h with no approval" as a metric → on-call page via Prometheus AlertManager.
- 24h timeout is sometimes too short (overnight runs that interrupt at 11pm UTC). Mitigation: timezone-aware approval routing in M05 — runs interrupted near EOD route to APAC region's Slack channel.
- Slack user identity isn't a real auth boundary. Anyone with channel access can approve. Mitigation:
APPROVAL_CHANNELis a private channel with explicit membership; M06's compliance audit cross-references the Slack user ID against an approved-approvers list. - LangGraph
interrupt_beforedoesn't support partial state updates mid-pause. The approver can approve or deny — they can't modify the action. Mitigation: M05's "Escalate" button returns the action to the supervisor for redesign instead of granting partial approval.
Reversal plan
Message-queue approval (full SQS flow): ~2 engineer-weeks. Triggers:
- Slack as a vendor becomes unacceptable (compliance, regional restrictions).
- Approval volume crosses ~500/day where Slack ergonomics break down.
- Approvers need richer state-modification capabilities than Approve/Deny/Escalate.
Implementation: SQS queue + dedicated approver web UI + reconciliation poller. Replaces the Slack handler in src/alerting/slackbot.py; LangGraph contract stays the same (interrupt_before + graph.aupdate_state).
Email + magic-link approval: ~1 engineer-week. Use when on-call engineers don't live in Slack (rare in 2026).
References
src/agents/hitl.py— interrupt + approval-request flowsrc/alerting/slackbot.py— Slack handlersrc/observability/recovery.py— timeout + auto-reject scantests/test_hitl_approval.py— approve/deny/timeout integration testsrunbooks/hitl-approval-stuck.md— what to do when an approval is missing- ADR-001 (LangGraph orchestrator)
- ADR-002 (Redis checkpoint persists the paused state)
- ADR-005 (DEPRECATED — earlier ToolRegistry RBAC mistake; HITL approval is partly a defense-in-depth response)