Context
Original v0 design (ADR-005, originally accepted) had one global ToolRegistry instance shared across all agents. Workers would registry.get("query_database") and execute it. This was the simplest possible design and shipped on day 2 of M02 build-out.
It worked fine until M03 wired in a Quality agent and a Transform agent. Then we hit two production-shaped problems:
- Quality agent gained access to Transform tools. The Quality agent's job is to validate, not to mutate. With a global registry, nothing prevented
quality_agent.registry.call("transform_apply_schema_change")— the registry didn't know which agent was asking. We caught this in code review of an unrelated PR; the agent had been "helpfully" applying schema fixes during a routine validation run for ~2 weeks. - No audit trail of which agent invoked which tool. When a destructive write happened, the audit log said "Tool transform_drop_table executed at 14:22:11". The next question was always "which agent decided to do that?" and the answer was a 20-minute git-blame + LangSmith trace excavation. Useless during incidents.
What was originally decided
# DEPRECATED — see "What we got wrong" below
class ToolRegistry:
def __init__(self):
self._tools: dict[str, Tool] = {}
def register(self, name: str, tool: Tool) -> None:
self._tools[name] = tool
def get(self, name: str) -> Tool:
return self._tools[name]
# At startup
registry = ToolRegistry()
registry.register("query_database", DatabaseTool())
registry.register("transform_apply_schema_change", SchemaTool())
# ... 20 more tools
# In every worker
class IngestionWorker:
def __init__(self, registry: ToolRegistry):
self.registry = registry # has access to ALL tools
What we reversed to
RBAC-aware registry with per-agent scope:
# src/tools/registry.py — v2, current
class ToolRegistry:
def __init__(self):
self._tools: dict[str, Tool] = {}
self._scopes: dict[str, set[AgentRole]] = {} # tool -> allowed agents
def register(self, name: str, tool: Tool, allowed_roles: set[AgentRole]) -> None:
self._tools[name] = tool
self._scopes[name] = allowed_roles
def for_agent(self, role: AgentRole) -> "ScopedToolView":
"""Return a view that only sees tools allowed for this agent role."""
return ScopedToolView(
tools={n: t for n, t in self._tools.items() if role in self._scopes[n]},
audit_logger=self._audit,
agent_role=role,
)
class ScopedToolView:
async def call(self, name: str, *args, **kwargs):
if name not in self._tools:
raise ToolNotPermitted(f"Agent {self._agent_role} cannot use {name}")
self._audit_logger.log_tool_call(self._agent_role, name, args, kwargs)
return await self._tools[name](*args, **kwargs)
Workers receive a scoped view, not the global registry:
# src/agents/workers.py
class QualityWorker:
def __init__(self, scoped_tools: ScopedToolView):
self.tools = scoped_tools # only validation tools
# Wiring
quality_view = registry.for_agent(AgentRole.QUALITY) # only validation tools
transform_view = registry.for_agent(AgentRole.TRANSFORM) # only transform tools
Why reversed
Two concrete incidents drove the reversal:
- 2026-04-12: Quality agent applied a schema migration during a validation run because the supervisor's routing decision had wedged on a high-confidence validate-then-fix pattern. The schema change broke the M04 dashboard for 23 minutes. Logged in
runbooks/incident-2026-04-12-quality-agent-overreach.md. - 2026-04-13: Code review caught the Quality agent's
tools.call("transform_apply_schema_change")line in a routine PR. The reviewer flagged it; investigation revealed it had been silently working in the regression suite for ~2 weeks, masked by idempotent reapplication.
The first incident was the gating one. The second was the credibility one — global access invites accidents we can't see in CR.
What we got wrong (and what we'd do again)
Got wrong: treated the tool registry as a directory (one canonical lookup table) when it's actually a capability boundary. Capabilities should be granted to roles, not held in a global namespace anyone can read from.
Would do again: the Tool Protocol shape (async def __call__(self, ...) -> ToolResult) and the registration-by-name pattern. Both still hold. Only the access model changed.
Reversal cost
- Schema rewrite (
src/tools/registry.py): 1 day - Worker wiring updates (
src/agents/workers.py× 4): 1 day - Test rewrite + per-agent test fixtures: 1 day
- Audit-log integration: 0.5 day
- Per-tool
allowed_rolesenumeration (audit each of 20+ tools): 1 day - Total: ~4.5 engineer-days
Lessons
- "Default-deny" is a feature, not a polish item. v0 was default-allow with a global namespace. v1 is default-deny via the scoped view. The default flips the cognitive load: you have to explicitly grant access in
register(...), not explicitly forbid it later. - Capability boundaries belong at the registry, not the tool. We considered putting role checks inside each tool's
__call__. That's per-tool — easy to forget on the next tool added. Registry-level scoping enforces it once for all tools. - Audit log naming has to identify the agent. The v0 audit log said "Tool X executed". v1 says "Agent Q (role: QUALITY) called Tool X". One word saves 20 minutes of incident response.
- Document the deprecation. The next engineer who proposes "let's just use one global registry" needs to find this ADR and the incident runbook before they re-introduce the bug.
References
src/tools/registry.py— current v2 RBAC-aware implementationsrc/security/rbac.py— AgentRole enum + role assignmentsrunbooks/incident-2026-04-12-quality-agent-overreach.md— the incident that drove the reversal- ADR-001 (LangGraph orchestrator)
- ADR-003 (supervisor-worker topology — workers receive scoped views, not the registry)
- ADR-004 (HITL approval is the second line of defense; RBAC is the first)