Skip to content
Back to Agentic Data Pipeline

Single global ToolRegistry without RBAC scoping (DEPRECATED)

✗ DeprecatedAgentic Data Pipeline02 — Tool Implementation (originally) · M03 (deprecation)
By AI-DE Engineering Team·Stakeholders: platform owner, security engineer, ML engineer

Context

Original v0 design (ADR-005, originally accepted) had one global ToolRegistry instance shared across all agents. Workers would registry.get("query_database") and execute it. This was the simplest possible design and shipped on day 2 of M02 build-out.

It worked fine until M03 wired in a Quality agent and a Transform agent. Then we hit two production-shaped problems:

  1. Quality agent gained access to Transform tools. The Quality agent's job is to validate, not to mutate. With a global registry, nothing prevented quality_agent.registry.call("transform_apply_schema_change") — the registry didn't know which agent was asking. We caught this in code review of an unrelated PR; the agent had been "helpfully" applying schema fixes during a routine validation run for ~2 weeks.
  2. No audit trail of which agent invoked which tool. When a destructive write happened, the audit log said "Tool transform_drop_table executed at 14:22:11". The next question was always "which agent decided to do that?" and the answer was a 20-minute git-blame + LangSmith trace excavation. Useless during incidents.

What was originally decided

# DEPRECATED — see "What we got wrong" below
class ToolRegistry:
    def __init__(self):
        self._tools: dict[str, Tool] = {}

    def register(self, name: str, tool: Tool) -> None:
        self._tools[name] = tool

    def get(self, name: str) -> Tool:
        return self._tools[name]

# At startup
registry = ToolRegistry()
registry.register("query_database", DatabaseTool())
registry.register("transform_apply_schema_change", SchemaTool())
# ... 20 more tools

# In every worker
class IngestionWorker:
    def __init__(self, registry: ToolRegistry):
        self.registry = registry  # has access to ALL tools

What we reversed to

RBAC-aware registry with per-agent scope:

# src/tools/registry.py — v2, current
class ToolRegistry:
    def __init__(self):
        self._tools: dict[str, Tool] = {}
        self._scopes: dict[str, set[AgentRole]] = {}    # tool -> allowed agents

    def register(self, name: str, tool: Tool, allowed_roles: set[AgentRole]) -> None:
        self._tools[name] = tool
        self._scopes[name] = allowed_roles

    def for_agent(self, role: AgentRole) -> "ScopedToolView":
        """Return a view that only sees tools allowed for this agent role."""
        return ScopedToolView(
            tools={n: t for n, t in self._tools.items() if role in self._scopes[n]},
            audit_logger=self._audit,
            agent_role=role,
        )


class ScopedToolView:
    async def call(self, name: str, *args, **kwargs):
        if name not in self._tools:
            raise ToolNotPermitted(f"Agent {self._agent_role} cannot use {name}")
        self._audit_logger.log_tool_call(self._agent_role, name, args, kwargs)
        return await self._tools[name](*args, **kwargs)

Workers receive a scoped view, not the global registry:

# src/agents/workers.py
class QualityWorker:
    def __init__(self, scoped_tools: ScopedToolView):
        self.tools = scoped_tools     # only validation tools

# Wiring
quality_view = registry.for_agent(AgentRole.QUALITY)   # only validation tools
transform_view = registry.for_agent(AgentRole.TRANSFORM)  # only transform tools

Why reversed

Two concrete incidents drove the reversal:

  1. 2026-04-12: Quality agent applied a schema migration during a validation run because the supervisor's routing decision had wedged on a high-confidence validate-then-fix pattern. The schema change broke the M04 dashboard for 23 minutes. Logged in runbooks/incident-2026-04-12-quality-agent-overreach.md.
  2. 2026-04-13: Code review caught the Quality agent's tools.call("transform_apply_schema_change") line in a routine PR. The reviewer flagged it; investigation revealed it had been silently working in the regression suite for ~2 weeks, masked by idempotent reapplication.

The first incident was the gating one. The second was the credibility one — global access invites accidents we can't see in CR.

What we got wrong (and what we'd do again)

Got wrong: treated the tool registry as a directory (one canonical lookup table) when it's actually a capability boundary. Capabilities should be granted to roles, not held in a global namespace anyone can read from.

Would do again: the Tool Protocol shape (async def __call__(self, ...) -> ToolResult) and the registration-by-name pattern. Both still hold. Only the access model changed.

Reversal cost

  • Schema rewrite (src/tools/registry.py): 1 day
  • Worker wiring updates (src/agents/workers.py × 4): 1 day
  • Test rewrite + per-agent test fixtures: 1 day
  • Audit-log integration: 0.5 day
  • Per-tool allowed_roles enumeration (audit each of 20+ tools): 1 day
  • Total: ~4.5 engineer-days

Lessons

  • "Default-deny" is a feature, not a polish item. v0 was default-allow with a global namespace. v1 is default-deny via the scoped view. The default flips the cognitive load: you have to explicitly grant access in register(...), not explicitly forbid it later.
  • Capability boundaries belong at the registry, not the tool. We considered putting role checks inside each tool's __call__. That's per-tool — easy to forget on the next tool added. Registry-level scoping enforces it once for all tools.
  • Audit log naming has to identify the agent. The v0 audit log said "Tool X executed". v1 says "Agent Q (role: QUALITY) called Tool X". One word saves 20 minutes of incident response.
  • Document the deprecation. The next engineer who proposes "let's just use one global registry" needs to find this ADR and the incident runbook before they re-introduce the bug.

References

  • src/tools/registry.py — current v2 RBAC-aware implementation
  • src/security/rbac.py — AgentRole enum + role assignments
  • runbooks/incident-2026-04-12-quality-agent-overreach.md — the incident that drove the reversal
  • ADR-001 (LangGraph orchestrator)
  • ADR-003 (supervisor-worker topology — workers receive scoped views, not the registry)
  • ADR-004 (HITL approval is the second line of defense; RBAC is the first)
Built into the project

This decision shipped as part of Agentic Data Pipeline — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open