The Agent Watchtower, Part 2: Anatomy of an Agent Control Plane

The technical architecture for unified agent governance. Registry, observability, policy, and control - how to build the infrastructure that makes multi-cloud agent governance possible.

Contents

In Part 1, we identified the problem: fragmented agent deployments across AWS, Azure, GCP, and open-source frameworks create governance blind spots. The solution is an Agent Control Plane - a unified layer that sits above individual platforms.

This post gets technical. We’ll cover the four core components of an agent control plane, how they work together, and what to consider when building or buying.

The Four Pillars

An effective agent control plane needs four capabilities:

  1. Registry - Know what agents exist
  2. Observability - See what agents do
  3. Policy - Define what agents should do
  4. Control - Enforce boundaries at runtime

Each pillar is necessary. None is sufficient alone.

Pillar 1: The Agent Registry

You can’t govern what you don’t know exists. The registry is the foundation - a single source of truth for all agents across all platforms.

What the Registry Captures

Identity:

  • Agent ID (unique, platform-agnostic)
  • Name, version, description
  • Platform (AWS/Azure/GCP/OSS/Internal)
  • Deployment environment (prod/staging/dev)

Ownership:

  • Owning team and business unit
  • Technical contacts
  • Escalation path

Classification:

  • Risk tier (critical/high/medium/low)
  • Data sensitivity level
  • Regulatory scope (GDPR, SOX, etc.)

Capabilities:

  • Tools/actions the agent can invoke
  • Data sources it can access
  • External APIs it can call

Lifecycle:

  • Creation date, last update
  • Current status (active/deprecated/suspended)
  • Approval chain and audit trail

Registration Patterns

How do agents get into the registry?

Push registration: Agents register themselves at startup. Works for agents you control, but requires code changes.

Pull discovery: The control plane scans platforms for agents. AWS Bedrock agents, Azure AI deployments, LangChain processes - each requires a different discovery mechanism.

CI/CD integration: Registration happens as part of deployment pipeline. Agent doesn’t deploy unless it’s registered.

Manual entry: Fallback for edge cases. Better to have manual registration than unregistered agents.

Most production implementations use a combination. CI/CD integration for new deployments, pull discovery for existing agents, manual entry for vendor-embedded agents you can’t auto-discover.

Registry Anti-Patterns

Over-indexing on manual processes. If registration requires filling out a 50-field form, teams won’t do it. Automate everything you can.

Ignoring vendor agents. That Salesforce copilot counts. That embedded chatbot in your ITSM tool counts. If it makes decisions using AI, it belongs in the registry.

Treating the registry as static. Agent configurations change. Capabilities expand. The registry needs continuous sync, not one-time registration.

Pillar 2: Observability

Once you know what agents exist, you need to see what they’re doing. This goes beyond traditional APM - you need to capture agent-specific signals.

The Observability Stack

Telemetry collection: Every agent interaction generates telemetry. Inputs, outputs, tool calls, reasoning traces, latency, token counts, confidence scores.

Trace correlation: When Agent A calls Agent B, you need to follow the thread. Distributed tracing adapted for agent architectures.

Behavioral metrics: Not just “did it respond” but “how did it respond.” Refusal rates, escalation patterns, confidence distributions, topic clustering.

Anomaly detection: ML models that learn normal behavior and flag deviations. When an agent starts behaving differently, you know immediately.

What to Capture

Every interaction:

  • Request ID and timestamp
  • Input (with PII masking)
  • Output (with PII masking)
  • Latency breakdown
  • Token counts (input/output)
  • Model used
  • Confidence score (if available)

Tool invocations:

  • Which tools were called
  • Parameters passed
  • Results returned
  • Whether the call succeeded

Reasoning traces:

  • Chain-of-thought (if using that pattern)
  • Decision points
  • Why alternatives were rejected

Escalations and failures:

  • When did the agent defer to humans?
  • What errors occurred?
  • What was the recovery path?

Cross-Platform Challenges

Each platform has its own telemetry format. AWS Bedrock traces look different from Azure AI logs look different from LangChain callbacks.

The control plane needs adapters for each platform - translating native telemetry into a unified schema. This is unglamorous plumbing work, but it’s essential.

Consider:

  • Schema normalization (common fields across platforms)
  • Timestamp synchronization (platforms may have clock drift)
  • Sampling strategies (you can’t store everything at scale)
  • Retention policies (balancing cost vs. audit requirements)

Pillar 3: Policy Engine

Observability tells you what happened. Policy defines what should happen. The policy engine translates governance requirements into enforceable rules.

Policy Hierarchy

Enterprise policies: Apply everywhere. Non-negotiable minimums that every agent must meet regardless of platform or business unit.

Domain policies: Apply to specific business domains. Retail banking might have different requirements than internal operations.

Agent policies: Apply to individual agents. Custom rules for specific use cases.

Policies cascade: enterprise → domain → agent. Lower levels can add restrictions but can’t override higher-level prohibitions.

Policy Types

Behavioral boundaries:

  • Topics the agent can/cannot discuss
  • Actions the agent can/cannot take
  • Response formats and constraints

Data access controls:

  • Which data sources are permitted
  • PII handling requirements
  • Cross-border data flow restrictions

Escalation rules:

  • When must the agent defer to humans?
  • What confidence thresholds trigger escalation?
  • How are edge cases handled?

Rate limits:

  • Maximum requests per time period
  • Token budgets (cost control)
  • Concurrent execution limits

Audit requirements:

  • What must be logged?
  • How long must logs be retained?
  • What requires explicit consent?

Policy Enforcement Points

Policies are only useful if they’re enforced. Where does enforcement happen?

Pre-invocation: Before the agent processes a request. Check if the request is allowed, if the caller is authorized, if rate limits are exceeded.

During execution: Monitor tool calls, data access, external API usage. Intervene if policies are violated.

Post-execution: Validate outputs before returning to users. Check for PII leakage, policy violations, confidence thresholds.

The best architectures enforce at all three points. Defense in depth.

Pillar 4: Runtime Control

Observation and policy define intent. Control plane executes that intent - actually intervening in agent behavior when needed.

Control Capabilities

Kill switches: Immediately halt a specific agent, all agents of a type, or all agents in a domain. When something goes wrong, you need to stop the bleeding fast.

Behavior modification: Adjust agent behavior without redeployment. Change confidence thresholds, enable/disable tools, modify escalation rules.

Traffic management: Route requests to different agents based on load, risk, or policy. Canary deployments, A/B testing, gradual rollouts.

Fallback orchestration: When an agent fails, route to alternatives. Human escalation, simpler rule-based fallback, different model.

Real-Time vs. Near-Real-Time

Some controls must be real-time. Kill switches. Pre-invocation policy checks. These add latency, so they must be fast.

Other controls can be near-real-time. Anomaly detection might process telemetry with a few seconds delay. Behavioral analysis might run on batched data.

Design for the latency budget your use case allows. Customer-facing agents might need sub-100ms policy checks. Internal batch processing might tolerate longer delays.

The Control Loop

The four pillars form a continuous loop:

  1. Registry tells you what agents exist
  2. Observability shows what they’re doing
  3. Policy defines what they should do
  4. Control enforces boundaries when reality diverges from policy

And then:

  1. Observability captures the intervention
  2. Registry updates agent status
  3. Policy might be refined based on learnings

It’s not a one-time setup. It’s an ongoing operation.

Integration Architecture

How does the control plane connect to agents across platforms?

Adapter Pattern

Each platform gets an adapter that handles:

  • Agent discovery
  • Telemetry collection
  • Policy translation
  • Control execution

AWS Bedrock adapter knows how to: list Bedrock agents, parse Bedrock traces, translate policies to Bedrock guardrails, invoke Bedrock APIs for control.

Azure AI adapter knows the equivalent for Azure. And so on.

The control plane core is platform-agnostic. Adapters handle platform-specific concerns.

Deployment Options

Sidecar: Deploy a control plane agent alongside each AI agent. Maximum visibility, but operationally complex.

Proxy: Route all agent traffic through the control plane. Centralized enforcement, but potential bottleneck.

SDK: Integrate control plane libraries into agent code. Minimal latency, but requires code changes.

Hybrid: Different patterns for different platforms. Proxy for platforms that support it, SDK for others.

Most enterprise deployments end up hybrid. There’s no one-size-fits-all.

Build vs. Buy

Should you build your own control plane or buy one?

Build considerations

Pros:

  • Full control over architecture
  • Custom integration with existing systems
  • No vendor dependency

Cons:

  • Significant engineering investment
  • Ongoing maintenance burden
  • Opportunity cost

Buy considerations

Pros:

  • Faster time to value
  • Vendor handles updates and maintenance
  • Potentially better than what you’d build

Cons:

  • Vendor lock-in
  • May not fit your exact needs
  • Another system to manage

The realistic middle ground

Most organizations land somewhere in between:

  • Buy core capabilities (registry, basic observability)
  • Build custom adapters for internal platforms
  • Build custom policies for specific requirements
  • Integrate with existing tools (SIEM, ITSM, IAM)

The control plane doesn’t have to be monolithic. It’s a collection of capabilities that can be assembled from different sources.

What Comes Next

We’ve covered the technical architecture. But architecture alone doesn’t create governance. You need an operating model - who decides what, how autonomy is balanced, how policies evolve.

In Part 3: The Autonomy Spectrum, we’ll tackle the human side. How do you give business units enough freedom to innovate while maintaining enough control to satisfy regulators? Hint: it’s not a binary choice.

Share this article

Stay ahead of AI governance

Get insights on enterprise AI trust, agentic systems, and production architecture delivered to your inbox.

Subscribe

Related Articles