Office Hours — How do you architect systems where AI agents can safely execute code or access tools without human review on every action?

How do you architect systems where AI agents can safely execute code or access tools without human review on every action?

This is the core tension of production agentic systems right now. You can’t review every action, but you also can’t let agents run completely loose. The answer isn’t a single safeguard, it’s a layered system where you trade off between autonomy and safety depending on what the agent is actually trying to do.

Capability-Gated Execution

Start by being ruthlessly specific about what an agent is allowed to do. Instead of giving Claude Opus 4.8 or GPT-5.5 blanket access to your infrastructure, define discrete capabilities and gate them behind verified request patterns.

A real example: if you’re deploying a code-writing agent, don’t let it run git push directly. Instead, create a wrapper that:

Parses the agent’s intent from structured output (using function calling or JSON mode).
Validates the request against a whitelist of allowed operations.
Executes only the validated subset.
Returns the result back to the agent.

# Simplified pattern
class CapabilityGate:
    ALLOWED_GIT_OPERATIONS = {"status", "diff", "add", "commit"}
    
    def execute_git(self, agent_request: str) -> str:
        operation = self._parse_operation(agent_request)
        
        if operation not in self.ALLOWED_GIT_OPERATIONS:
            return f"Operation {operation} not allowed"
        
        # Only execute if validated
        return subprocess.run(
            ["git", operation],
            capture_output=True,
            timeout=30
        ).stdout.decode()

The goal: agents can’t accidentally (or maliciously) run git reset --hard origin/main or deploy to production. You’ve reduced the surface area to specific, auditable commands.

Observability as the Real Safeguard

You can’t prevent all mistakes, but you can detect them fast. The teams getting reliable autonomous agents in production right now are treating observability as a first-class safeguard, not an afterthought.

Log every agent action with enough context to understand what it was trying to do and what actually happened. Include:

The agent’s stated intent (what it said it wanted to do).
The actual request it made (the parsed, validated command).
The result it got back.
Any divergence between intent and outcome.

Then set up alerts for:

Repeated failures on the same task (sign of a failure loop).
Operations on sensitive resources (database writes, credential access, external API calls to unfamiliar endpoints).
Unusual patterns (an agent suddenly making 100 API calls when it normally makes 5).

This is where tools like Arize, Datadog, or even just structured logging to a searchable backend become essential. You’re not preventing the agent from failing, you’re ensuring you catch it within seconds, not hours.

Cost and Rate Limiting as a Practical Brake

This is unglamorous but essential: rate limits and cost budgets are real safeguards. If an agent starts looping (repeatedly trying the same failing operation), it will eventually hit a cost ceiling or request limit before it causes catastrophic damage.

Set per-action costs and per-session budgets. If a code-writing agent normally uses 500 tokens per task, give it a 2000-token budget per session. If it burns through that without completing the task, stop and escalate to human review.

class AgentBudget:
    def __init__(self, max_tokens: int, max_api_calls: int):
        self.tokens_remaining = max_tokens
        self.calls_remaining = max_api_calls
    
    def deduct(self, tokens_used: int, calls_used: int = 1) -> bool:
        if tokens_used > self.tokens_remaining:
            return False
        if calls_used > self.calls_remaining:
            return False
        
        self.tokens_remaining -= tokens_used
        self.calls_remaining -= calls_used
        return True

Qwen3.7-Max achieved 35 hours of autonomous operation on chip optimization in May 2026, but even that had explicit checkpoints where humans could intervene. The difference between a safe autonomous system and a runaway loop is often just good budget instrumentation.

Deterministic State Machines for High-Stakes Decisions

For operations that matter (API calls to production systems, database writes, credential access), don’t rely on pure LLM reasoning. Use the agent to propose an action, then validate it against a deterministic state machine that enforces the rules you actually care about.

Example: an agent wants to deploy code. Instead of letting it call your deployment API directly, route it through a gate that:

Parses the deployment request.
Checks preconditions (tests passing, no uncommitted changes, branch policy met).
Generates a dry-run plan.
Compares the plan against known-safe patterns.
Executes only if the plan is deterministically safe.

This is what GitHub’s infrastructure team figured out when they started scaling Copilot’s agentic capabilities. You can’t make the agent smarter about safety; you make the infrastructure smarter and reduce the agent’s degrees of freedom.

Segregation by Consequence

Not all actions are equal. Writing to a staging database is low-consequence. Modifying production infrastructure is high-consequence. Route them through different gates.

Low-consequence operations can be fully autonomous. High-consequence operations require explicit human approval, even if it means breaking the autonomy loop.

Most production deployments use a tiered approach:

Tier 1 (fully autonomous): read operations, test execution, code analysis, drafting outputs.
Tier 2 (async human review): opening PRs, creating issues, staging deployments.
Tier 3 (synchronous approval): production deploys, credential rotation, billing changes.

This keeps the agent fast for the 80% of tasks that don’t need human judgment, while ensuring humans stay in the loop where it matters.

The Real Failure Mode

The thing that actually breaks production agent systems isn’t usually a catastrophic one-off mistake. It’s silent degradation: an agent successfully executes actions, but those actions are wrong in ways that aren’t immediately obvious.

Meta’s incident with their support bot hijacking Instagram accounts happened because the agent had access to change account email addresses, and nobody thought to add a confirmation step. It wasn’t a runaway loop or a budget overrun. It was a single unauthorized action that succeeded too easily.

This is why observability wins over pure prevention. You can’t think of every edge case. But if you log everything and alert on anomalies, you’ll catch the degradation before it becomes a disaster.

Bottom line: Gate agent capabilities to discrete operations, log everything, set cost/rate limits as a practical brake, use deterministic validation for high-stakes decisions, and tier your approval requirements by consequence. This layered approach beats trying to make a single safeguard foolproof.

Question via Hacker News