Office Hours — How do you prevent AI agents from taking unintended actions when they have access to external tools and APIs?

How do you prevent AI agents from taking unintended actions when they have access to external tools and APIs?

This is the real problem nobody talks about until something breaks in production. You’ve got an agent with access to your infrastructure—maybe it’s calling Stripe, modifying a database, or deploying code—and the question isn’t “will it mess up,” it’s “when will it mess up, and what’s the damage radius?”

The honest answer: there’s no single guardrail that works. You need layers, and each layer catches different failure modes.

Define a Hard Perimeter Around What the Agent Can Touch

Start by being explicit about what APIs the agent can even see. Don’t give it a generic “execute any HTTP request” tool and hope for the best. Instead, build a tool schema where each capability is intentional.

# Bad: agent has access to everything
agent_tools = [
  {
    "name": "http_request",
    "description": "Make any HTTP request",
    "execute": lambda url, method, body: requests.request(method, url, json=body)
  }
]

# Better: constrained tool definitions
agent_tools = [
  {
    "name": "fetch_user_data",
    "description": "Look up user by ID. Returns name, email, created_at. Read-only.",
    "parameters": {
      "type": "object",
      "properties": {"user_id": {"type": "string"}},
      "required": ["user_id"]
    },
    "execute": lambda user_id: db.query(f"SELECT name, email, created_at FROM users WHERE id = ?", user_id)
  },
  {
    "name": "update_user_email",
    "description": "Change a user's email. Requires explicit confirmation before execution.",
    "parameters": {...},
    "execute": lambda user_id, new_email: update_email_with_audit_log(user_id, new_email),
    "requires_confirmation": True
  }
]

The second approach is slower but predictable. The agent can’t accidentally discover new endpoints or bypass your intent. Every tool has a specific contract.

Require Explicit Approval for Mutation Operations

Read-only operations can ship fast. Write operations need friction. When an agent wants to modify state—delete data, charge a card, deploy code—it should ask for confirmation first, and that confirmation should be easy to audit.

class ConfirmableAction:
  def __init__(self, description, execute_fn, risk_level="medium"):
    self.description = description
    self.execute_fn = execute_fn
    self.risk_level = risk_level
    self.confirmation_token = None
    
  def request_confirmation(self):
    """Agent asks for approval; human reviews the description and decides."""
    token = generate_token()
    self.confirmation_token = token
    return {
      "status": "awaiting_confirmation",
      "action": self.description,
      "risk": self.risk_level,
      "confirmation_token": token
    }
    
  def execute_if_confirmed(self, token):
    if token != self.confirmation_token:
      raise ValueError("Invalid confirmation token")
    return self.execute_fn()

This isn’t foolproof, but it creates a checkpoint. You can log every confirmation, review it afterward, and catch patterns where the agent keeps requesting the same risky action.

Sandbox Code Execution Ruthlessly

If your agent is writing or executing code (which is increasingly common with Claude Code and GitHub Copilot), isolation is non-negotiable. The agent shouldn’t be able to read your .env file, access the host filesystem, or make arbitrary system calls.

A Docker container is the baseline. A full sandbox (like Daytona or Firecracker) is better. What you’re defending against: an agent prompt-injected by user input, or an agent that tries to exfiltrate secrets because it got confused about what “read the config” means.

# Minimal agent execution environment
FROM python:3.11-slim
WORKDIR /sandbox
RUN chmod 000 /root /home
COPY --chown=nobody:nobody ./agent_runner.py .
USER nobody
ENTRYPOINT ["python", "agent_runner.py"]

The agent runs as a non-root user with no access to sensitive directories. If it tries to read /etc/passwd, it gets a permission error. If it tries to exfil data via DNS, you catch it with network policy. Sandboxing isn’t novel, but it’s the first thing teams skip when they’re in a hurry.

Set Hard Limits on Cost and Rate

Runaway agent loops are your most likely failure mode. An agent gets stuck in a retry loop, calling an expensive API repeatedly, or it hallucinates an endpoint and hammers it. Without cost limits, you wake up to a $50K bill.

Set per-agent budgets and per-action costs. Enforce them in middleware before the agent even makes the call.

class AgentBudget:
  def __init__(self, max_spend_per_hour=500, max_requests_per_minute=60):
    self.max_spend = max_spend_per_hour
    self.max_rpm = max_requests_per_minute
    self.spent_this_hour = 0
    self.requests_this_minute = 0
    
  def check_before_action(self, action_name, estimated_cost):
    if self.spent_this_hour + estimated_cost > self.max_spend:
      raise BudgetExceeded(f"Action would exceed hourly budget. Spent: ${self.spent_this_hour}, Requested: ${estimated_cost}")
    if self.requests_this_minute >= self.max_rpm:
      raise RateLimitExceeded(f"Hit {self.max_rpm} requests/min limit")
    return True

You’ll tune these thresholds based on your risk tolerance and typical workload. But having them means a misconfigured agent can’t empty your account.

Monitor Agent Behavior in Real Time

Log every action the agent takes. Not just successes, but what it tried, what it asked for confirmation on, what it skipped. When something goes wrong, you need a full trace.

Look for anomalies: is the agent making requests to APIs it doesn’t normally use? Is it retrying the same action dozens of times? Is it exfiltrating data in batches? These patterns show up in logs before they cause damage.

@contextmanager
def track_agent_action(agent_id, action_name, params):
  start = time.time()
  log_entry = {
    "agent_id": agent_id,
    "action": action_name,
    "params": params,
    "timestamp": start,
    "status": "pending"
  }
  
  try:
    yield log_entry
    log_entry["status"] = "success"
  except Exception as e:
    log_entry["status"] = "failed"
    log_entry["error"] = str(e)
  finally:
    log_entry["duration_ms"] = int((time.time() - start) * 1000)
    audit_log.write(log_entry)
    
    # Alert on anomalies
    if log_entry["duration_ms"] > 30000:
      alert(f"Agent {agent_id} slow request: {action_name}")
    if log_entry["status"] == "failed" and is_suspicious_pattern(agent_id):
      alert(f"Possible runaway loop: {agent_id}")

Real-time monitoring isn’t glamorous, but it’s your last line of defense when the agent starts doing something weird.

Accept That Some Failures Are Baked In

Tool use is where agent reliability hits its floor. Claude Opus 4.7 and GPT-5.5 are both good at deciding which tool to call, but they hallucinate parameters, misunderstand return values, and retry failed calls incorrectly. That’s not going away with better prompting.

The question isn’t “how do I make the agent never fail?” It’s “how do I make failures cheap and visible?” A failed database query should be a logged error, not $500 in API calls. A misunderstood API response should trigger a human review queue, not a cascade of bad decisions.

Bottom line: Prevent unintended actions by designing a bounded tool

Question via Hacker News