Office Hours — What concrete guardrails and constraints do you need to put on LLM agents before deploying them to production?

What concrete guardrails and constraints do you need to put on LLM agents before deploying them to production?

Agents are moving from demo to production, which means your liability moves with them. A Claude Code system writing a PR that deletes your database, or a coding agent quietly encoding an API key into a commit, aren’t abstract risks anymore—they’re operational problems your team will debug at 2 AM. The guardrails you need depend on what actions your agent can take, what data it touches, and whether you can catch failures before they compound.

Input and Output Isolation

Start with a hard boundary on what enters and leaves your agent. Your agent shouldn’t see production credentials, and production systems shouldn’t receive unchecked agent outputs directly.

For input, this means scrubbing secrets from context before the model sees them. Parse environment variables, remove API keys from code snippets, and strip PII from logs or files the agent reads. A coding agent doesn’t need to see your actual database password to understand the schema—pass a placeholder instead.

For output, never execute agent-generated code directly in production. Run it in a sandbox first. If your agent writes a migration or modifies infrastructure, that output goes to a staging environment, or requires explicit human approval before touching production. Instagram’s chatbot breach (mentioned in Daily Signal 2026-06-08) exposed 20,000+ accounts because a password-reset agent had direct access to user account operations without rate limiting or verification—the agent’s output went straight to users without a checkpoint.

A concrete pattern: pipe agent outputs through a validation layer that checks syntax, verifies the change is within expected scope, and logs every action. If a coding agent generates a SQL command, parse it to ensure it matches your expected table structure. If it generates a shell command, check against a whitelist of safe operations.

Cost and Rate Limiting

Agents consume tokens like nothing you’ve seen before. Qwen3.7-Max demonstrated 35 hours of autonomous operation on a single task—that’s thousands of dollars if you’re on a metered API. You need hard stops before your bill explodes.

Set a token budget per task and per month. Most frameworks support this: specify a max tokens per agent run, and fail gracefully when approaching that limit. For agentic workflows, set a cost ceiling per operation—if your agent is solving a problem for $0.50 max and it’s eaten $2 in token spend, stop and escalate to a human.

Rate limiting matters because agents amplify mistakes. If your agent hits an API 100 times in a second, or retries the same failing operation infinitely, you’ll feel it immediately. Implement exponential backoff in your agent’s tool calls, and add a circuit breaker: if a tool fails N times in a row, the agent stops using it and reports the failure rather than spinning.

A real example: Perplexity’s approach (Daily Signal 2026-06-07) of letting agents write their own Python search routines in a sandbox cuts token costs 85% compared to agents calling fixed APIs. That’s not just clever architecture—it’s cost control. Your agents should similarly have mechanisms to be efficient, not just powerful.

Tool and API Access Control

Your agent shouldn’t have a master key. Implement role-based access for every tool it can call. If your agent can modify code, it shouldn’t be able to delete databases. If it can read logs, it shouldn’t be able to deploy infrastructure.

Create a separate API key scoped only to what the agent needs. If it’s a coding agent for a specific repo, give it write access to that repo only, not your entire organization. Use per-task credentials that expire after the task completes, so a compromised key has limited blast radius.

For database access, use read-only replicas for agents that only need to read. If an agent must write, use transactions with automatic rollback on errors, not bare INSERT/UPDATE statements. Require agent-initiated writes to hit a validation service that checks the change is sensible before committing.

Document every tool your agent has access to. Make it explicit in your agent system prompt: “You can call these 7 tools. You cannot make HTTP requests outside this list. You cannot execute arbitrary shell commands.” Simon Willison’s work on MCP servers (Daily Signal 2026-06-06) shows how to structure agent tool access cleanly—define a protocol, let the agent see only what you expose, and audit every call.

Monitoring and Observability

You can’t control what you can’t see. Instrument every agent action: log every tool invocation, every decision point, every failure. Store these logs where they survive longer than your agent run, so you can investigate 3 days later when something goes wrong.

Set up real-time alerts for anomalies. If your agent spends 10x the usual tokens on a task, alert. If it makes 100 API calls in 5 minutes, alert. If it tries to access a tool outside its allowed set, alert and kill the run.

Implement a dead man’s switch. If your agent runs for longer than expected without completing, it stops and escalates to a human. Long-running agents drift—they hallucinate their way into corners and keep trying to fix it. A timer forces the question: is this actually still solving the problem, or is it lost?

Create an audit trail that shows exactly what happened, in order. This is critical for compliance and for your own debugging. When your agent corrupts something, you need to replay the exact sequence of events.

Determinism and State Boundaries

Agents are stateful creatures—they remember context across multiple steps. That memory can grow unbounded or become inconsistent. Define clear state boundaries.

If your agent is solving a multi-step task, specify the maximum number of steps. Define what constitutes “done”—a concrete success criterion the agent must achieve, not a vague goal. “Write a function that passes these three tests” is good. “Improve the codebase” is not.

Use immutable snapshots of context where possible. When your agent reads a file, snapshot that file state so the agent isn’t operating on a moving target. If the file changes mid-task, the agent sees the version it started with, not inconsistent intermediate states.

Recreate agents from scratch for each new task rather than reusing a single agent instance across multiple problems. This prevents context bleed and makes the agent’s state manageable.

Guardrails for Sensitive Operations

Some operations shouldn’t be autonomous. Deletion, deployment, and credential rotation are examples. Implement mandatory approval gates for these.

A pattern that works: the agent proposes the action, logs it with full context, and a human approves or rejects. If approved, the agent executes. If rejected, the agent logs that and tries an alternative. This gives you the speed of automation without the risk of autonomous destruction.

For operations that can’t wait for human approval, use conservative defaults. An agent deciding whether to delete something should refuse by default unless it’s 100% certain that’s the right move. An agent deciding whether to deploy should require explicit enabling in your config.

Test your safeguards. Intentionally trigger them—write a test where your agent tries to delete something, and verify your approval gate fires. Test that cost limits actually stop the run, that tool access restrictions are actually enforced. Don’t rely on the framework’s promises.

Credential Management

This is where teams consistently fail. Your agent will eventually have access to credentials, and a prompt injection or model hallucination could exfiltrate them.

Never embed credentials in prompts or system messages. Use a secrets manager (Vault, AWS Secrets Manager, whatever). When your agent needs a credential, it requests it at runtime—the agent code sees a function call like get_database_password(), not the password itself.

Rotate credentials frequently. API keys accessed by agents should expire daily or hourly if possible. Use short-lived tokens (JWTs) rather than long-lived API keys.

Audit credential access. Log every time a secret is retrieved, what retrieved it, and from where. If an agent requests a credential it shouldn’t have access to, your audit log catches it.

On the broader point: OpenAI’s Lockdown Mode (Daily Signal 2026-06-07) disables web access and Agent Mode for sensitive operations, trading autonomy for safety. That’s a model worth copying. For high-stakes operations, disable your agent’s external capabilities entirely and have it work within a confined sandbox.

Testing Before Production

Don’t deploy an agent that hasn’t been stress-tested. Run it against adversarial inputs—prompts designed to trick it, malformed data, contradictory instructions. See what it does when tools fail, when it runs out of context, when it receives conflicting guidance.

Build your own agent benchmark before shipping. The Daily Signal (2026-06-08) notes that marketing claims about agent performance rarely match real-world results. Test your specific agent against your specific tasks in your actual environment. Don’t trust benchmarks published by model vendors—run the test

Question via Hacker News