Office Hours — How do you reliably evaluate AI agent behavior in production when testing in development doesn't catch all failure modes?

How do you reliably evaluate AI agent behavior in production when testing in development doesn’t catch all failure modes?

The hard truth: your test suite will never be comprehensive enough. Development environments are too clean, too constrained, and too predictable. Real production agents encounter edge cases you can’t manufacture in a lab—malformed data, inconsistent third-party APIs, novel user inputs that didn’t exist in your training set, and state combinations that take weeks to manifest.

The good news is you don’t need perfect foresight. You need a tiered evaluation strategy that catches failures as they happen, classifies them quickly, and feeds signal back into your system before it cascades.

Why Development Testing Fails for Agents

Agents aren’t deterministic functions. They’re stateful systems that accumulate context over time, make decisions based on uncertain inputs, and interact with external services that can fail or behave unexpectedly. Your test suite probably covers happy paths and a few known error cases. But agents fail in ways test suites can’t predict: they hallucinate tool arguments, get stuck in retry loops, call APIs in the wrong order, or accumulate bad state across multiple steps.

The gap is structural. Development testing is synchronous, bounded, and repeatable. Production agents are asynchronous, unbounded, and dependent on live services. A model that scored 95% on your eval benchmark might systematically fail on your actual production workflow because the benchmark doesn’t match the distribution of real requests.

This is why 40% of agent projects fail before reaching production, according to recent industry data. The teams that ship successfully don’t assume development evals are sufficient. They instrument production comprehensively from day one.

Layered Evaluation in Production

Start with fast, lightweight signals you can run on every execution. Then add deeper analysis on a sample or when you detect anomalies.

Layer 1: Execution-level signals (100% coverage, real-time). Monitor whether the agent completed its task, how many steps it took, whether it called the right tools, and how long it ran. This is your canary layer. Example:

@dataclass
class AgentExecution:
    task_id: str
    steps_taken: int
    tools_called: list[str]
    duration_seconds: float
    completed: bool
    final_state: str
    error: Optional[str]

def log_execution(exec: AgentExecution) -> None:
    metrics = {
        "agent.steps": exec.steps_taken,
        "agent.duration": exec.duration_seconds,
        "agent.completed": int(exec.completed),
    }
    for tool in exec.tools_called:
        metrics[f"agent.tool.{tool}"] = 1
    
    statsd.gauge("agent.execution", tags=[f"final_state:{exec.final_state}"], **metrics)
    
    if exec.error:
        sentry.capture_message(f"Agent error: {exec.error}", level="error", extra=exec.__dict__)

Watch for anomalies: agents completing in unusually few steps (might be giving up early), taking way too many steps (might be looping), calling unexpected tool sequences, or erroring on particular input types. Set alerts on these. They’re not failures yet, but they’re signals that something doesn’t match your mental model.

Layer 2: Output-level validation (sample or triggered). When an agent returns a result, validate it against basic constraints. Does it have the right structure? Are the values in the right range? Does it answer the user’s question?

For coding agents, this is easier: you can run tests on the generated code. For summarization or analysis tasks, you need heuristics or LLM-as-a-judge calls on a sample.

def validate_agent_output(output: dict, context: dict) -> dict:
    issues = []
    
    # Structural validation
    if "result" not in output:
        issues.append("missing_result_field")
    
    if "reasoning" not in output:
        issues.append("missing_reasoning")
    
    # Semantic validation: did the agent actually address the request?
    if not output.get("result"):
        issues.append("empty_result")
    
    # Sample-based deeper validation: LLM-as-judge
    if random.random() < EVAL_SAMPLE_RATE:
        judge_score = evaluate_with_model(
            model="claude-opus-4.8",
            prompt=f"Rate if this output answers the request. Request: {context['original_request']}. Output: {output['result']}",
        )
        if judge_score < 0.6:
            issues.append("judge_low_confidence")
    
    return {
        "is_valid": len(issues) == 0,
        "issues": issues,
        "output": output,
    }

The key is being conservative on the sample rate. At 5-10% sampling, you catch systematic issues without running an LLM call on every execution (which kills latency and cost). Bias the sampling toward edge cases: new user types, unusual request structures, agent steps that failed in execution layer.

Layer 3: Comparative evals (continuous, off-line). Weekly or daily, take a batch of production executions and compare them against your test suite or alternate model configurations. This catches drift where an agent’s behavior subtly degrades over time.

def comparative_eval_batch() -> dict:
    recent_executions = fetch_executions_since(hours=24)
    
    results = {
        "total_executions": len(recent_executions),
        "success_rate": sum(1 for e in recent_executions if e.completed) / len(recent_executions),
        "avg_steps": mean([e.steps_taken for e in recent_executions]),
        "error_types": Counter([e.error for e in recent_executions if e.error]),
    }
    
    # Compare against baseline from last week
    baseline = fetch_baseline_metrics(weeks_ago=1)
    
    for key in ["success_rate", "avg_steps"]:
        delta = (results[key] - baseline[key]) / baseline[key]
        if abs(delta) > 0.1:  # 10% drift threshold
            alert(f"{key} degraded by {delta*100:.1f}%")
    
    return results

Handling Failure Modes You Can’t Test

Some failure modes only show up in production because they depend on live data, race conditions, or state that persists across multiple agent runs.

When you catch a novel failure in production, don’t just fix it in code. Add it to your evaluation suite immediately. Implement it as both a unit test and an execution replay test: re-run the exact sequence of steps that led to the failure and verify your fix prevents it.

class TestAgentFailureModes(unittest.TestCase):
    def test_retry_loop_on_flaky_api(self):
        """Production incident: agent retried flaky API indefinitely."""
        mock_api = Mock(side_effect=[HTTPError(), HTTPError(), {"success": True}])
        
        agent = Agent(api=mock_api, max_retries=2)
        result = agent.execute(task="fetch_data")
        
        # Verify we don't retry more than max_retries
        assert mock_api.call_count <= 2
        assert result["error"] == "api_failed_after_retries"

    def test_hallucinated_tool_argument(self):
        """Production incident: agent called tool with invalid arg type."""
        agent = Agent(model="claude-opus-4.8", tools=[fetch_user_tool])
        
        # This input caused the agent to invent a user ID
        result = agent.execute(task="Find user named 'John Doe'")
        
        # Verify agent validated user_id is numeric
        for call in agent.tool_calls:
            if call.tool == "fetch_user":
                assert isinstance(call.args["user_id"], int)

The Cost-Signal Tradeoff

Running LLM-as-a-judge on every output is expensive and slow. But running it on zero outputs means you’re flying blind. The right sampling rate depends on your error tolerance and cost budget.

If you’re at $10k/month on Claude Opus calls, adding 5% sampling evals (500 extra model calls/day at 2 cents each) costs an extra ~$300/

Question via Hacker News