Office Hours — How do you actually monitor AI agents in production when there's no standard playbook yet? A daily developer question about AI/LLMs, answered with a direct, opinionated take. 2026-04-04T12:00:00.000Z Office Hours Office Hours office-hoursq-and-apractical-ai

Office Hours — How do you actually monitor AI agents in production when there's no standard playbook yet?

A daily developer question about AI/LLMs, answered with a direct, opinionated take.

Daily One question from the trenches, one opinionated answer.

How do you actually monitor AI agents in production when there’s no standard playbook yet?

You’re monitoring three separate things and most people conflate them. First, the mechanical stuff: latency, error rates, token usage, cost per inference. That’s vanilla observability. Throw it in Datadog or your existing stack.

Second, the model behavior layer. This is where it gets weird. You need traces of what the agent decided to do at each step, not just the final output. Tools like Langsmith or custom logging capture the agent’s reasoning chain, tool calls, and intermediate outputs. That’s how you spot when an agent starts hallucinating tool parameters or calling the wrong function repeatedly.

Third, the business outcome layer. Did the agent actually solve the user’s problem? A successful API call with perfect syntax might still be wrong. You need a feedback loop, ideally human-in-the-loop for critical paths. If you’re using Claude Opus 4.7 or GPT-5.5 for agent tasks, tag a sample of runs for human review. Start with 5-10% and adjust based on error patterns.

Concrete example: token cost drift

Here’s what you’re actually looking for. Say you’re running a multi-step coding agent that clones a repo, reads files, runs tests, and opens a PR. On Claude Opus 4.7, that’s roughly 50-80K input tokens per run if you’re caching file reads. At $3 per million input tokens, one failure loop that reruns the entire chain costs you $0.24-0.38 instead of $0.15-0.24.

Wire this into your trace logs:

{
  "trace_id": "agent-pr-4521",
  "timestamp": "2026-05-02T14:32:11Z",
  "input_tokens": 74200,
  "output_tokens": 12400,
  "cost_usd": 0.31,
  "steps": 7,
  "tool_calls": [
    {"tool": "git_clone", "status": "success", "latency_ms": 2840},
    {"tool": "read_files", "status": "success", "latency_ms": 1920},
    {"tool": "run_tests", "status": "failed", "retries": 2, "final_latency_ms": 8200}
  ],
  "outcome": "success",
  "human_reviewed": false
}

Pull this weekly. If you see input tokens creeping up 20% month-over-month, the agent is either seeing larger codebases or getting into retry loops. Both are fixable, but you only see them if you instrument token usage at the trace level, not just aggregate it.

The trap is waiting for perfect metrics

You won’t get them. Instead, instrument everything, pull weekly samples, and look for anomalies in the trace data. When an agent starts making weird decisions, the traces will show you where it went sideways faster than any aggregate metric.

Watch for two failure modes. First, cascading tool failures: the agent calls tool A successfully, then tool B with parameters derived from A’s output, and tool B fails because A actually returned something unexpected. Second, token explosion: the agent hits a retry loop and balloons token usage on successive attempts. Both show up in trace inspection before they hit your error budget.

Human review isn’t a checkbox. Tag 5-10% of runs for review, but bias toward outliers: runs that took more than three retries, runs that exceeded your cost per task baseline by 2x, runs where the agent switched tools three times in a row. That’s where the signal is. Random sampling catches nothing.

Bottom line: Log full execution traces with token counts, tool calls, retries, and latency at each step. Pull and manually inspect weekly samples, biased toward cost or latency outliers. Mechanical monitoring catches infrastructure problems. Trace inspection catches model drift and agent behavior degradation before users see it.

Question via Hacker News