Office Hours — What kinds of tasks are AI agents actually reliable at in production today?

What kinds of tasks are AI agents actually reliable at in production today?

Honestly, the reliable ones are narrower than people want to admit. Agents that work well tend to have a few things in common: they operate in constrained domains, the tools they call have predictable outputs, and failure modes are easy to detect and recover from.

Tasks That Actually Work

Classification and routing tasks work. Using Claude Opus 4.7 or GPT-5.4 to sort incoming support tickets into categories, then dispatch them to the right queue? That’s solid. The output space is bounded, you can validate the decision easily, and a bad call just means a human reviews it faster.

Code generation in sandboxed environments works when the spec is clear. Generating SQL queries, Kubernetes manifests, or Terraform configs against a schema you control. You can execute and validate the output before it touches production. Same with simple document summarization when the source material is clean.

The more interesting case: autonomous coding agents are now genuinely reliable for multi-step tasks. Claude Code with Opus 4.7, Cursor Agent, and GitHub Copilot (which now supports GPT-5.4, Claude Sonnet 4.6, and Gemini 3.1 Pro across the same interface) can clone a repo, run tests, fix failures, and open PRs without human intervention on each step. The key difference from other agentic work is that code has an objective verification signal. A test either passes or it fails. A linter either complains or it doesn’t. That tight feedback loop is what lets agents recover from mistakes and iterate. A developer can now queue up a task like “add pagination to the user list endpoint and ensure all tests pass” and get a working PR hours later without touching the code themselves.

The Feedback Loop Matters More Than Model Size

Here’s what actually separates the reliable agents from the ones that drift: fast, unambiguous feedback. When an agent can run a command and see results in under a minute, it can chain together dozens of steps. Claude Opus 4.6 holds the production record for longest autonomous operation window at 14.5 hours on a single task, but that was because the agent was continuously executing tests, reading output, and adjusting. Each iteration had a clear signal.

Compare this to an agent trying to refactor a legacy system’s authentication layer. Same capability tier, same model. But now each decision lives in a gray zone. “Should this validation happen in middleware or in the service?” The agent can write the code. It cannot tell if the architectural decision was sound until weeks later when the team is in production and finding edge cases.

This is why the same model that excels at fixing failing tests will struggle with multi-step workflows in unstructured domains. It’s not capability. It’s signal.

Where Agents Still Fail

What’s still messy is anything requiring real-time reasoning across multiple tool calls in unstructured environments. Multi-step workflows where each step’s output becomes fuzzy input for the next one. That’s where agents still hallucinate, loop infinitely, or produce plausible-sounding but incorrect results.

Consider an agent trying to migrate a legacy monolith to microservices. Each decision (should this service be separate? what should the API contract look like?) requires judgment calls without clear pass/fail criteria. An agent can’t detect whether it’s gone off track until much later in the process, if at all. Compare that to an agent fixing a failing unit test. It knows immediately whether its change worked.

Similarly, agentic RAG across heterogeneous sources still drifts. An agent pulling from multiple documentation sites, Slack archives, and internal wikis will confidently synthesize contradictory information. The problem isn’t hallucination in the classic sense. It’s that the agent has no way to know it’s wrong until a human catches it. You’ll notice the problem only after the agent has already made decisions downstream.

The Verification Signal Tradeoff

This is the core tradeoff in production agentic systems. Tasks with fast, objective verification signals (code compilation, test execution, schema validation) can sustain longer autonomous runs. Tasks without them should either have shorter decision chains or explicit human checkpoints between steps.

Coding agents can run for hours when there’s continuous feedback. An agent trying to restructure documentation or decide on API design patterns will drift. A 20-step workflow where step 15 requires subjective judgment but you don’t realize it until reviewing the full output is a waste of compute and a source of bugs.

The Real Pattern

The pattern that’s actually holding up: use agents for orchestration within tight guardrails, not as autonomous decision makers in ambiguous domains. Deploy them where success is verifiable in seconds or minutes, not where it requires hours of downstream validation.

That means: classification systems, code generation with automated testing, ticket routing, basic document summarization. It also means the newer breed of coding agents, because they have that constant objective signal.

It does not mean: agents making architectural decisions, agents doing exploratory data analysis without human interpretation, agents managing long-form creative or strategic work, or agents reasoning through novel problems where you can’t easily tell if they’re right.

Bottom line: Deploy agents where success is objectively verifiable. Autonomous coding tasks with test suites, classification and routing, constrained code generation. For anything requiring subjective judgment, architectural decision-making, or hours-long verification cycles, keep humans in the loop or use explicit workflow choreography instead of agentic autonomy. The model tier matters less than the feedback signal.

Question via Hacker News