Office Hours — What are some actual use cases of AI Agents right now?
A daily developer question about AI/LLMs, answered with a direct, opinionated take.
What are some actual use cases of AI Agents right now?
The honest answer is most “agents” in production are still pretty narrow. Real wins are happening in customer support automation where an agent with Claude Opus 4.7 or GPT-5.4 can handle tier-1 tickets, look up account info via APIs, and escalate when needed. Companies like Intercom and Zendesk have customers actually shipping this.
Code generation agents have moved well past IDE assists. Tools like Claude Code, Cursor Agent, and Devin can handle genuine multi-step tasks autonomously: cloning a repo, running tests, fixing failures, and opening a PR with no human in the loop for individual steps. GitHub Copilot now supports multiple models (GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro) and can tackle tasks that used to require context-switching between tools. JetBrains AI still owns the IDE assist category, but the fully autonomous end is real and in active use. Reliability still drops off on architectural decisions and anything requiring sustained context across many files, but the progress here is genuine.
When Agents Actually Work
The clearest pattern: tightly scoped agents with an objective success signal and a fast feedback loop. Code is the proof point because it’s verifiable. A test suite tells you immediately if the agent succeeded. A linter catches style issues. CI pipelines block bad PRs. That tight loop is what makes autonomous agents viable.
Consider a concrete example: an agent tasked with “upgrade this dependency and keep tests green.” The success signal is binary and fast.
# Agent workflow
git clone <repo>
npm install
npm test # baseline: all pass
npm upgrade lodash@^4.17.0 to ^4.18.2
npm test # agent sees 3 failures in utils.test.js
# agent reads error output, modifies code, reruns
npm test # passes
git push --set-upstream origin upgrade/lodash-4.18.2
The entire cycle completes in minutes. A human then reviews the diff before merge. That review is lightweight because the agent has already proven the code works. The agent isn’t replacing judgment. It’s eliminating the mechanical work so the review becomes a spot-check rather than a full investigation.
Claude Opus 4.7 holds the record for longest autonomous operation window at 14.5 hours, which matters for tasks like large-scale refactors across thousands of files. GPT-5.4 adds native computer use in the API, meaning agents can navigate GUIs, not just shell commands. Both expand the surface area of what “tightly scoped” can include.
Research automation is running in production at consulting firms and finance teams. Agents that crawl docs, pull data from APIs, and summarize findings operate reliably because the errors are catchable. A human reviews the output before it ships. The loop is tight enough that problems surface before they matter.
Where Agents Still Drift
What’s unreliable: agents that need to reason about ambiguous trade-offs or maintain coherent intent across a genuinely novel codebase tend to drift. Agentic RAG over multiple heterogeneous data sources still gets confused. Pulling from Slack, Jira, internal wikis, and cloud docs simultaneously creates enough noise that even frontier models lose the thread. The agent conflates old decisions with current context, pulls contradictory guidance from different sources, and ends up confident but wrong.
This is partly an architecture problem. Most teams reach for naive RAG, dump everything into a vector store, and expect the model to sort it out. It can’t. Heterogeneous sources with different update frequencies, conflicting terminology, and varying reliability levels require metadata filtering, recency weighting, and explicit source attribution before retrieval even starts. Fixing the pipeline helps. It doesn’t fully solve the problem.
Anything requiring judgment calls isn’t something you want running fully unsupervised. “Is this refactor safe?” “Should we deprecate this API?” “Does this architectural change align with our strategy?” These require product sense, risk assessment, and knowledge of organizational constraints. An agent can measure whether tests pass. It cannot measure whether a design choice aligns with unstated company values. That asymmetry matters.
When you can’t define success objectively and measure it fast, keep a human in the loop. The human becomes the verification layer. That’s not a limitation. It’s the right place to put judgment that carries real consequences. Autonomous agents run tests in seconds. They cannot run product judgment in seconds.
Bottom line: If there’s a fast, objective signal telling the agent whether it succeeded, autonomous agents are worth trying today. If success requires human judgment to evaluate, keep a human in the loop. The boundary between those two cases is where most of the interesting engineering work is happening right now.
Question via Hacker News