Office Hours — What are developers actually building with AI coding agents right now? A daily developer question about AI/LLMs, answered with a direct, opinionated take. 2026-05-22T12:00:00.000Z Office Hours Office Hours office-hoursq-and-apractical-ai

Office Hours — What are developers actually building with AI coding agents right now?

A daily developer question about AI/LLMs, answered with a direct, opinionated take.

Daily One question from the trenches, one opinionated answer.

What are developers actually building with AI coding agents right now?

The answer is concrete work that generates real revenue: autonomous testing frameworks that find bugs faster than humans, multi-file refactoring that preserves architectural intent, ticket-to-PR pipelines that ship without human intermediaries, and infrastructure scaffolding for bootstrapping new services. This isn’t theoretical. Railway is seeing individual customers spend $200K+ monthly on coding agent workloads. Ramp integrated Codex into code review and saw feedback cycles collapse from hours to minutes. OpenAI and Dell are shipping on-premise agent deployments for enterprises that need code generation without data exfiltration.

The pattern is clear: developers aren’t using agents for ideation or learning. They’re using them for mechanical tasks where success is verifiable—test generation, boilerplate elimination, dependency updates, documentation generation, and the kind of cross-file refactoring that would take a human developer a day per file.

What Actually Works

Coding agents excel when the task has a fast feedback loop. Run tests, watch them pass or fail, adjust code. This is the inverse of what fails: open-ended design decisions, “is this refactor safe?”, architecture reviews, or anything requiring subjective judgment. When an agent hits a test failure, it can iterate. When it hits a question about whether something should exist at all, it gets stuck.

The real wins are in automation of the tedious. A team at a mid-scale startup built an agent that watches their GitHub issues, generates a test suite from the issue description, writes implementation code to pass those tests, and opens a PR—all without human validation until the PR review stage. The agent fails maybe 20% of the time in ways that catch edge cases a junior developer would miss. The other 80% of the time, it ships code that works.

Another pattern: agents that own specific, well-defined subsystems outperform generalist agents. A coding agent that only touches your test suite, or only generates Terraform, or only refactors migrations—these agents are dramatically more reliable than agents trying to reason about your entire codebase simultaneously. The constraint focus matters more than the model quality.

The Infrastructure Tax You Pay

Autonomous coding requires serious sandboxing. You can’t let an agent run arbitrary code on your laptop. Teams are building (or adopting) container-based execution environments that:

  • Clone the repo into an ephemeral container
  • Run the agent’s code changes inside that container
  • Execute test suites and linters as the objective signal
  • Destroy the container regardless of output
  • Keep the code changes but discard the runtime environment

This is non-trivial infrastructure. You need to manage container lifecycle, cap execution time (agents can get stuck in infinite loops), handle resource limits, and log everything for debugging when things go wrong. A few teams tried to skip this and let agents modify production codebases directly. None of them did it twice.

Cost is the other surprise. An agent that makes ten API calls to reason about a single file change isn’t that expensive. But an agent that runs for 30 minutes analyzing a large codebase, making small changes, running tests, retrying? Check your invoice. Teams are implementing token budgets per task and hard stops when an agent crosses a threshold without making progress. One team set a $50 limit per ticket. Their agents learn fast under that constraint.

What Breaks

Agents hallucinate code structure. They’ll confidently reference functions that don’t exist, import modules that aren’t in your dependencies, or assume architectural patterns you don’t actually use. A well-structured codebase with clear naming, high test coverage, and minimal magic is the baseline requirement. If your codebase is a mess, agents make it worse faster.

Long-horizon planning breaks. An agent can handle a single file or a tightly coupled pair of files. Ask it to refactor across ten services and it drifts. It loses sight of the original constraint. This is why ticket-to-PR works: the ticket is the constraint. The agent’s job is bounded. The moment you ask it to reason about “the whole system,” success becomes ambiguous and the agent makes arbitrary choices.

Agent context explosion is real. If you dump your entire codebase into the context window, you’re wasting tokens and degrading performance. Teams using agents are learning to feed them just enough context: the specific files they’re modifying, the tests that validate their changes, and nothing else. This requires tooling to extract relevant context before handing it to the agent.

How Teams Are Shipping This

The winning pattern right now is Claude Opus 4.7 or GPT-5.5 paired with tight task scoping. One team uses Claude Code with a custom harness that:

  1. Extracts the issue description
  2. Queries the codebase for relevant files (using embeddings)
  3. Pulls the test suite that validates the change
  4. Feeds that bounded context to the agent
  5. Runs tests in a sandbox
  6. Opens a PR with the agent’s output and a summary of what it tried

The agent fails on maybe 15% of tickets (usually because the issue description was ambiguous, not because the agent was dumb). The team’s human review still catches edge cases. But the velocity is 3-4x what they had with junior developers writing first drafts.

Another team is building specialized agents per language. One for Python/Django, one for TypeScript/React, one for Go services. They’re not trying to build a polyglot agent. Single-language agents have stronger priors about idioms, common patterns, and anti-patterns. They fail less often.

The Cost Reality

Agents are not cheaper than hiring. They’re faster. A task that would take a developer 4 hours costs maybe $20 in API calls and compute. That’s attractive. But the infrastructure, monitoring, and guardrails aren’t free. Teams building production agent systems are spending engineering time on orchestration, not just prompt engineering.

The ROI only works if you have high-volume, low-ambiguity tasks. Ticketing systems with thousands of small issues. Refactoring campaigns. Dependency updates. Boilerplate generation. If your work is mostly novel architecture and design decisions, agents won’t move the needle.

Bottom line: Autonomous coding agents are shipping in production right now, but exclusively for bounded, verifiable tasks like test generation, boilerplate elimination, and single-file refactoring. Build tight task scoping, strong sandboxing, and use agents to augment velocity on mechanical work—not to replace human judgment on design.

Question via Hacker News