Paper of the Week — Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study

Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study

Priyansh Trivedi, Olivier Schmitt (SonarSource). Published 2026-05-21. arXiv:2605.20049

One sentence summary

Code quality — not just correctness — meaningfully affects coding agent task success rates, and this study is the first to isolate that variable with controlled minimal-pair experiments.

Why this paper

Every team shipping agentic coding workflows benchmarks task completion, but almost nobody measures whether the codebase the agent works in is a hidden variable. As Devstral 2 and GPT-5.3 Codex get deployed on legacy repos, this question has real budget implications.

What they did

The SonarSource team constructed matched pairs of code problems — same functionality, same tests, but one version “clean” (well-named, low complexity, no code smells) and one “unclean.” They then ran coding agents against both versions of each task, holding everything else constant, to isolate whether structural and stylistic quality affects agent performance independently of task difficulty.

Key findings

Agents showed measurably lower task-completion rates on unclean code variants compared to their clean counterparts, even when the underlying task logic was identical
The effect was consistent across multiple agent configurations, suggesting this is a property of how LLMs process noisy context rather than an artifact of any single model
Failure modes on dirty code skewed toward the agent misidentifying the right location to edit, not toward generating syntactically incorrect patches
Clean code reduced the number of agent turns needed to complete tasks, with a direct inference cost implication
The dataset of minimal-pair problems is released, enabling reproducible benchmarking on this dimension

Why it matters for practitioners

If you’re evaluating a coding agent against your internal codebase, your benchmark numbers are confounded by your codebase’s quality — agents will appear weaker on legacy repos not because the task is harder but because the context is noisier. This also inverts the usual justification for code cleanup: it’s not just about human maintainability anymore; it’s about agent performance and cost per task.

What you can use today

Before deploying a coding agent on a new repo, run a quick static analysis pass (SonarQube, Ruff, or similar) and treat the output as a proxy for expected agent degradation — the dirtier the repo, the more you should budget for retries and human review
If you’re building evals for coding agents, pull the released minimal-pair dataset to add a code-quality axis to your benchmark rather than treating codebase quality as a constant
Prioritize cleanup of files that agents touch most often (hot paths in your issue tracker) before rolling out agentic workflows — the paper’s findings suggest targeted cleanup yields disproportionate reliability gains