Modern coding AI agents face a troubling reality: they can pass tests today, but their code crumbles under pressure tomorrow. New research reveals that these tools degrade rapidly when asked to iterate on their own work over long periods, failing to maintain the quality architects demand.

SlopCodeBench exposes this hidden flaw by testing 11 different models across 20 complex problems where agents must repeatedly extend their own software without tight constraints on internal structure. Unlike previous benchmarks that check single-shot solutions, this study tracks how code rots over time through two specific metrics: verbosity and structural erosion. The results are stark—no single agent solved any problem from start to finish, with the best model managing only 17.2% of checks at its peak.

As agents worked through evolving specifications, their code grew redundant in nearly 90% of cases. Structural complexity shifted toward difficult-to-manage functions in 80% of instances. When compared against real-world open-source repositories, agent-generated code was more than twice as verbose and significantly more eroded than human-written equivalents. While tracking these projects showed human code remaining stable, AI performance worsened with every iteration. Even attempts to fix initial quality through prompts could not stop the inevitable decline.

The takeaway is clear: current benchmarks mislead by only measuring pass rates, ignoring that software must evolve. Without better design discipline, today's coding agents cannot build the durable systems required for long-term development.

Source: SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks by Gabriel Orlanski et al., https://arxiv.org/abs/2603.24755