While artificial intelligence tools promise to revolutionize software development, a new study exposes a disturbing truth: coding agents get worse the longer they work on a single project. Instead of building robust systems that improve over time, these models struggle with basic architectural decisions and rapidly degrade in quality during iterative tasks.

Researchers introduced SlopCodeBench to test how language models handle long-horizon programming challenges where specifications evolve naturally. Unlike standard tests that check if code works once, this benchmark forces agents to repeatedly extend their own solutions without strict control over internal design choices. The results were shocking for the industry. Not a single model successfully solved any of the 20 problems from start to finish across all tests. At best, agents managed a 17.2% completion rate at any given checkpoint.

The study tracked two specific signs of decay: verbosity and structural erosion. Redundancy and duplicate code filled up nearly 90% of generated trajectories, while complexity piled into critical functions in 80% of cases. When comparing AI-generated code to human-written open-source repositories over time, the gap widened with every iteration. Human code maintained stability, but agent performance consistently deteriorated. Attempts to fix this by improving initial prompts showed only marginal gains; they could not stop the downward slide. These findings prove that current benchmarks fail to measure how well code holds up when extended, and today's agents lack the discipline required for real-world software engineering where requirements change constantly.

Source: SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks by Gabriel Orlanski et al., https://arxiv.org/abs/2603.24755