Start United States USA — software Beyond Benchmarks: Measuring the True Cost of AI-Generated Code

Beyond Benchmarks: Measuring the True Cost of AI-Generated Code

Von

November 21, 2025

The time spent by software developers in dealing with the quality and risk issues spawned by LLMs has not made developers faster.
The first wave of AI adoption in software development was about productivity. For the past few
years, AI has felt like a magic trick for software developers: We ask a question, and seemingly
perfect code appears. The productivity gains are undeniable, and a generation of developers is
now growing up with an AI assistant as their constant companion. This is a huge leap forward in
the software development world, and it’s here to stay.
The next — and far more critical — wave will be about managing risk. While developers have
embraced large language models (LLMs) for their remarkable ability to solve coding challenges,
it’s time for a conversation about the quality, security, and long-term cost of the code these
models produce. The challenge is no longer about getting AI to write code that works. It’s about
ensuring AI writes code that lasts.
And so far, the time spent by software developers in dealing with the quality and risk issues
spawned by LLMs has not made developers faster. It has actually slowed down their overall
work by nearly 20%, according to research from METR.
The first and most widespread risk of the current AI approach is the creation of a massive, long-
term technical debt in quality. The industry’s focus on performance benchmarks incentivizes
models to find a correct answer at any cost, regardless of the quality of the code itself. While
models can achieve high pass rates on functional tests, these scores say nothing about the
code’s structure or maintainability.
In fact, a deep analysis of their output in our research report, “The Coding Personalities of
Leading LLMs,” shows that for every model, over 90% of the issues found were “code smells” — the raw material of technical debt. These aren’t functional bugs but are indicators of poor
structure and high complexity that lead to a higher total cost of ownership.
For some models, the most common issue is leaving behind “Dead/unused/redundant code,”
which can account for over 42% of their quality problems.