TL;DR. LLMs made the coding phase cheaper and the validation phase more expensive. Estimation models still measure only the part that got cheaper. Until you split generation effort from validation effort, your sprints will keep ending in surprise.
Two studies, the same tool, opposite results.
In a controlled experiment, developers using GitHub Copilot completed a JavaScript task 55.8% faster than the control group [1]. In a separate METR study, sixteen experienced developers working on day-to-day tasks were 19% slower with AI assistance. The most revealing detail: they believed they were faster [2].
The natural impulse is to figure out which study is wrong. But the question is already mis-framed.
Both studies are right. The divergence is the conclusion.
There is no shortage of data on LLM productivity. The problem is that we keep using estimation models built for a world that no longer exists, and trying to patch the mismatch with a single "average AI uplift" multiplier that means nothing without knowing what kind of task it applies to.
##Variance is the signal, not the noise
Before interpreting, the raw data:
| Study | Context | Result |
|---|---|---|
| Peng et al., 2023 [1] | Controlled greenfield JavaScript task | +55.8% faster |
| Google RCT [2] | Controlled experiment | +21% faster |
| Microsoft / Accenture [2] | 4,867 developers in production | +26% more tasks completed |
| METR [2] | 16 experienced OSS developers, real tasks | -19% slower |
The gap between +55.8% and -19% is not statistical noise. It is the single most important variable in any project estimate: the type of task.
Greenfield vs. real production work
The Copilot experiment measured a well-defined greenfield task: complete requirements, no historical context, no implicit dependencies. METR measured experienced developers working on what they work on every day: code with history, undocumented constraints, architectural decisions that exist for reasons no one remembers but that are still true.
Different contexts. Different results. Inevitably.
The direct conclusion:
- The more controlled and greenfield the task, the larger the AI gain.
- The closer it gets to real production work (legacy code, ambiguity, cross-system integration), the smaller the gain. It can turn negative.
Applying a uniform "X% faster with AI" multiplier to a project estimate is not an acceptable simplification. It is a category error.
A 2025 systematic review of 39 studies on arXiv [5] confirms this pattern. The heterogeneity between studies is not explained by the tool used, or by developer experience. It is explained by task type and requirement specificity. Control for that factor and the results stop contradicting each other.
##The cost moved, it did not disappear
The paradox becomes clearer when you look at organisational metrics rather than individual ones.
Faros AI analysed data from more than 10,000 developers for the DORA 2025 report [3]. The individual metrics are impressive, and these are the numbers most AI-productivity conversations cite:
- +21% tasks completed
- +98% pull requests merged
- +66% epics per developer
Here is what happened downstream:
- PR review time: +441%
- Bugs per developer: +54%
- Incidents per pull request: +242.7%
- Tasks stalled with no activity for a week or more: +26%
The mechanism is not mysterious
LLMs generate code faster, developers open more PRs, those PRs are larger and more often "almost right, but not quite". 66% of developers identify that phrase as their biggest time sink [2].
The generated code passes unit tests but contains subtle assumptions about state, external API behaviour, or system invariants the test suite does not cover. Reviewers spend more time validating code they did not write and whose reasoning they have to reconstruct. More bugs reach production. Debugging and rework absorb the time saved at generation.
DORA names this the AI Productivity Paradox: individual metrics visibly improve while organisational delivery metrics stagnate.
The savings concentrate at the start of the process. The costs accumulate in review, debugging, and stabilisation.
The implication for estimation is direct. COCOMO, story points, and most planning tools estimate the coding phase. That is exactly the phase where gains are largest and most consistent, and exactly the phase that has become least representative of total cost. Estimating only what got cheaper while ignoring what got more expensive systematically produces estimates below reality.
This is not an argument against adopting LLMs. It is an argument against how most teams account for their effects.
##The structural problem in estimation models
Story Points were designed to capture perceived difficulty and human uncertainty. COCOMO uses lines of code as a proxy for effort. Function Points measure functional complexity.
They all share an assumption:
The dominant cost of software development is the act of writing code.
That assumption no longer holds as a general rule.
A first academic framework
A 2026 paper in Frontiers in Artificial Intelligence [4] proposes the first academic framework specifically designed for effort estimation in LLM-assisted contexts. The central argument: the failure of existing models is not a calibration issue, it is structural. The numbers do not need adjusting. They are measuring the wrong dimensions.
The framework identifies five effort drivers that current models do not parameterise:
- LLM reasoning complexity: how far the output diverges from real intent.
- Context completeness: how much of the surrounding system the model can see.
- Transformation reach: how many system boundaries the change crosses.
- Refinement cycles: how many iterations before the output is usable.
- Human supervision effort: validation, correction, integration.
Complexity and effort have decoupled
The exploratory data from the same paper deserves attention:
- 78% of tasks historically classified as high complexity were completed with less than 25% of the expected human effort.
- 22% of tasks classified as low complexity required more than 180% of anticipated effort, almost entirely due to validation and integration overhead [4].
A task that looks simple can blow up in validation cost if the generated code touches multiple system boundaries, or if requirements are underspecified. A task that looks hard can become trivial if it maps directly to patterns the LLM has mastered.
This asymmetry is invisible to story points, invisible to COCOMO, and invisible to any model that treats effort as a function of perceived difficulty or size.
##What to do today
The honest answer is the field still does not have a validated formal model. The Frontiers 2026 framework is the most rigorous attempt to date, but it deliberately stops short of parametric formulas. Those require calibration with post-LLM project data that is still being collected.
There are practical adjustments that make sense to adopt now, without waiting for that model.
1. Decompose by task type before estimating
Before any estimate, explicitly separate:
- LLM-friendly: boilerplate, scaffolding, documentation, well-specified features with complete requirements.
- LLM-hostile: legacy integration, ambiguous requirements, cross-system changes, undocumented domain.
Estimating both with the same method produces meaningless numbers.
2. Make validation overhead visible
For tasks involving legacy code or incomplete requirements, add 30 to 50% to account for review and integration. This is not safety margin. It captures a cost that was previously hidden in the process but that AI-assisted development makes more salient.
3. Drop pre-LLM baselines
If the team's reference velocity comes from projects predating AI assistant adoption, it reflects a different effort distribution. Using it introduces systematic error. Build a new baseline with recent data.
4. Move to flow metrics for tracking
Cycle time, lead time, and throughput measure the full delivery cycle regardless of how code is produced. Story points measure estimated coding effort, progressively the least variable part of the process.
Flow metrics tell you whether you are actually delivering faster. Story points tell you less and less.
5. Estimate conservatively
In real, diversified projects, a net gain of 20 to 30% is a defensible expectation [2]. Claims of gains above 50% in production contexts almost always derive from greenfield or controlled experiments that do not transfer.
##Conclusion
Story points are not dead. They are measuring what got cheaper while ignoring what got more expensive. Most teams have not noticed because the costs are hidden where they do not track: review queues, debugging cycles, integration rework.
What seems useful to do now is to make the separation explicit. In every estimate, distinguish between:
- Generation effort: how long to produce the code.
- Validation effort: how long to make it correct, integrated, and sustainable.
These two numbers are moving in opposite directions with LLM adoption. Treating them as a single conflated estimate is why teams keep getting surprised at the end of every sprint.
If you are using the same estimation process you used in 2022, it is not that everything is fine. It is that you have not yet measured what is changing.
##References
[1] Peng et al., 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. arXiv:2302.06590. https://arxiv.org/abs/2302.06590
[2] Addyo, 2025. The Reality of AI-Assisted Software Engineering Productivity. https://addyo.substack.com/p/the-reality-of-ai-assisted-software
[3] Faros AI / DORA Report 2025. Key Takeaways from the DORA Report 2025. https://www.faros.ai/blog/key-takeaways-from-the-dora-report-2025
[4] Frontiers in Artificial Intelligence, 2026. Toward LLM-aware software effort estimation: a conceptual framework. https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2026.1772418/full
[5] arXiv, 2025. The Impact of LLM-Assistants on Software Developer Productivity: A Systematic Review and Mapping Study. arXiv:2507.03156. https://arxiv.org/html/2507.03156v2