METR (Model Evaluation & Threat Research)

The most rigorous study of AI coding tool productivity in production contexts. Key findings:

19% slower: Experienced developers using AI tools took longer, not shorter, to complete tasks
Belief gap: Those same developers believed AI made them 24% faster — wrong about both direction and magnitude
Economists and ML specialists independently predicted 38–39% speedups — also wrong

Methodology

Design: Randomized controlled trial (RCT)
Participants: 16 experienced open-source developers from large, mature repositories
Repository scale: Averaging 22,000+ GitHub stars, 1M+ lines of code
Developer experience: Average 5 years prior contribution to their own project
Tasks: 246 total issues (bug fixes, features, refactors) drawn from each developer’s own project queue
Tools (AI condition): Primarily Cursor Pro with Claude 3.5 / Claude 3.7 Sonnet
Compensation: $150/hour

The Perception Gap

Before the study, developers forecast AI would reduce completion time by 24%. After completing all tasks, they estimated AI had sped them up by 20%. The actual result was a 19% slowdown. Wrong on direction and magnitude.

Why This Matters

The study is not evidence that AI tools don’t work. The authors explicitly state the results don’t generalize beyond experienced developers in complex, familiar codebases. What it captures is the J-curve: bolting AI tools onto existing workflows creates overhead that outweighs speed gains from code generation.

Teams that redesign their entire workflow around AI — different ticket structures, different review processes, different meetings — see 25–30% gains. The METR study captures teams that haven’t made that transition.

The Lab vs. Production Contrast

The Microsoft/GitHub lab study found developers completed an unfamiliar JavaScript task 55% faster with Copilot. METR found experienced developers working in their own mature codebases 19% slower. Both can be true simultaneously: AI excels at unfamiliar, scoped tasks; it creates overhead in complex, familiar systems where existing workflow disruption isn’t compensated by generation speed.

Paper

METR Blog: metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
arXiv: arxiv.org/abs/2507.09089