Back to Blog
The Productivity Panic Is Coming for Senior Engineers Next

The Productivity Panic Is Coming for Senior Engineers Next

Three waves of productivity metrics. Story points, DORA, AI output. Each one optimized for a proxy until it broke. Each one made senior engineers less visible.

engineering-leadershipproductivitycareermetrics
May 2, 2026
6 min read

Three metrics in nine years. Each one promised to fix the last. Each one made the same mistake.

I sat in a sprint review last year where the velocity chart showed 40% quarter-over-quarter growth. The VP of Engineering was thrilled. She pulled it up on the big screen, and I watched six senior engineers exchange the same look. We all knew. The numbers were real. The velocity was up. And the team had shipped exactly one feature customers actually asked for.

Nobody said anything. The chart was going up. That's what mattered.

I've been writing production software for nine years. In that time I've watched the industry cycle through three waves of productivity measurement. Each wave arrived with research to back it up, optimized for a proxy until that proxy broke, and progressively made the work senior engineers actually do less visible.

Wave 1: Story Points

Story points were supposed to fix time estimation. Instead of guessing hours (and being consistently wrong), teams would compare tasks to each other. Relative sizing. A login page is a 3, a payment integration is an 8.

It worked for about six months. Then the chart showed up.

The moment velocity appeared in a sprint review slide deck, every estimate in the room went up by 2 points. Nobody agreed to this. Nobody sent an email. It just happened. Last quarter's 3 became this quarter's 5. Velocity went up. Everyone celebrated. Nothing changed.

42% of teams admitted to manipulating velocity metrics when tied to performance reviews, per the 2025 DORA report. I'm surprised the number is that low.

I watched teams break one feature into four sub-tasks to inflate the count. Sat through planning poker sessions where the first person to speak set the anchor and everyone followed. The Fibonacci scale gave the theater of estimation a mathematical costume, but the numbers were fiction.

Here's what nobody talks about: the senior engineer who spent three days investigating a production memory leak and fixed it with a one-line config change? One story point. Maybe two. The junior who scaffolded a CRUD page from a template? Five points. The dashboard rewarded the scaffolding.

Wave 2: DORA Metrics

DORA metrics arrived to fix the mess. Deployment frequency. Lead time for changes. Change failure rate. Mean time to recovery. Four signals backed by real research from Nicole Forsgren's "Accelerate."

Better than story points. For about two years.

Then Goodhart's Law came for them too. Teams started breaking deployments into micro-releases. A whitespace fix. A comment update. A config tweak. Deployment frequency looked great on paper. But the DORA team's own 2025 research found that the "fail fast, fix fast" hypothesis they'd expected to validate with AI-assisted development actually failed. Software delivery instability kept rising even as deployment frequency climbed.

JetBrains surveyed 24,534 developers and found 66% don't believe current metrics accurately reflect their work. 73% of engineering managers called these metrics "unreliable or actively misleading" since AI entered the picture.

I spent two weeks last month auditing 269 database tables at work. Nine analytical lenses. A 5,782-line report. Fifteen findings, five of them critical, including cross-tenant data leaks hiding behind a DEFAULT '1' on 95 columns. On every productivity dashboard, I vanished for those two weeks. No PRs. No deployments. No velocity signal at all. The fact that I probably prevented the company's first multi-tenant data breach doesn't register in Jira, in DORA, or in any standup summary.

That's the kind of work senior engineers do. And it's been invisible to the economics of how we measure software since before I started my career.

Wave 3: AI Output

Now we're in the third wave. PR counts. Lines generated. Tasks completed. The AI productivity metrics.

The headline numbers look staggering. Faros AI measured over 10,000 developers and found 98% more PRs merged, 21% more tasks completed. My own numbers track. I shipped 27 commits in one week last April, 10 versions across two repos, 22 PRs merged.

But the numbers hide more than they show.

Simon Willison, the co-creator of Django, said something on Lenny's Podcast that stopped me cold: "Using coding agents well is taking every inch of my 25 years of experience as a software engineer, and it's mentally exhausting. I can fire up four agents in parallel. By 11 a.m., I am wiped out for the day."

That's a 25-year veteran describing the same compulsion loop that makes coding agents addictive. The tool is faster but the cognitive load is higher. Nobody measures cognitive load.

The METR randomized controlled trial found experienced developers were 19% slower with AI tools on complex tasks. Not faster. Slower. They perceived themselves as faster, with a 39-point gap between how productive they felt and how productive they were. AI didn't just make output easier to produce. It made the engineering around that output harder.

Organizations keep counting PRs anyway. Stack Overflow's 2026 survey found 72% of developers believe their team games metrics in some way. 55% reported metrics being used punitively. Trust in AI code accuracy dropped from 40% to 29% in a single year.

My 27-commit week? Some of those commits were hygiene debt that built up during a feature sprint. Work I could have avoided with better planning upfront. The most valuable day I had that month was an 8-hour planning session that produced 75 minutes of code. That session prevented three architectural mistakes the following week. No AI output metric would have captured it. Most would have scored it zero.

The Pattern

Three waves. Same mistake.

Story points measured estimation instead of judgment. DORA measured delivery speed instead of whether the delivery was valuable. AI output metrics measure generated code instead of the decisions that shaped what to generate.

Each wave makes the same things invisible: mentoring, investigation, architecture reviews, the three-day debugging session, the planning day, the audit that catches cross-tenant data leaks. The role has shifted from writing to editing, from producing artifacts to making decisions about which artifacts matter. Every metric still counts the artifacts.

Senior engineers do more of this judgment work than anyone else on the team. They also produce the fewest measurable outputs.

That's why the productivity panic is coming for them next. Not because they're unproductive. Because the metrics defining "productive" don't measure what they do. The job didn't go extinct. It just became invisible to the people holding the dashboards.

I'm Part of the Problem

I should be honest. I run a publishing pipeline that ships a blog post nearly every day. That's a velocity metric applied to content. I track eval harness scores on my own writing. I celebrated that 27-commit week.

I didn't expect to find myself on every side of this argument.

Metrics aren't the enemy. I don't think we should stop measuring. But three waves of evidence keep saying the same thing: reducing engineering judgment to a number ends with the number going up, the judgment going down, and the people doing the hardest thinking becoming the least visible.

The fix isn't a better metric. It's deciding what you refuse to reduce to one.

Share

Get new posts in your inbox

Architecture, performance, security. No spam.

Keep reading