Back to Blog
I Asked My AI to Fix One Line. It Rewrote the Entire Function.

I Asked My AI to Fix One Line. It Rewrote the Entire Function.

GPT-5.4 scores 0.395 on edit faithfulness while Claude Opus 4.6 scores 0.060. Over-editing isn't a bug. It's a training incentive. Models are rewarded for solving the task, not for touching the least code.

aisoftware-engineeringcode-reviewdeveloper-tools
April 24, 2026
8 min read

Last week I asked Claude Code to fix an off-by-one error in a loop boundary. The bug was one character: a -1 that shouldn't have been there. The fix was deleting two characters.

Claude deleted the two characters. Then it renamed two variables. Added input validation I didn't ask for. Restructured the return statement into an early-return pattern. And extracted a helper function that existed solely to make the "fix" look cleaner.

The diff was 34 lines. The actual fix was one. This is AI over-editing: the model fixes your bug and rewrites the function on its way out.

I caught it because I review every change. But the thing is, the code compiled. The tests passed. If I'd been vibing, I'd have accepted the whole thing without noticing that my function had been quietly rewritten by a model that decided it knew better.

Someone Finally Measured AI Over-Editing

A researcher named nrehiew published a study called "Coding Models Are Doing Too Much." It's the first rigorous attempt I've seen to actually measure what every developer using AI coding tools already feels: the model fixes your bug but half the function is different when it's done.

The methodology is straightforward. Take 400 problems from BigCodeBench. Programmatically corrupt each one with a specific, minimal bug (flip an operator, swap a boolean, change a range boundary). The ground truth fix is the exact reversal of the corruption. Nothing more. Then measure how much each model changes beyond that minimum.

The metric is token-level Levenshtein distance, normalized so you can compare across functions of different sizes. Zero means the model made exactly the minimal fix. Higher means it rewrote more than it needed to.

Here's what the numbers look like:

Claude Opus 4.6 (reasoning): 0.060 Levenshtein, 0.200 Added Cognitive Complexity, 91.2% Pass@1.

GPT-5.4 (reasoning): 0.395 Levenshtein, 2.313 Added Cognitive Complexity, 72.3% Pass@1.

Read those numbers again. GPT-5.4 edits 6.5x more code than necessary to fix the same bugs. And it gets fewer of them right. The model that changes the most is also the least accurate. Claude Opus 4.6 is both the most correct and the most minimal editor tested.

The study's example is perfect: a single off-by-one error where range(len(x) - 1) should be range(len(x)). One character fix. GPT-5.4's response: add None checks, introduce np.asarray conversions with dtype=float, add finite-value masking, validate array sizes, change the curve_fit call signature, and replace the plotting logic entirely. Functionally correct. Structurally unrecognizable.

Why Reasoning Models Over-Edit More

Here's the counterintuitive finding. Reasoning models, the ones we're told are "better" at coding, over-edit more than their non-reasoning counterparts. Across almost every model family tested (DeepSeek, GPT-5, GPT-5.4, Gemini 3.1 Pro, Qwen 3.6, Kimi 2.5), the reasoning variant produces larger diffs than the base model.

The explanation makes sense once you think about it. Extended reasoning gives the model more room to notice things. More room to "improve" code that wasn't broken. More room to add validation, rename variables, refactor control flow. The model isn't malfunctioning. It's doing exactly what it was trained to do: produce the best possible solution. The problem is that "best possible solution" and "minimal fix to the actual bug" are different objectives, and no major model is trained on the second one.

This is why better prompts and rules don't fully solve it. I've written about agents ignoring CLAUDE.md instructions. That post was about the symptom. This is the cause. The model's reward function doesn't care about your diff size. It cares about whether the output is correct. If the output is correct AND the function is "better" by the model's internal standard, that's a strictly higher reward than just fixing the bug.

Tests Don't Catch AI Code Rewrites

The most common pushback I hear: "Just write more tests." If the tests pass, the code is fine.

The study directly addresses this. Over-editing is invisible to test suites. The modified code is functionally correct. It passes every test. But the function you committed to your repo is no longer the function you wrote. It's a function the model wrote, with the model's variable names, the model's control flow, the model's opinions about error handling.

This is verification debt compounding in real time. LinearB found AI-generated code has 1.7x more issues per PR and gets accepted only 32.7% of the time versus 84.4% for human code. A big chunk of that review burden comes from over-editing. Reviewers can't tell which parts of the diff are the actual fix and which parts are the model's unsolicited improvements. So they have to review everything.

A commenter on the HN thread put it sharply: "The over-editing cost is asymmetric when you're solo. If an agent rewrites 50 lines when you asked it to touch 5, there's no second reviewer behind you. You're the writer and the reviewer, and the reviewer is usually the one whose attention is already depleted."

One Prompt Helps. Training Is the Real Fix.

The study tested whether prompting makes a difference. Adding "IMPORTANT: Try to preserve the original code and the logic of the original code as much as possible" to the prompt reduced over-editing across every model. Reasoning models benefited most, which makes sense: the same capacity that lets them over-think a fix also lets them follow constraints more precisely.

But prompting is a workaround, not a solution. You have to remember to include it. The model has to attend to it consistently across long sessions. Anyone who's watched Claude Code ignore an explicit CLAUDE.md rule after 30 minutes of context accumulation knows how fragile that is.

The real fix is training. The study compared four approaches: supervised fine-tuning (SFT), rejection-sampled SFT, DPO (preference optimization), and RL with an edit-aware reward.

SFT looked perfect in-domain (0.002 Levenshtein, 93.2% Pass@1) but collapsed completely out-of-domain. Pass@1 dropped to 45.8%. The model had memorized corruption patterns, not learned minimal editing. It also suffered severe catastrophic forgetting, losing 43% of its general coding ability on LiveCodeBench.

RL was the only method that generalized. It improved all three metrics (correctness, edit minimality, cognitive complexity) out-of-domain AND preserved coding ability on LiveCodeBench with zero degradation. The edit-aware reward combined functional correctness with Levenshtein-based minimality. The model learned to fix bugs without rewriting everything around them.

A separate paper, PRepair (arXiv:2604.05963), reached the same conclusion independently. Their Edit-Aware Group Relative Policy Optimization improved repair precision by 31.4%. Same insight: the training objective has to include edit minimality, not just correctness.

What I Actually Do About This

I'm not training models. I'm an engineer who uses them 10+ hours a day. Here's what I've learned works.

I commit after every meaningful turn. Not after every session. After every turn where the model touches code. If Claude over-edits, I can git diff immediately, see the scope creep, and revert before the next prompt builds on the wrong foundation. The HN commenter who said "committing after every turn so I can revert cheaply" has the right idea.

I use git add -p religiously. Partial staging lets me accept the one-line fix and reject the 33 lines of unsolicited refactoring. This is the single most effective practice I've found. It turns the review from "did the whole diff make sense?" into "which hunks are the actual fix?"

I put negative constraints in CLAUDE.md. Not "fix bugs" but "fix only the specific bug described. Do not rename variables. Do not add validation. Do not restructure control flow. Do not extract functions unless explicitly asked." These constraints help. They don't always hold, especially in long sessions, but they reduce the frequency by maybe 60%.

And I run a harness that measures diff size. If a one-line fix produces a 30-line diff, the harness flags it. Not as a failure, but as a review checkpoint. Sometimes the model found a real improvement. Usually it didn't.

The Uncomfortable Part

Claude Opus 4.6 is the best performer in this benchmark. I use Claude Code daily. It would be easy to declare victory and move on.

But 0.060 Levenshtein is not zero. Claude still over-edits. Less than GPT-5.4, less than Gemini, less than most models tested. But it still renames variables I didn't ask it to rename. It still adds type annotations I didn't request. It still occasionally restructures a perfectly readable if-else into a ternary because the model's aesthetic preferences differ from mine.

And the raw Opus 4.6 in a benchmark is not the same thing as Claude Code in my terminal. Claude Code wraps the model in a harness with tools, permissions, context management, and system prompts that all shape behavior. The benchmark measures the model's instinct. My daily experience includes the harness, the context window pressure, the tool call overhead, and whatever CLAUDE.md rules survived the last compaction cycle.

The deeper issue is the one nrehiew identifies at the end of the paper: over-editing is a training objective problem. Models are trained to solve tasks. They're not trained to solve tasks minimally. Until that changes at the model level, every workaround I use (git add -p, commit-per-turn, negative constraints, diff-size monitoring) is just me manually compensating for a misaligned reward function.

Another HN commenter said something that stuck with me: "Over-editing is one of the biggest tells of junior engineers."

They're right. And right now, every frontier model is a junior engineer who rewrites your function because it saw a slightly better way to write it. The fix isn't telling the junior to stop. It's training them to understand that the best change is often the smallest one.

Share

Get new posts in your inbox

Architecture, performance, security. No spam.

Keep reading