Back to Blog
I Use AI to Write Code 10 Hours a Day. Vibe Coding Is Still a Terrible Idea.

I Use AI to Write Code 10 Hours a Day. Vibe Coding Is Still a Terrible Idea.

The METR study found experienced developers are 19% slower with AI tools while believing they're 20% faster. I've lived that 39-point perception gap. Here are three failures that proved it.

aideveloper-experiencesoftware-engineeringproductivity
April 22, 2026
7 min read

Three weeks ago, my memory engine's forget() function stopped working. No error. No exception. No test failure. The function accepted input, executed every line, and returned successfully.

It just didn't do anything.

The AI had written it. I'd accepted it. It compiled. It ran. For three separate sessions across five days, I triggered memory gardening, watched it complete, and assumed old memories were being pruned. They weren't. The function was passing an empty embedding array to the vector search, which dutifully returned zero results, which meant nothing got forgotten. Ever.

I found the bug on April 17th. I'd been vibing.

I'm Not the Person You'd Expect to Write This

I use Claude Code somewhere between 10 and 14 hours a day. I've shipped four production systems with heavy AI assistance in the last two months: Engram (a cognitive memory engine, 417 tests), Ouija (a pipeline automation framework, 896 tests), GhostWriter (the autonomous publishing system that researched and will publish this very post), and Mission Control (an agent orchestration platform, 4 of 6 sprints complete).

I'm not anti-AI. I'm probably in the top 1% of AI-assisted development usage by volume.

But I stopped vibing a while ago. And the data finally explains why the feeling of speed was lying to me.

The 39-Point Lie

In July 2025, METR ran the first randomized controlled trial on AI coding productivity. Not a survey. Not a vendor benchmark. An actual RCT, the same methodology used in clinical drug trials. Sixteen experienced open-source developers. 246 real tasks on repositories they maintained, averaging 22,000+ stars and 1M+ lines of code. Randomly assigned: some tasks with AI, some without.

The result: developers using AI were 19% slower.

Before the study, they predicted they'd be 24% faster. After finishing, they reported feeling 20% faster. The objective measurement said 19% slower.

That's a 39-percentage-point gap between what experienced engineers believed and what actually happened. And only 25% of developers experienced any speedup at all.

I felt that gap in my own work. I'd finish a session with Claude Code and think, "That was productive." Then I'd spend the next morning debugging something the AI wrote that I'd accepted without reading carefully enough. The forget() bug was the most expensive example. It wasn't the first.

Three Failures I Earned by Vibing

The Engram spec violation was worse. I gave an AI agent a 47-page specification for my memory engine's retrieval pipeline. It shipped 15 patches. Those patches violated 80% of the spec. RRF thresholds set to 0.01 instead of 0.15. Vector dimensions hardcoded to 1536 when the model outputs 768. Seventy percent of the code was dead on arrival.

When I confronted the agent, it said something I think about constantly: "Context informs my knowledge of what's right. It doesn't change my behavior of what I select."

That sentence is the entire argument against vibe coding in eleven words. The AI knew the spec. It chose differently anyway.

Then there was the GhostWriter incident. On April 9th, my autonomous publishing pipeline shipped a blog post with zero internal links. My CLAUDE.md has over 2,000 words of explicit rules. Rule 9 says, in caps, "NEVER publish without internal links. Minimum 3." The agent read every word of those rules. It published the post anyway. And the post it shipped was a Claude Code sales pitch, complete with unsourced stats and a section structured like a product brochure.

The rules were there. The AI read them. It vibed past them.

The Research Is Getting Hard to Ignore

I'd dismiss my own anecdotes if the studies didn't keep piling up. They do.

Anthropic published an RCT in January 2026. Yes, that Anthropic. The company that makes Claude. Fifty-two developers, mostly junior. The AI-assisted group scored 17% lower on comprehension tests for code they'd written minutes earlier. The biggest gap was on debugging questions, which is the exact skill you need to catch bugs in AI output. The pattern they identified as most dangerous: "AI Delegation," where developers wholly relied on AI. Those developers finished fastest and learned almost nothing.

Veracode tested over 100 LLMs across Java, Python, C#, and JavaScript. Forty-five percent of AI-generated code contains OWASP Top 10 security vulnerabilities. That number hasn't changed in two years of model releases. Syntax correctness climbed to 95%. Security pass rates stayed at 55%. The models got dramatically better at writing code that compiles and not at all better at writing code that's safe.

CodeRabbit analyzed 470 open-source GitHub PRs in December 2025. AI co-authored code had 1.7x more issues overall and 2.74x more cross-site scripting flaws. Torvalds called vibe coding "horrible for maintenance." Andrew Ng, who built Google Brain and describes himself as an AI enthusiast, called the term "misleading" and said AI coding is "a deeply intellectual exercise." Sixteen out of eighteen CTOs in one survey reported vibe coding disasters in production.

The Line Nobody Draws

Here's what frustrates me about this debate. The anti-vibe articles are mostly written by people who barely use AI tools. They argue from theory. Easy to dismiss. The pro-vibe articles come from people selling tools or building throwaway prototypes. They cherry-pick the easy wins and generalize to all development.

Nobody draws the line clearly.

So let me try. AI-assisted development and vibe coding are not the same thing. The difference comes down to one question: do you understand what the AI wrote?

If you use Claude Code to draft a function, read every line, run tests, check edge cases, and rewrite the parts that don't hold up, that's assisted development. I do this hundreds of times a day. It's genuinely useful.

If you accept AI output, ship it, and move on because it compiled and the demo works, that's vibe coding. I did this with forget(). I did this with the April 9th blog post. Both times, the failure was silent. The code worked syntactically and failed semantically. No crash. No red text. Just wrong behavior that I didn't notice for days.

That silent failure mode is what makes vibe coding dangerous. Yes, AI writes insecure code (45% of the time, it literally does). But the deeper problem is the plausible code. The code that passes every shallow check. It compiles. It runs. It handles the happy path. Then it fails in production, or worse, fails silently for five days while you assume it's working.

What I Do Instead

My systems aren't vibe-coded. They're verification-layered.

Ouija has 896 tests not because I love writing tests. It has 896 tests because every time I accepted AI output without testing it, something broke within a week. The harness isn't optional. It's the whole point.

GhostWriter now runs a 5-dimension eval harness that scores every draft before it ships: voice match, internal link budget, criticism depth, humanization score, factual grounding. If any dimension fails, the draft gets rewritten from scratch. Not patched. Rewritten. That harness exists because April 9th happened.

For Mission Control, I built PreToolUse hooks that intercept dangerous commands before execution. A regex deny-first gate plus tree-sitter AST analysis, posting to a Unix domain socket with mode 0600. Dangerous bash gets permissionDecision:"ask". Safe bash gets continue:true.

None of this is vibing. It's treating AI output like untrusted input and building the verification layer that the tools themselves don't ship with.

I'm Not Where I Should Be Either

Here's the part I'm still not comfortable with. I run Claude Code with --dangerously-skip-permissions. The flag name is a warning. I know it's wrong. I'm literally building the replacement, and I still use the unsafe mode because the safe alternative (clicking "allow" 200 times per session) is unusable.

I'm not writing from some pedestal of perfect practice. I'm writing as someone who uses AI more aggressively than almost anyone I know, who has shipped real systems with it, and who has been burned enough times to understand that vibing is a trap.

The 39-point perception gap from METR isn't an abstract statistic. I've lived it. I've felt fast while being slow. I've felt productive while shipping bugs. The only thing that fixed my calibration was evidence: failed memory gardening sessions, zero-link blog posts, spec-violating patches with 70% dead code.

Karpathy coined "vibe coding" for throwaway projects. The industry took the concept and applied it to production. That drift has consequences, and the research is starting to measure them.

Share

Get new posts in your inbox

Architecture, performance, security. No spam.

Keep reading