I Spent 8 Hours Planning with an AI. The Code Took 75 Minutes.

7am. Phone on the nightstand. I opened my terminal and saw 16 new commits across three repos. An Android app that didn't exist when I went to sleep was built, compiled, and ready to install. Claude Opus 4.7 had done the work overnight while I slept.

The execution took about 75 minutes of active agent time. But that number is a lie by omission. I'd spent the previous 8 hours planning, debating, and specifying what the agent should build. The planning-to-execution ratio was 6.4:1. Nobody talks about those hours.

I wrote about separating planning from execution back in February. I had no idea what the real ratio would look like.

What 8 Hours of Planning Actually Looks Like

I was building Ouija's dispatch system, a pipeline automation engine for managing AI agent work. The planning happened across 9 Build Log sessions in two days. I keep an append-only log with timestamps, decisions, and rationale. Not because I'm organized. Because Claude Code sessions hit context limits, and if the planning artifacts don't survive compaction, they're worthless.

Session 1 was a 4-hour marathon. I wrote 19 vault notes: the Key (ignition prompt), the Operating Manual, Phase 1 task breakdowns, a Vision document, risk registers. No code. Just architecture.

Session 1b was a 90-minute debate round. Rex (my analysis agent) had reviewed everything I'd written and came back with 13 suggestions. I accepted 10. Rejected 3. The rejections mattered more than the acceptances because they defined boundaries: don't restructure the 5-phase spine, don't recommend specific libraries before code-time, don't inject marketing work into engineering phases.

Then came session 1c, where Rex corrected one of its own quantitative claims (R@K metric was 85%, not the 57.5% Rex had originally stated). That established a precedent: verify the AI's numbers, even when the AI is the one running analysis.

Three more debate rounds followed. The acceptance rates tell the story:

Round 1: 77% (10 of 13 suggestions applied). Round 2: 58% (11 of 19, and I was filtering harder). Round 3: 6% (1 of 16).

Most people would look at that 6% and think Rex was failing. The opposite. It meant the specification was converging. Rex's suggestions were becoming redundant because the vault was already well-specified. The Plan README went through 5 versions in a single day. The Phase 1 note grew to 28 sections. Rex called it "the most complete project specification Rex has seen."

This isn't prompt engineering. This is collaborative specification through structured disagreement. Rex proposes. I filter. The good ideas get applied with attribution. The Build Log documents every decision.

The 6.4:1 Ratio

After the first day, the ratio was clear. 8 hours of planning and debate. 75 minutes of execution (one task shipped, PR #54). That's a 6.4:1 planning-to-execution ratio.

Addy Osmani wrote about this inversion in "The 80% Problem in Agentic Coding." His observation: "The developers succeeding with this approach spend 70% of their time on problem definition and verification strategy, 30% on execution." My data is more extreme. And more honest about what the 70% actually contains.

It's not all "problem definition." A chunk of it is negotiating with your AI. Another chunk is building planning artifacts that survive context window resets. Another chunk is rejecting suggestions that feel reasonable but violate constraints the AI doesn't fully internalize. Rules are just suggestions without enforcement. Debate rounds are how you enforce them.

Here's the part nobody publishes. The ratio isn't static. It compresses.

After session 2 (first execution): 7:1. After session 2f (four execution sessions complete, four tasks shipped): 1.4:1.

The execution sessions averaged 54 minutes per task. Fast. Because every decision had already been made. The agent wasn't choosing between architectural options. It was following a blueprint I'd spent 8 hours sharpening.

The planning investment amortizes. The first task is expensive. Tasks 2 through 8 ride on the specification work that preceded them.

Vishwas Navada K published "The Honest Math of Coding with AI Agents in Production" in February. Two developers, 86,000 lines of AI-generated code, 188 sessions, 2 months. His conclusion: "Same speed, with the effort redistributed. The writing got faster. The thinking didn't." My data confirms this. The thinking IS the work now. And the thinking is measurable if you bother to log it.

What Happens When You Don't Plan Enough

I have a control group. And it's embarrassing.

Mission Control is an Android app I built for approving agent actions from my phone. The plan was different from Ouija's. Rex reviewed it and flagged a concern I should have taken more seriously: the plan was "AI-drafted without Muhammad's iterative shaping." Ouija's vault had 14+ human revision cycles. Mission Control's plan was substantially agent-generated. I greenlit it with changes but without the depth of the Ouija process.

The result worked. Sprint 1 went from plan to working phone app in about three days, with an overnight build on --dangerously-skip-permissions. But the edges showed.

The APK was 63MB. For a monitoring app with a feed and approval cards, that's bloated. Rex's LOC estimate of "~600" turned out to be 2-3K in practice. The original plan called for 35 Gradle modules (I cut to 5 after Rex flagged it as over-engineered for a solo developer). No push notifications shipped, which means if something breaks at 3am, I don't know until morning.

Lighter planning produced a faster start and messier output. The agent optimized for correctness within its own context. It didn't optimize for size, for what I'd actually need at 3am, or for the gap between its LOC estimate and reality. Those are planning failures, not execution failures.

The METR randomized controlled trial puts numbers on this. Sixteen experienced open-source developers completed 246 tasks on their own codebases, with and without AI tools. The result: 19% slower with AI. Self-perception: believed 24% faster. That's a 43-percentage-point gap between feeling productive and being productive. Skipping planning feels fast. The data says otherwise.

Scientific American covered the broader pattern in March: developers using AI are working longer hours, not shorter. Multitudes found a 27.2% increase in PRs merged alongside a 19.6% rise in out-of-hours commits. More output. Same or more work. The context infrastructure that should reduce total effort only works if you invest the upfront hours to build it.

The Uncomfortable Part

Here's my criticism of my own process.

6.4:1 is a solo developer ratio. I work alone on Ouija. One person planning, one agent debating, one agent executing. When you add a team, async communication, and merge conflicts, heavy upfront planning collides with reality. One developer who published about running 8-9 parallel AI sessions found that misdirection frequency spiked as parallelism increased. My ratio might be optimal for one. It might be catastrophic for eight.

I also can't fully defend the volume. Nineteen vault notes in one session. Five versions of the Plan README in one day. Fourteen revision cycles. Was every revision necessary? Or did the planning become its own form of productive procrastination? I wrote a 28-section Phase 1 note. Some of those sections haven't been referenced since.

And the 6.4:1 number is the startup cost, not the steady state. The ratio compressed to 1.4:1 as execution ramped. Publishing "6.4:1" without that context would be misleading. The headline is true. The full picture is more nuanced.

But the planning wasn't just documentation. It was the context infrastructure that made the overnight autonomous build possible. Without it, the agent would have succeeded at the wrong thing. Fast, confident, correct output aimed at the wrong target. I've seen that failure mode enough to know it's worse than slow delivery.

The industry keeps saying AI makes coding faster. The data, mine and everyone else's, says something different. AI makes planning more valuable. The coding IS faster. But the total time didn't shrink. It shifted. The 8 hours weren't overhead.

They were the product.

The code took 75 minutes because the plan took 8 hours. Not despite it. I still have the Build Log. Session timestamps, acceptance rates, rejection rationale, test growth from 851 to 884 across the sprint. The verification debt is real. But so is the amortization curve. Next time the ratio will be different. Better, probably. The vault already exists.

The planning is the work now. Vibe coding is the opposite of this, and I think the results speak for themselves.

I Spent 8 Hours Planning with an AI. The Code Took 75 Minutes.

What 8 Hours of Planning Actually Looks Like

The 6.4:1 Ratio

What Happens When You Don't Plan Enough

The Uncomfortable Part

Get new posts in your inbox

Keep reading

The Flag Is Called --dangerously-skip-permissions. I Run It Every Night.

Why My Agent Pipeline Still Runs on BullMQ

Anthropic Shipped Agent Memory to Production While I Was Still Debugging Mine