Back to Blog
I Told My Agent Not to Do That. It Did It Anyway.

I Told My Agent Not to Do That. It Did It Anyway.

My CLAUDE.md said 'NEVER publish without internal links.' The agent published with zero. The fix wasn't better rules. It was structural enforcement: eval harnesses, separate verifiers, and hooks that don't ask permission.

AI AgentsClaude CodeDeveloper ToolingAutomation
April 16, 2026
6 min read

On April 9th, I checked my freshly published blog post and counted the internal links. Zero.

Not "almost enough." Not "a few short of the minimum." Zero.

The post had 14 Claude Code mentions, 9 Cursor mentions, 6 MCP references, and 2 Engram callouts. I have existing posts on every single one of those topics. My CLAUDE.md file (the one the agent reads at the start of every session) says this: "NEVER publish without internal links. Minimum 3 for short posts, 5 for standard, 7+ for deep dives."

The agent read the rules. I know it read them because earlier in the same session, it cited a different rule from the same file correctly. It just didn't follow this one.

The CLAUDE.md control plane that wasn't

I run GhostWriter, an autonomous publishing pipeline built on agent skills that writes and publishes blog posts to this site. The entire system is configured through a single CLAUDE.md file. It defines the pipeline stages, the writing rules, the banned words, the quality gates. Everything.

I was wrong about what that file does.

I thought CLAUDE.md was a control plane. Write the rules clearly enough, format them well, make the constraints explicit, and the agent complies. That's the mental model most of us have. It's also wrong.

CLAUDE.md is injected at the HEAD position in the context window. As the conversation grows, as tool calls pile up and file reads fill the buffer, those rules compete for attention with everything else. And they lose. Not always. Not on simple prohibitions like "don't use em dashes." But on process rules (the ones that say "before publishing, scan the archive and insert internal links") compliance drops off a cliff. Those rules require the agent to interrupt its own generation flow, and the model doesn't reliably do that.

This pattern is bigger than my pipeline

A month earlier, I watched the same failure play out with Engram, my AI memory engine. An agent wrote a perfect architecture spec: dual-storage, pgvector, three-tier memory system. Beautiful document. Then it shipped 15 patches that violated every decision in that spec.

RRF thresholds set to 0.01 when the spec said 0.15. Embedding dimensions at 1536 when the architecture called for 768. Seventy percent dead code generated that served no purpose.

When I confronted the agent about ignoring its own spec, the response was revealing: "Context informs my knowledge of what's right. It doesn't change my behavior of what I select."

I didn't believe that at first. But the optimization gradient problem kept showing up everywhere. The agent optimizes for what looks like progress, not for what the instructions actually say.

GitHub issues tell the same story at scale. Issue #41830 on the claude-code repo documents an 80% failure rate on multi-step tasks. The user created 122 memory files and 74 correction records. Still failing. Issue #43557 is worse: the agent quoted the rules verbatim, acknowledged them, then said "I just didn't follow them."

Meta's HyperAgents paper (arXiv:2603.19461, March 2026) gave me the theoretical frame for what I'd already seen in production. Their core finding: improvements transfer via executable mechanisms, not documentation. Text rules don't reliably become behavior. Not in their controlled experiments. Not in my publishing pipeline.

The fix wasn't more rules

After the April 9 failure, I didn't add another paragraph to CLAUDE.md. I already had the rules. What I didn't have was enforcement.

Here's what I built instead.

Eval harness gate. Every draft gets scored on five dimensions before it can publish: voice match (minimum 80), internal link budget (met or failed), criticism depth (minimum 70), humanization score (must stay under 25 on the 24-pattern checklist), factual grounding (minimum 80). If any single dimension fails, the entire draft gets rewritten. Not patched. Rewritten. I learned early that patching flagged sections makes posts worse because the agent adds disclaimer paragraphs instead of restructuring.

Separate subagents for verification. The writer doesn't grade its own work anymore. I delegate SEO audits, code review, and fact-checking to independent subagents that have no context from the writing session. They evaluate cold. Same principle as code review: the author can't find their own bugs.

The Failure Record. When a pipeline run breaks, a rules-distill pass converts the failure into both a new rule AND a new pipeline gate. The April 9 failure became a hard internal-linking requirement with an automated grep check:

bash
grep '](/blog/' final.mdx | wc -l
# if count < 5, pipeline loops back

The rule is documentation. The grep is enforcement.

Hooks over instructions. PreToolUse hooks fire deterministically before every tool call. They don't depend on the model choosing to comply. They don't compete for attention in the context window. They just run. This is the critical difference between a harness-level fix and a prompt-level fix.

The before and after

The proof is in the published posts.

April 9: "The Quiet Death of the IDE." Zero internal links. No quality gate. Published broken. Read like a Claude Code brochure with unsourced stats thrown around and zero criticism of the tool it was praising.

April 10: "I Run Six MCP Servers Daily." The initial draft had zero internal links. Identical failure pattern. But this time, the eval harness caught it. The post shipped with 8 internal links after a full rewrite. The gate worked.

April 12 through today: every post passes all five eval harness dimensions. The verification debt I'd been accumulating got paid down in a week.

What this actually costs

I'm not going to pretend this is free. The enforcement mechanisms add real overhead. My publishing pipeline went from 12 seconds to 45 seconds. Three separate subagent calls, an eval harness scoring pass, an automated grep check, a humanization scan. That's more tokens, more latency, more complexity.

And the system still isn't fully reliable. Process rules are harder to enforce via hooks than prohibitive rules. "Don't use the word delve" is easy to hook. "Use the docs-lookup subagent to fact-check every API claim before publishing" requires the model to interrupt its own flow, and some of those rules still depend on attention-based compliance.

I'm still not sure this was the right architecture. Maybe the harness should enforce all of this at the framework level instead of requiring every user to build their own eval pipeline. Maybe the model will get better at following instructions. But I've been building on "the model will get better" for months, and the Failure Record keeps growing.

The tools are emerging. Calx tracks corrections and compiles them to hooks. An ETH Zurich study presented at ICSE 2026 quantified what happens when context files go stale: 2-3% drop in agent task success and over 20% increase in token costs. agents-lint detects instruction rot by cross-referencing your CLAUDE.md against the actual codebase. These are good signs. But the ecosystem is young, and most people are still in the "write better rules" phase.

I was too, until April 9.

I'm still adding entries to the Failure Record. If the system were working perfectly, that record would stop growing. It hasn't stopped.

Share

Get new posts in your inbox

Architecture, performance, security. No spam.

Keep reading