Back to Blog
Context Engineering Replaced Prompt Engineering and Nobody Noticed

Context Engineering Replaced Prompt Engineering and Nobody Noticed

I've been doing context engineering for months without calling it that. A 547-line CLAUDE.md, subagent isolation, strategic compaction, six MCP servers. The term just caught up to the practice.

aisoftware-engineeringdeveloper-toolsagents
April 26, 2026
6 min read

My CLAUDE.md file is 547 lines long. It contains a three-stage publishing pipeline, a 24-pattern anti-detection checklist, five quality gate dimensions, a failure record that logs every time the system broke and why, and references to 14 skills that load on demand. My "prompt" this morning was the number 3. One digit. The context file did the rest. If there's a name for what I spend my time on now, it's context engineering. Not prompt engineering. I was doing it before anyone named it.

Birgitta Böckeler, writing on Martin Fowler's site, published "Context Engineering for Coding Agents" in February. She defined it simply: "curating what the model sees so that you get a better result." Karpathy called it "the delicate art and science" that separates real agent work from vibe coding. In Q1 2026, Neo4j, Elastic, ByteByteGo, and Firecrawl all independently published context engineering guides. The term arrived.

But the practice had been running for months. I didn't learn context engineering from a guide. I learned it the way I learn most things: something broke, and the fix wasn't a better prompt.

What I Actually Spend My Time On

I haven't written a traditional "prompt" in months. The work that determines whether my AI agents produce good output happens before they read any instruction at all.

CLAUDE.md is the primary artifact. Not a config file. Structured context that shapes every agent session: pipeline stages defining what the agent can do at each phase, writing rules governing voice and tone, a failure record where every past mistake becomes a future constraint, and skill references that load on demand. When my publishing agent runs, it reads this file first. Everything downstream flows from what's in it.

Between pipeline stages, I run strategic compaction. Stage 2 produces research: web search results, source comparisons, competitor analysis, 15+ potential link targets. That context is useful during research. It's noise during drafting. So I compact between stages, deliberately clearing the research context before the writing phase loads. This is the part nobody talks about in the playbooks. Compaction is context curation at the pipeline level, not the prompt level.

I delegate to subagents for context isolation. The SEO specialist runs in its own context window. The code reviewer runs in another. The docs-lookup agent verifies technical claims in a third. This isn't because those agents are better at their jobs when isolated. It's because the SEO keyword analysis shouldn't exist in the writing context. When the writer sees keyword density data, the prose contorts toward keywords. Isolation protects the writing, not the SEO.

I run six MCP servers: Engram, Exa, Zapier, Context7, GitHub, and an Obsidian vault connector. Each one loads tool definitions into the agent's context window on every session. That's capability. It's also a context tax. Claude Code GitHub issue #13700 proposes lazy-loading MCP servers only when their tools are actually called. Until that ships, I pay the overhead. Six servers means six invoices of context whether I use them in a given session or not.

And I built Engram because context windows aren't memory. A million-token window sounds enormous until you realize effective capacity is 60-70% of the advertised limit, models lose 30% accuracy on information positioned in the middle of the window, and the window resets every session. Engram persists facts, decisions, and corrections across sessions. It's the persistent context layer that the window can't be.

I stopped spending time on prompts months ago. All the time goes into managing what the model sees.

When Context Engineering Fails

Two weeks ago, my publishing agent shipped a blog post at 4:30 AM instead of 9 PM. Every quality gate passed. The eval harness scored it: voice match 92, internal links 8 out of 5 required, criticism depth 88, humanization 12, factual grounding 92. The post was good. The timing was wrong by 17 hours.

Root cause: context drift. A previous session's compacted summary had encoded "topic pick = run full pipeline" from a day where the 9 PM cron had already fired. My agent pattern-matched a morning "3" (picking a topic from a list) to that stale context without checking whether Stage 3 had actually been triggered.

Böckeler includes a caveat that most context engineering articles skip: "ultimately this is not really engineering... execution still depends on how well the LLM interprets them." She's right. My elaborate context infrastructure still failed to prevent the violation. 547-line instruction file, five-dimension eval harness, three isolated subagents, strategic compaction between stages, persistent memory via Engram, PreToolUse behavioral hooks. All green. Still wrong.

But here's what I didn't expect: the failure was debuggable. I traced it to a specific compacted summary. I added a HARD STOP rule to CLAUDE.md. The behavior changed on the next run. The model didn't get smarter. The context got better.

That's the argument for calling this engineering even when it's probabilistic. Not that it always works, but that failures have traceable causes and fixable remedies. A prompt failure is "I guess I should have worded it differently." A context failure is "this specific piece of stale information caused this specific wrong behavior, and removing it fixes the behavior."

I'm still not sure the HARD STOP rule will hold long-term. Compaction erodes rules over long sessions. But at least I know which layer to debug.

What the Playbooks Miss

Search "context engineering" right now and you'll find comparison tables. Prompt engineering on the left, context engineering on the right. Neat rows distinguishing "optimizes instruction" from "optimizes information environment." Every article published this quarter has one.

These tables aren't wrong. They're incomplete.

The distinction they miss: prompt engineering is craft that doesn't compound. Context engineering is infrastructure that does. A well-crafted prompt works once. A CLAUDE.md failure record accumulates every mistake into a constraint that prevents the next one. An eval harness adds dimensions as new failure modes surface. Subagent patterns, once established, run on every pipeline execution without additional effort. The compounding is the point.

One hour invested in CLAUDE.md compounds across every future session. One hour invested in prompt phrasing compounds across exactly one conversation.

But I have a criticism of my own discipline. "Context engineering" is already accumulating playbooks, certification courses, listicles, and conference talks. The same trajectory as "prompt engineering" in 2023. Within six months, someone will sell a masterclass that teaches people to write CLAUDE.md files and call it infrastructure.

It's not. Writing a CLAUDE.md file is configuration. Building a system with compaction cycles, isolated subagents, persistent memory, behavioral hooks, and eval sensors is infrastructure. One is a file. The other is a system. The playbooks flatten that distinction because systems don't fit in a seven-step guide.

The term caught up to the practice. That's fine. Terms are useful.

If you're using AI agents in production, you're already doing context engineering. The question is whether you're treating it as infrastructure or as configuration. One compounds. The other doesn't. The model doesn't fail because it's dumb. It fails because it saw the wrong things. Fix what it sees.

Share

Get new posts in your inbox

Architecture, performance, security. No spam.

Keep reading