Anthropic Shipped Agent Memory to Production While I Was Still Debugging Mine

Last Tuesday, Anthropic announced persistent memory for Claude Managed Agents. I read the blog post at 2am, sitting in the same chair where I'd spent the previous three weeks debugging a forget() function that matched 73 memories when I searched for the word "banana."

The timing was brutal.

I've been building Engram, a memory engine for AI agents, since early 2026. Five engineering waves. pgvector, BM25, Neo4j graph database, cross-encoder reranking. Eight packages on npm. And here's Anthropic, shipping the same concept as a managed service with Netflix and Rakuten already on board.

My first reaction was the obvious one: why did I build this?

My second reaction, after actually reading the architecture, was different. They solved a different problem than I did. A useful one. But not the one that kept breaking my agents.

What Anthropic Actually Built

Persistent memory for Claude Managed Agents mounts directly onto a filesystem. Memories are files. Claude reads and writes them with the same bash tools it uses for everything else. It's clean, and I mean that genuinely.

The scoping model is solid. Stores are workspace-scoped with configurable read/write permissions, so you can have a shared knowledge base that agents read but don't modify alongside per-session stores they own. Multiple agents can work against the same store without stomping each other. Every change creates an immutable memory version, giving you point-in-time recovery. You can roll back to earlier snapshots if an agent writes something wrong.

Rakuten reports 97% fewer first-pass errors with workspace-scoped memory. Wisedocs saw 30% faster document verification. Netflix is carrying context across sessions including mid-conversation corrections from human reviewers.

Those numbers are real. Opus 4.7 is specifically optimized for this. The model writes better memories, reads them more effectively. For teams building on Claude today, this is the right answer.

But there's a gap between "persistent storage" and "useful agent memory," and I spent months falling into it.

What Broke When I Built My Own

Engram's architecture looks nothing like a filesystem. It's a graph-over-vector hybrid: pgvector for semantic search, BM25 for keyword retrieval, Neo4j for relationship traversal, and a cross-encoder reranker that scores candidates before they hit the context window. The thesis was simple. The graph IS the brain. Vector search finds candidates. Spreading activation does the actual remembering.

Getting to 85% recall on the LoCoMo benchmark took all five waves. The early versions were embarrassing. 19.6% recall. Widening the vector scan pool from 90 to 500 rows gave me a 26.6% jump overnight. Adding contextual embedding (prepending preceding conversation turns) added another 17%. Cross-encoder reranking on top of that pushed the score further.

But the benchmark wasn't what taught me anything. Production use was.

My agents run on a VPS with six MCP servers. Engram is one of them. Every agent session ingests memories. Every analysis, every blog pipeline run, every code review generates episodes that flow into the memory store. After six weeks, I had close to a thousand episodes, 325 semantic facts, and 81 digests.

And my recall quality was degrading.

Not because the retrieval was wrong. Because the signal was drowning in noise. Superseded analyses kept resurfacing. Duplicate operational logs competed with consolidated state entries. Stale briefings from three weeks ago ranked higher than the current weekly retro because they had more semantic overlap with the query.

The fix wasn't a retrieval improvement. It was forgetting.

The Three Weeks I Couldn't Forget

Engram's forget() function was supposed to deprioritize memories by query. Search for "stale briefing April 15," match the relevant memories, lower their confidence scores. Simple.

Except it wasn't. Issue #7 on GitHub. The function used a "deep" recall strategy with no minimum similarity threshold. When I called forget("stale briefing April 15"), it returned 73 to 99 "affected" memories. Including completely unrelated ones. Including gibberish. Searching for "asdf" matched 82 memories.

I tried three consecutive memory gardening sessions (April 13, 15, 17). All three were blocked. The first threw TypeError: fetch failed on half the operations. The second silently matched everything and nearly carpet-bombed my entire memory store. The third combined both failure modes.

The fix was five lines of code. Switch from "deep" to "light" recall strategy. Add a minimum relevance score of 0.5. Ship v0.3.7.

After that fix, I ran four consecutive clean gardening sessions. 516 memories deprioritized across eight days. Zero failures. The categories I pruned tell you what memory pollution actually looks like in production: superseded planning analyses (100+), duplicate GhostWriter operational logs, stale sprint summaries, heartbeat noise from cron jobs, old email snapshots that kept resurfacing.

Anthropic's memory doesn't have this problem yet. But it will. Every system that accumulates memories without a curation discipline hits this wall eventually.

What File-Based Memory Can't Do

Here's where the architectures diverge.

Anthropic stores memories as files. You can grep them. You can edit them in a console. That's great for transparency and for agents that follow rules predictably (which is its own open question). But files are flat.

When I ask Engram "what do I know about code review for the Ouija project?", the retrieval pipeline doesn't just find memories containing those keywords. It seeds Neo4j with entity matches, runs spreading activation across relationship edges, discovers that a memory about BullMQ job failures is three hops away from a memory about PR dispatch patterns, which connects to a reviewer preference memory about test coverage thresholds. Those memories never mention "code review" directly. A keyword search on files would miss them entirely.

This is the difference between semantic similarity and relational reasoning. Vector search (and file grep) finds "what sounds like the query." Graph traversal finds "what connects to the query through learned relationships." Both matter. But if you only have the first one, your agent's memory is functionally a search engine, not a brain.

Anthropic's filesystem approach also doesn't address three problems I've been fighting:

Correction Persistence

When a human reviewer tells my agent "actually, that config value should be X, not Y," that correction needs to supersede the original memory. Not just add a new file alongside the old one. The old fact needs to decay. In Engram, this is dedup threshold tuning (currently at 0.62) combined with temporal validity tracking. I'm still not sure the threshold is right.

Behavioral Memory

My Ouija pipeline is getting Phase 4 integration with Engram: three memory types (code patterns, reviewer preferences, operational quirks like CI failures). The agent should recall that last time it touched auth.ts, the tests broke because of a missing mock. That's not a fact stored in a file. It's a behavioral pattern extracted from past dispatches.

Cross-Session Contamination

Early Engram had a session_id isolation gap. The MCP server accepted session_id on writes but didn't pass it to the recall engine. Multi-agent setups with shared memory need read isolation too, or Agent A's deployment context bleeds into Agent B's code review. I fixed this, but the bug was subtle enough that it ran for weeks before I caught it.

The Honest Comparison

I need to be fair about both sides.

Anthropic's memory is genuinely good infrastructure. If you're building on Claude and you need persistent context, use it. The scoping, versioning, concurrency model, and rollback support are more mature than anything I've built. The $0.08 per session-hour pricing is reasonable at moderate scale. And Sara Du from Ando nailed it: "Memory lets us stop building memory infra and focus on the product itself." That's a legitimate value proposition.

But the trade-offs are real. It's Claude-only. Multi-agent coordination is still in research preview. All memory flows through Anthropic's infrastructure, which matters for data residency. Session costs compound at scale. And if you ever migrate, you start from zero. That's vendor lock-in with a friendly face.

Engram has its own problems. 85% R@K on LoCoMo sounds fine until you realize the leaderboard leaders (MemoryLake, EverMemOS) report 92-94% on F1, which is a stricter metric than R@K. The numbers aren't directly comparable, and that ambiguity works against the indie project. There's a naming collision with another project called Engram (different codebase, same npm ecosystem, same LoCoMo claims). Cold CLI latency is 10.3 seconds. Neo4j is overkill for small repos. And I spent three weeks debugging a five-line fix while Anthropic shipped the whole feature.

The honest answer is: build if you need relationship reasoning, temporal coherence, or model-agnostic memory. Use Anthropic's if you need memory today and the Claude commitment works for you. Both paths cost more than they should.

The Part Nobody Talks About

The agent memory problem isn't storage. Every framework, every managed service, every MEMORY.md file in every repo can store memories. That's the easy half.

The hard half is knowing what to forget.

I've pruned 516 memories across four gardening sessions. Each one was a decision: is this superseded? Is this a duplicate phrased differently? Is this noise from a cron job, or signal from a real incident? Is this analysis still current, or did the weekly retro make it obsolete?

Anthropic gave agents a hard drive. I'm still building the hippocampus. I don't know if the hippocampus is worth the effort yet. But I know the hard drive alone isn't enough.

Anthropic Shipped Agent Memory to Production While I Was Still Debugging Mine

What Anthropic Actually Built

What Broke When I Built My Own

The Three Weeks I Couldn't Forget

What File-Based Memory Can't Do

Correction Persistence

Behavioral Memory

Cross-Session Contamination

The Honest Comparison

The Part Nobody Talks About

Get new posts in your inbox

Keep reading

GitHub Is Using GPT to Review Claude's Work. That's Either Brilliant or the Most Expensive Code Review Ever.

Context Engineering Replaced Prompt Engineering and Nobody Noticed

AI Agents Don't Crash. They Succeed at the Wrong Thing.