
I'm Wiring Graph Memory Into Code Review. Here's What Vectors Miss.
Your AI code reviewer gives the same feedback your team rejected three weeks ago. It can't know. I'm building the fix: two graphs, one structural and one cognitive, wired together through spreading activation.
Last week my agent flagged a function for violating the team's naming convention. The same function. The same convention. The same flag it raised three weeks earlier, when we decided to keep the name because it matched the upstream API.
The agent didn't know we'd had that conversation. It couldn't know. Every review session starts blank.
I've been running six MCP servers daily for months now. One of them is Engram, the cognitive memory engine I built on Neo4j. Another is code-review-graph, a structural mapping tool with 10,700 stars on GitHub. They're both graphs. They solve completely different problems. And neither one alone is enough for graph memory in code review that actually works.
Two Graphs, Two Problems
code-review-graph (by tirth8205) does something genuinely clever. It parses your codebase with Tree-sitter, builds an AST graph of every function, class, and import, then stores the relationships in SQLite via NetworkX. When you ask it to review a PR, it computes the blast radius of your changes and feeds only the affected files to your agent. 28 MCP tools. 6.8x fewer tokens per review. The tagline is accurate: it turns your agent from a "forgetful tourist" into someone who knows the map.
But knowing the map isn't the same as knowing the history.
Engram takes the opposite approach. It doesn't understand code structure at all. It has no AST parser, no call graph, no blast radius analysis. What it has is a cognitive memory graph with 8 node labels (Memory, Person, Topic, Entity, Emotion, Intent, Session, TimeContext) and 14 relationship types. When I recall a memory, it doesn't just do vector similarity search. It uses Neo4j spreading activation to walk the graph, following associative links the way human memory works. The vector search finds seed nodes. The graph does the actual remembering.
code-review-graph knows WHAT your code does. Engram knows WHY you decided to do it that way.
The Gap Nobody's Filling
Cloudflare announced Agent Memory yesterday (April 17). Their agentic code reviewer uses it, and they published my favorite line from any blog post this month: "The most useful thing it learned to do was stay quiet. The reviewer now remembers that a particular comment wasn't relevant in a past review." That's exactly right. The best code reviewer isn't the one that finds more issues. It's the one that knows which issues the team already dismissed.
But Cloudflare's system is locked to their infrastructure. Durable Objects, Vectorize, their internal OpenCode plugin. Not portable.
Augment Code published a framework the same day distinguishing three persistence layers: static context files, agent memory, and living specs. Their key insight: "If an agent re-derives the same architectural decisions every session, agent memory is missing." I've lived this. My agent Rex re-derives things constantly. I wrote explicit rules in CLAUDE.md telling it what to do. It read the rules and ignored them. Rules in text files aren't memory. Memory is something the system has internalized through experience.
The SA-RAG paper (arXiv 2512.15922) gave me the academic backing I needed. Pavlovic et al. showed that spreading activation on a knowledge graph performs comparably to HippoRAG 2's Personalized PageRank for multi-hop retrieval. Their framing is perfect: "retrieval as a structural graph problem, not a prompting problem." That's exactly what I've been arguing. You don't solve multi-hop context with better prompts or bigger context windows. You solve it with graph traversal.
What I'm Actually Building
I analyzed code-review-graph's architecture and identified five integration approaches with Engram. The strongest two:
A Claude Code hook that fires memory_recall automatically when the agent enters a code review session. Not per file (I tried that, and 20+ file reads per review drowned the context window with memory results). Once per session, batched, with the PR diff as the recall query. The hook pulls in past decisions, rejected suggestions, architectural conventions the team chose to enforce.
And a dual-graph MCP server where code-review-graph provides the structural context (what changed, what's affected, what calls what) and Engram provides the cognitive context (why we built it this way, what we tried before, what the reviewer flagged last time). Two graphs, one query, one response. The agent sees both the map and the history.
I didn't expect the temporal coherence problem to be the hardest part. Structural graphs change on every commit. A function node updates hourly. Cognitive graphs change when humans make decisions. The node representing "team decided to use Redis for caching" hasn't been touched in six months. Merging them into one graph creates nodes with wildly different staleness profiles. I'm still not sure this is solvable without maintaining them as separate stores with a query-time bridge.
Where Both Tools Fall Short
I need to be honest about code-review-graph's limitations. Its search ranking MRR is 0.35. Finding code by structure works well. Finding the RIGHT code for a given question is still hit-or-miss. And NetworkX loads the entire graph into memory. Fine for repos under 10,000 nodes. At 100K+ nodes, that approach falls apart. The "memory loop" feature sounds promising but it's really just Q&A re-ingestion into flat markdown. It can't generalize from past reviews to new situations.
Engram has the opposite problem. It has zero structural code understanding. If you ask it "what's the blast radius of changing the auth middleware," it has no idea. It can tell you the team discussed auth middleware architecture on March 14th with three engineers and decided to keep the Express pattern. But it can't map function call chains. It needs the structural graph it doesn't have.
And Neo4j is overkill for small repos. I'm running a JVM-based graph database in Docker to remember that the team prefers functional components over class components. For a 500-file project, that overhead is absurd. The tool selection problem is real: sometimes the right answer is a simpler tool.
Spreading activation tuning is also brutal. When I first migrated from graphology (an in-process JS graph library) to Neo4j, I hit O(E) edge check performance and BFS queue blowup that forced me to rewrite Waves 1 through 3 of the implementation plan. Then the MERGE-based node creation started creating exponential duplicates because I was matching on multiple properties without a unique identifier constraint. The fix was simple (always MERGE on a single fqn field, then SET additional properties) but the debugging wasn't.
What Code Review Becomes With Graph Memory
The multi-agent coordination problem in code review is that your reviewer, your linter, and your CI pipeline all operate in isolation. None of them share memory. Your reviewer doesn't know your linter already flagged the same issue. Your CI doesn't know the reviewer approved a pattern that looks like a test failure.
Wiring Engram into this loop changes what review means. Instead of "does this code follow the rules," the question becomes "does this code align with what the team has decided, changed its mind about, and learned over the past six months." The agent skill stack I've been building treats each capability as a composable layer. Code review with memory is just another layer.
Cloudflare figured this out. vexp figured this out. The SA-RAG paper has the math. I'm building the open-source version with Engram's graph architecture and code-review-graph's structural mapping, connected through a dual-MCP interface that keeps both graphs in sync without collapsing them into one.
The reviewer that remembers isn't the one with the biggest context window. It's the one that can traverse a graph of decisions and stop before it repeats a mistake the team already resolved.
I'll publish the integration as @engram-mem/plugin-code-review once the temporal bridging layer stabilizes. For now, the privacy questions around persisting review history across sessions need answers. Who owns the memory of a rejected code suggestion? What happens to the graph when an engineer leaves the team? The verification problem isn't just about checking agent output. It's about deciding what agents should be allowed to remember.
Get new posts in your inbox
Architecture, performance, security. No spam.
Keep reading
I Run Six MCP Servers Daily. Here's What Breaks.
MCP won the standard war. But running six servers in production every day exposes failure modes no demo will show you: context bloat, silent auth failures, and tool selection that falls apart at scale.
I Told My Agent Not to Do That. It Did It Anyway.
My CLAUDE.md said 'NEVER publish without internal links.' The agent published with zero. The fix wasn't better rules. It was structural enforcement: eval harnesses, separate verifiers, and hooks that don't ask permission.
Skills, MCP, and the Orchestration Gap Nobody's Fixing
Agent skills became an open standard. MCP connects everything. But the layer between them, the one that keeps agents from failing catastrophically in production, barely exists.