GraphRAG Pilots Succeed. Production Deployments Fail Quietly.

Last month I watched a demo where a GraphRAG system traced a 4-hop path through a supply chain knowledge graph and returned a dollar-amount impact estimate in under two seconds. The audience was sold. I was too, honestly. Then I thought about what happens when that same system ingests its 10,000th document and "IBM," "International Business Machines," and "Big Blue" are three separate nodes.

I run Neo4j in production. Not for document retrieval, but for a cognitive memory engine that uses spreading activation to retrieve context for AI agents. Different use case, same database, same maintenance nightmares. I've spent weeks debugging graph problems that don't show up in demos. So when I see GraphRAG pitched as the next evolution of retrieval, I believe the capability claim. I don't believe the production readiness claim.

The Benchmark Numbers Are Real. The Context Is Missing.

GraphRAG's accuracy advantage on multi-hop queries is genuine. On the MultiHop-RAG dataset, it hits 71.17% accuracy versus standard RAG's 65.77%. On cross-document reasoning ("which customers are affected by this supplier's incident given their regional overlap?"), GraphRAG wins by 4x: 33% versus 8%. Those numbers are from tianpan.co's production analysis, and I have no reason to doubt them.

But here's what the benchmark comparisons leave out: for specific document search (find the exact policy, locate the contract clause), vector RAG actually wins. The 3.4x overall accuracy advantage that gets cited in every GraphRAG pitch is real, but it's misleading without knowing your query distribution. If 80% of your queries are single-hop lookups, you're paying for a graph you don't need.

Gradient Flow's assessment is the one nobody quotes in their pitch deck: "We barely know of any examples of production deployments that are offering real business value." That's Ben Lorica and Prashanth Rao, two of the most respected voices in data infrastructure. And that was their conclusion after surveying the GraphRAG landscape.

Entity Resolution Is the Load-Bearing Wall

Sowmith Mandadi wrote the piece everyone building GraphRAG should read before writing a line of code. His core insight: entity resolution errors don't add up. They compound.

If your entity extraction is 85% accurate (which is optimistic for specialized domains, where tianpan.co measured 60-85%), and your query requires 5 hops, your answer reliability is (0.85)^5 = 44%. Fewer than half your multi-hop answers are trustworthy. At 3 hops it's 61%. At 2 hops, 72%.

A single misidentified entity doesn't produce one wrong answer. It poisons every path that traverses through it. "Dr. Smith" appears 847 times across your documents. Is it the same person? If your system guesses wrong on 15% of those merges, every downstream query inherits the corruption.

I've lived a version of this problem. When I first wired MERGE-based node creation into my graph layer, I was matching on multiple properties simultaneously. If any property differed between two nodes that should have been the same entity, Neo4j created a duplicate. The result: exponential node proliferation that made the graph look richly connected but was actually fragmented. The fix was embarrassingly simple (MERGE on a single unique identifier, then SET everything else). The debugging was not. I spent days staring at a graph that looked right in the visualizer but returned garbage on queries.

And that was with 8 node labels and 14 relationship types. Microsoft's GraphRAG extracts entities with open-ended LLM calls. The extraction consumes 58% of total indexing tokens. It produces noisy graphs at scale, and the noise is systematic, not random. The same extraction errors appear consistently across similar document types, creating coverage gaps you can't see until a query falls through them.

The Cost Nobody Adds Up

The indexing cost comparison everyone publishes:

Vector RAG: under $5, takes minutes. LightRAG: ~$0.50, about 3 minutes. Microsoft GraphRAG: $50-200, roughly 45 minutes. For a 500-page corpus.

Index construction runs 40-57x slower than standard RAG. Per-query latency averages 14,434ms versus 1,724ms for vector RAG. Those numbers are from systematic benchmarks, not edge cases.

But indexing cost is only the first layer. Nobody adds up the real total:

Schema design. You need an ontology. Which entity types matter? What relationship types to extract? How granular? Goldman Sachs and Deloitte deployed GraphRAG with hand-designed ontologies. That's weeks of domain expert time before you write any retrieval code.

Triple-index synchronization. Production GraphRAG needs three indexes: text (full-text search), vector (semantic similarity), and graph (entity relationships). Every document change must propagate to all three. Every entity update triggers re-evaluation of existing merges. This is a data engineering problem that most tutorials pretend doesn't exist.

Graph decay. Production knowledge graphs without automated refresh drift 15-20% from ground truth per quarter. That's from tianpan.co's analysis, and it matches what I've seen. Community summaries built at index time reflect the document state when they were generated. They don't auto-update. Four months after launch, your graph is answering questions based on a reality that's already moved on.

Re-indexing. When documents change (they always change), you either re-run the full extraction pipeline or maintain incremental update logic. Microsoft is still refining their incremental approach. If Microsoft hasn't solved this cleanly, your team probably won't either.

What I Learned Running a Graph Database in Production

My graph isn't a GraphRAG system. It's a cognitive memory graph with different goals. But the maintenance problems are identical, and the things I've learned translate directly.

Graph connectivity can be meaningless. I had 28 TimeContext singleton nodes connecting thousands of memories. The graph looked densely connected in every visualization tool. The connections carried zero semantic value. Degree centrality was useless. You need to audit what your edges mean, not just count them.

Silent graph operations are the worst failure mode. Unlike SQL where a failed DELETE throws an error, Cypher operations can match zero nodes and return success. I had a memory pruning function that returned success for three weeks while deleting nothing. No error. No log. Just a graph that couldn't forget. Amarnath Byakod documented the equivalent problem for GraphRAG: LLMs generate invalid Cypher 15-30% of the time, and the failures are often silent.

Architecture audits find more problems than you expect. My 5-wave implementation plan went through automated architect-reviewer audits. They found 14 critical issues and 22 high-severity bugs. Causal edge discovery was fundamentally broken (it looked for matching episode IDs across sessions, which was impossible since episodes have unique UUIDs). If a carefully designed, audited system had 36 significant issues, what's hiding in a weekend GraphRAG prototype that got promoted to production because the demo looked good?

Graph algorithm tuning is non-trivial. PageRank-based decay sounds elegant until you implement it. A binary top-10% cliff (keep the top 10%, decay everything else) causes catastrophic information loss. You need a gradient. Collins & Loftus figured this out in 1975 with spreading activation. Most GraphRAG implementations still haven't caught up to 50-year-old cognitive science.

The Alternative Nobody Wants to Talk About

LightRAG won best paper at EMNLP 2025. It achieves comparable accuracy to full GraphRAG at 0.1% of the cost, using a flat graph structure instead of hierarchical communities. For most teams, that's the answer.

SA-RAG (Pavlovic et al., arXiv 2512.15922) showed that spreading activation on a knowledge graph performs comparably to HippoRAG 2's Personalized PageRank for multi-hop retrieval. No LLM-guided traversal required. KG-Infused RAG (Wu et al.) showed a 39% absolute improvement over naive RAG using spreading activation on pre-existing knowledge graphs.

The pattern: the next generation of graph-based retrieval may not look like GraphRAG at all. Instead of LLM-powered entity extraction and community summarization, it uses graph traversal algorithms that don't need an LLM at every step. Cheaper, faster, and the failure modes are more predictable.

When GraphRAG Actually Makes Sense

I'm not arguing GraphRAG is bad. Multi-hop reasoning across entity relationships is a real capability that vector search genuinely cannot provide. The supply chain demo I saw was solving a problem that no amount of chunk optimization would fix.

GraphRAG makes sense when:

More than 10-20% of your queries require cross-document reasoning (audit your logs first)
Your domain is inherently relational (supply chains, legal precedent, compliance)
You have the engineering team to maintain the graph as a first-class data product, not a demo artifact
You've already exhausted metadata filters and hybrid search on your vector index

It doesn't make sense when:

Most queries are single-hop lookups (vector RAG is cheaper and faster)
Your team can't commit to ongoing graph maintenance
You're adding it because benchmarks look better without checking if those benchmarks match your query distribution

The honest recommendation: start with vector RAG. Get it working well. Monitor your query logs for multi-hop patterns. If fewer than 10% of questions require cross-document reasoning, the engineering investment isn't justified yet. If you do add a graph, start with a narrow domain (contracts, compliance, product specs) where entity relationships are dense and well-defined. Don't try to graph your entire corpus on day one.

The teams that succeed with GraphRAG in production treat the knowledge graph as a first-class data product, with schema governance, freshness SLAs, and entity resolution monitoring. The teams that fail treat it as something you build once and query forever. Graphs aren't databases. They're living systems that decay the moment you stop maintaining them.

GraphRAG Pilots Succeed. Production Deployments Fail Quietly.

The Benchmark Numbers Are Real. The Context Is Missing.

Entity Resolution Is the Load-Bearing Wall

The Cost Nobody Adds Up

What I Learned Running a Graph Database in Production

The Alternative Nobody Wants to Talk About

When GraphRAG Actually Makes Sense

Get new posts in your inbox

Keep reading

Your CI Pipeline Depends on a Model That Ships Breaking Changes Without a Changelog

I'm Wiring Graph Memory Into Code Review. Here's What Vectors Miss.

What broke the AI coding productivity narrative in a week?