
AI Made Microservices More Expensive. Nobody Wants to Admit It.
AI workloads punish the exact things microservices add: network hops, duplicated caches, serialization overhead. The monolith argument isn't about simplicity anymore. It's about money.
I run four AI agents on a single VPS. They share one PostgreSQL database, one Redis instance, one Node.js process. Last month a friend asked when I was going to split this into proper microservices.
I did the math instead.
The Consolidation Nobody's Arguing About Anymore
Amazon Prime Video consolidated from microservices to a monolith and cut infrastructure costs by 90%. Segment went from 140 microservices to one. Shopify runs a modular monolith. Stack Overflow handles 6,000 requests per second on architecture old enough to drive a car.
According to a 2026 survey, 42% of organizations that adopted microservices are actively consolidating them back into larger units. That number would've been heresy in 2020.
This isn't new. Engineers have been quietly walking back microservice decisions for years. What IS new is why the conversation flipped. The original arguments against microservices were about complexity, debugging difficulty, operational overhead. Valid, but abstract. Hard to quantify in a budget meeting.
AI made the argument financial.
Three Ways AI Workloads Punish Distribution
Traditional backend requests are cheap. A REST call hits a database, returns JSON, completes in 50-100ms, costs fractions of a cent in compute. You can afford network hops between services because each hop adds maybe 5ms. Cache misses? Just hit the database again. The cost is negligible.
AI workloads break every one of those assumptions.
An LLM inference call takes 2-10 seconds, not milliseconds. A prompt cache miss costs roughly 10x what a cache hit costs. Embedding generation runs about $0.13 per million tokens with text-embedding-3-large. These aren't rounding errors. They're line items on a bill that arrives monthly.
Problem 1: Duplicated computation. Service A needs embeddings for a document. Service B needs the same embeddings for search. In a monolith, you compute once and share the result from one pgvector table. In microservices, you either pay for those embeddings twice, or you build a shared embedding service. Now you have three services for what was a two-service problem, plus a network hop that didn't exist before.
Problem 2: Cache fragmentation. Centralized LLM gateways with semantic caching show 40-60% cost reductions in production. That only works when every service routes through the same cache. Split your services, split your cache. Each one maintains its own smaller, colder cache with worse hit rates. Every service boundary is a cache boundary. Every cache boundary is a cost boundary.
Problem 3: Serialization of context. When Service A calls Service B with a traditional payload, you're serializing a few hundred bytes of JSON. When an AI-integrated Service A calls Service B with a prompt context, you're serializing 2-5KB of instructions, retrieved documents, and conversation history. Multiply that by every service hop in the chain. The overhead that was invisible at 200 bytes becomes very visible at 5,000.
What My Stack Accidentally Got Right
I didn't plan for any of this. My architecture evolved from pragmatism, not foresight. I run Engram (memory retrieval), Ouija (job dispatch), GhostWriter (content pipeline), and mc-agent (monitoring) on a single $7 VPS. PM2 manages the processes. BullMQ handles the job queue. One pgvector table for all embeddings. One Redis instance for all caching.
When a BullMQ worker needs to retrieve context from Engram, it's an in-process function call. No serialization. No network latency. The "retrieve context, then generate response" loop completes without a single network hop between its steps.
If I split this into microservices, that in-process call becomes an HTTP request. The context blob gets serialized, transmitted, deserialized at every boundary. The prompt cache that currently serves four agents from one Redis instance becomes four separate caches, each colder than the original. The token cost multiplies by a factor I don't want to calculate.
When my agent pipeline broke in April, it published a post with zero internal links because the agent skipped the archive scan step. I found the bug in one PM2 log in about ten minutes. In a distributed architecture, that failure mode looks like "Service A didn't call Service B." You need distributed tracing across service boundaries just to identify it, let alone fix it.
The Part Where I Admit This Is Survivorship Bias
I should be honest. My monolith works because I'm one developer on one box with no organizational reason to split anything apart. I haven't had 50 engineers stepping on each other's code. I haven't needed GPU nodes for inference separated from CPU for orchestration. I haven't needed hard isolation between services for GDPR or SOC 2.
Calling my infrastructure decisions an "architecture" is generous. It's pragmatism that happened to be right.
But it was right for a specific reason. AI workloads punish distribution harder than traditional workloads do. Network hops aren't free when the payload is a prompt context. Cache misses aren't cheap when the recomputation is an LLM call. The overhead that was negligible at $0.001 per request becomes real at $0.05.
The Threshold Shifted
The argument was never "monoliths are always better." It was always about thresholds. At what team size, what request volume, what isolation requirement does the operational tax of microservices pay for itself?
AI moved that threshold. The economics changed. If you're building AI features into a system with fewer than a few million daily requests and fewer than 20 engineers, you probably don't need service boundaries for AI workloads. You need a shared cache and a modular codebase inside one deployment.
The industry spent five years splitting everything apart. It'll spend the next two putting some of it back together. AI didn't create the monolith argument. It just made the bill visible.
Get new posts in your inbox
Architecture, performance, security. No spam.
Keep reading
The hidden cost of 'right' decisions: what 4 years of infrastructure teaches about trade-offs
Every infrastructure decision is a bet on the future. After watching teams make the same mistakes across multiple startups, here's what actually matters when choosing your stack.
Why My Agent Pipeline Still Runs on BullMQ
Vercel Workflows ships crash recovery, step isolation, and durable state for agents. My pipeline uses BullMQ on a $7 VPS instead. Here's the trade-off.
Context Engineering Replaced Prompt Engineering and Nobody Noticed
I've been doing context engineering for months without calling it that. A 547-line CLAUDE.md, subagent isolation, strategic compaction, six MCP servers. The term just caught up to the practice.