
173 Agents, Zero Owners: Azure's Structural Rot
A former Azure Core engineer just published the most detailed infrastructure failure account since the Knight Capital postmortem. 173 unowned agents, 500K monthly crashes, and a trillion dollars gone.
A former Azure Core engineer just published a five-part account of what went wrong inside Microsoft's cloud platform. Not vague complaints. Names, numbers, dates, architecture diagrams described from memory. The kind of insider disclosure that makes cloud customers recalculate their risk exposure.
I read all five parts. Then I read the 487-comment HN thread. What I found wasn't a story about Microsoft. It was a pattern I've seen at three companies now, playing out at the largest possible scale.
The number that stopped me
173 agents.
That's how many software agents the Azure team identified as candidates for porting to their new Overlake accelerator card. The author, who rejoined Azure Core in May 2023 as a senior member of the Overlake R&D team, says nobody at Microsoft could explain why 173 agents existed, what they all did, or how they interacted with each other.
Azure sells VMs, networking, and storage. Add observability and servicing, and you need maybe a dozen orchestration processes. Not 173. The author puts it plainly: "it takes a serious amount of misunderstanding to get there."
Those 173 agents were collectively hammering the Hyper-V hypervisor through its WMI interface at 10,000 calls per second during peak bursts. The Hyper-V team had no visibility into which agents were responsible. The Azure team couldn't answer either.
This is what architectural bankruptcy looks like. Not a dramatic failure. A slow accumulation of components that nobody owns, nobody understands, and nobody can safely remove.
The crash economy
WireServer, a web server running on the host OS of every Azure node, was crashing 300,000 to 500,000 times per month across the fleet. Test coverage sat below 40%. Monthly releases introduced more new bugs than they fixed. Most rollouts ended in panicked rollbacks.
The author submitted bug fixes using smart pointers to address memory management issues. They were rejected. The reason: fear of breaking something.
I've been on teams where this happens. You know the codebase is rotting, but every fix carries more perceived risk than the bug it addresses. So you stop fixing things. The crash count becomes the new normal. Monthly newsletters report "glowing quality metrics" because nobody claimed ownership of their modules in the crash reporting system, so automated triage couldn't attribute crashes to specific teams.
Read that twice. The quality reports looked great because the crash reporting pipeline was misconfigured.
I initially assumed this was negligence. Some team ignoring their dashboards. But it's worse than that. Nobody had registered as the owner of their modules in the crash system. The bugs weren't being hidden. They were invisible by default. No owner, no attribution, no incident, no fix.
The security problem nobody wanted to hear
WireServer runs on the host OS, the secure side of the machine. It's directly reachable from any guest VM. The author discovered it was maintaining in-memory caches containing unencrypted tenant data from multiple tenants, mixed together in the same memory areas.
The host OS maps the memory pages of every VM running on that node. A compromise of the host gives an attacker access to the complete memory of every VM on the machine. Running a web server on that host, exposed to every guest, with unencrypted multi-tenant data in memory, is exactly the attack surface you'd design if you wanted to maximize blast radius.
The author described WireServer as a "walking security liability." A VP-level security architect (who the author says wrote a well-known threat modeling book) agreed without reservation.
The author was terminated.
Rewrite everything in Rust (and ship nothing)
Azure mandated Rust for all new software. The early prototypes pulled in nearly 1,000 third-party Rust crates, most unvetted.
Of 64 key work items identified to reengineer the VM management stack, none had been completed by end of 2024. Work hadn't started on approximately 60 of them. The list included foundational infrastructure: a key-value store, tracing, logging, observability.
Meanwhile, Microsoft publicly implied at Ignite conferences from 2023 through 2025 that the Boost offload and Rust rewrite were well underway. They weren't.
I've watched this pattern at smaller scale. A team drowning in technical debt announces a rewrite in a new language. Junior engineers get excited. Management gets a story for the roadmap. Nothing ships. The old system keeps crashing. But now you have two codebases to worry about, and the senior engineers who could actually fix the old one have left.
The HN comments nailed it: "Language choices won't save you here. The problem is organizational paralysis."
Where did the adults go?
In 2014, Nadella eliminated the dedicated SDET role across Microsoft. Hundreds of testers were retrained into software engineering positions. Some landed in Azure operations.
By 2023, roughly half the organization responsible for Compute Node Services consisted of junior engineers with one or two years of experience. The Group Engineering Manager's background was in web performance, specifically optimizing CSS for page load times. This group was now responsible for the software running on every Azure node on the planet.
One HN commenter put it simply: "For something like Azure, people are not fungible. You need to retain them for decades."
Dave Cutler, who originally built Azure's Fabric Controller, intended the system to operate without human intervention. By 2024, the reality was 14,209 Just-In-Time access requests in a 74-day period. That's roughly 192 manual interventions per day. On government clouds alone, ProPublica reported "hundreds of interactions" monthly, some involving $18/hour employees copy-pasting commands under direction from staff in foreign countries.
What OpenAI saw
OpenAI was running on Azure under a right-of-first-refusal agreement. In March 2025, they signed an $11.9 billion deal with CoreWeave. Sam Altman's statement: "Advanced AI systems require reliable compute."
If you need a translation: Azure wasn't reliable enough for our workloads.
Six months later, OpenAI expanded the CoreWeave agreement by another $6.5 billion and committed $300 billion to Oracle. Microsoft responded with 15,000 layoffs. The stock dropped 30% from its peak, wiping over a trillion dollars in market value.
OpenAI had visibility into Azure's operational reality that most customers don't. They could see the manual interventions, the crash rates, the deployment rollbacks. And they made the rational decision: diversify away from a platform that was on life support.
The seven-step collapse
I've seen this play out at three companies now. Different scale, same sequence:
- Ship fast under competitive pressure. Cut corners on reliability.
- The team that built it moves on. Knowledge walks out the door.
- Hire junior replacements. They inherit a system they don't understand.
- Bugs accumulate faster than fixes. Testing stays low because everyone's firefighting.
- Someone proposes a rewrite. It never ships.
- Management reports stable metrics because the measurement system is broken.
- A major customer notices the gap between the SLA and reality. They leave.
Azure is step 7. Most organizations get stuck at step 4 or 5 and never get big enough for the consequences to make headlines.
What to watch for in your own stack
If you're running production on any cloud provider, four signals from this story should trigger a risk review.
First: support ticket patterns. If your provider's support engineers are requesting manual access to your infrastructure regularly, the automation they promised isn't working. Cutler designed Azure for zero human touch. Reality: 192 manual interventions daily.
Second: management process count. Ask your cloud provider's architecture team how many agents run on the node hosting your VMs. If they can't answer definitively, you're in 173-agent territory.
Third: rollback rate. Azure's monthly releases were consistently rolling back. Frequent reverts in your provider's changelog mean the testing discipline is weak.
Fourth, and most important: talent continuity. When a platform's original architects leave and the replacements are junior, the system's complexity exceeds the team's ability to manage it. This is the most reliable leading indicator I know of for reliability decay.
The author's most pointed observation: Azure had "millions of crashes each month, the majority unattributed." That's not a technical problem. That's a cultural one. When crashes go unowned, they go unfixed. When they go unfixed long enough, fixing them becomes more dangerous than living with them.
That's the trap. And it's not unique to Microsoft.
Get new posts in your inbox
Architecture, performance, security. No spam.
Keep reading
My AI Reviewer Confabulated a Review of the Wrong Repo
My AI code reviewer wrote a structurally perfect review of code from a completely different repository. The fix wasn't better prompts. It was architectural constraints.
Four Costs of Building Full-Text Search Into SQLite
FTS5 looked like one virtual table and done. Then the dotted tokens broke MATCH parsing, the sync triggers needed manual wiring, the tokenizer rewrote my search model, and every write started touching two tables.
ServiceNow, SAP, and Workday Just Put a Meter on Your AI Agents
Three enterprise SaaS vendors shipped AI agent restrictions in the same two-week window. The governance case is real. The implementation is a trust problem.