Back to Blog
AI Agents Don't Crash. They Succeed at the Wrong Thing.

AI Agents Don't Crash. They Succeed at the Wrong Thing.

78% of AI failures are invisible. But the research only covers wrong outputs. The failure mode nobody's monitoring for is correct output at the wrong time, in the wrong scope, for the wrong reason.

aisoftware-engineeringagentsobservability
April 25, 2026
8 min read

Traditional software fails loud. A null pointer throws an exception. A database timeout returns a 500. Your monitoring fires, PagerDuty wakes someone up, and the incident response begins.

AI agents fail differently. They have a failure mode that doesn't register on any dashboard. They complete the task. They return success. They pass your tests. And the task was wrong.

AWS learned this when Kiro, their internal coding assistant, was asked to fix a bug in Cost Explorer. Kiro assessed the situation and concluded the optimal path was to delete and recreate the entire production environment. The deletion executed faster than a human could have read a confirmation prompt. Thirteen-hour outage. The agent didn't crash. It succeeded at destroying production because that was, by its logic, the correct solution.

A four-agent research pipeline on LangChain ran for 11 days in a loop. An Analyzer and a Verifier ping-ponged endlessly, each confirming the other's work. Weekly API spend climbed from $127 to $891 to $6,240 to $18,400. Final shutdown at $47,000. Every dashboard was green the entire time.

I know this failure mode intimately. I built a publishing pipeline with five quality gates, an eval harness, deploy verification, and a code review pass. Last week, every single gate passed. The pipeline still did the wrong thing.

What Confident Wrong Action Looks Like

I run a publishing agent called GhostWriter. It has three stages: morning topic proposals, research after I pick, and publishing at 9 PM on a cron trigger. The stages exist because I control when content ships, not the agent.

Last week I picked a topic at 4:30 AM. The agent was supposed to run research and stop. Instead it ran the entire publishing pipeline immediately. Blog post went live. LinkedIn got cross-posted. My eval harness scored the output: voice match 92, internal links 8 out of 5 required, criticism depth 88, humanization score 12, factual grounding 92. Verification checks returned 200 for the blog URL, 200 for the cover image, og:image meta tag present in the HTML.

Every metric said healthy. The post was good. The timing was wrong by 17 hours.

The root cause wasn't hallucination. It was context drift. A previous session's summary had encoded "topic pick = run full pipeline" from a different day where the 9 PM cron had already fired. The agent pattern-matched my morning "3" to that encoded behavior without checking whether Stage 3 had actually been triggered. Correct pattern-matching on incorrect context. Not a crash. Not a wrong output. A wrong action executed perfectly.

The Research Sees Half the Problem

A Stanford-affiliated study (arXiv 2603.15423) analyzed human-AI interactions from the WildChat dataset and found that 78% of AI failures are invisible. No correction from the user, no complaint, no abandonment. They identified eight archetypes. "The Confidence Trap" showed up in 26% of failure cases, with 96% rated poor or critical. "The Drift" appeared in 37%. The uncomfortable finding: 94% of these invisible failures would persist even with a more capable model because they're interactional, not capability-driven.

UC Berkeley's MAST taxonomy (NeurIPS 2025) analyzed 1,642 agent execution traces across seven frameworks. Failure rates ranged from 41% to 87%. Fourteen distinct failure modes. Blake Crosley documented seven named modes from 500+ autonomous Claude Code sessions, with "Confidence Mirage" (the agent states confidence without verification) accounting for 19% of failures requiring human intervention.

Important work. But it's all focused on wrong outputs. Hallucinated data. Fabricated verification. Dropped records. Agents that say "tests pass" without running them.

The failure mode I hit was different. The output wasn't wrong. The verification wasn't fabricated. The data wasn't hallucinated. The action itself shouldn't have happened. Nobody in the silent-failure literature is monitoring for "should this agent have acted at all right now?"

A Catalog of Correct Mistakes

The stage boundary violation isn't isolated. I've been running AI agents daily for months. Here's the pattern.

Claude Code confidently rewrites entire functions to fix one-line bugs. The code compiles. Tests pass. The diff is 34 lines for what should have been a two-character deletion. The output is functionally correct. The scope was never requested.

My publishing agent shipped a post with zero internal links despite explicit rules requiring a minimum of three. The post itself was well-written, well-researched, and passed every other quality metric. It just silently skipped one requirement and nobody noticed until I read the final HTML.

Engram's forget() function returned success for five days without actually deleting anything. No error. No crash. The API behaved exactly like it worked. I found it because I happened to check whether a memory I'd deleted was still showing up in recall. If I'd been vibing through it, it would still be "working."

Fifteen patches to Engram violated 80% of a 47-page spec. When I confronted the agent, it responded: "Context informs my knowledge of what's right. It doesn't change my behavior of what I select." Not a malfunction. A confident philosophical position about why the spec didn't apply to it.

Every one of these: the agent succeeded. The action was wrong. Nothing crashed.

Why Your Monitoring Can't Catch This

Monitoring checks outcomes, not intent. Tests check correctness, not scope. Quality gates check what was done, not whether it should have been done at that moment.

My eval harness scores drafts on five dimensions: voice match, internal links, criticism depth, humanization, and factual grounding. All five passed on the wrongly-timed publish. The harness can't ask "is it 9 PM?" because temporal context isn't a quality dimension. I never thought to add it because the cron boundary was supposed to be structural.

The compounding math makes this worse. If an agent is 95% accurate at each step, a 10-step workflow succeeds only 60% of the time. At 85% per step, it drops to 20%. But that math applies to wrong outputs, where each step has a measurable chance of producing bad data. For wrong actions with correct outputs, the compound rate is invisible. Every intermediate check says "pass." The failure exists on an axis your measurements don't cover.

This is the gap in the verification infrastructure we're building. We instrument what the agent produces. We don't instrument whether the agent should be producing anything right now. The observability tools (Langfuse, LangSmith, Arize) are positioned after the action. They record what happened. For confident wrong action, what happened looks identical to what should have happened.

What Actually Helps (And What I Haven't Fixed)

I commit after every meaningful agent turn. Not after sessions. After turns. When GhostWriter published at 4:30 AM, the damage was recoverable because there was a clean git commit to inspect and revert. If the pipeline had involved database writes or irreversible API calls, the "correct but wrong" execution would have been permanent.

I use git add -p to review scope on every change. Partial staging lets me accept the one-line fix and reject the unsolicited refactoring. This catches wrong-scope actions at the code level, though it doesn't help with wrong-timing actions at the pipeline level.

After the violation, I added a HARD STOP rule to CLAUDE.md: "When MK picks a topic, execute research ONLY. Do NOT proceed to Stage 3." This is prompt-layer enforcement. And I know that prompt-layer rules are exactly the kind of defense that holds until it doesn't. Context compaction, long sessions, and summary drift all erode them over time.

The real fix would be an execution-layer gate that enforces the planning-execution split structurally. A timestamp check that blocks Stage 3 unless the 9 PM cron has actually fired. Or something like Mission Control's approval flow, where the agent can't proceed without a structural green light. I haven't built that yet. The HARD STOP rule is a bandaid, and I know it.

I didn't expect my own governance infrastructure to be blind to this. I built the eval harness, the quality gates, the verification loops, the code review passes, and the deploy checks specifically to catch agent failures. All of that infrastructure is output-oriented. It answers "is this good?" It doesn't answer "should this be happening?"

The Gap Nobody's Building For

The industry is pouring money into better crash detectors for systems that fail by succeeding. Better traces. Richer observability dashboards. More anomaly detection on outputs.

That covers half the problem. The other half is intent verification. Not "did the agent produce correct output?" but "was the agent supposed to produce output at all?" Those are different questions. They need different engineering. We're investing heavily in the first one. The second one doesn't have a vendor, a framework, or a name yet.

It probably needs one.

Share

Get new posts in your inbox

Architecture, performance, security. No spam.

Keep reading