Multi-Agent Development Is a Distributed Systems Problem. I Learned This the Hard Way.

The first time I ran three agents in parallel on Ouija, two of them wrote to the same file. Different content. Neither knew the other existed. The third agent read that file mid-write and got a half-finished version that parsed as valid JSON but contained nonsense.

No crash. No error. Just wrong output that looked right.

I stared at the logs for twenty minutes before I understood what happened. Then I laughed, because I'd spent years reading about distributed systems and somehow convinced myself that multi-agent development was different. It's not. They're processes. They share state. They race. They fail in exactly the ways that decades of distributed systems literature already documented.

I just had to rediscover all of it myself.

What I built and why multi-agent seemed right

Ouija is a pipeline engine that dispatches AI coding agents from a kanban board. TypeScript, Fastify, BullMQ, Postgres. Thirteen packages in a monorepo. 316 tests. The idea: you move a card to "In Progress," an agent clones the repo, writes code, opens a PR. The card moves to "Review." A human approves. The card moves to "Done."

Single-agent Ouija worked fine. One agent, one task, sequential. But I wanted parallelism. Three agents handling three cards simultaneously. Different repos, different tasks, shouldn't conflict. The pitch writes itself.

I also run GhostWriter (the terminal-first publishing pipeline behind this blog) with parallel subagents. An SEO specialist, a docs checker, and a code reviewer all run concurrently via warm NDJSON stdio sessions. Each task takes 3-4 seconds. Three in parallel should take 4 seconds total instead of 12.

Should.

Failure 1: BullMQ's FIFO is a suggestion

Ouija uses BullMQ for job orchestration. Jobs go into a queue, workers pick them up, first in first out. Except FIFO in BullMQ guarantees pickup order, not completion order. At concurrency greater than 1, multiple workers grab jobs in sequence but finish them whenever they finish. A fast job that started second completes before a slow job that started first.

I discovered this when pipeline step 3 completed before step 2. An agent tried to write to a file that step 2's agent hadn't created yet. No error from BullMQ. No warning. The job ran, produced garbage, and reported success.

The docs spell this out. But "completion order is not guaranteed" isn't something you internalize until you're debugging a pipeline that intermittently produces wrong output and you can't figure out why, because each individual job succeeds.

I switched to BullMQ's FlowProducer, which lets you define parent-child job dependencies. Children run first, parent runs after all children complete. Better. But failParentOnFailure bit me in a way I didn't expect. The docs say failure propagates upward through the job tree. In practice, my monitoring loop was checking parent status too frequently and catching stale state. The parent would show "waiting-children" while a child had already failed, because the propagation hadn't fully resolved yet. I was making decisions based on a status read that was milliseconds ahead of the truth.

I lost a full afternoon to that race condition. The parent looked alive. The children were failed. Downstream jobs had already started because my orchestrator trusted a status poll instead of waiting for a completion event.

Failure 2: Git worktrees don't isolate your agents from shared state

The filesystem problem was obvious: agents writing to the same files. Git worktrees solved it. Each agent gets its own worktree, its own branch, its own copy of the repo. Filesystem isolation, done.

Except shared state isn't just files.

Two agents querying Postgres for the "current" pipeline state got different answers depending on transaction timing. Agent A reads state, starts work. Agent B reads the same state, starts work. Agent A writes its result back. Agent B writes its result back, overwriting Agent A's changes. Classic lost update. Textbook distributed systems problem that I somehow didn't anticipate because I was thinking about "agents" instead of thinking about "concurrent processes accessing shared mutable state."

The fix was transactional isolation with proper locking. Which is exactly what you'd do in any concurrent system. Nothing agent-specific about it.

Failure 3: Webhook storms and the idempotency gap

Ouija ingests webhooks from kanban boards. Card moved? Webhook fires. Agent assigned? Webhook fires. Fastify handles the HTTP layer, and here's something the tutorials skip: Fastify has no built-in backpressure for webhooks. If an upstream service retries aggressively (and they do), you eat every retry.

I got duplicate dispatches. Two agents spinning up for the same card because the kanban board retried a webhook that Ouija had already processed. The first request succeeded but the response didn't reach the sender in time. Retry. Second dispatch. Two agents, same task, conflicting PRs.

The fix: accept webhooks immediately with a 202, queue the actual processing via BullMQ, and deduplicate. BullMQ's dedup has two modes: one that blocks duplicates until the original job completes or fails, and a TTL-based mode that blocks for a fixed time window. I started with completion-based dedup, but jobs that failed and got cleaned up would lose their dedup records, reopening the window for duplicates. Another afternoon gone.

Oh, and Fastify's default bodyLimit is 1 MiB. GitHub webhook payloads with large diffs exceed that. The request drops silently. No error response, no log entry. I found this one by accident while debugging something else entirely.

Failure 4: The coordination tax nobody talks about

Here's the number that broke my mental model.

GhostWriter's subagents each take 3-4 seconds via warm NDJSON sessions. Three agents in parallel: 4 seconds. Three agents sequentially: 12 seconds. The parallelism saves 8 seconds.

But coordinating three parallel agents (dispatching, collecting results, handling partial failures when one succeeds and two fail, deciding whether to retry or abort, merging outputs that might conflict) added 6-9 seconds of orchestration overhead. Net savings: somewhere between negative 5 seconds and positive 2 seconds.

For most tasks, sequential was faster.

The warm sessions introduced another problem I didn't expect. After about 15 dispatches, an agent starts hallucinating based on prior task context bleed. The NDJSON protocol keeps the process alive between tasks, which is great for latency. But context accumulates. By dispatch 15, the agent is responding to ghosts from tasks it handled an hour ago. I added session rotation every 10 tasks, which means cold-starting a new process (8-12 seconds) periodically negating the warm-session speed advantage.

The practical ceiling for concurrent code-writing agents is 2-3. Not because of CPU or memory. Because human review is the bottleneck. Five agents producing five branches means five review sessions. The parallelism savings evaporate at the merge step.

Failure 5: Cascading failures from stalled agents

BullMQ detects stalled jobs using a lock mechanism. maxStalledCount defaults to 1, meaning one stall and the job gets retried. If an agent doesn't renew its lock within the configured lockDuration, the job gets marked as stalled.

AI agents are slow. A code-generation task can easily take 60 seconds. I was hitting stalled-job detection on perfectly healthy agents that were just thinking. The retry would fire while the original was still running. Two agents, same task, same problem as the webhook duplication but from a different angle.

The cascading part: when a stalled job gets retried, downstream jobs have already started processing based on the original run's output. The retry produces different output (because the agent doesn't have deterministic output). Now you have two versions of the truth propagating through the pipeline.

I cranked lockDuration to 120 seconds and maxStalledCount to 2. That was a band-aid. The real fix was implementing proper heartbeats from the agent process back to the job lock, which required changes to how agent skills communicate with the orchestrator.

The architecture that survived

Everything I described above happened in the first two weeks. What survived:

Ouija's core is a pure transition function. Zero I/O. Give it a current state and a trigger, it returns the next state and a list of side effects. The orchestrator handles the messy parts: loading state from Postgres, running side effects, persisting results. The state machine never touches the network.

This made every failure debuggable. When agents produced wrong output, I could replay the exact sequence of states and triggers without running any agents. The state logic was correct. The failures were all in the I/O layer, the coordination layer, the timing layer. Exactly where distributed systems failures always live.

The EventBus and JobQueue are separated at the interface level. Events are fire-and-forget notifications. Jobs are work items with completion semantics. Mixing these two concepts (which every early prototype did) created most of the ordering bugs.

The plugin system (Kanban, Git, Agent, Notification) isolates side effects by domain. When the Kanban plugin breaks, it doesn't take down agent dispatch. When the Git plugin has a timeout, notifications still fire.

The honest take

Most tasks don't need multi-agent.

I know that's an uncomfortable claim when the entire industry is shipping multi-agent frameworks. But a single agent with good tools (MCP servers, structured prompts, proper context management) handles 80% or more of what people reach for multi-agent to solve. You're not getting 5x throughput from five agents. You're getting 2x throughput with 3x complexity. That's a bad trade for most workloads.

Ouija is stalled at Phase 1 after five days. The engine works. The state machine is solid. The 316 tests pass. But the coordination logic, the part that makes agents work together instead of alone, is harder to debug than the agent logic itself. That's the tell. When your coordination layer is more complex than the work being coordinated, you've optimized for the wrong thing.

I'm still not sure Ouija needs to be multi-agent. A single well-orchestrated agent with BullMQ for task sequencing might have been the right call. I built the distributed version because it was intellectually interesting, not because the problem demanded it. That's an honest admission I don't see enough in the multi-agent discourse.

When multi-agent is actually worth it

Genuinely independent tasks with no shared state. That's the sweet spot. Different repos, different databases, no overlapping files. The moment agents need to coordinate on shared resources, you're in distributed systems territory and you'd better know the literature.

If you're evaluating multi-agent, here's my checklist:

Can agents work on completely isolated resources? If no, expect shared-state bugs.
Is your job queue configured for the concurrency model you actually need? BullMQ's defaults assume you've read the docs. Most people haven't.
Do you have idempotency at every boundary? Webhooks retry. Jobs retry. Agents are non-deterministic. Every handler must produce the same result regardless of how many times it runs.
Is your coordination overhead less than 30% of the parallelism benefit? Measure it. Don't guess. I guessed wrong.

The distributed systems literature is fifty years deep. The FLP impossibility result, the Byzantine Generals problem, the CAP theorem. None of it is new. All of it applies to multi-agent AI systems exactly as written. The only thing that's new is that a generation of developers is discovering it for the first time, with agents instead of servers, and calling it novel.

It's not novel. It's just hard. And pretending otherwise is how you end up staring at logs at 2 AM wondering why two agents wrote to the same file.

Multi-Agent Development Is a Distributed Systems Problem. I Learned This the Hard Way.

What I built and why multi-agent seemed right

Failure 1: BullMQ's FIFO is a suggestion

Failure 2: Git worktrees don't isolate your agents from shared state

Failure 3: Webhook storms and the idempotency gap

Failure 4: The coordination tax nobody talks about

Failure 5: Cascading failures from stalled agents

The architecture that survived

The honest take

When multi-agent is actually worth it

Get new posts in your inbox

Keep reading

Forget the Output. Watch What Your Agent Reads Before It Writes.

I'm Wiring Graph Memory Into Code Review. Here's What Vectors Miss.

Your Company Already Has AI Agents. You Just Don't Govern Them Yet.