Back to Blog
GitHub Had 10 Incidents in April and Your Pipeline Felt Every One

GitHub Had 10 Incidents in April and Your Pipeline Felt Every One

10 degraded-service incidents in one month. Code Search down for 8 hours. Pages serving 17.5 million errors. Copilot at 97.5% failure rate. GitHub stopped being just a code host, and the blast radius proves it.

githubinfrastructuredevopsplatform-dependency
May 15, 2026
8 min read

GitHub published their April 2026 availability report on May 14. Ten degraded-service incidents in a single month. Code Search went fully down for over two hours, then took another six to recover. Pages served 17.5 million HTTP 500 errors in 39 minutes. The Copilot Coding Agent hit a 97.5% peak error rate across 22,700 workflows.

The same month, Mitchell Hashimoto pulled Ghostty off GitHub. Fifty thousand stars. The creator of Vagrant, Terraform, and Packer. GitHub user #1299 since February 2008. Eighteen years of daily use, and he walked.

Those two facts are connected.

The April Outage by the Numbers

Jakub Oleksy's availability report breaks it down incident by incident. The worst:

April 1: Code Search hit a 100% query failure rate for 2 hours 20 minutes. Root cause was "an automated infrastructure upgrade that applied changes too aggressively to the messaging system." Total recovery took 8 hours 43 minutes. Longest single outage of the month.

April 9: Copilot Coding Agent reached 97.5% peak error rate. 22,700 workflows affected. Four separate Copilot incidents happened that month. Four.

April 13: GitHub Pages generated 17.5 million HTTP 500 errors in 39 minutes. A deployment pipeline update cascaded across the serving layer.

April 23: A DNS configuration change in one datacenter knocked out six services simultaneously. Copilot, Webhooks, Git operations, Actions, Migrations, and Deployments all degraded at once. 7% of Copilot AI requests failed. 2.07% of Git operations errored. Root cause: "a recently introduced traffic-balancing mechanism" that caused DNS resolvers to fail.

April 27: A scraping attack from 600,000 unique IP addresses consumed 30% of daily search traffic in four hours. 65% of legitimate searches started timing out. The cascade spread to Issues, Pull Requests, Projects, Actions, the Package Registry, and Dependabot Alerts.

Three of ten incidents were Copilot-related. Two were search-related. Two involved DNS. The longest multi-phase incident stretched to roughly 19 hours across Code Scanning and Projects on April 20-21.

Every number above comes from GitHub's own post-incident analysis. This isn't inferred from third-party monitors or Twitter complaints. When the platform vendor publishes ten incident writeups in a single monthly report, the question stops being whether the platform is reliable. It becomes how fast the reliability is changing.

One Platform, Twelve Failure Points

GitHub isn't a code host anymore. Count the surfaces: Git hosting, CI/CD via Actions, package registry, code search, security scanning through Dependabot and CodeQL, release distribution, issue tracking through Issues and Projects, documentation via Pages and Wikis, identity and OAuth, and four AI surfaces in Copilot, Copilot Chat, Copilot Coding Agent, and Copilot Memory.

That's a dozen services under one domain. When the infrastructure underneath them fails, the blast radius doesn't respect service boundaries. April 23 showed this concretely. One misconfigured traffic-balancing rule in one datacenter. In a platform with isolated services, that's a localized DNS problem. In GitHub's architecture, it simultaneously degraded Copilot completions, webhook deliveries, Git operations, Actions workflows, repository migrations, and deployment status updates. Not because those services are related to each other. Because they share infrastructure that was never designed to carry all of them at once.

Juan Torchia counted six of those surfaces in his own stack after Hashimoto left. He put it plainly: "Six surfaces. Six failure points concentrated in a single vendor." A single git push origin main in his workflow triggers Actions, Pages, Releases, and the issue tracker simultaneously.

Andrew Nesbitt's analysis went deeper. 91% of PyPI packages reference GitHub Actions by mutable tag. Two-thirds of workflows have no permissions: block at all. And OIDC trusted publishing now routes through Actions for PyPI, npm, RubyGems, and crates.io. "We've spent a decade hardening package managers with lockfiles, 2FA mandates, signatures, audit logs and provenance attestations," Nesbitt wrote, "and the net effect of wiring all of that to OIDC has been to take trust we used to spread across thousands of individual maintainer credentials and concentrate it on one CI platform."

Vito Sartori compiled a GitHub uptime chart spanning April 2016 through January 2026. Before the Microsoft acquisition: effectively four nines. After: what he calls a "noisy descent that has never recovered." Worst months dipped below 99.6%. His explanation is structural. GitHub acquired Travis CI's function via Actions, absorbed npm into the package registry, bought Dependabot, bought Semmle for CodeQL. Each acquisition collapsed an independent failure domain into the same platform. When those tools worked independently, an Actions outage didn't take down your package registry. Now it can.

Vlad Fedorov, writing for GitHub's own engineering blog in March, confirmed the pattern: "Extremely rapid usage growth exposing scaling limitations in parts of our current architecture."

I run GitHub Actions for Ouija (1,195 tests), Engram (460 tests), and this portfolio site. My Dependabot CI has been stuck for twelve days straight. Zero checks running on any PR because a settings issue treats Dependabot[bot] as an outside collaborator. I ended up merging PRs without CI verification because the queue was untenable. Different scale than a ten-incident month. Same structural problem: when the platform handles everything, a failure in one surface blocks workflows that have nothing to do with it.

The Cost Nobody Tallies

Stack Overflow's 2025 survey puts GitHub at 81% as a collaboration tool. GitLab sits at 36%. That 81% isn't market share. It's infrastructure dependency. More than four out of five developers have GitHub in their critical path.

Torchia's framing sticks: "That's not convenience. That's systemic dependency dressed up as convenience."

The supply chain angle compounds it. When Actions serves as both the build system and the OIDC trusted publisher for package registries, an Actions compromise isn't just a CI outage. It's a supply chain attack surface. The same platform where malicious actors inject payloads through GitHub Issues now acts as the trusted third party for every OIDC-signed package on npm and PyPI. Nesbitt's data exposes the scale: 91% of PyPI packages depend on a platform that had ten incidents in thirty days.

David Bushell put it bluntly: "Your CI pipeline is over-engineered and GitHub Actions are an abomination. Finding another solution is an absolute chore but do you trust GitHub to be reliable?"

The compounding problem is invisible. When Actions goes down, teams don't stop shipping. They merge without checks. When Dependabot Alerts can't reach the API, your vulnerability scanner goes quiet and nobody notices until a CVE lands in production three weeks later. The downstream quality cost of platform unreliability never shows up in an incident report. It shows up in the bugs you didn't catch.

Hashimoto leaving after eighteen years isn't a protest. It's a signal that the reliability contract changed. When user #1299 walks away from the platform he helped build, the question stops being "is GitHub getting worse" and becomes "what's your fallback when it gets worse at the wrong time."

When It Still Makes Sense

GitHub is still the rational default for most teams. The 81% isn't irrational. AWS, Cloudflare, and Vercel all publish Actions. The marketplace is unmatched. For a team shipping a web app that doesn't touch regulated infrastructure, GitHub does enough things at a good enough reliability level.

Alternatives have real costs. GitLab self-hosted needs sysadmin work most startups won't sign up for. Codeberg is community-led and fast but was under DDoS itself recently. Self-hosted Forgejo runs clean, but you're trading one dependency for a maintenance commitment.

Full migration is almost never the right answer.

Partial diversification is. Torchia's list: mirror critical repos to a second host, run self-hosted runners for critical CI pipelines so an Actions outage doesn't block deploys, keep docs in a portable format instead of GitHub Wiki, and don't adopt marketplace Actions that have no equivalent anywhere else. If your infrastructure routes through a single vendor, at minimum write down what you'd do when that vendor has a bad day.

Zero diversification when you run CI, packages, security, docs, and identity through a platform that had ten incidents in April is a bet worth making deliberately. Not by accident.

GitHub's problem isn't that it had ten incidents in April. Every platform at scale has bad months. The problem is that each incident cascaded across services that should have had independent failure boundaries. The cost of consolidation is a blast radius that grows every time they ship a new surface. They added four Copilot surfaces this year alone.

The question isn't whether to leave GitHub. It's whether you've mapped what breaks in your stack the next time it goes down.

Share

Get new posts in your inbox

Architecture, performance, security. No spam.

Keep reading