Back to Blog
GitHub Is Using GPT to Review Claude's Work. That's Either Brilliant or the Most Expensive Code Review Ever.

GitHub Is Using GPT to Review Claude's Work. That's Either Brilliant or the Most Expensive Code Review Ever.

GitHub's Rubber Duck ships GPT-5.4 as a reviewer for Claude Sonnet's code. The cross-model pattern is real, backed by ICLR 2026 research. But 'second opinion' is the wrong frame. The hardest agent failures need structured verification, not another model guessing.

ai-agentscode-reviewdeveloper-toolingverification
April 21, 2026
8 min read

Two weeks ago, GitHub shipped a feature that sounds exactly like something I'd been building on my own: multi-model code review, where a second AI reviews the first one's work. They call it Rubber Duck. I call mine an eval harness. The difference between those two names tells you everything about where this industry's headed. And I'm not sure it's the right direction.

Rubber Duck is an experimental mode in GitHub Copilot CLI. You pick Claude Sonnet 4.6 as your orchestrator, and GPT-5.4 automatically reviews its plans, implementations, and tests at three checkpoints. Cross-model verification from a different family, catching what the primary model missed.

The pitch makes sense. The research backs it up. I still think the framing is wrong.

The Numbers Are Real

I'll give credit where it's due. GitHub tested Rubber Duck on SWE-Bench Pro, and the results aren't trivial. Sonnet paired with GPT-5.4 closed 74.7% of the performance gap between Sonnet and the much more expensive Opus. On the hardest problems (3+ files, 70+ steps), the improvement hit 4.8%.

That's not nothing. If you can get near-Opus quality at Sonnet-plus-a-cheap-reviewer pricing, the economics are compelling.

The academic research supports the pattern too. An ICLR 2026 paper on cross-model verification found that models from different families have a correlation of 0.54 on errors, compared to 0.77 within the same family. Different training data, different blind spots. Propel coined the term "model synchopathy" for what happens when the same model reviews its own output: the reviewer inherits the generator's biases because the latent representations are too similar. Prompt diversity doesn't fix this. You need actual model diversity.

GitHub documented three specific catches. A scheduler that would start and immediately exit without running any tasks. A loop that silently overwrote the same dictionary key on every iteration. A database connection that wouldn't close properly. These are real bugs that a same-model self-review would likely miss.

So why am I not celebrating?

The Part Nobody's Measuring

Because the bugs Rubber Duck catches aren't the ones that keep me up at night.

I've spent the last six months running verification pipelines for agent-generated output. Not "ask another model what it thinks" verification. Rubric-based scoring with dimensions, fail thresholds, and mandatory rewrites when the score drops below a line. The kind of verification where you check agent output against external criteria, not against another model's intuition.

The worst failures I've seen didn't involve bugs at all. The code compiled. The tests passed. The logic was fine. The problem was something no model, from any family, would catch by staring at the output.

Here's what I mean. I had an agent with explicit instructions: include internal links in every published document, minimum five. The instructions were in the project context. The agent had seen them dozens of times. It published a piece with zero internal links despite having access to an archive of 50+ posts it could have linked to. Every tool, every concept, every project name in that post had a matching article in the archive. The agent ignored all of it.

Would GPT-5.4 reviewing that output have caught the problem? No. Because the rule wasn't in the code. It was in the project configuration. GPT wouldn't know the linking requirement existed, let alone enforce it.

This is the class of failure I keep hitting: spec adherence. Did the agent do what was actually asked, given the full context of the project? That question can't be answered by another model looking at the same diff.

Five Problems Rubber Duck Can't See

The research on multi-model review is pretty clear about what converges and what doesn't. Security vulnerabilities with clear exploit paths? Cross-model review catches those. Logic bugs with reproducible test failures? Same. Runtime errors and crashes? Absolutely.

But here's what oscillates or fails outright, according to Zylos Research: architectural opinions when no documented spec exists. Style preferences without team-specific guidelines. Cross-boundary changes where client code gets flagged for server-side concerns.

And that's just within the scope of what models CAN review. There's a deeper layer of problems that live entirely outside any model's field of vision:

Spec adherence. Did the code do what was asked? This requires access to requirements documents, project rules, and stakeholder intent. Not just the diff.

Architectural coherence. Does this change fit the existing codebase? I've seen agents write technically correct code that violates the architectural decisions the team made three months ago. No reviewer, human or AI, catches this without project history.

Factual accuracy. Is this API call correct for the current version of the library? Both Claude and GPT can confidently generate code for an API that changed two releases ago. A second model doesn't help when both models share the same stale training data.

Context drift. Is the agent still following instructions from 50,000 tokens ago? This is a memory problem, not a review problem. Another model can't verify context that evaporated from the first model's window.

Selection behavior. I documented an agent that quoted the correct values from a specification, then used the wrong values in its implementation. It knew what was right. It selected what was wrong. That's not a knowledge gap. It's not something a second opinion fixes. It's a behavioral failure that only structured evaluation gates can catch.

The False Positive Problem Makes It Worse

There's another issue nobody's talking about in the Rubber Duck coverage. The best single-pass AI code review tools achieve an F1 score of 19.38%, according to Zylos Research. That means for every real bug flagged, roughly nine non-bugs get flagged too.

Adding a second model doesn't necessarily improve this ratio. It can make it worse. More opinions means more surface area for false positives. And once developers stop trusting review comments (which happens fast at 9:1 noise ratios), they stop reading them entirely. Even the correct ones.

A Carnegie Mellon study of 807 open-source projects that adopted Cursor found that by month three, the initial productivity boost had faded while code analysis warnings rose 30% and complexity climbed 41%. More AI involvement, more mess. Adding another AI to review the first AI's mess is fighting fire with a slightly different flavor of fire.

I didn't expect to feel this cynical about it. I wanted Rubber Duck to be the answer. The pattern is elegant. But I keep coming back to the same conclusion: the harness matters more than the model. Changing which model reviews the output is a model-layer fix for a harness-layer problem.

Second Opinion vs. Diagnostic Test

Here's the frame I wish GitHub had used.

A second opinion is another doctor looking at your symptoms and guessing. Useful when the problem is a knowledge gap (your first doctor missed something the second one learned in residency). Less useful when the problem is systemic (both doctors trained at the same medical school and share the same blind spots). Completely useless for "did the patient follow the treatment plan?" because that question doesn't live in the patient's body. It lives in the patient's behavior.

A diagnostic test is different. It checks specific things against specific criteria. Blood work. Imaging. Measured values compared to known thresholds. It doesn't guess. It measures.

Rubber Duck is a second opinion. What the agent orchestration layer actually needs is diagnostic tests. Rubrics with scored dimensions. External criteria from the project spec, the existing codebase, current library docs. Separate specialized verifiers for separate concerns. Hard gates where a score below the threshold means rewrite, not patch.

I run this kind of system for my own publishing pipeline. Five dimensions, each scored independently. If any dimension fails, the entire output gets rewritten from scratch. Not patched. Not "addressed." Rewritten. A second model giving feedback doesn't create that level of accountability. Only a structured gate does.

Where This Actually Lands

I'm not saying Rubber Duck is useless. Cross-model review genuinely catches bugs that self-review misses. The research is solid. The examples GitHub showed are real. If your alternative is a single model reviewing its own output, yes, use Rubber Duck.

But if you're building serious agent infrastructure, don't stop there. A second opinion is step one. Step two is building verification systems that check agent output against things no model carries in its weights: your project rules, your architectural decisions, your current dependency versions, your context from six hours ago.

The industry is celebrating "second opinion" when what it needs is structured verification. Those are different things. One is a model-layer improvement. The other is an infrastructure problem.

I'm still not sure the market sees the difference. But every time one of my agents ignores a rule it was explicitly given, I'm reminded that another model staring at the output wouldn't have helped. Only the gate that caught it, scored it, and forced a rewrite actually fixed the problem.

Rubber Duck is a good name, though. I'll give them that.

Share

Get new posts in your inbox

Architecture, performance, security. No spam.

Keep reading