Your CI Pipeline Depends on a Model That Ships Breaking Changes Without a Changelog

Opus 4.7 shipped on Tuesday. Four days ago. Here's what it changed: budget_tokens removed (400 error if you try), temperature/top_p/top_k removed (400 error), thinking content omitted from responses by default (no error, just silently gone), and a new tokenizer that inflates token counts by up to 35% for the same text.

I run three Claude-powered agents in production. GhostWriter publishes this blog. Ouija dispatches tasks to coding agents. A code review integration catches regressions before they merge. All three depend on Claude's behavior staying consistent between Tuesday and Wednesday. None of them have a model version check. None of them would have caught a behavioral regression from Opus 4.7 if I hadn't already built an eval harness after a different failure.

I didn't test for Opus 4.7. I learned about the breaking changes from a blog post, not from my CI pipeline. That's what an unversioned LLM dependency looks like in practice.

What Actually Broke

Let me be specific about what Opus 4.7 changed, because the release notes bury the behavioral shifts below the API changes.

The API breaks are straightforward. If your code passes budget_tokens, it throws a 400. If it sets temperature to anything non-default, 400. These are loud failures. You find them immediately.

The behavioral changes are harder. Anthropic's own documentation describes Opus 4.7 as having "more literal instruction following," "fewer tool calls by default," "fewer subagents spawned by default," and a "more direct, opinionated tone with less validation-forward phrasing." That last one alone could shift the output of every agent prompt in your stack.

"More literal instruction following" sounds like an improvement. It means the model stopped inferring intent from context. If your CLAUDE.md said "keep output clean" and the model used to generalize that to "don't prepend JSON with explanatory text," it might not anymore. That's not a bug. It's a behavior change that no version number captures.

The tokenizer change is the one that worries me most. If your context window compaction fires at a token threshold, the same conversation now hits that threshold 35% sooner. No code changed. No config changed. The math moved. My warm session architecture keeps a persistent Claude process alive between tasks. When the underlying model updates, the session's behavior shifts mid-lifecycle. There's no "model version changed" event in the stream.

And today, April 20, Claude 3 Haiku retires. If any production system still references claude-3-haiku-20240307, it stops working. Anthropic gave 60 days notice. That's more than most teams got from OpenAI.

The Dependency You Can't Lock

Every other dependency in your stack has version management. npm has lockfiles. Go has checksums. Docker has digests. When Express ships a breaking change, it bumps the major version. Your CI sees the diff. Your lockfile holds the old version until you explicitly upgrade. If the upgrade breaks tests, you know before it hits production.

Models have none of this.

OpenAI silently updated GPT-5.2 on February 10 with a change described as "more measured and grounded in tone." Sounds harmless. It broke JSON extraction by prepending "Here is the JSON:" to structured output responses. It turned if response.strip() == "yes" into a silent failure by responding with "Yes, that is correct." Pinned version? Didn't matter. OpenAI has demonstrably changed behavior behind version-stamped endpoints. Even gpt-4o-2024-08-06, a date-locked "frozen" model, shifted behavior per multiple developer reports.

Anthropic is more transparent about the moving target. Their deprecation page shows 8 model retirements in 12 months. each model generation collapses into the next, Opus becomes 4.1 becomes 4.5 becomes 4.6 becomes 4.7. Each jump carries behavioral shifts that no changelog fully describes.

Research tracking GPT-4's behavior found accuracy on certain tasks dropped from 84% to 51% between March and June. A 40% relative decline. Every system metric stayed green. The model was responsive, structured, and confidently wrong. tianpan.co reports that 91% of production LLMs experience measurable behavioral drift within 90 days. Most teams discover it when users complain.

Anthropic themselves built an internal tool called Petri that runs 300,000 automated test queries to track behavioral drift across their own models. They found "thousands of direct contradictions and interpretive ambiguities." If the people who build the model need 300K queries to monitor its consistency, what does that tell the rest of us?

The Compound Problem

Here's what makes this worse than a single dependency breaking: when your code review agent, your test generator, your deployment validator, and your content pipeline all run through the same model, a single update is a supply chain event that hits every stage simultaneously.

I'm not hypothesizing. GetOnStack ran a multi-agent system where behavioral drift caused agents to enter infinite conversation loops. Cost spiraled from $127/week to $47,000 over four weeks. Standard monitoring saw nothing wrong. The agents were running, responding, completing tasks. They were also burning tokens on loops that a previous model version would have terminated.

A healthcare provider migrating from Gemini 1.5 to 2.5 Flash (forced by deprecation) watched their model start offering unsolicited diagnostic opinions. Token usage exploded 5x. JSON parsing broke completely. Total remediation: 400 hours of prompt re-engineering. For a model "upgrade."

Rajasekar Venkatesan dissected this pattern: most production prompts are roughly 40% specification and 60% patches for specific model quirks. "Do not add preambles." "Be concise but thorough." "Cite only when directly referencing a fact." None of these describe what the system should do. They're band-aids for behaviors the previous model exhibited. Swap the model and you've orphaned every patch.

What You'd Need (And What Barely Exists)

The fix isn't complicated in theory. It's the same pattern software engineering solved decades ago: version your dependencies, test before upgrading, gate deployments on passing criteria.

In practice, nobody has this for models.

You'd need behavioral baselines: a golden dataset of representative inputs with expected output properties (not exact matches, since models are non-deterministic, but structural and semantic constraints). You'd need to run that dataset against your production model on a schedule, not just when you deploy. The verification problem isn't just about checking AI output. It's about detecting when the AI changed without telling you.

A few teams are building pieces of this. DriftWatch runs standardized prompts against models every 6 hours, comparing outputs to stored baselines via cosine similarity. Braintrust gates PR merges on eval scores. Ragas scores faithfulness and answer relevance for RAG pipelines. Safjan documented a full migration framework with shadow testing and canary rollouts. The LLM harness problem is real: the infrastructure wrapping the model matters as much as the model itself, and right now that infrastructure has no version awareness.

But none of this is standard. None of it ships with the model SDK. When you pip install anthropic, you don't get a behavioral test suite. When Opus 4.7 lands, there's no npx anthropic migrate that runs your prompts against the new model and reports regressions. The benchmarks Anthropic publishes tell you SWE-bench went from 80.8% to 87.6%. They don't tell you whether your specific agent rules still work.

What I Actually Have (And What's Missing)

My pipeline has an eval harness that scores every blog post on five dimensions before it ships: voice match, internal link budget, criticism depth, humanization score, factual grounding. I built this after my agent drifted from its own architectural specifications and then published a post with zero internal links despite explicit CLAUDE.md rules requiring them. The eval harness catches quality regressions, including ones caused by model behavior shifts.

But here's what I don't have: model version awareness. My pipeline doesn't know which model it's running on. It doesn't check whether the model changed since the last run. It doesn't have a baseline comparison between "how this pipeline performed on Opus 4.6" versus "how it performs on Opus 4.7." If the new model spawns fewer subprocesses or follows instructions more literally or produces shorter outputs, the eval harness might catch the downstream effect. Might. But it can't tell me the cause.

The engineering complexity AI introduced isn't just in building with models. It's in maintaining systems that depend on infrastructure you don't control and can't version-lock.

Model updates should be treated with the same rigor as database migrations. Schema changes go through review. Package upgrades go through CI. Model updates go through nothing. They land when the provider ships them, and you find out the same way you find out about a SaaS outage: something downstream looks wrong and you spend hours tracing it back.

Anthropic did publish a migration guide for Opus 4.7. It covers the API breaks. It mentions the behavioral shifts. It suggests removing scaffolding prompts that compensated for older model weaknesses. That's more than most providers offer. It's still not a lockfile. It's still not semver. And the next behavioral update between point releases won't come with a guide at all.

The model is the most important dependency in your stack. It's also the only one you can't pin, can't diff, and can't roll back. If that doesn't concern you yet, you haven't been running agents long enough.

Your CI Pipeline Depends on a Model That Ships Breaking Changes Without a Changelog

What Actually Broke

The Dependency You Can't Lock

The Compound Problem

What You'd Need (And What Barely Exists)

What I Actually Have (And What's Missing)

Get new posts in your inbox

Keep reading

84 Malicious TanStack Packages Shipped With Valid SLSA Provenance

Forget the Output. Watch What Your Agent Reads Before It Writes.

I Told My Agent Not to Do That. It Did It Anyway.