Yes, the models got dumber
By Iain,

In March 2023, GPT-4 could identify prime numbers with 97.6% accuracy. By June, that figure had cratered to 2.4%. Not a rounding error, not a minor regression, but a 95-point collapse on the same task with the same prompts. If a bridge lost 95% of its load-bearing capacity in three months, someone would go to prison. In AI, the vendor posts a changelog and moves on.
This pattern has repeated with depressing regularity across every frontier provider. Models ship to applause and enterprise contracts get signed on the strength of benchmark screenshots, and then something changes. The model you evaluated is no longer the model answering your customers, and nobody tells you until your production workflow starts producing garbage.
The evidence is not anecdotal
Researchers at Stanford and UC Berkeley tracked this drift formally, comparing GPT-3.5 and GPT-4 snapshots from March and June 2023 across seven tasks. The results were bad enough to make the researchers themselves flinch. GPT-4’s ability to generate directly executable code dropped from 52% to 10%. Its willingness to follow chain-of-thought prompting, one of the most widely used techniques for improving accuracy, degraded without explanation.
“The magnitude of the changes in the LLMs’ responses surprised us,” James Zou, a Stanford professor and co-author, told The Register. The team’s conclusion was blunt. The behaviour of the “same” LLM service can shift substantially in weeks, and nobody outside the provider knows when or why.
This wasn’t a one-off result that got debated and forgotten. The OpenAI developer forums have become a rolling graveyard of complaints. In September 2025, users running GPT-4.1 reported severe intelligence degradation within 30 days of launch, with complex tool calls and multi-step instructions suddenly failing. Similar threads appeared for GPT-4 Turbo in May 2025. The pattern never varies, and by now it has become depressingly predictable. Works brilliantly at launch, degrades silently, users scramble to figure out what broke.
Why this happens (and why the incentives encourage it)
There are at least four mechanisms that can degrade a deployed model, and most frontier providers are using all of them simultaneously.
Quantisation is the most technically straightforward of the four, and the easiest to understand. A model trained in 16-bit or 32-bit floating-point precision gets compressed to 8-bit or 4-bit integers for serving. The arithmetic is straightforward enough, since a model stored in FP16 needs roughly two bytes per parameter, so a 70-billion-parameter model demands about 140GB of VRAM just for weights. Quantise to 4-bit and you cut that to around 35GB, enough to run on hardware that costs a fraction as much.
The trade-off is supposed to be minimal, and Red Hat’s analysis of over 500,000 evaluations found that 8-bit and 4-bit quantised models showed “very competitive accuracy recovery” on most benchmarks, especially for larger models. But that phrase “most benchmarks” is doing heavy lifting. Quantisation works by rounding, and rounding destroys outlier values. The weights that fire rarely but matter enormously for edge-case reasoning are exactly the weights that get flattened first. For standard tasks you barely notice the difference, but for the specific hard problems your production system was built to handle, the gap can be catastrophic. One developer reported that dynamic quantisation of a 3B-parameter model dropped accuracy from 65.6% to 32.3%, a halving that no benchmark average would predict.
Mixture-of-experts routing is the more interesting culprit, and the one providers talk about least. DeepSeek’s V3, for example, has 671 billion total parameters but only activates about 37 billion per token. The economics are irresistible because you get the capacity of a massive model with the inference cost of a much smaller one. But the router decides which experts handle which queries, and routing decisions are probabilistic. A query that activated your model’s strongest expert subnetwork at launch might get routed differently after an update to the routing logic, or after the provider adjusts load balancing to handle peak traffic. The user sees the same model name in the API response. The actual computation behind it may have changed entirely.
Distillation and model substitution is the elephant in the room that everyone suspects but nobody can prove definitively. Rumours have circulated since mid-2023 that OpenAI routes some queries to smaller, cheaper models behind the same API endpoint. The Gleech.org 2025 AI retrospective put it plainly: “True frontier capabilities are likely obscured by systematic cost-cutting (distillation for serving to consumers, quantisation, low reasoning-token modes, routing to cheap models).” GPT-4.5 was retired after just three months, presumably because the inference costs were unsustainable, even though it still ranked in the top five on LMArena for hallucination reduction nine months later. The model that performed best got killed because it was too expensive to run.
Safety tuning and RLHF adjustments create the subtlest form of drift. When OpenAI tightens content filters or adjusts the model’s tendency to refuse certain queries, those changes ripple through the entire behaviour space. The Stanford study found that GPT-4 became less willing to explain why it refused sensitive questions, switching from detailed explanations to terse “Sorry, I can’t answer that” responses. The model may have become safer by one measure, but it simultaneously became less transparent and less useful for legitimate applications that happened to brush against the updated boundaries.
The economics are doing exactly what you would expect
Running frontier models is staggeringly expensive, and every provider is under pressure to reduce cost-per-token. The maths, as one industry analysis noted, resembles building more fuel-efficient engines and then using the efficiency gains to build monster trucks. Token prices have dropped by a factor of 1,000 in three years, but reasoning models now generate thousands of internal tokens before producing a single visible output, and 99% of demand shifts to the newest model the moment it ships.
Providers respond by doing what any business would do. They optimise for throughput and margin, quantising the weights and routing easy queries to cheaper subnetworks while distilling the flagship into something that passes the benchmarks but costs a tenth as much to serve. The individual techniques are all defensible, but stacked together and applied silently, they create a system where the model’s advertised performance diverges from its delivered performance over time.
DeepSeek made this trade-off explicit and turned it into a business strategy. Its V3 model serves inference at roughly 90% below comparable OpenAI and Anthropic rates, and the MoE architecture that enables this pricing is openly documented. Whatever you think of the approach, at least the engineering trade-offs are visible. The problem is worse when providers make the same trade-offs quietly, behind an API that returns the same model identifier regardless of what actually computed the response.
What this means if you build on top of these models
The practical upshot is unpleasant but straightforward. If your application depends on consistent model behaviour, you are building on sand that shifts without warning. The Stanford researchers recommended continuous monitoring, and they were right, but monitoring alone doesn’t solve the problem, because it tells you something broke without stopping it from breaking.
Pinning to a specific model snapshot helps, where providers offer it, but even snapshots get deprecated. OpenAI maintains them for a few months and then requires developers to migrate. The careful evaluation you ran against the March snapshot becomes irrelevant when you’re forced onto the June version and nobody can tell you exactly what changed.
The deeper issue is one of trust and transparency. When a model provider updates a live model, they are unilaterally changing the behaviour of every application built on top of it. That is not a software update but an undocumented API change, the kind that would trigger outrage in any other engineering discipline. Imagine if AWS silently swapped your database engine for a cheaper one that was “approximately equivalent” on standard benchmarks, and you can begin to see how the AI industry has somehow normalised something that would be career-ending negligence anywhere else.
Where this leaves us
The model you benchmarked, the one that earned the contract, that impressed the board, that your engineers spent weeks building prompts and evaluation harnesses around, is a snapshot of a moving target. Quantisation shaves off the edges while routing sends your queries to whichever expert subnetwork happens to be cheapest that millisecond, and safety updates redraw the boundaries of what the model will and won’t do. None of it shows up in the model name string your application receives in the API response.
Somewhere in a data centre, the accountants and the alignment researchers are both pulling the same model in different directions, one toward cheaper inference and the other toward tighter guardrails, and the engineers who built their products on last month’s version are left checking the forums to figure out why everything stopped working on a Tuesday.
(This article was originally published on iain.so.)
Like this? Get email updates or grab the RSS feed like it’s 2008.
More from the blog
-

Another nice mess
Somewhere in your business right now, someone is assembling a picture that no single app can provide. It may be the project manager pulling hours from Harvest and budget data from the finance tool to assess whether the engagement is still viable. Maybe it's you on a Sunday, because what you need is not any one number from a system, but the pattern across three of them. The cloud gave small businesses access to the best software they had ever had, priced monthly and built for specific purposes. But twenty years of sensibly chosen apps have left the average small business with a patchwork data …
-

The state and the machine
> What little we saw of Fable and Mythos offers both cause for excitement and concern. It was widely and credibly seen as a model of a completely different caliber from those that had come before. Perhaps the risks in this instance were overstated or amplified for political ends. What is more profound is that the short time we had with the models offered a clear glimpse of a future in which a single company is making significant progress toward a superintelligence with the potential to rival or exceed the power of nation-states or even massive corporations. That juncture was never going to ar…
-

We have ways of making you pay
> The true cost of AI work is hard to measure; the value of AI work is also hard to measure, and metering changes which of those two blindnesses you notice first. It drags the cost into the light, itemised and arriving monthly, while the value stays diffuse, lagging and easy to argue about. That asymmetry is exactly why the panic is showing up now, ahead of any definitive verdict on whether the spending was worth it.Simon Willison did the arithmetic on himself. He pays $200 a month across his Anthropic and OpenAI consumer plans, and when he ran the [ccusage](https://github.com/ryoppippi/ccusa…
-

Bloated: how chat made you fat
> It helps to remember the time you save generating a document is not free. It is borrowed from every person who has to read it, at interest, and the longer the distribution list the worse the rate of return.The pitch for writing with a language model is that it saves you time: you describe the memo, the model produces it and 90 seconds later you have four pages (okay, maybe forty) instead of a blank document. Someone still has to read those pages though. The model did not remove that work. It just moved it downstream to your colleagues or suppliers, and on the way it produced more than any h…
-

Apple’s bicycle without a chain
Steve Jobs described the computer as a bicycle for the mind. Apple Intelligence so far is more like a bicycle with no chain. The frame is gorgeous, and the engineering is extraordinary, but you cannot get far with it.In early 2025, Xe Iaso published a [piece that landed like a brick through a window](https://xeiaso.net/blog/2025/squandered-holy-grail/) in the Apple developer community. The argument was simple and damning: Apple had built the holy grail of trusted compute with Private Cloud Compute, a genuinely unprecedented piece of security infrastructure, only to fill it with half-baked not…
