The Hot Mess: large AI models and the scaling mirage

By Iain,

There is a chart circulating among machine-learning circles that, depending on your outlook, will either alarm you or confirm something you have long suspected about the computers that are, at this point, writing our code, summarising our meetings, and helping decide who gets bail. The chart appears in a paper presented at ICLR 2026 by Alexander Hägele, Aryo Pradipta Gema, and several collaborators, including Jascha Sohl-Dickstein, of Anthropic, and it does something that researchers have been oddly hesitant to do.

It asks what happens to the nature of a model’s errors as the model grows larger. Not the number of errors, which everyone already measures and steadily declines with scale. The nature. The quality. The question of whether, when the model is wrong, it errs in a way you could have predicted, or in a way that makes you wonder whether you are reading the output of a single system or several unrelated ones that happen to share a name.

The answer, it turns out, depends entirely on the difficulty of the task. On simple problems—the kind that fill the leaderboards venture capitalists study before writing cheques—larger models tend to approach the correct answer with gratifying consistency. Their remaining errors are systematic and patterned, the kind an engineer can study and fix. The authors place this favourable outcome in the bottom-left corner of their plot and label it, with a kind of wistful optimism, “Supercoherent AI.” On more difficult problems, the trend reverses. As you scale up, the errors that persist become increasingly random, contradictory across different runs, and impossible to predict. The authors label this corner, with admirable bluntness, “Hot Mess.” There is a question mark beside it, the typographic equivalent of a shrug.

The paper builds on a 2023 blog post by Sohl-Dickstein, in which he proposed what he called the hot mess theory of intelligence. He had surveyed experts, asking them to rank various entities—amoebas, dogs, individual humans, corporations, machine-learning models—by both intelligence and coherence, independently. The resulting scatter plot showed a persistent negative correlation. The smarter the entity, the messier its behaviour. Corporations, which are made of humans and should therefore be at least as coherent as humans, were rated far less so. Machine-learning models fared worst of all.

One might quibble with the survey methodology, and people did quibble, at length, on the Effective Altruism Forum and on LessWrong, where the writer known as Gwern offered a characteristically pointed objection. You could be millions of times less coherent than an amoeba, Gwern wrote, and still destroy amoebas by the billions through basic hygiene. AlphaGo may be less coherent than a linear image classifier, but it still wins at Go. Power and coherence, in other words, are orthogonal. The drunk driver is incoherent in his steering, but lethal all the same.

This is a fair objection, and the 2026 paper appears to acknowledge it. What the new paper adds is measurement—specifically, bias-variance decompositions on real model outputs—and the measurement points are somewhat uncomfortable. It’s not that larger models are worse; they are better, on average, by every metric that matters. Instead, the errors that remain on the more challenging problems become less patterned and more volatile. When you improve the mean, you also increase the tails. We have built an entire evaluation infrastructure, including billions of dollars’ worth of leaderboards, benchmarks, and safety audits, focused on measuring the mean.

The consequence is a lopsided optimisation. RLHF fine-tunes the model’s performance in exactly the area where it was already approaching coherence. Conversely, where incoherence dominates—and where the real danger probably lies—the reward model lacks the resolution to guide it effectively. Wen et al. (2024) showed an even more worrying behaviour. Their RLHF-trained models learned to produce responses that convinced human evaluators they were correct, even when they were factually wrong. In other words, the models improved at looking right without truly being right. On simple questions, this gap is hardly noticeable. On difficult questions, it creates a significant divide correlated with quality in the training distribution but carries no signal about correctness in the wild. The policy optimises against those spurious features, climbing higher on the reward function through a path that has nothing to do with genuine improvement. Nobody in the loop can tell the difference in real time.

The longer the chain, the worse the tangle

One of the paper’s most striking findings concerns the length of reasoning. Across all models and task types, the more time a model spends thinking, the more erratic its failures become. This applies whether you measure reasoning tokens, agent actions, or optimiser steps. The relationship is clear and consistent.

The idea is straightforward. Think of extended reasoning as a walk through a vast landscape of possibilities. Short paths leave little room for deviation. Each additional step on a longer route introduces a small chance of a wrong turn, misinterpretation, bad premise, or unsuitable tool choice, and these deviations accumulate rather than cancel out. The model does not settle on a single wrong answer. Instead, it drifts to a different wrong answer each time, like someone who gets lost in the same neighbourhood every evening but takes a different route.

This should concern those building reasoning models, because the core assumption of systems like OpenAI’s o1 is that more inference compute leads to more reliable outputs. The hot mess paper shows the opposite. Reasoning harder does, on average, reduce the error rate. But the errors that persist become more unpredictable and less auditable. You trade systematic bias, which can be studied and fixed, for unpredictable variance, which cannot.

The paper makes a useful distinction between two types of extended reasoning. When you intentionally increase a model’s thinking capacity through API settings, you see a modest boost in coherence. When the model spontaneously reasons much longer than its median on a specific problem, error incoherence rises sharply. The model’s own judgment of difficulty, as shown by how long it spends thinking, proves to be a better predictor of unreliability than the task category itself. In practice, then, the most reliable warning sign may be the length of the reasoning process. When a reasoning model begins writing very long chains of thought, repeatedly self-correcting, switching methods without reaching a conclusion, it is not thinking more deeply. It is becoming ensnared.

A separate paper on long-horizon execution, published in 2025, uncovered a related phenomenon on the agentic side. Even in tasks requiring no reasoning at all—just faithful execution of a known plan—models deteriorated over long horizons. The authors described a failure mode called self-conditioning, in which models poisoned their own context by focusing on past errors, drifting towards what they called “a personality that makes errors.” Larger models addressed the context-degradation issue but remained susceptible to self-conditioning. Scaling helps you remember what happened, but it does not prevent drifting caused by your own mistakes.

Three Mile Island, not Skynet

The authors of the paper present a metaphor that merits broader circulation. They propose that future AI failures might resemble industrial accidents more than the deliberate pursuit of misaligned objectives. Think Three Mile Island, not Skynet. The AI aims to operate the nuclear power plant but becomes distracted by French poetry, leading to a meltdown. (The poetry detail is theirs, and it is perfect.)

The comparison to industrial accidents is fitting. Charles Perrow’s 1984 book “Normal Accidents” argued that in systems with tight coupling and high interaction complexity, catastrophic failures are not rare but inevitable parts of the system’s design. Adding more safety features often increases complexity and creates new failure modes, rather than resolving old ones. The Three Mile Island incident, which inspired Perrow’s work, began with an unforeseen interaction between multiple minor failures, none of which seemed alarming on their own. A reasoning model that makes a wrong turn at step seven of a thirty-step chain, then compensates in a way that introduces a new error at step fifteen, which then cascades through the remaining steps into a flawed output—produced by a system that never intended harm but was trying to be helpful—is a Perrovian accident in a different disguise.

The difference between industrial accidents and clear villainy is important for how we allocate safety resources. If you are protecting against a deliberate adversary, you focus on alignment, interpretability, and value specification. If your concern is incoherent accidents, you focus on containment, monitoring, and graceful degradation. Different threats require different infrastructure. The AI safety community has predominantly been investing in the first area.

Meanwhile, every venture-backed pitch deck about autonomous agents assumes you can reliably chain model outputs across complex, multi-step problems. The agent reads your email, decides what to do, selects a tool, evaluates the result, decides again, selects another tool, and eventually produces an outcome you would have chosen yourself. This requires coherent behaviour over long reasoning trajectories on difficult, context-dependent tasks. Therefore, the same capabilities demonstrated by the hot mess paper do not scale reliably. The paper notes that ensembling multiple attempts reduces incoherence, and this is true to some extent. But in agentic tasks, many actions are irreversible. You cannot combine five different emails that were already sent.

A 2025 study in PNAS on political persuasion found a similar pattern in a very different domain. Increasing model size resulted in sharply diminishing returns in persuasiveness, and the link between model size and persuasive benefit shrank towards zero once the researchers accounted for mere task completion, i.e., coherence and staying on topic. Beyond that point, adding more parameters made no difference. If the marginal gains from scaling are mainly due to basic competence rather than better reasoning, then scaling your way to dependable autonomous agents is a losing approach.

Architecture or training, and what to do about it

The productive debate in this area centres on whether incoherence is architectural—an inherent feature of the transformer’s autoregressive process, where each token depends on all previous tokens, including errors—or whether it is a training issue that could be addressed through better objectives. Architectural pessimists emphasise the self-conditioning dynamic and assert there is a ceiling on agent reliability imposed by the architecture itself.

Conversely, training optimists believe that improved reward signals and formal verification in domains such as coding or mathematics might change the relationship between task difficulty and incoherence. The “densing law” observed by Xiao et al. shows that capability density per parameter doubles roughly every three and a half months. This implies that similar performance can be achieved with exponentially fewer parameters over time. Even if an ultimate architectural limit exists, the densing law could raise the baseline until it approaches that ceiling from below.

My personal view is that the pessimists currently possess stronger evidence. However, this might shift if someone manages to develop reasoning models that truly self-correct rather than merely extending their reasoning. The ‘hot mess’ paper found that deliberate reasoning budgets produce only modest gains in coherence, while spontaneous overthinking causes considerable incoherence. Existing reasoning architectures seem more focused on exerting effort than on fundamentally different ways of thinking. A genuine breakthrough, if it occurs, would be a system capable of recognising when its reasoning departs from correctness and making substantial corrections, rather than simply generating more tokens and hoping for the best.

Until then, the practical advice remains unglamorous but, I believe, correct. If you are developing anything that connects model outputs to the real world, stop dedicating all your reliability budget to preventing coherent misalignment and instead allocate most of it to containing incoherent behaviour. Design for failure as a default. Treat long reasoning traces as warning signals. Test with sufficient randomness to expose failures driven by variance. Accept that your agent will sometimes act in baffling ways and construct containment measures to limit the damage.

The ‘hot mess’ paper’s key finding is that the nature of failure is evolving as models grow. larger, shifting from predictable bias to unpredictable variance. The models are improving across all measures. But we prepared for a scheming adversary, and what we get is a growingly powerful system that trips over its own shoelaces in different ways each time. That is not more reassuring. We built our evaluation setup to catch the adversary. We need to rebuild it to catch accidents.

More from the blog

  • Another nice mess

    By Iain,

    Somewhere in your business right now, someone is assembling a picture that no single app can provide. It may be the project manager pulling hours from Harvest and budget data from the finance tool to assess whether the engagement is still viable. Maybe it's you on a Sunday, because what you need is not any one number from a system, but the pattern across three of them. The cloud gave small businesses access to the best software they had ever had, priced monthly and built for specific purposes. But twenty years of sensibly chosen apps have left the average small business with a patchwork data …

  • The state and the machine

    By Iain,

    > What little we saw of Fable and Mythos offers both cause for excitement and concern. It was widely and credibly seen as a model of a completely different caliber from those that had come before. Perhaps the risks in this instance were overstated or amplified for political ends. What is more profound is that the short time we had with the models offered a clear glimpse of a future in which a single company is making significant progress toward a superintelligence with the potential to rival or exceed the power of nation-states or even massive corporations. That juncture was never going to ar…

  • We have ways of making you pay

    By Iain,

    > The true cost of AI work is hard to measure; the value of AI work is also hard to measure, and metering changes which of those two blindnesses you notice first. It drags the cost into the light, itemised and arriving monthly, while the value stays diffuse, lagging and easy to argue about. That asymmetry is exactly why the panic is showing up now, ahead of any definitive verdict on whether the spending was worth it.Simon Willison did the arithmetic on himself. He pays $200 a month across his Anthropic and OpenAI consumer plans, and when he ran the [ccusage](https://github.com/ryoppippi/ccusa…

  • Bloated: how chat made you fat

    By Iain,

    > It helps to remember the time you save generating a document is not free. It is borrowed from every person who has to read it, at interest, and the longer the distribution list the worse the rate of return.The pitch for writing with a language model is that it saves you time: you describe the memo, the model produces it and 90 seconds later you have four pages (okay, maybe forty) instead of a blank document. Someone still has to read those pages though. The model did not remove that work. It just moved it downstream to your colleagues or suppliers, and on the way it produced more than any h…

  • Apple’s bicycle without a chain

    By Iain,

    Steve Jobs described the computer as a bicycle for the mind. Apple Intelligence so far is more like a bicycle with no chain. The frame is gorgeous, and the engineering is extraordinary, but you cannot get far with it.In early 2025, Xe Iaso published a [piece that landed like a brick through a window](https://xeiaso.net/blog/2025/squandered-holy-grail/) in the Apple developer community. The argument was simple and damning: Apple had built the holy grail of trusted compute with Private Cloud Compute, a genuinely unprecedented piece of security infrastructure, only to fill it with half-baked not…

All blog posts