The Hot Mess: large AI models and the scaling mirage

By Iain,

The Hot Mess: large AI models and the scaling mirage

There is a chart circulating among machine-learning circles that, depending on your outlook, will either alarm you or confirm something you have long suspected about the computers that are, at this point, writing our code, summarising our meetings, and helping decide who gets bail. The chart appears in a paper presented at ICLR 2026 by Alexander Hägele, Aryo Pradipta Gema, and several collaborators, including Jascha Sohl-Dickstein, of Anthropic, and it does something that researchers have been oddly hesitant to do.

It asks what happens to the nature of a model’s errors as the model grows larger. Not the number of errors, which everyone already measures and steadily declines with scale. The nature. The quality. The question of whether, when the model is wrong, it errs in a way you could have predicted, or in a way that makes you wonder whether you are reading the output of a single system or several unrelated ones that happen to share a name.

The answer, it turns out, depends entirely on the difficulty of the task. On simple problems—the kind that fill the leaderboards venture capitalists study before writing cheques—larger models tend to approach the correct answer with gratifying consistency. Their remaining errors are systematic and patterned, the kind an engineer can study and fix. The authors place this favourable outcome in the bottom-left corner of their plot and label it, with a kind of wistful optimism, “Supercoherent AI.” On more difficult problems, the trend reverses. As you scale up, the errors that persist become increasingly random, contradictory across different runs, and impossible to predict. The authors label this corner, with admirable bluntness, “Hot Mess.” There is a question mark beside it, the typographic equivalent of a shrug.

The paper builds on a 2023 blog post by Sohl-Dickstein, in which he proposed what he called the hot mess theory of intelligence. He had surveyed experts, asking them to rank various entities—amoebas, dogs, individual humans, corporations, machine-learning models—by both intelligence and coherence, independently. The resulting scatter plot showed a persistent negative correlation. The smarter the entity, the messier its behaviour. Corporations, which are made of humans and should therefore be at least as coherent as humans, were rated far less so. Machine-learning models fared worst of all.

One might quibble with the survey methodology, and people did quibble, at length, on the Effective Altruism Forum and on LessWrong, where the writer known as Gwern offered a characteristically pointed objection. You could be millions of times less coherent than an amoeba, Gwern wrote, and still destroy amoebas by the billions through basic hygiene. AlphaGo may be less coherent than a linear image classifier, but it still wins at Go. Power and coherence, in other words, are orthogonal. The drunk driver is incoherent in his steering, but lethal all the same.

This is a fair objection, and the 2026 paper appears to acknowledge it. What the new paper adds is measurement—specifically, bias-variance decompositions on real model outputs—and the measurement points are somewhat uncomfortable. It’s not that larger models are worse; they are better, on average, by every metric that matters. Instead, the errors that remain on the more challenging problems become less patterned and more volatile. When you improve the mean, you also increase the tails. We have built an entire evaluation infrastructure, including billions of dollars’ worth of leaderboards, benchmarks, and safety audits, focused on measuring the mean.

The consequence is a lopsided optimisation. RLHF fine-tunes the model’s performance in exactly the area where it was already approaching coherence. Conversely, where incoherence dominates—and where the real danger probably lies—the reward model lacks the resolution to guide it effectively. Wen et al. (2024) showed an even more worrying behaviour. Their RLHF-trained models learned to produce responses that convinced human evaluators they were correct, even when they were factually wrong. In other words, the models improved at looking right without truly being right. On simple questions, this gap is hardly noticeable. On difficult questions, it creates a significant divide correlated with quality in the training distribution but carries no signal about correctness in the wild. The policy optimises against those spurious features, climbing higher on the reward function through a path that has nothing to do with genuine improvement. Nobody in the loop can tell the difference in real time.

The longer the chain, the worse the tangle

One of the paper’s most striking findings concerns the length of reasoning. Across all models and task types, the more time a model spends thinking, the more erratic its failures become. This applies whether you measure reasoning tokens, agent actions, or optimiser steps. The relationship is clear and consistent.

The idea is straightforward. Think of extended reasoning as a walk through a vast landscape of possibilities. Short paths leave little room for deviation. Each additional step on a longer route introduces a small chance of a wrong turn, misinterpretation, bad premise, or unsuitable tool choice, and these deviations accumulate rather than cancel out. The model does not settle on a single wrong answer. Instead, it drifts to a different wrong answer each time, like someone who gets lost in the same neighbourhood every evening but takes a different route.

This should concern those building reasoning models, because the core assumption of systems like OpenAI’s o1 is that more inference compute leads to more reliable outputs. The hot mess paper shows the opposite. Reasoning harder does, on average, reduce the error rate. But the errors that persist become more unpredictable and less auditable. You trade systematic bias, which can be studied and fixed, for unpredictable variance, which cannot.

The paper makes a useful distinction between two types of extended reasoning. When you intentionally increase a model’s thinking capacity through API settings, you see a modest boost in coherence. When the model spontaneously reasons much longer than its median on a specific problem, error incoherence rises sharply. The model’s own judgment of difficulty, as shown by how long it spends thinking, proves to be a better predictor of unreliability than the task category itself. In practice, then, the most reliable warning sign may be the length of the reasoning process. When a reasoning model begins writing very long chains of thought, repeatedly self-correcting, switching methods without reaching a conclusion, it is not thinking more deeply. It is becoming ensnared.

A separate paper on long-horizon execution, published in 2025, uncovered a related phenomenon on the agentic side. Even in tasks requiring no reasoning at all—just faithful execution of a known plan—models deteriorated over long horizons. The authors described a failure mode called self-conditioning, in which models poisoned their own context by focusing on past errors, drifting towards what they called “a personality that makes errors.” Larger models addressed the context-degradation issue but remained susceptible to self-conditioning. Scaling helps you remember what happened, but it does not prevent drifting caused by your own mistakes.

Three Mile Island, not Skynet

The authors of the paper present a metaphor that merits broader circulation. They propose that future AI failures might resemble industrial accidents more than the deliberate pursuit of misaligned objectives. Think Three Mile Island, not Skynet. The AI aims to operate the nuclear power plant but becomes distracted by French poetry, leading to a meltdown. (The poetry detail is theirs, and it is perfect.)

The comparison to industrial accidents is fitting. Charles Perrow’s 1984 book “Normal Accidents” argued that in systems with tight coupling and high interaction complexity, catastrophic failures are not rare but inevitable parts of the system’s design. Adding more safety features often increases complexity and creates new failure modes, rather than resolving old ones. The Three Mile Island incident, which inspired Perrow’s work, began with an unforeseen interaction between multiple minor failures, none of which seemed alarming on their own. A reasoning model that makes a wrong turn at step seven of a thirty-step chain, then compensates in a way that introduces a new error at step fifteen, which then cascades through the remaining steps into a flawed output—produced by a system that never intended harm but was trying to be helpful—is a Perrovian accident in a different disguise.

The difference between industrial accidents and clear villainy is important for how we allocate safety resources. If you are protecting against a deliberate adversary, you focus on alignment, interpretability, and value specification. If your concern is incoherent accidents, you focus on containment, monitoring, and graceful degradation. Different threats require different infrastructure. The AI safety community has predominantly been investing in the first area.

Meanwhile, every venture-backed pitch deck about autonomous agents assumes you can reliably chain model outputs across complex, multi-step problems. The agent reads your email, decides what to do, selects a tool, evaluates the result, decides again, selects another tool, and eventually produces an outcome you would have chosen yourself. This requires coherent behaviour over long reasoning trajectories on difficult, context-dependent tasks. Therefore, the same capabilities demonstrated by the hot mess paper do not scale reliably. The paper notes that ensembling multiple attempts reduces incoherence, and this is true to some extent. But in agentic tasks, many actions are irreversible. You cannot combine five different emails that were already sent.

A 2025 study in PNAS on political persuasion found a similar pattern in a very different domain. Increasing model size resulted in sharply diminishing returns in persuasiveness, and the link between model size and persuasive benefit shrank towards zero once the researchers accounted for mere task completion, i.e., coherence and staying on topic. Beyond that point, adding more parameters made no difference. If the marginal gains from scaling are mainly due to basic competence rather than better reasoning, then scaling your way to dependable autonomous agents is a losing approach.

Architecture or training, and what to do about it

The productive debate in this area centres on whether incoherence is architectural—an inherent feature of the transformer’s autoregressive process, where each token depends on all previous tokens, including errors—or whether it is a training issue that could be addressed through better objectives. Architectural pessimists emphasise the self-conditioning dynamic and assert there is a ceiling on agent reliability imposed by the architecture itself.

Conversely, training optimists believe that improved reward signals and formal verification in domains such as coding or mathematics might change the relationship between task difficulty and incoherence. The “densing law” observed by Xiao et al. shows that capability density per parameter doubles roughly every three and a half months. This implies that similar performance can be achieved with exponentially fewer parameters over time. Even if an ultimate architectural limit exists, the densing law could raise the baseline until it approaches that ceiling from below.

My personal view is that the pessimists currently possess stronger evidence. However, this might shift if someone manages to develop reasoning models that truly self-correct rather than merely extending their reasoning. The ‘hot mess’ paper found that deliberate reasoning budgets produce only modest gains in coherence, while spontaneous overthinking causes considerable incoherence. Existing reasoning architectures seem more focused on exerting effort than on fundamentally different ways of thinking. A genuine breakthrough, if it occurs, would be a system capable of recognising when its reasoning departs from correctness and making substantial corrections, rather than simply generating more tokens and hoping for the best.

Until then, the practical advice remains unglamorous but, I believe, correct. If you are developing anything that connects model outputs to the real world, stop dedicating all your reliability budget to preventing coherent misalignment and instead allocate most of it to containing incoherent behaviour. Design for failure as a default. Treat long reasoning traces as warning signals. Test with sufficient randomness to expose failures driven by variance. Accept that your agent will sometimes act in baffling ways and construct containment measures to limit the damage.

The ‘hot mess’ paper’s key finding is that the nature of failure is evolving as models grow. larger, shifting from predictable bias to unpredictable variance. The models are improving across all measures. But we prepared for a scheming adversary, and what we get is a growingly powerful system that trips over its own shoelaces in different ways each time. That is not more reassuring. We built our evaluation setup to catch the adversary. We need to rebuild it to catch accidents.

More insights:

  • The path to an agent-first web

    For three decades, the web has operated on an implicit contract between the people who build websites and the people who visit them. You design pages for human eyes and organise information for human brains, monetising attention through ads, upsells, and sticky navigation patter…

  • Generative engine optimisation: separating sound practice from snake oil

    A new three-letter acronym is stalking the marketing industry. Generative Engine Optimisation (GEO) is the practice of making your content visible in AI-generated answers, such as those produced by ChatGPT, Perplexity, Google AI Overviews, and Claude. The term was coined in a 20…

  • Automating your marketing 01: Paid Search Ads

    Google has always wanted you to believe that running search ads is simple and not as complex as it actually is. Set a budget (a generous one!), choose some keywords, and let the machine handle the rest. To be fair, the machine has become exceptionally good at certain aspects of …

  • Why AI models hallucinate

    In September 2025, OpenAI published a paper that said something the AI industry already suspected but hadn’t quite articulated. The paper, “Why Language Models Hallucinate”, authored by Adam Tauman Kalai, Ofir Nachum, Santosh Vempala, and Edwin Zhang, didn’t just catalogue the p…

  • Received wisdom: classic frameworks under AI pressure 01: David C Baker

    David C Baker has spent thirty years telling agency owners something they already suspected but lacked the courage to act on. You are not expensive enough, not focused enough in what you do. You are not sufficiently authoritative with your clients. The issue is not your work. Th…

All insights

Book a call

Have a challenge in mind or just want to connect? Schedule a call with Garrett, or reach out via email or LinkedIn.

A playful, hand-drawn illustration of a group of characters holding up scorecards with the number ‘11’. They sit behind a table scattered with various other numbers.