We’ve partnered with Astuteo — read the announcement

The flatness of the machine

By Iain,

You can feel it before you can name it. A paragraph arrives, fluent and frictionless, and something in the back of your reading brain flinches. The sentences are grammatically flawless, the structure orderly, the tone warm but not too warm, authoritative but not too authoritative. It reads the way a hotel room looks, everything is there, nothing is wrong, and yet the text has no texture, no grain, no evidence that a particular person with particular opinions sat down and hammered it out. It is prose that has been to finishing school and learned nothing except how to be inoffensive at dinner.

This is the uncanny valley of writing. Large language models now produce text that is, by most surface measures, competent, in the way that a Marriott breakfast buffet is, by most surface measures, food. They can mimic registers, follow instructions, and generate passable copy in seconds. What they cannot do, reliably, is sound like anyone in particular. The words arrive clean, centred, sanded smooth, and they are, in a precise technical sense, the most probable words. Probability, it turns out, is the enemy of voice.

Readers notice, even when they cannot articulate what they are noticing. A 2024 study from the University of Kansas found that when people suspected AI involvement in a piece of writing, their trust in the author dropped, even when the text quality was unchanged. The researchers called it a “transparency penalty.” Disclosure of AI authorship degraded perceptions of authenticity, effort, and sincerity. The interesting finding was that this penalty applied even when readers could not identify specific tells. They were not spotting bad grammar or factual errors but responding to an absence, some quality of personhood that should have been there and was not. The prose equivalent of talking to someone at a party who maintains perfect eye contact and says absolutely nothing of substance.

How the machine writes

An LLM does one thing. Given a sequence of tokens, it calculates a probability distribution over what comes next and samples from it. The training objective, next-token prediction, means getting the next word right, billions of times, across terabytes of internet text, until the resulting model develops what looks like an understanding of syntax, argument, tone, even humour. Whether this constitutes understanding or merely a very expensive statistical trick is, as we shall see, contested, but the mechanical fact is not. Every word an LLM writes is the output of a probability calculation.

The consequences for prose style are baked into the architecture. A model trained to predict the most likely next token will, absent intervention, gravitate toward the centre of its training distribution. It will favour common words over rare ones, conventional syntax over eccentric syntax, safe constructions over risky ones. The resulting text has what researchers call low perplexity, meaning it is highly predictable, and low burstiness, meaning its sentence lengths and structures cluster tightly together. Human writing, by contrast, is irregular.

People write long sentences and then short ones, use odd words because they like them, and break rules for emphasis. A 2024 study in Artificial Intelligence Review comparing human-written news text to LLM output across six models confirmed what any attentive reader already suspects, that humans exhibit more scattered sentence length distributions, more varied vocabulary, and shorter syntactic constituents. LLMs produce text that is more concise and uniform, with distributions that cluster around lower values and have tight interquartile ranges. In plain English, the machine writes within a narrower band, hedging, rounding off, converging on the middle. If you imagine the full range of human prose as a piano, the model is playing exclusively in the two octaves around middle C, and it has been told that those are the only octaves that exist. This is not a bug, but what the training objective selects for.

The alignment tax on style

The base model, the raw output of pre-training, is actually wilder than what you encounter in ChatGPT or Claude. It contains the full chaotic range of internet text, from Wikipedia to Reddit rants to Nigerian business correspondence to academic papers. It can mimic any of these registers, but unpredictably. You might ask it for a recipe and get a manifesto. This is where reinforcement learning from human feedback enters the process, and where things get both interesting and depressing.

RLHF works by having human annotators compare pairs of model outputs and choose which they prefer. These preferences are used to train a reward model, which then guides the base model toward outputs that receive high scores. The intention is to make the model helpful, harmless, and honest. The side effect is a flattened voice. RLHF introduces what the technical literature calls a mode-seeking mechanism, which narrows the range of outputs, pushing the model away from the tails of its distribution and toward a bland, deferential, faintly eager-to-please centre. The result is a prose style that reads like middle management wrote it, competent, cautious, and stripped of anything that might offend or surprise. Imagine a committee of well-meaning strangers voting on what makes a good sentence, and then imagine doing this millions of times. That is roughly what RLHF does to prose. You get the sentence that nobody actively dislikes, which is another way of saying you get the sentence that nobody remembers.

The annotators themselves leave fingerprints on this process. OpenAI outsourced much of its RLHF work to Sama, a data annotation company with operations in Kenya and Uganda, as well as to other vendors across sub-Saharan Africa. Alex Hern, writing in the Guardian, noted that the word “delve” is far more common in formal Nigerian and Kenyan English than in American or British usage. When RLHF annotators in Nairobi rated outputs, they naturally preferred phrasing that matched their own register, and the model learned accordingly. ChatGPT acquired a slight but measurable tilt toward West and East African business English that nobody designed, and nobody noticed until the word “delve” started showing up at an industrial scale in places it had no business being. It was, in hindsight, the world’s least intentional act of linguistic imperialism in reverse.

The vocabulary forensics

Dmitry Kobak and colleagues at the University of Tübingen studied vocabulary changes across 14 million PubMed abstracts from 2010 to 2024. They borrowed a technique from Covid-era epidemiology, excess mortality analysis, and applied it to words. Instead of counting surplus deaths, they counted surplus vocabulary, and the results were stark. The word “delves” appeared in 25 times as many 2024 papers as pre-LLM trends would predict. “Showcasing” and “underscores” surged ninefold, while “crucial” increased by 2.6 percentage points across the entire corpus. The excess was unprecedented, and the researchers estimated that at least 10 per cent of 2024 PubMed abstracts had been processed with LLMs. In some subfields, the figure reached 30%. A generation of medical researchers, it seemed, had collectively decided that their findings were worth “delving into” at exactly the same moment.

A follow-up study by Kei Matsui examined 135 potentially AI-influenced terms against 84 stable control phrases across PubMed records from 2000 to 2024. The control phrases, ordinary academic constructions like “all patients” and “results suggest,” held steady for two decades, doing their dull but honest work. The AI-influenced terms “meticulous,” “intricate,” “tapestry,” and “boast” spiked sharply after 2022, but the uncomfortable detail is that several of these words had already begun creeping upward in 2020, before ChatGPT launched. The lexical preferences of LLMs may have been seeded during the RLHF process, shaped by the vocabulary preferences of annotators whose influence preceded the public release of the tools. The contamination, in other words, ran deeper than anyone first assumed, and the timeline made it harder to draw a clean line between “human wrote this” and “machine wrote this.” Which, if you think about it, is precisely the problem.

The effect has a name now, AI-ese, and it is recognisable not because any single word gives the game away, but because the words arrive in concert. “Delve” on its own proves nothing, but “delve” alongside “crucial,” “underscore,” “intricate,” “foster,” and “tapestry” in a single abstract proves everything. The tells are combinatorial rather than atomic, like catching someone in a lie, not from one detail but from six details that are all a bit too perfect.

The irony is poisonous because Nigerian and Kenyan writers whose formal English naturally includes some of these terms are now being flagged by AI detection systems for writing in their own language. As Hern observed in the Guardian, if AI-ese sounds like African English, then African English sounds like AI-ese, and the stigma runs only one way. The celebrated writer Elnathan John put it sharply on X: “Imagine after being force-fed colonial languages, being forced to speak it better than its owners, then being told that no one used basic words like ‘delve’ in real life.” One might add that nobody asked the Kenyan annotators whether they wanted to teach ChatGPT to write like them, or whether they were comfortable becoming, in effect, the uncredited ghostwriters of the internet’s new house style.

What gets lost: semantic ablation and the flattening of prose

There is a subtler problem than bad vocabulary. LLMs do not merely favour certain words but systematically strip specificity from prose. Ask a model to write about a jazz record, and it will give you “complex harmonies” and “innovative arrangements” when what you needed was “McCoy Tyner comping in fourths behind a Coltrane solo that lasts eleven minutes and sounds like someone trying to describe a colour that doesn’t exist.” The first version is accurate, and the second version is writing.

This is what might be called semantic ablation, and it is the quietest form of damage a language model does. The model, trained on everything, defaults to the most generalised version of any idea, the way a politician defaults to “the hardworking people of this country” when they cannot remember which constituency they are in. Specific proper nouns get replaced with category labels. Precise technical vocabulary gives way to near-synonyms that carry less information. Vivid metaphors, the kind a human writer uses because they were up late and the phrase struck them and stuck, are smoothed into conventional similes. The texture that makes prose worth reading, the grit and grain of individual perspective, is exactly what the probability distribution selects against, because unusual phrasing has low probability, and the model avoids it.

You can see this in any domain where precision matters. A model asked to write about wine will produce “notes of dark fruit with a velvety finish,” when a sommelier would say “blackcurrant and pencil lead, tannins still gripping, needs three years.” A model writing about code architecture will reach for “robust and scalable solution” when the programmer meant “we sharded the database because the read latency was killing us at peak.” The model is not wrong, exactly, but genericised, having taken a specific observation and translated it into the most common way of expressing that category. This is the opposite of what good writing does. Good writing takes a general category and finds the one concrete detail that makes it real. The model takes a real detail and files it down until it fits in the category bin.

The structural patterns compound the effect further. Studies of AI-generated text consistently find the same architectural habits: an introductory sentence that frames the topic, three supporting points often bulleted, and a summative paragraph that begins “In conclusion” or “In summary.” Transitions are handled with “Furthermore,” “Moreover,” and “Additionally,” the verbal equivalent of a PowerPoint slide advancing. If you have ever read a corporate strategy document and thought, “A committee produced this,” you already know the feeling. Human prose, when it is working, builds momentum through rhythm and surprise, withholding and then delivering, setting up an expectation and then breaking it. These are the mechanics of attention, and they are exactly the patterns that RLHF selects against, because annotators working under time pressure prefer text that is immediately clear over text that rewards sustained reading.

The effect is compounding, and researchers call it model collapse. As more AI-generated text enters the internet at an extraordinary volume, future models are trained on it. A 2023 study led by Ilia Shumailov at Oxford demonstrated that models trained on the outputs of other models progressively lose the tails of their distributions. Minority patterns, unusual phrasings, rare vocabulary, and distinctive syntactic structures all gradually disappear. Each generation of the model becomes slightly more generic than the last. The researchers compared it to a photocopier copying a page, the text getting blurrier with each pass. It is an apt metaphor, though one might prefer a culinary analogy. Imagine making stock from stock from stock, each generation thinner and less flavourful, until you are left with warm, faintly savoury water.

The internet is already thick with this stuff, and getting thicker by the hour. Estimates vary, but AI-generated content now constitutes a substantial and growing fraction of new text published online. Much of it is SEO spam, product descriptions, and content-farm filler designed to capture search traffic without providing anything worth reading, the textual equivalent of those shops at airports that exist only because you are trapped. Some of it is harder to spot, and this is the more troubling category. Blog posts that sound plausible but contain no original reporting, LinkedIn articles that string together received wisdom in fluent paragraphs, product reviews that describe features without having used the product. The dead internet theory, once a fringe conspiracy popular with the kind of people who also worry about chemtrails, is becoming a description of something uncomfortably real, not that bots have replaced all human activity online, but that the ratio of generated to authored text has shifted far enough to change what the average piece of writing on the internet looks and feels like.

The parrot and the professor

Whether any of this is fixable depends, in part, on what you think LLMs are doing when they write. Emily Bender’s position, laid out in her 2021 paper with Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell, is that they manipulate linguistic form without access to meaning, that they are, in her phrase, stochastic parrots. “The only thing a large language model can learn is about form, sequences of letters and punctuation, and what’s likely to go next,” she told an audience at Harvey Mudd College in November 2024. An LLM, in her view, “no more understands the texts it is processing than a toaster understands the toast it is making.” It is one of those analogies that is either devastating or slightly unfair, depending on your priors about toasters.

Geoffrey Hinton disagrees, which is what Geoffrey Hinton tends to do. He argues that predicting the next token at the level frontier models now achieve requires something functionally equivalent to understanding, and that this understanding is emergent rather than designed. A 2024 Scientific American investigation described a workshop at Berkeley where frontier models solved novel tier-four mathematics problems, producing coherent proofs that went beyond memorisation. If a parrot can do abstract mathematics, Hinton’s camp suggests, perhaps it is time to reconsider what we mean by parrot.

For the question of text quality, though, the debate matters more than it first appears. If Bender is right and LLMs process form without meaning, then the flatness of their prose is not a temporary limitation but a structural constraint, like asking a colour-blind person to arrange paint swatches. A system that does not understand what it is saying cannot develop a voice, because voice requires intention, and intention requires something to intend. The model will always converge on the statistically average way of expressing any given idea, because it has no reason, no internal reason, to prefer a specific or unusual expression over a generic one.

If Hinton is right and some form of understanding is emerging, the picture is different but not necessarily better, because understanding does not automatically produce good prose, and plenty of humans understand what they are saying and still write badly. The question becomes whether the training process, the combination of next-token prediction and RLHF, can be modified to reward stylistic distinction rather than punishing it. It would be as if a piano teacher had spent years drilling a student on scales and now wanted them to play with feeling. Technically possible, but you may have trained the feeling out.

What would need to change

For LLM text to become properly indistinguishable from human writing, rather than merely passable at first glance, several things would need to happen at once, and some of them may be impossible within the current architecture.

The training objective would need to tolerate surprise. Next-token prediction, by definition, rewards the most probable continuation. Good writing often does the opposite, surprising the reader by taking the less expected path when that path is more vivid, more precise, or more truthful. A 2025 study from the University of Mannheim measured this gap directly, comparing the entropy of LLM-generated story continuations against human-authored fiction. The models produced text with two to four times lower entropy than the human ground truth, and the gap widened further after RLHF alignment. Literary theorists would have predicted as much. Wolfgang Iser argued that the “gaps” in a text, its moments of indeterminacy, are what compel cognitive engagement. Roland Barthes distinguished between “readerly” texts, which deliver meaning passively, and “writerly” texts, which invite the reader to become a co-creator. By this framework, LLMs are relentless engines of readerly text, closing every gap, resolving every ambiguity, smoothing every rough surface, like an anxious host who fills every silence at a dinner party. The result is what the literary scholar Sianne Ngai called “stuplimity,” a synthesis of shock and boredom born from the accumulation of frictionless but creatively flattened content. Anyone who has asked ChatGPT to write a poem and then immediately wished they hadn’t will recognise the sensation.

A training signal that consistently rewards the predictable token will consistently produce predictable text. Some researchers are trying to break the cycle. Meta’s Large Concept Model operates above the token level, predicting semantic concepts rather than individual words, which is a bit like teaching someone to think in paragraphs rather than syllables. A team led by John Joon Young Chung has proposed “diversified DPO”, a modification to direct preference optimisation that rewards outputs for differing from the average response to the same prompt while maintaining quality, which is as close as anyone has come to formalising the instruction “be interesting.” Whether these produce more human-like prose remains to be seen, but they at least acknowledge that the statistical averaging built into current systems is a problem worth solving rather than a feature to celebrate.

RLHF would need to stop rewarding blandness. The current process trains for helpfulness and harmlessness, which in practice means training for a voice that offends nobody and interests nobody, the literary equivalent of hold music. Annotators, working under time pressure for low wages, tend to prefer clear, safe, and conventional outputs. They are not being asked to reward literary quality, idiosyncratic phrasing, or the kind of constructive difficulty that makes good writing worth the effort. The reward model, in other words, is optimised for customer service, not prose. It is as though you trained a restaurant critic by asking a thousand people whether they liked their meal at a chain restaurant, and then used the result to evaluate a Michelin-starred tasting menu.

Researchers at the University of Washington have documented this as “homogeneity-by-design”, arguing that the flattening of LLM output is an organisational decision, not a technical side effect. Companies optimise for the broadest possible user base, which means optimising for the blandest possible voice, the textual equivalent of painting every wall beige because nobody complains about beige. Changing this would require either a different class of annotator, a different set of instructions, or a different alignment mechanism entirely. It would also require companies to accept the commercial risk that a more distinctive model might alienate some users, and no publicly traded company has ever willingly chosen “alienate some users” as a product strategy.

The “telling instead of showing” problem would need a solution. Researchers at Columbia University have documented how LLM-generated creative writing consistently “tells instead of shows,” a failing that any undergraduate writing workshop would flag. The model states emotions rather than rendering them through action and detail, and summarises rather than dramatises, reaching for the abstract category when the specific instance would do the work. Tuhin Chakrabarty and colleagues found that LLM fiction is “hackneyed and rife with clichés, while failing to demonstrate rhetorical complexity.” A separate study found that LLM-generated stories are “homogeneously positive and lack tension,” a fair description of a corporate motivational poster. This is a structural problem rooted in the training objective, and it may be the hardest one to fix. Showing requires the writer to trust the reader to infer, and inference is uncertain. The model, trained to minimise uncertainty, reaches for the explicit statement every time. It is the prose equivalent of a comedian who explains the punchline.

Models would need something resembling a persistent perspective. Human voice in writing comes from accumulated experience, consistent opinions, and the willingness to be wrong in ways that reveal character, and an LLM has none of these. It generates each response from scratch, with no memory of having held a position before and no stakes in holding one now. It cannot be contrarian, because it has nothing to be contrarian against, and it cannot be personal, because there is no person to be personal about. The most it can do is simulate these qualities on instruction, which produces roughly the same effect as a method actor who has done extensive research into the role but has never actually experienced grief, or joy, or the specific indignity of being stuck on the M25 for three hours behind an overturned caravan.

A 2025 study in Nature Human Behaviour put numbers on what this means. Researchers tested whether LLMs could replicate human conceptual representations across nearly 4,500 word concepts. The models performed well on non-sensorimotor dimensions, the social, emotional, and abstract concepts, which are also not coincidentally, the dimensions most heavily represented in internet text. They failed on motor-related dimensions, the concepts rooted in physical experience, the things you know because your body has done them. The researchers concluded that motor representations rely on embodied experiences that cannot be learned from text alone, and the implications for writing are blunt. The best prose is grounded in sensory particularity. It knows what rain sounds like on a tin roof, what a specific street smells like at 5 am, what it feels like to hold a conversation while angry and trying not to show it. These are not things that can be learned from statistical correlations in a text corpus. They are things that are known because someone lived them. No amount of training data about rain will give you the tin roof.

And perhaps most fundamentally, the text would need to carry a sense of cost. The feeling that a specific human spent time choosing these words over other words, not because a probability distribution favoured them, but because the writer believed they were the right ones and was willing to be judged for the choice. This is what readers detect, or fail to detect, when they flinch at AI-generated prose. It is not that the grammar is wrong or the facts are off. It is that no one is home, that the text is produced but not authored, arriving fully formed from nowhere in particular, addressed to no one in particular, about nothing that anyone in particular actually cares about. It is writing as room-temperature water, technically adequate, satisfying nothing.

Where this leaves the written word

The companies building these systems tend to frame the flatness as a solvable engineering problem, one that will yield to time, scale, better RLHF, better data, better prompting, and perhaps it will. The tech industry’s capacity for self-belief should never be underestimated. But there is a version of this story in which the flatness is not a bug to be fixed but a feature of what the technology is. A system designed to predict the most likely next token will, no matter how large or sophisticated it becomes, produce text that tends toward the average, and the average, in writing, is the death of style.

Meanwhile, the written internet fills with this stuff as model collapse proceeds. The tails of the distribution, where the weird, precise, distinctive, culturally specific, gloriously improbable writing lives, get thinner with each training cycle. A Kenyan postgraduate is accused of using ChatGPT because he writes in formal English. A medical researcher’s abstract is indistinguishable from a hundred others because they all passed through the same model. A reader scrolls past another paragraph of competent, textureless, faintly warm prose and does not bother to finish it. Nobody notices. There is always more where that came from.

We are building a technology that is very good at producing text and very bad at producing writing. The distinction between the two has never mattered more, and the thing that makes the distinction, the human willingness to put something at stake in a sentence, to be caught out, to be specific when being vague would be safer, is exactly the thing that cannot be back-propagated through a neural network.

More from the blog:

  • Software was never meant to last forever

    There is a particular kind of frustration that anyone who has worked inside a mid-sized organisation will recognise. You are eighteen months into a Salesforce implementation. The original scope was clean and reasonable. But somewhere around month four, somebody realised that you…

  • The vibe coding spectrum: from weekend hacks to the dark factory

    A year ago, Andrej Karpathy posted a tweet that would come to define how an entire industry talks about itself. “There’s a new kind of coding I call ‘vibe coding,’” he wrote, “where you fully give in to the vibes, embrace exponentials, and forget that the code even exists.” He d…

  • Claude Opus 4.6 just shipped agent teams. But can you trust them?

    Anthropic shipped Claude Opus 4.6 this week. The headline features are strong: a 1M token context window (a first for Opus models), 128K output tokens, adaptive thinking that adjusts reasoning depth to the task, and top-of-the-table benchmark scores across coding, finance, and l…

  • AI slop: psychology, history, and the problem of the ersatz

    In 2025, the term “slop” emerged as the dominant descriptor for low-quality AI-generated output. It has quickly joined our shared lexicon, and Merriam-Webster’s human editors chose it as their Word of the Year. As a techno-optimist, I am at worst ambivalent about AI outputs, so…

  • The missiles are the destination

    One of my uncommon enjoyments is the work that happens right in the middle of a big problem that needs to be solved, or even a nosedive. A calmness kicks in, the path gets clearer and I can usually tunnel vision my way through to course correction. I used to think this was spec…

All blog posts

Let’s chat

Whether you have a challenge in mind or just want to connect, let’s chat. You can also drop us an email, connect on LinkedIn or save our contact information for later.

A playful, hand-drawn illustration of a group of characters holding up scorecards with the number ‘11’. They sit behind a table scattered with various other numbers.