The machine that improves the machine

By Iain,

The machine that improves the machine

In May 2025, Google DeepMind released AlphaEvolve, an AI system that discovers better algorithms by evolving code through thousands of iterations. Within months, it had already optimised parts of Google’s data centre operations, improved hardware chip designs, and, most tellingly, accelerated the training of the very language models that power it. That last detail deserves considerably more than a footnote, because when an AI system starts making the tools used to build AI systems faster and cheaper, you are looking at a feedback loop that will change how the next generation of large language models gets built.

This piece is about that loop, covering what AlphaEvolve actually does, how it differs from the LLMs you already use, and what it means for the models that come after it. (For the technically inclined, the full AlphaEvolve paper is available on arXiv.)

What AlphaEvolve does

The easiest way to understand AlphaEvolve is to compare it with a regular large language model like GPT-4 or Claude. When you ask an LLM to write code, it gives you one answer, maybe a good one, maybe a disaster. If the code is wrong, you tell it so, and it tries again. The whole process depends on your ability to spot problems and describe them clearly enough for the model to correct course.

AlphaEvolve works according to a completely different logic, because you give it a starting algorithm (a seed program that works, even if badly) and an evaluation function, which is just an automated test that scores how well any version of that algorithm performs. From there, it runs a continuous loop, using Google’s Gemini language models to generate mutated versions of the code, testing each one against the evaluation function, keeping the best performers, discarding the rest, and feeding the winners back in as parents for the next round of mutations. Think of it as selective breeding for software, with Gemini as the source of genetic variation and the evaluation function as natural selection.

Two models work in tandem throughout this process, with Gemini Flash, the faster and cheaper model, generating a wide spread of variations, throwing lots of ideas at the wall to see what sticks. Gemini Pro, the more capable model, offers deeper, more considered suggestions when the search gets stuck on a good but not the best available solution (what mathematicians call a local optimum). The combination means the system explores broadly without losing the ability to reason carefully about promising candidates.

This setup eliminates the hallucination bottleneck that plagues regular LLM code generation. “Hallucination” in this context means the model confidently producing output that looks right but is factually wrong or broken. When a regular LLM writes code, it can produce something that looks plausible but fails in ways that require human expertise to detect. AlphaEvolve’s evaluation function catches every failure automatically, and nothing gets promoted to the next generation unless it actually passes the tests. The hallucination problem doesn’t vanish entirely (the base models still generate plenty of bad ideas), but the evolutionary selection process filters out everything that doesn’t work, and only the verified survivors carry forward.

The difference between suggesting and discovering

This is the distinction that gets lost in some coverage of AlphaEvolve, and is the most important one. An LLM, at bottom, is a suggestion engine that produces the most probable completion based on everything it has seen in training. It can only remix and recombine patterns from its training data, and if no human has ever written down a particular algorithm, the LLM is unlikely to produce it from scratch.

AlphaEvolve is a discovery engine, and the difference between those two categories is not semantic. Because it tests thousands of mutations over many generations, it can stumble onto solutions that no human has proposed and no training dataset contains. Google tested it on 50 open problems in mathematics, areas where the best human-devised solutions had been standing for years or decades. AlphaEvolve rediscovered the known best answer 75% of the time and found something better 20% of the time. That 20% figure is the one to pay attention to, because these were not incremental tweaks to existing methods but novel constructions that professional mathematicians hadn’t found, verified automatically as correct.

A suggestion engine tells you what it has already seen, while a discovery engine finds things nobody has seen yet. The LLM provides the creative variation, and the evolutionary process, together with the evaluation function, provides the rigour; neither alone could produce these results. Google’s own researchers have since used AlphaEvolve to push the boundaries of theoretical computer science, uncovering new mathematical structures that tighten what we know about approximation problems.

How it is changing the way LLMs get built

Set the mathematics aside for a moment, because the most concrete near-term impact is on the infrastructure that trains and runs large language models, and Google has been deploying AlphaEvolve-discovered improvements internally for over a year.

Faster matrix multiplication

Matrix multiplication is the computational bedrock of modern AI, and understanding why helps by picturing a spreadsheet with rows and columns of numbers. Matrices are grids of numbers like this, and “multiplying” two of them together is the basic mathematical operation that language models perform billions of times during training, and every time they generate a response. Every time a language model processes a token (a word or a subword), it multiplies matrices. Every training step is dominated by these matrix operations, which means even small improvements in multiplication speed ripple through the entire stack.

AlphaEvolve found a smarter way to divide large matrix multiplication operations into smaller, more manageable subproblems. The result was a 23% speedup in a critical kernel (a tightly optimised piece of code that runs on the GPU) in Gemini’s architecture, which translated into a 1% reduction in Gemini’s overall training time. One per cent sounds modest until you remember that a training run takes months on thousands of expensive specialised chips (called accelerators), at which point 1% represents millions of dollars and a measurable tonnage of carbon emissions.

AlphaEvolve also optimised low-level GPU instructions for the FlashAttention mechanism. FlashAttention is a clever technique that enables language models to process long sequences of text without running out of memory, and it is a critical component of how modern transformers work. It achieved a 32.5% speedup on that particular kernel, which is territory that human engineers typically leave to compilers (software that automatically translates high-level code into the low-level instructions that chips actually execute) because hand-optimising at that level is extraordinarily tedious. AlphaEvolve, of course, doesn’t get bored and doesn’t mind doing the same dull task ten thousand times in a row.

Better hardware

Beyond software optimisation, AlphaEvolve proposed a change to the Verilog code (the programming language used to design computer chips) that describes an arithmetic circuit for matrix multiplication in Google’s Tensor Processing Units (TPUs), the custom chips Google builds specifically to run AI workloads. The modification removed unnecessary bits from the circuit while maintaining functional correctness, confirmed by formal verification methods, and that change was integrated into an upcoming generation of TPUs.

This is worth pausing on for a moment, because an AI system suggested a hardware improvement, written in the standard hardware description language that chip designers use, which was good enough to go into production silicon. The gap between AI-generated suggestions and decisions baked into physical chips has shrunk to zero in this specific case, and the implications for the semiconductor design cycle are hard to overstate.

More efficient data centres

Google’s data centres use a scheduling system called Borg to allocate computational tasks across vast fleets of machines. AlphaEvolve discovered a better scheduling heuristic that has been running in production for over a year, continuously recovering an average of 0.7% of Google’s global compute resources. At Google’s scale, 0.7% is an enormous amount of computing power freed up without buying a single new server, the kind of gain that infrastructure teams spend years chasing with conventional engineering.

Where this goes next

So AlphaEvolve is already making LLM training faster, the hardware more efficient, and the data centre operations leaner, so the question is how far this extends and what it means for the next generation of models.

Training optimisers

Every modern LLM is trained using an optimisation algorithm, something like AdamW, which controls how the model learns from data. During training, a language model has billions of numerical settings (called “weights”) that determine its behaviour, and the optimiser’s job is to adjust those weights, step by step, so the model gets better at predicting text. These optimisers were designed by human researchers through a combination of mathematical reasoning and experimental trial and error, and they work well. But there is no reason to believe they are perfect.

AlphaEvolve could develop new optimisation algorithms from scratch, starting by defining the evaluation function as training loss on a representative dataset, seeding it with AdamW or whatever the current best practice is, and letting the system explore variations across the entire design space. Learning rate schedules, momentum terms, weight decay strategies, and gradient clipping behaviour are all parameters of algorithms that can be mutated and selected based on a clear fitness signal. The result might be an optimiser that converges faster, generalises better, or handles certain data distributions more gracefully than anything a human researcher would have designed by hand.

The transformer architecture, first described in a 2017 paper and still the backbone of every major LLM, was designed by humans. “Architecture” here means the blueprint that determines how data flows through the model, which mathematical operations occur in what order, and how different parts of the system communicate with each other. The various modifications since then (different attention mechanisms for deciding which parts of the input to focus on, activation functions that control how signals pass between layers, normalisation layers that keep numbers from blowing up during training, positional encoding schemes that tell the model where each word sits in a sentence) were all hand-designed too, through a labour-intensive process of propose, implement, train, evaluate, repeat.

AlphaEvolve can compress that entire cycle dramatically by seeding the search with the current transformer architecture, defining an evaluation function like “how well does this perform per unit of computing power used,” and evolving. The idea of using computers to search for better model architectures isn’t new (it is called neural architecture search), but previous approaches used random mutations or simple heuristics to propose changes, and using an LLM as the mutation engine means the proposed changes can be structurally informed rather than random, which should make the search far more efficient.

It is possible, maybe probable, that the attention mechanism we currently use is not the best one, and the same goes for layer normalisation, feed-forward block design, and the way residual connections are structured. A system that can explore architectural variations at scale and evaluate them rigorously could find configurations that beat the standard transformer on specific tasks or at specific scales, and we would have no way of predicting what those configurations look like in advance.

Data recipes

The ratio and sequencing of training data, how much code versus natural language versus mathematical text versus dialogue, and in what order the model sees it during training, has an outsized impact on final performance. Researchers at AI labs currently tune these “data recipes” through expensive ablation studies, training smaller stand-in models (called “proxy models”) with different mixtures and hoping the results transfer to the full-scale run.

This is what mathematicians call a combinatorial optimisation problem, meaning there are a huge number of possible combinations to try and a clear way to measure which combination works best, exactly the type of problem AlphaEvolve was built for. Evolving data mixture ratios and curriculum schedules against proxy training performance could identify optimal recipes faster and more reliably than the current approach, which often relies on accumulated institutional knowledge and educated guesses passed down like folklore within research teams.

Post-training and the question of measurement

After the initial training run, LLMs go through a second stage of refinement designed to make them actually useful and safe to talk to. This typically involves reinforcement learning from human feedback (RLHF), where human reviewers rate the model’s outputs and those ratings are used to steer the model toward better answers. It also involves constitutional AI methods (where the model is trained to follow a set of principles) and rejection sampling (where the model generates many candidate answers and keeps only the best). The many settings that govern these processes are all tuned by humans through iterative experimentation, and getting them right is more art than science.

AlphaEvolve could accelerate this process by evolving these parameters against automated evaluation benchmarks, and indeed, this may be one of the most consequential applications. The catch is that measuring how well a model “behaves” is harder to automate than measuring matrix-multiplication speed. You can’t just define a scoring function for “is this model trustworthy” the way you can for “is this algorithm faster.” But for components of the post-training process that do have measurable proxies (reward model accuracy, refusal rates on safety benchmarks, helpfulness scores on standardised evaluations), evolutionary optimisation could meaningfully speed up the research loop.

The flywheel we should watch

All of the applications above are interesting on their own, but taken together, they describe something more consequential than any single optimisation.

AlphaEvolve uses Gemini to discover improvements, and some of those improvements make Gemini’s training faster, which means a faster-trained Gemini becomes a better mutation engine for AlphaEvolve, which discovers better improvements, which makes training faster again.

Google has explicitly acknowledged this dynamic, noting that AlphaEvolve sped up a kernel in Gemini’s training while Gemini powers AlphaEvolve. The system already contains a recursive improvement loop and is running in production.

This is not recursive self-improvement in the runaway science-fiction sense, because every improvement still passes through a human-defined evaluation function, and the system cannot redefine its own objectives or decide to optimise for anything other than what the engineers asked for. The evaluation function is the leash, and a sturdy one. But within the bounds set by that evaluation function, the system can compound improvements over time in a way that purely human-driven research cannot match.

The best analogy is compound interest, where each individual improvement is modest on its own. A 1% training speedup here, a 0.7% resource recovery there, a 23% kernel acceleration somewhere else, none of them world-shaking in isolation, but compounding in a context where the improved system is itself the tool used to discover the next round of improvements.

For labs outside Google, this raises an uncomfortable question about competitive dynamics. If Google has a flywheel that lets its AI improve the efficiency of training its own AI, and nobody else has an equivalent system, the gap between having and not having this advantage widens over time because the returns are recursive. Labs that can build their own version of this evolutionary optimisation loop will stay in the race, and labs that cannot will find the cost of staying competitive increasing faster than their ability to fund it. Several open-source implementations have already emerged, suggesting that the broader research community understands the stakes.

Where the ceiling is

Nobody knows exactly how far this goes, but there are structural constraints that limit the extent to which evolutionary optimisation of LLMs can progress, at least in its current form.

The biggest constraint is the evaluation function, because AlphaEvolve works brilliantly when you can define a clear, automated scoring metric, and questions like “is this matrix multiplication faster?” and “does this circuit produce correct output with fewer gates?” have unambiguous numerical answers. The hardest questions in AI development do not share that property. “Is this model more helpful?” and “does it reason more reliably?” and “is it safer?” are questions where the evaluation function is itself the subject of active research, where reasonable people disagree about what the right metric even is, and where optimising for a proxy metric can produce a model that games the proxy without improving on the underlying quality you care about. This is Goodhart’s Law in action, the old observation that when a measure becomes a target, it stops being a good measure.

The second constraint is the sheer size of what researchers call the “search space,” meaning the number of possible variations the system could try. Evolving a scheduling heuristic or a matrix multiplication kernel involves a relatively bounded set of possibilities. Evolving the full architecture, training procedure, data recipe, and post-training strategy for a frontier LLM involves a space so large that even thousands of generations might not explore a meaningful fraction of it. The evolutionary approach will likely work best when decomposed into well-defined subproblems rather than aimed at the grand challenge of “design the best possible AI” from scratch.

The third constraint is evaluation cost, because to score a candidate LLM architecture or training procedure, you have to actually train a model with it, or at least a small proxy model (a cheaper stand-in for the full-size version), which is not cheap even at reduced scale. The evolutionary loop requires many evaluations per generation, and many generations to converge, so the total computational cost of the search process might dwarf the savings it discovers unless carefully managed with proxy models and transfer assumptions.

What to take from all of this

AlphaEvolve signifies a shift in how AI research is conducted, transitioning from a process where human researchers propose ideas, implement them, run experiments, and publish papers, to one where the loop narrows to include an automated system that proposes, implements, evaluates, and iterates faster than any team of humans could manage alone.

This shift does not replace human researchers, because someone still needs to define the right evaluation function, interpret surprising results, and decide which discoveries to trust in production. The system found a better scheduling heuristic for Borg, but a human team had to verify it, stress-test it, and make the call to deploy it. AlphaEvolve accelerates the generation and filtering of ideas without eliminating the need for judgment about which ideas to adopt.

For the next generation of LLMs, the most likely impact is practical but powerful. Training will become cheaper, and hardware will become more efficient as the accumulated inefficiencies of suboptimal algorithms throughout the training and inference stack gradually diminish. Models of a given quality level will cost less to produce, meaning the frontier advances even if there are no significant breakthroughs in architecture or training methodology. The incremental gains accumulate, and they happen faster when the tool discovering those gains improves alongside what it is optimising.

So: the tool that builds the tool that builds the model. Google began turning that crank in 2024, and by the time most heard about AlphaEvolve, the algorithms it had discovered were already in production, reducing costs, saving energy, and making the next training run marginally faster than the previous one. A flywheel that continuously turns, one generation at a time.

More insights:

  • The path to an agent-first web

    For three decades, the web has operated on an implicit contract between the people who build websites and the people who visit them. You design pages for human eyes and organise information for human brains, monetising attention through ads, upsells, and sticky navigation patter…

  • Generative engine optimisation: separating sound practice from snake oil

    A new three-letter acronym is stalking the marketing industry. Generative Engine Optimisation (GEO) is the practice of making your content visible in AI-generated answers, such as those produced by ChatGPT, Perplexity, Google AI Overviews, and Claude. The term was coined in a 20…

  • Automating your marketing 01: Paid Search Ads

    Google has always wanted you to believe that running search ads is simple and not as complex as it actually is. Set a budget (a generous one!), choose some keywords, and let the machine handle the rest. To be fair, the machine has become exceptionally good at certain aspects of …

  • Why AI models hallucinate

    In September 2025, OpenAI published a paper that said something the AI industry already suspected but hadn’t quite articulated. The paper, “Why Language Models Hallucinate”, authored by Adam Tauman Kalai, Ofir Nachum, Santosh Vempala, and Edwin Zhang, didn’t just catalogue the p…

  • Received wisdom: classic frameworks under AI pressure 01: David C Baker

    David C Baker has spent thirty years telling agency owners something they already suspected but lacked the courage to act on. You are not expensive enough, not focused enough in what you do. You are not sufficiently authoritative with your clients. The issue is not your work. Th…

All insights

Book a call

Have a challenge in mind or just want to connect? Schedule a call with Garrett, or reach out via email or LinkedIn.

A playful, hand-drawn illustration of a group of characters holding up scorecards with the number ‘11’. They sit behind a table scattered with various other numbers.