The ten trillion dollar gamble

By Iain Harper,

In November 2025, on stage at the Wall Street Journal’s Tech Live event, the chief financial officer of OpenAI was asked how her company planned to honor roughly $1.4 trillion in compute contracts on $13 billion of revenue. Sarah Friar said she was looking to assemble a network of banks, private equity, and a federal “backstop” or “guarantee.” By the following evening, she had posted to LinkedIn explaining that “backstop” had muddied the point, that what she meant was something more like a public-private partnership, and that the United States government has been “incredibly forward-leaning” on AI as a strategic asset. The post was titled in such a way as to suggest someone had explained, gently, what the original phrase implied.

A few weeks earlier, in a CNBC interview about Nvidia’s $100 billion investment in OpenAI, Friar had said the quiet part out loud in a different way. The money, she explained, would mostly come back to Nvidia. OpenAI was going to lease the chips. This is not, on its face, an unusual arrangement. It is also exactly the arrangement that, in a different century and a different industry, the SEC eventually charged Lucent Technologies with running.

Out in west Texas, sixty miles north of Abilene, on land that grew cotton for most of the twentieth century, a building roughly the size of the Pentagon has appeared. It has no signage. Its job is to convert electricity into tokens. The first phase draws 1.2 gigawatts. A second campus in New Mexico will draw more. Stargate, the joint venture, has identified roughly seven gigawatts of planned capacity, which is the load of Atlanta on a hot afternoon, dedicated to running language models. The five biggest American hyperscalers are now collectively committing to spend between $660 billion and $725 billion in 2026, nearly double 2025 levels. McKinsey, applying its house mix of confidence and arithmetic, puts the cumulative figure at $6.7 trillion globally through 2030. Add OpenAI, Anthropic, xAI, the Chinese state-backed projects, the Saudi PIF announcements, and the various sovereign AI funds, and the round number is easily ten trillion dollars by the end of the decade. The combined GDP of Japan, Germany, and the UK.

The wager being placed at Abilene is not, strictly, on artificial intelligence. It is on inference, the moment a trained model meets a user and produces an answer. Training is the expensive, headline-grabbing event. Inference is what happens for the other 99.9% of the time the model is in use. It runs every second of every day, scales with users, and sets the cost floor for whether the entire industry can ever turn a profit. Friar’s comment about the money going back to Nvidia is the financial side of the same wager. Both depend on the same bet: that demand for inference will arrive on schedule in the volume required to pay back the buildout. Nobody on a trading floor in London or a boardroom in Redmond can tell you with confidence that it will.

What follows is a guide to what inference is, what it costs, why it is the linchpin of the entire AI economy, and where the unknowns sit. The mechanics are important because the economics flow directly from them, and the economics matter because the entire stock market is currently betting that the mechanics will yield, predictably, to engineering.

What inference is, mechanically

At inference time, a large language model is essentially a frozen pile of numbers. These numbers were chosen during training to make the model good at predicting the next token in a sequence. A token is a word, a fragment of a word, or sometimes a single character. When you type a question into ChatGPT, three things happen in sequence. The text is broken into tokens. Those tokens are converted to vectors and fed through the model in a single parallel pass. The model then emits new tokens one at a time until it decides to stop or hits a limit.

These two phases, prefill and decode, have radically different cost structures. Almost every optimization in the field exists because of that asymmetry.

Prefill is fast. Your prompt arrives, is tokenized, and the model processes every input token in parallel. Each layer runs a self-attention computation that produces three vectors per token: Q, K, and V (query, key, and value). The keys and values are stored in the KV cache, a piece of GPU memory that grows linearly with the length of the prompt. Prefill is compute-bound. A modern GPU processes tens of thousands of input tokens per second because the work parallelizes beautifully.

Decoding is slow. After prefill produces the first new token, the model must produce the second by attending to every prior token, then the third by attending to every prior token plus the second, and so on. This is the autoregressive loop, and there is no parallelizing it. You cannot generate token 501 before token 500 because the input to token 501 is token 500. Each step pulls from two distinct memory pools. The KV cache, which has grown by one entry for every token generated so far, must be read in full. The model weights, fixed in size but very large, must also be read in full. Both reads happen for every output token, interleaved across each layer of the network. The first scales with conversation length. The second scales with model size. During decoding, the GPU spends most of its time waiting on memory bandwidth, performing a small amount of computation between reads. A frontier model on an Nvidia H100 will use perhaps 10% of the chip’s compute capacity during decoding and run at memory-bandwidth-bound speeds. This is why output tokens are priced 3 to 10 times higher than input tokens across major API providers. You are paying, in effect, for the GPU’s waiting time.

The KV cache is the source of most of the strange behavior in inference economics. Doubling the context window doubles the cache. A 128,000-token context costs about 64 times the memory of an 8,000-token context, because attention is quadratic in sequence length and the cache must hold every prior key and value. When a model serves many users at once, each user’s KV cache occupies its own slice of GPU memory. Run out of memory, and you must evict cached state, recompute it from scratch, or refuse the request. There is a whole sub-industry devoted to packing more KV caches onto the same GPU, all of it invisible to the user and all of it determining whether unit economics work.

Why does any of this matter? Because every cost reduction in the industry, and every quality regression that users complain about, operates on prefill, on decode, or on the cache that connects them.

Why inference dominates everything

For most of computing history, software has had a beautiful property called zero marginal cost. Once Microsoft Word was written, the millionth user cost nothing more than the first. Distribution was free, the buyer paid for the CD-ROM, and venture capital built an entire playbook on the back of it. SaaS, when it arrived, kept the basic shape. Salesforce’s marginal cost per seat was a sliver of bandwidth and a row in a database. You spent a lot to build the product, then printed money once it shipped.

LLM inference broke that. Every token a model generates is a small but real claim on a GPU somewhere, drawing power, dissipating heat, depreciating an asset that costs $30,000 to $40,000 to acquire. ChatGPT’s marginal cost per query is non-trivial. With around 810 million weekly active users generating roughly 2.5 billion queries a day, the inference bill is the dominant line on OpenAI’s income statement. Per-query cost runs in fractions of a cent. At scale, fractions of a cent become large numbers. OpenAI burned around $8 billion in cash in 2025 on $13.1 billion in revenue, and projects to burn $17 billion in 2026.

The structural problem is that the unit cost of an LLM does not behave like the unit cost of a software seat. It behaves like the unit cost of a chemical. You can engineer it down, but it asymptotes to physical limits, including the cost of electricity, silicon fabrication, and water for cooling. Sam Altman can promise hundreds of billions in 2030 revenue, but the gross margin he eventually lands on depends on how cheaply photons of electricity can be turned into tokens of text. That number today is around $0.40 per million tokens for GPT-4-equivalent quality, down from $20 in late 2022. The decline is real and dramatic, and it underpins every bullish argument in the sector. Whether the decline continues at that rate is the bet underneath everything else.

Open cost engineering

Public-facing inference optimization is one of the most active areas of engineering in computing. The techniques fall into roughly five families, and stacking them together can improve cost per token by an order of magnitude without changing the model.

Quantization reduces the precision of the numbers in which a model is stored. A model trained in FP16 can often run in INT8, INT4, or even FP4 with minor loss of quality. Each step halves memory bandwidth and roughly doubles tokens per second per GPU. Modern techniques like GPTQ and AWQ preserve the most important weights at higher precision and aggressively quantize the rest, achieving close to 4-bit average precision with under 1% accuracy loss on most benchmarks. NVIDIA’s NVFP4 format, baked into Blackwell, takes this further. Quantization is now standard. If a frontier API gives you an answer, the weights generating that answer are very rarely the same precision as the model was trained at.

Distillation trains a smaller model to imitate a larger one. The smaller model is faster, cheaper, and usually a bit dumber. Most major providers maintain entire families of models with names like full, mini, nano, haiku, flash, and lite that are largely distilled descendants of a common parent. DeepSeek’s V3 family famously serves a 671-billion-parameter model at $0.14 per million input tokens, against GPT-4o’s roughly $3, by combining aggressive distillation with the next technique on the list.

Mixture of Experts (MoE) breaks a model into many smaller sub-networks, called experts, and only activates a subset for any given token. A 480-billion-parameter MoE with 35 billion active parameters per token has the storage cost of a 480B model and the inference cost of a 35B one. The trade-off is that you need enough memory to hold all the experts, even though only some run at once. MoE is why a Mac Studio with 128GB of unified memory can run Qwen 3.5 122B locally at usable speeds.

Speculative decoding uses a small, fast draft model to guess multiple tokens ahead, then has the big model verify them in parallel during a single forward pass. When the small model guesses correctly, you get multiple tokens for the cost of one. When it guesses wrong, you fall back to normal generation. The technique is lossless, meaning the output distribution is mathematically identical to running the big model alone, and routinely produces 2 to 3 times speedups on predictable text such as code or JSON.

Batching, paged attention, and continuous batching pack multiple users’ requests into a single GPU pass and reuse cached state across users with shared prefixes. PagedAttention, the technique behind the vLLM serving engine, reduces KV cache memory fragmentation by 55% and supports roughly 10 times more concurrent users on the same hardware. None of this changes the model. All of it changes the economics.

Epoch AI’s analysis of the curves is the cleanest empirical work on the question. The price to reach GPT-4’s performance on PhD-level science questions fell by 40 times per year. Some benchmarks fell faster. The fastest trends, around 900 times per year, started after January 2024 and are heavily influenced by DeepSeek’s pricing aggression. Whether 40 times or 900 times per year is sustainable is the central uncertainty.

Surreptitious cost engineering

If you are an inference provider with hundreds of millions of users, all of whom signed up because of a benchmark you posted, you face a perpetual incentive. The benchmark was achieved on a specific model at a specific precision with a specific reasoning budget. Every user who hits your API after that wants the same quality. But quality costs money, and the market keeps demanding lower prices. So you optimize. You quantize more aggressively. You route easy queries to a cheaper model. You shorten the default reasoning budget. You silently swap weights from one checkpoint to the next, with no announcement and no version bump. Some users notice, eventually, but cannot prove anything because language models are inherently non-deterministic. Two identical prompts produce different outputs, so any quality regression looks like noise until enough people complain on Reddit.

The phenomenon has acquired a name. AI shrinkflation. The pattern, as documented by VentureBeat, is that users report a model getting worse, the company denies it, complaints accumulate, and eventually a postmortem appears blaming a combination of wrapper changes, prompt edits, and reasoning-effort defaults. Anthropic published one such postmortem in April 2026 after Stella Laurenzo, a senior director at AMD, audited 6,852 Claude Code session files and 234,000 tool calls and found a measurable drop in reasoning depth. Anthropic’s response identified three culprits: a default reasoning-effort downgrade from high to medium, a session-caching bug, and a system-prompt change instructing the model to stay under 100 words. None touched the model weights. All produced noticeable, measurable degradation. A BridgeMind benchmark slide cited in the same coverage showed Opus 4.6 accuracy on one test falling from 83.3% to 68.3% over the affected period.

OpenAI has been less transparent. In December 2025, Wired reported that ChatGPT’s free and Go users had been quietly defaulted to GPT-5.2 Instant, with the auto-routing system that used to escalate hard prompts to a thinking model disabled. The shift was framed as giving users more control. The cost savings across hundreds of millions of free users would be considerable. Sam Altman has publicly acknowledged dissatisfaction with how the auto-routing system performs. Whether silently demoting free users to a cheaper model is consistent with how the product was advertised is an open question.

To be clear, there is no public evidence that any major lab has ever swapped a paid API model for a stealth-distilled version of the same product. Accusations in this category are almost always traced to product-layer changes, wrapper updates, or non-determinism that users mistake for model changes. But the incentive to do so is large and growing, and the technical means are well understood. A provider could quietly serve a 4-bit quantized version of a model marketed as full-precision, and most users would not notice.

A provider could distil a frontier model into a smaller one with 95% of the original’s benchmark performance and serve the distilled version under the same name. Trust in the system rests entirely on the labs not doing these things. Nothing in the API contracts requires them not to. The house has every incentive to swap your steak for a cheaper cut and serve it in dim lighting. Whether it does depends on how much it values its reputation and how confident it is that no one is running a side-by-side comparison.

LLMflation and the limits of the curve

Andreessen Horowitz coined “LLMflation” to describe the observation that the cost of running an LLM at a given quality level falls by roughly 10 times per year, faster than compute costs in the PC era and faster than bandwidth costs in the dotcom era. The cheapest model to score 42 on MMLU when GPT-3 launched in November 2021 cost $60 per million tokens. By late 2024, the same score was available from Llama 3.2 3B at $0.06 per million. A factor of a thousand in three years. This progress is genuine. It is also the kind of curve that gets dressed up in the language of physical law and treated as inevitable.

The dressing of choice is Wright’s Law. Theodore Wright was an aeronautical engineer who, in 1936, observed that for every doubling of cumulative aircraft production, labor costs fell by roughly 20%. The pattern, he argued, generalized. Solar panels follow Wright’s Law at around 20% per doubling. Lithium batteries at 19%. Nuclear power, conspicuously, does not follow Wright’s Law at all. Costs have risen over the decades rather than fallen, because the binding constraints are regulatory and political rather than technical, and the production volume has been too low for learning effects to dominate. The relevance to AI is that not every technology gets cheaper just because more of it gets made.

Wright’s Law is an empirical regularity, not a law of physics. The technologies that follow it tend to share specific properties, including a modular production unit, learning-by-doing in manufacturing, demand growing fast enough to drive cumulative production through many doublings, and no hard physical floors approaching the cost. Bicycles and refrigerators do not get cheaper exponentially, even though we have made many of them. The 10-times-per-year LLMflation curve assumes cumulative tokens continue to double rapidly. If demand growth slows, the curve flattens. If it accelerates, the curve steepens. The curve is, in other words, a function of the demand it is supposed to forecast. A forecast that depends on its own conclusion is not a forecast.

If LLMflation is the bull’s argument, Jevons paradox is its more sophisticated cousin. Cheaper inference, the argument runs, will not reduce total compute demand because it unlocks new use cases that consume more compute than the efficiency saved. Satya Nadella posted exactly this argument on X within hours of DeepSeek’s R1 release, when the market had briefly panicked that a cheaper model would lead to lower chip demand. “Jevons paradox strikes again! As AI gets more efficient and accessible, we will see its use skyrocket.” The framing helped Microsoft considerably as it was about to disclose a $13 billion AI revenue run rate.

William Stanley Jevons first published the observation in 1865 in a book called The Coal Question. Watt’s more efficient steam engine, he noticed, had not reduced British coal consumption. It had increased because cheaper coal-fired power made it economically viable in industries where it had previously been too expensive. The new engines used less coal per unit of work. They did far more work. Total coal consumption rose. The paradox holds in any specific industry only when demand is elastic enough that the efficiency gain is more than offset by the demand response. Whether AI inference is in the elastic regime is an empirical question that no one has answered with much confidence.

There is a strong prima facie case that it is. Cheaper inference unlocks workloads that were previously uneconomic, every code completion in every IDE, every transcript summary, every email draft, every search query rewritten as a question. The demand curve is a stack of dozens of latent applications, each waiting for inference to fall below its viability threshold. There is also a counter-case that bulls underemphasize. Some inference demand is genuinely inelastic. A user asking ChatGPT to summarize a document does not want to summarize it ten times because the summary became cheaper. They want it summarized once.

Most consumer chat traffic looks like this, capped by the user’s attention rather than the model’s price. The central question is which of the two worlds we are in. In the first, agents, the LLM-driven systems that call tools and chain hundreds of inference calls per task, become the dominant mode of use, as Jevons holds. In the second, agents fizzle, and the marginal demand curve is set by humans typing into chat boxes, which saturates. The capex curve is being drawn on the assumption that we live in the first world. If we live in the second, the curve is drawn on a fault line.

The customer side of the bet

The supply side of this argument is loud. Every hyperscaler quarterly call covers it. Every analyst note tabulates it. The demand side gets less attention, perhaps because it is more embarrassing. In March 2026, Goldman Sachs senior US economist Ronnie Walker published an analysis of fourth-quarter S&P 500 earnings calls. Half of all Russell 3000 companies discussed AI. Among S&P 500 companies, only 10% of management teams quantified AI’s impact on a specific use case. Only 1% quantified its impact on earnings. The Census Bureau’s Business Trends survey found that the share of US establishments using AI was under 20%, unchanged from the previous month. Walker’s headline conclusion was that there is no meaningful relationship between AI adoption and productivity at the economy-wide level.

There are localized exceptions. The same Goldman analysis found median productivity gains of around 30% in two narrow use cases, software engineering and customer service. WRITER’s 2026 enterprise survey found that 97% of executives report personal benefit from AI but only 29% see meaningful organizational ROI from generative AI, and only 23% from AI agents. A separate Fortune CFO survey found that executives privately expect AI-attributed layoffs in 2026 to be roughly nine times the publicly reported figures, even as many of those same CFOs acknowledged a gap between expected productivity gains and measured ones.

This is not the consumption pattern of a technology that is about to absorb $725 billion of annual capex. It is the consumption pattern of a technology that is in the early innings of an expensive, multi-year experiment whose ROI nobody has yet been able to measure. The Fortune 500 stat that gets cited everywhere, that 92% of large firms use ChatGPT, is impressive on face value. It is also exactly what you would expect to see in the early stages of any technology bubble, when buyers fear being left behind. Whether they are paying for value or for FOMO is the question.

The strange financing

The cumulative number is the one that ought to focus minds. Morgan Stanley’s central case is roughly $3 trillion in AI infrastructure investment through 2029. McKinsey’s goes higher, to $6.7 trillion globally by 2030. McKinsey’s upper scenario is $7.9 trillion. Add the various private commitments, and the total commitment by the end of the decade is somewhere between $5 trillion and $10 trillion. Pick the round number you prefer. It is all glorified guesswork.

This is the largest concentrated capital expenditure cycle in modern corporate history. For comparison, the entire US electric utility industry invested around $160 billion in 2024 on generation, transmission, and distribution. The technology sector is now outspending the utility sector on energy-adjacent infrastructure by more than 2x. The Manhattan Project cost about $30 billion in today’s dollars. Apollo cost $288 billion over thirteen years. OpenAI alone projects cumulative losses of $115 billion through 2029 before turning cash-flow positive.

The financing of all this has features that are usually compared, in the analyst notes, to the late-1990s telecom bust. The comparison is not wrong, exactly, but it is a comparison from a different generation of financial reporters reaching for the most recent crisis they covered, and it misses what is structurally novel about the present arrangement.

Lucent, the canonical 1999 cautionary tale, had Nortel, Ericsson, and a half-dozen other equipment vendors competing for the same customers in the competitive local exchange carrier (CLEC) market. When the demand miss came, the unwind was savage but distributed across multiple vendors, multiple supply chains, multiple national champions. NVIDIA has, depending on how you count, between 80% and 92% of the AI accelerator market. It has invested or committed to invest in OpenAI, CoreWeave, xAI, Lambda, Nebius, and a long tail of smaller neoclouds. Many of those entities then buy or lease Nvidia GPUs with the money Nvidia has put in. CoreWeave alone carries $10.45 billion in debt collateralized by GPUs. NVIDIA separately committed to buy $6.3 billion of cloud services from CoreWeave through 2032, which is to say, NVIDIA has agreed to pay CoreWeave for the use of GPUs that NVIDIA originally sold to CoreWeave with money NVIDIA originally lent to CoreWeave. The whole structure is a single counterparty in fancy dress.

The concentration risk of this shape was not present in the dotcom buildout. It is a property of the present arrangement. When every node in a system depends on every other node, and they all share an upstream dependency, you do not get graceful degradation. You get cascade failure, the kind that production engineers spend their careers designing out of distributed systems. If OpenAI’s revenue disappoints by 30% in 2027, that is a material problem for OpenAI, but it is also a material problem for Microsoft (45% of whose $625 billion cloud backlog is OpenAI), for Oracle (whose $300 billion contract is OpenAI), for Nvidia (whose 2027 earnings model leans heavily on OpenAI deployment), for CoreWeave (whose largest customer is OpenAI via Microsoft), and for the GPU-backed debt market whose collateral is suddenly worth less than the loans against it.

The first cracks have already shown. In March 2026, at Morgan Stanley’s technology conference, Jensen Huang publicly confirmed that the original $100 billion Nvidia investment in OpenAI was being reduced to $30 billion, and that any subsequent tranches were unlikely. The official explanation was OpenAI’s planned IPO. The unofficial explanation, circulating among the people whose job is to read between the lines, was that the loop had become more of a liability than an asset. A few months earlier, the WSJ had reported that Nvidia internally questioned OpenAI’s business discipline. Huang denied this, calling the report “nonsense.”

A few weeks ago, the same paper reported that Friar had warned OpenAI’s leadership that the company might fail to pay for its compute contracts if revenue did not accelerate. Friar and Altman issued a joint statement calling the report “ridiculous.” Friar had previously, when asked by Morgan Stanley analyst Brad Gerstner how OpenAI could honor $1.4 trillion in commitments on $13 billion of revenue, given the answer that included the word “backstop.” When the chief financial officer of the most prominent company in the sector starts gesturing toward federal guarantees, the structure has stopped pretending to be normal venture finance.

Then there is the depreciation question, which is where the accounting gets interesting. A GPU is a depreciable asset. Its cost is spread across its useful life. The choice of useful life determines whether billions flow through the expense line and into reported earnings. In 2020, Amazon extended its server depreciation schedule from three to four years. By 2023, all of the big three had moved to six. CoreWeave depreciates GPUs over 6 years, Nebius over 4, and Lambda over 5. Michael Burry accused the hyperscalers of overstating earnings by roughly $176 billion through 2028 because, in his view, GPUs become economically obsolete in two to three years, not six.

Burry is neither entirely right nor entirely wrong. The hyperscalers’ defense, that GPUs cascade from primary training to high-value inference to batch inference and finally to general compute, is plausible in steady state. The defense becomes less plausible at the speed at which NVIDIA is now releasing new architectures. Blackwell shipped in volume in 2025. Rubin is shipping in 2026. Each generation delivers roughly 30 times the inference performance of its predecessor on certain workloads. Jensen Huang himself joked, on stage, that “when Blackwell starts shipping in volume, you couldn’t give Hoppers away.” If H100 resale prices fall 70% in their first three years, as they roughly have, then a straight-line depreciation schedule over six years is hard to defend. Even Satya Nadella, on a recent earnings call, said he did not want to get “stuck with four or five years of depreciation on one generation.”

The depreciation question is non-cash, so it does not directly determine whether the buildout succeeds. It determines how the income statements of the people building it look in the meantime. If Burry is right, the hyperscalers are reporting $50 billion to $60 billion of additional annual operating income that will eventually have to be reconciled against reality. Whether through quiet impairments across multiple quarters or a single bad earnings call that forces a sector-wide rethink, the gap closes one way or another. None of the ways is pleasant.

The physical wall

There is a constraint on this enterprise that does not appear on any income statement, cannot be engineered around with software, and is increasingly the binding limit on how fast the buildout can proceed. Electricity.

A typical AI-optimized data center draws 20 to 30 megawatts, against 5 to 10 megawatts for a traditional facility. Gigawatt-class campuses are now standard. Microsoft, Meta, and OpenAI each have multiple sites either operational or under construction at the 1GW-plus scale, with one Meta campus in Louisiana planned to scale to 5GW. The IEA’s most recent figures show data center electricity consumption rose 17% in 2025, with AI-focused facilities growing far faster. Total US data center IT load is projected to roughly double from around 80GW in 2025 to 150GW by 2028.

The US grid was not built for this. Average lead times to connect new generation in primary markets exceed four years. Morgan Stanley estimates a 49-gigawatt generation shortfall by 2028. PJM, the grid operator covering the mid-Atlantic, has attributed a doubling of capacity prices to data center load growth. Northern Virginia, the world’s largest data center cluster, is now substantially constrained on grid availability. Ireland imposed a de facto cap on new connections in Dublin years ago. Texas is the new growth market because, like a man choosing his pub for the cheap beer rather than the cooking, it has more electricity than rules.

The industry’s response is to bypass the grid. Hyperscalers are signing direct power purchase agreements with nuclear plants, contracting with small modular reactor projects (the IEA’s queue of conditional offtake agreements for SMRs has grown from 25GW at the end of 2024 to 45GW today), buying gas turbine output decades in advance, and, in some cases, building entire on-site generation parks. GE Vernova’s gas turbine order book has an 80 GW backlog that extends into 2029. Two consequences flow from this.

First, the cost of inference is now directly coupled to the cost of electricity in an unhedgeable way. If electricity prices rise, as they are in regions with concentrated AI demand, the floor under inference costs rises with them. Wright’s Law and LLMflation can take a long time to make up for a doubling of electricity prices. Second, the buildout is now competing with the residential and industrial customers it shares the grid with. A March 2026 Consumer Reports analysis found that communities near major data center clusters in Virginia, Texas, and Georgia were already seeing residential rate increases of 8% to 15%. The political surface area of this is enormous, and so far, almost entirely unaddressed.

What runs on a laptop in 2026

The other side of the inference economy runs on consumer hardware. And in 2026, it has become genuinely interesting.

Apple’s M5 Max with 128GB of unified memory, in a MacBook Pro that draws 60 to 90 watts under sustained inference, can run a 70B-parameter model at Q4 quantization at 18 to 25 tokens per second. That is faster than a human can read. With a Qwen 3.5 122B mixture-of-experts model, the same machine produces around 15 tokens per second at usable quality. In late March, Ollama shipped MLX-backed inference with NVFP4 quantization on Apple Silicon, more than doubling decode performance on the same hardware. Three weeks later, Alibaba released Qwen 3.6-35B-A3B, a 35-billion-parameter MoE with only 3 billion active parameters per token, which landed on the upgraded stack as something of a regime change. The model file is around 20GB at Q4. Token generation feels like a 3B model. Quality is closer to 35B. Simon Willison ran it on his MacBook Pro, and a non-trivial number of developers stopped reaching for the API for routine work.

Two years ago, the local-model story was a curiosity. Llama 2 13B on a MacBook was a slow, hot experiment that produced output worse than ChatGPT for free. The ratio has flipped. A current-generation Mac Mini with 48GB RAM runs Qwen 3.6-35B-A3B comfortably, draws around 30W under load, and costs less in electricity per year than a single month of ChatGPT Plus. The frontier models still beat them, but the gap is now measured in quality on the hardest 10% of tasks rather than across the board.

The bull case argues that local models accelerate Jevons. Every developer who runs Qwen on their laptop for code completion is, paradoxically, generating more total inference demand because they are using AI more heavily, without restraint, without paying per token. Eventually, they graduate to more complex problems that require API calls, and those API calls are larger because the workflow has expanded.

The bear case argues that local models are a deflationary force on the API economy. Anything a 35B MoE running on a Mac Studio can do, the API providers cannot charge premium prices for. The frontier shrinks to whatever is genuinely beyond local reach. The middle of the API market, the millions of routine queries from developers, knowledge workers, and curious consumers, is what the local models are coming for, and it is a much larger fraction of total revenue than the frontier.

Both stories are true. Anthropic’s revenue grew from $9 billion ARR at the end of 2025 to $30 billion by April 2026, driven mostly by Claude Code and enterprise API usage in agentic workflows. That is consistent with the bull case. The demand for high-end inference is exploding because agents make it pay. OpenAI’s pricing on its lower tiers, meanwhile, has fallen 90% in 18 months because Llama, Qwen and DeepSeek are nipping at the bottom. The frontier is becoming more lucrative, while the middle is becoming commoditized. The strategic question for any provider is whether it sits closer to the frontier or to the middle. The strategic question for the people writing the cheques is whether the frontier will grow fast enough to absorb the supply they are building.

Two inferences

The pattern that has held so far, where training and inference both run on broadly the same Nvidia silicon, is starting to break. Cerebras Systems launched its IPO on Nasdaq, the largest pure-play AI hardware listing of 2026. The deal was oversubscribed twenty times, pushing the price range from the original $115 to $125 up to $150 to $160 before pricing. At the top of the range, the valuation lands around $49 billion. The IPO is, in part, a referendum on whether the market believes that one chip is enough.

Cerebras builds wafer-scale silicon. The WSE-3, its latest, holds 44GB of SRAM on a single chip and reads from it at 21 petabytes per second. An Nvidia H100 has 80GB of HBM at 3.35 terabytes per second. Half the memory, six thousand times the bandwidth. The chip is extraordinary for workloads that fit in on-chip memory and pointless for workloads that do not. Yield is a constant fight because the entire wafer is the chip. Cost per unit reflects this. The market has nonetheless concluded that there is a workload worth paying the premium for.

What makes the IPO consequential is who the customer is. OpenAI signed a multi-year, $20 billion-plus Master Relationship Agreement with Cerebras in January, with 750 megawatts of capacity coming online in tranches through 2028. The prospectus describes the two companies as agreeing to co-design future Cerebras hardware models. Read against the rest of this article, that sentence is the interesting one. The company, whose $1.4 trillion in compute commitments anchors the entire NVIDIA trade, has separately decided that a meaningful share of its workload belongs on different silicon.

The reason is workload bifurcation. Inference is starting to split into two distinct shapes. The first is answer inference. A human asks a question, the model responds, and the time the user spends staring at a loading indicator is a direct cost. Speed wins. Cerebras and Groq are competitive here because they collapse the decode bottleneck into something that feels instantaneous. Voice interfaces, wearables, anything where the user is waiting for the next word, lives in this regime.

The second is agentic inference. A model receives a task, calls tools, queries databases, retries failed steps, and eventually produces a result. No human is waiting on any individual token. The latency budget is measured in minutes or hours rather than milliseconds. What wins here is the ability to hold the context and accumulated state that the agent needs to do its job. Memory capacity beats memory bandwidth. Slower DRAM beats expensive HBM, and older silicon at the right node beats the leading edge. The agent does not know or care how long any single decode step took, only whether the eventual answer was right.

NVIDIA knows this. The Dynamo framework, released last year, disaggregates inference itself so different parts of the work can run on different hardware tiers. The company is shipping standalone memory and CPU racks specifically to keep its GPUs from straining either bandwidth or compute. This is not the behavior of a company that thinks its current architecture is the right answer for the next decade. It is the behavior of a company that knows the workload is shifting and is trying to keep up.

The implication for the buildout is uncomfortable. A meaningful share of the $725 billion being committed in 2026 is being spent on hardware optimized for a workload mix the industry openly expects to shift. NVIDIA’s premium is largely a latency premium, and latency is the part of the equation that becomes optional once humans leave the loop. If agentic inference becomes the dominant use of compute, as the labs themselves now believe, then the GPUs being delivered to Abilene this year are priced for a customer who may, by 2028, prefer slower hardware at a fraction of the cost. The wager has always rested on the inference that the demand arrives on schedule. It now also rests on the assumption that the shape of that demand will remain the same.

What we do not know

We know inference costs have fallen at roughly 10 times per year since 2022, a faster learning curve than any prior technology has demonstrated.

We know the techniques driving the decline have meaningful runway left, but we do not know how much. Quantization below 4 bits per weight starts hurting accuracy on hard tasks. MoE sparsity has limits. Speculative decoding is most useful when the output is predictable and less useful as model outputs become more reasoning-heavy. The next 10 times will likely come from a combination of all of these, plus chip improvements. Anyone who tells you with confidence that the curve continues at the same slope through 2030 is selling something.

We know AI demand is growing fast at the supply side. Anthropic at 30 times in 16 months. OpenAI at 3 times per year sustained at scale. We do not know how much of that demand is durable end-user value versus enterprise budget on a multi-year experiment that, by Goldman’s reckoning, is not yet showing up in productivity statistics. Walker’s analysis is the part of the story that the people writing capex slides have decided not to read. They may be right that productivity follows adoption with a long lag. They may be wrong. That is the bet.

We do not know whether Jevons holds at the scale required. We know it holds for some workloads, including agentic coding, content generation, and RAG pipelines. We do not know whether those workloads absorb $725 billion in annual capex-worth of capacity, or merely $300 billion. The difference obviously matters.

We do not know whether the GPU depreciation schedules reflect economic reality, nor whether the reconciliation, when it comes, will be gradual or sudden. We do not know whether the circular financing structures hold up under stress. They have held up so far because the music has not stopped. Friar’s “backstop” comment was the moment the lyrics changed. Huang’s quiet retreat from $100 billion to $30 billion was a bridge.

We do not know whether the power constraint resolves. Building 49GW of new generation in three years is the kind of physical-world challenge that makes software people uncomfortable because it does not yield to engineering at the pace they are used to. We do not know how local models reshape the API market. We do not know (and the labs themselves disagree) whether silent quality regressions will erode the trust that the entire premium-API business model depends on.

A closing image

In the center of Las Vegas, at the Bellagio, the highest-stakes private games happen in a room called Bobby’s Room, named after Bobby Baldwin, who won the World Series of Poker in 1978. The buy-in is undisclosed. The walls are lined with mirrors so the players can watch their opponents from every angle.

The participants in Bobby’s Room have one thing in common with the people writing the cheques for those data centers at Abilene. Both have studied the table and concluded that the expected value of the next hand is positive and that the size of the pot justifies the bet.

The difference is that in Bobby’s Room, when a hand goes wrong, the chips slide to a different side of the table. In the inference economy, when a hand goes wrong, the entire room is implicated. The hyperscalers, the model labs, the chip makers, the neoclouds, the utilities, the bondholders, and the residential customers paying 12% more for electricity all sit at the same felt. Microsoft, NVIDIA and OpenAI have been dealt good cards, and have raised confidently. The pot is now somewhere between three and ten trillion dollars, depending on what you count.

The hand is not over. The cards are still being turned. The position, in May 2026, is that the people building data center infrastructure know more than the rest of us about whether the bet pays off, but not as much more as they would have you believe. They are betting their company. We are betting our pension funds. The cotton fields north of Abilene grew cotton because cotton paid. They will grow tokens because tokens pay. Whether they keep growing them, and at what price, is the only question.

Outside the building, the wind moves through the empty stretch of grass between the perimeter fence and the road. The transformers hum at 60 hertz. Inside, the GPUs that cost more per square foot than the most expensive office space in Manhattan are running, drawing power, dissipating heat, producing tokens at roughly the speed of human thought. There is no signage. There is no crowd. The bet is being made one query at a time, and the table is open all night.

More blog posts

All blog posts

A playful, hand-drawn illustration of a group of characters holding up scorecards with the number ‘11’. They sit behind a table scattered with various other numbers.