Out of context: strategies for managing agent memory
By Iain,
The ongoing contest in AI technology—a “strange arms race”—is the relentless expansion of the context window, which is the maximum input size for a large language model. This arms race is driven by the persistent notion that a larger context equals greater intelligence and capability. For example, Google’s Gemini 1.5 Pro now supports a million tokens, Anthropic’s Claude can handle 200,000, and OpenAI continually increases GPT-4’s limit. The promise is clear: if you feed the machine your entire codebase, legal documents, or a complete novel, it will analyse the material with flawless, almost divine understanding. However, the reality is that the technology does not yet meet this promise.
The sales pitch versus the reality
The marketing version of context windows appears clean and satisfying. A bigger window means the model can “see” more of your data at once, enabling it to draw connections across larger bodies of information. This means you can stop chopping your documents into smaller fragments and hand over the entire lot. For anyone who has spent time wrestling with retrieval-augmented generation workflows, splitting text, embedding it, hoping the right chunk surfaces at the right moment, the promise of a large context window feels like salvation.
But the actual behaviour of these models under long-context conditions is something else entirely. Research from Chroma measured 18 LLMs and found that models do not utilise their context uniformly, and their performance becomes increasingly unreliable as input length grows. Even on tasks as simple as retrieving a specific piece of information or copying text verbatim—which you would expect a computer to handle easily—they observed increasing inconsistency in performance as the input length increased.
Think about what that means. The model isn’t just getting a little fuzzy around the edges; it’s getting selectively blind, and it doesn’t tell you where the blind spots are.
The U-shaped hole in the middle
The most well-documented version of this problem has a name that sounds like a lost children’s book: “lost in the middle.” A landmark study from Stanford and UC Berkeley tested what happens when you move the position of relevant information within a long context. Performance degrades sharply depending on where the information sits. Models are best at locating items at the very beginning of the input, decent at catching items at the very end, and progressively worse at noticing anything wedged in between.
The researchers plotted this as a U-shaped curve, with high performance at the edges, a valley in the centre, and the shape holding even for models explicitly designed and trained for long contexts. The model pays attention to what it reads first and what it reads last. Everything in the middle gets the treatment you give the terms and conditions when installing software.
This is not a minor quirk. If you’re feeding a 100-page contract into an LLM and the clause you care about is on page 47, you have a problem. The model will confidently summarise the preamble and the signature block, and the bit about indemnification that your lawyer specifically asked about will get smoothed into a vague generality or missed altogether. The worst part is the model won’t flag the gap. It will produce a fluent, well-structured answer that happens to be wrong in the one place you needed it to be right.
The 50% wall
Here’s a number that should make anyone building on these systems nervous. Studies have shown that LLMs experience a decline in reasoning performance when processing inputs that approach or exceed approximately 50% of their maximum context length. For GPT-4o, with its 128K-token window, that means trouble begins around 64,000 tokens, roughly 48,000 words, or about the length of The Great Gatsby — a long way from the theoretical maximum.
So when someone tells you their model supports 128K tokens, what they’re really saying is that it supports about 64K tokens before things start to become unreliable, and probably less than that for anything requiring actual reasoning rather than simple retrieval. The context window isn’t a hard container that functions uniformly up to its limit. It’s more like a swimming pool, with a deep end full of murky water and things that bite.
The gap between advertised and reliable capacity is enormous, and almost no one marketing these tools mentions it. It’s the tech equivalent of selling a car with a 200 mph speedometer and not mentioning that the engine overheats at 95.
Why this matters
The context window problem compounds in ways that aren’t obvious until you’re knee-deep in a production system. Consider a common workflow in which you combine a system prompt, a set of instructions, some reference documents, and a user query into a single call. The system prompt sits at the top, and the user query sits at the bottom. The reference documents, the actual information the model needs to reason about, sit in the middle.
You have just arranged your information in the exact configuration that the Stanford research says will produce the worst results.
This isn’t hypothetical; it’s the default architecture of most LLM applications. The retrieval-augmented generation pattern, the one everyone is building on, places retrieved context squarely in the middle of the prompt. The model sees the instructions first, sees the question last, and treats the evidence like something it vaguely remembers from a party it went to three weeks ago.
And the failure mode is silent. The model doesn’t say, “I couldn’t locate that in the context you gave me.” It fabricates an answer that sounds plausible because that’s what language models do. The confidence of the output is completely decoupled from the reliability of the retrieval. You get the same polished paragraph whether the model actually found the right information or whether it’s riffing on vibes.
The attention economy inside the model
The core of this is the transformer’s attention architecture. Self-attention, which makes these models function, scales quadratically with sequence length. Each token must attend to every other token, and as the sequence lengthens, the computational cost rises sharply and the attention scores become less focused. The model has a limited capacity for “caring about things,” and dividing that capacity across 100,000 tokens means each token gets a smaller share.
It’s similar to asking someone to listen carefully to a five-minute conversation versus a five-hour lecture. The hardware remains the same, and so do the ears, but the quality of listening worsens because attention is finite, and the demand on it has increased dramatically.
Various engineering solutions have been devised to tackle this, including sparse attention patterns, sliding-window approaches, and retrieval-augmented architectures that aim to surface relevant chunks before they get lost in noise. These help, but they are merely patches on a fundamental structural limitation. The real issue—that attention is a limited resource spread across an expanding surface—does not disappear just because you get clever with the implementation.
Some newer methods, like ring attention and different forms of hierarchical context management, show potential. However, “shows promise in a research paper” and “works reliably in production at scale” remain different places, separated by a journey that often takes years.
The quiet cost of over-stuffing
There’s a secondary problem that receives even less attention than the lost-in-the-middle effect: noise. Every token you add to the context that isn’t directly relevant to the task is a token competing for the model’s limited attention. Including a 50-page background document when only three paragraphs are needed doesn’t give the model more information; it gives it more distractions.
This contradicts the instinct most people have when working with LLMs. The natural impulse is to provide more context, not less, “just in case it needs it.” However, research indicates that the opposite approach is more effective. Precision in what you feed the model is more important than volume. A surgeon doesn’t need access to the entire hospital to operate; they need the right instruments on the right tray.
Ironically, the push for larger context windows actually promotes the wrong behaviour. When the window is small, you’re forced to be selective, considering what information genuinely matters for the task. When the window is vast, there’s a temptation to skip that careful thought and include everything. Paradoxically, this constraint has produced better results than the unrestricted option.
What to do about it
If you’re designing systems that utilise LLMs with long contexts, several points follow from the research.
First, don’t rely solely on the stated capacity. Treat the effective context window as about half the advertised maximum, and even less for complex reasoning tasks. Arrange your architecture around 50-60K tokens, even if the model technically supports 128K.
Second, place the important information at the edges. If you can control where information appears within the context, position the most critical content at the beginning or the end. This isn’t a trick; it reflects a documented feature of how these models process information.
Third, don’t assume the model has read everything you’ve provided. Incorporate verification into your workflow. If you’ve asked the model to reason over a specific document section, have it cite where it found the information. If it can’t indicate the source, assume it didn’t actually use it.
Fourth, smaller and targeted prompts outperform large and careless ones. Instead of dumping an entire repository into the context and hoping the model finds what it needs, do the retrieval work first. Extract the relevant sections, place them intentionally, and keep the context concise. A well-curated 10K-token prompt will outperform a sloppy 100K-token one nearly every time.
Fifth, and this is the uncomfortable truth, accept that the problem might not be solvable with current architectures. The focus on context window size is partly a marketing tactic. Vendors compete on the number because it’s easy to measure and market. But raw capacity without reliable usage is a vanity metric. A 1-million-token context window where the model reliably uses 200,000 tokens is functionally the same as a 200,000-token window. The extra zeros are merely decorative.
The bigger picture
There’s something somewhat amusing about the situation. We have created machines that can write poetry, pass bar exams, and explain quantum mechanics to children, yet they can’t reliably locate a paragraph in the middle of a lengthy document. The context window challenge serves as a reminder that these systems are, at their core, statistical pattern-matching tools operating under constraints that don’t always align with how we intend to use them.
The fix will probably come, eventually. Better attention mechanisms, smarter retrieval, and hybrid architectures that combine different approaches to memory and reasoning. But right now, in production, the context window is a leaky bucket. The vendors keep making it bigger, and nobody’s fixed the holes.
The responsible thing, the boring, unglamorous, career-limiting thing, is to design systems that account for these limitations rather than pretending they don’t exist. Test with information in different positions. Measure retrieval accuracy, not just generation fluency.
Like this? Get email updates or grab the RSS feed like it’s 2008.
More from the blog
-

Another nice mess
Somewhere in your business right now, someone is assembling a picture that no single app can provide. It may be the project manager pulling hours from Harvest and budget data from the finance tool to assess whether the engagement is still viable. Maybe it's you on a Sunday, because what you need is not any one number from a system, but the pattern across three of them. The cloud gave small businesses access to the best software they had ever had, priced monthly and built for specific purposes. But twenty years of sensibly chosen apps have left the average small business with a patchwork data …
-

The state and the machine
> What little we saw of Fable and Mythos offers both cause for excitement and concern. It was widely and credibly seen as a model of a completely different caliber from those that had come before. Perhaps the risks in this instance were overstated or amplified for political ends. What is more profound is that the short time we had with the models offered a clear glimpse of a future in which a single company is making significant progress toward a superintelligence with the potential to rival or exceed the power of nation-states or even massive corporations. That juncture was never going to ar…
-

We have ways of making you pay
> The true cost of AI work is hard to measure; the value of AI work is also hard to measure, and metering changes which of those two blindnesses you notice first. It drags the cost into the light, itemised and arriving monthly, while the value stays diffuse, lagging and easy to argue about. That asymmetry is exactly why the panic is showing up now, ahead of any definitive verdict on whether the spending was worth it.Simon Willison did the arithmetic on himself. He pays $200 a month across his Anthropic and OpenAI consumer plans, and when he ran the [ccusage](https://github.com/ryoppippi/ccusa…
-

Bloated: how chat made you fat
> It helps to remember the time you save generating a document is not free. It is borrowed from every person who has to read it, at interest, and the longer the distribution list the worse the rate of return.The pitch for writing with a language model is that it saves you time: you describe the memo, the model produces it and 90 seconds later you have four pages (okay, maybe forty) instead of a blank document. Someone still has to read those pages though. The model did not remove that work. It just moved it downstream to your colleagues or suppliers, and on the way it produced more than any h…
-

Apple’s bicycle without a chain
Steve Jobs described the computer as a bicycle for the mind. Apple Intelligence so far is more like a bicycle with no chain. The frame is gorgeous, and the engineering is extraordinary, but you cannot get far with it.In early 2025, Xe Iaso published a [piece that landed like a brick through a window](https://xeiaso.net/blog/2025/squandered-holy-grail/) in the Apple developer community. The argument was simple and damning: Apple had built the holy grail of trusted compute with Private Cloud Compute, a genuinely unprecedented piece of security infrastructure, only to fill it with half-baked not…
