Out of context: strategies for managing agent memory

By Iain,

Out of context: strategies for managing agent memory

The ongoing contest in AI technology—a “strange arms race”—is the relentless expansion of the context window, which is the maximum input size for a large language model. This arms race is driven by the persistent notion that a larger context equals greater intelligence and capability. For example, Google’s Gemini 1.5 Pro now supports a million tokens, Anthropic’s Claude can handle 200,000, and OpenAI continually increases GPT-4’s limit. The promise is clear: if you feed the machine your entire codebase, legal documents, or a complete novel, it will analyse the material with flawless, almost divine understanding. However, the reality is that the technology does not yet meet this promise.

The sales pitch versus the reality

The marketing version of context windows appears clean and satisfying. A bigger window means the model can “see” more of your data at once, enabling it to draw connections across larger bodies of information. This means you can stop chopping your documents into smaller fragments and hand over the entire lot. For anyone who has spent time wrestling with retrieval-augmented generation workflows, splitting text, embedding it, hoping the right chunk surfaces at the right moment, the promise of a large context window feels like salvation.

But the actual behaviour of these models under long-context conditions is something else entirely. Research from Chroma measured 18 LLMs and found that models do not utilise their context uniformly, and their performance becomes increasingly unreliable as input length grows. Even on tasks as simple as retrieving a specific piece of information or copying text verbatim—which you would expect a computer to handle easily—they observed increasing inconsistency in performance as the input length increased.

Think about what that means. The model isn’t just getting a little fuzzy around the edges; it’s getting selectively blind, and it doesn’t tell you where the blind spots are.

The U-shaped hole in the middle

The most well-documented version of this problem has a name that sounds like a lost children’s book: “lost in the middle.” A landmark study from Stanford and UC Berkeley tested what happens when you move the position of relevant information within a long context. Performance degrades sharply depending on where the information sits. Models are best at locating items at the very beginning of the input, decent at catching items at the very end, and progressively worse at noticing anything wedged in between.

The researchers plotted this as a U-shaped curve, with high performance at the edges, a valley in the centre, and the shape holding even for models explicitly designed and trained for long contexts. The model pays attention to what it reads first and what it reads last. Everything in the middle gets the treatment you give the terms and conditions when installing software.

This is not a minor quirk. If you’re feeding a 100-page contract into an LLM and the clause you care about is on page 47, you have a problem. The model will confidently summarise the preamble and the signature block, and the bit about indemnification that your lawyer specifically asked about will get smoothed into a vague generality or missed altogether. The worst part is the model won’t flag the gap. It will produce a fluent, well-structured answer that happens to be wrong in the one place you needed it to be right.

The 50% wall

Here’s a number that should make anyone building on these systems nervous. Studies have shown that LLMs experience a decline in reasoning performance when processing inputs that approach or exceed approximately 50% of their maximum context length. For GPT-4o, with its 128K-token window, that means trouble begins around 64,000 tokens, roughly 48,000 words, or about the length of The Great Gatsby — a long way from the theoretical maximum.

So when someone tells you their model supports 128K tokens, what they’re really saying is that it supports about 64K tokens before things start to become unreliable, and probably less than that for anything requiring actual reasoning rather than simple retrieval. The context window isn’t a hard container that functions uniformly up to its limit. It’s more like a swimming pool, with a deep end full of murky water and things that bite.

The gap between advertised and reliable capacity is enormous, and almost no one marketing these tools mentions it. It’s the tech equivalent of selling a car with a 200 mph speedometer and not mentioning that the engine overheats at 95.

Why this matters

The context window problem compounds in ways that aren’t obvious until you’re knee-deep in a production system. Consider a common workflow in which you combine a system prompt, a set of instructions, some reference documents, and a user query into a single call. The system prompt sits at the top, and the user query sits at the bottom. The reference documents, the actual information the model needs to reason about, sit in the middle.

You have just arranged your information in the exact configuration that the Stanford research says will produce the worst results.

This isn’t hypothetical; it’s the default architecture of most LLM applications. The retrieval-augmented generation pattern, the one everyone is building on, places retrieved context squarely in the middle of the prompt. The model sees the instructions first, sees the question last, and treats the evidence like something it vaguely remembers from a party it went to three weeks ago.

And the failure mode is silent. The model doesn’t say, “I couldn’t locate that in the context you gave me.” It fabricates an answer that sounds plausible because that’s what language models do. The confidence of the output is completely decoupled from the reliability of the retrieval. You get the same polished paragraph whether the model actually found the right information or whether it’s riffing on vibes.

The attention economy inside the model

The core of this is the transformer’s attention architecture. Self-attention, which makes these models function, scales quadratically with sequence length. Each token must attend to every other token, and as the sequence lengthens, the computational cost rises sharply and the attention scores become less focused. The model has a limited capacity for “caring about things,” and dividing that capacity across 100,000 tokens means each token gets a smaller share.

It’s similar to asking someone to listen carefully to a five-minute conversation versus a five-hour lecture. The hardware remains the same, and so do the ears, but the quality of listening worsens because attention is finite, and the demand on it has increased dramatically.

Various engineering solutions have been devised to tackle this, including sparse attention patterns, sliding-window approaches, and retrieval-augmented architectures that aim to surface relevant chunks before they get lost in noise. These help, but they are merely patches on a fundamental structural limitation. The real issue—that attention is a limited resource spread across an expanding surface—does not disappear just because you get clever with the implementation.

Some newer methods, like ring attention and different forms of hierarchical context management, show potential. However, “shows promise in a research paper” and “works reliably in production at scale” remain different places, separated by a journey that often takes years.

The quiet cost of over-stuffing

There’s a secondary problem that receives even less attention than the lost-in-the-middle effect: noise. Every token you add to the context that isn’t directly relevant to the task is a token competing for the model’s limited attention. Including a 50-page background document when only three paragraphs are needed doesn’t give the model more information; it gives it more distractions.

This contradicts the instinct most people have when working with LLMs. The natural impulse is to provide more context, not less, “just in case it needs it.” However, research indicates that the opposite approach is more effective. Precision in what you feed the model is more important than volume. A surgeon doesn’t need access to the entire hospital to operate; they need the right instruments on the right tray.

Ironically, the push for larger context windows actually promotes the wrong behaviour. When the window is small, you’re forced to be selective, considering what information genuinely matters for the task. When the window is vast, there’s a temptation to skip that careful thought and include everything. Paradoxically, this constraint has produced better results than the unrestricted option.

What to do about it

If you’re designing systems that utilise LLMs with long contexts, several points follow from the research.

First, don’t rely solely on the stated capacity. Treat the effective context window as about half the advertised maximum, and even less for complex reasoning tasks. Arrange your architecture around 50-60K tokens, even if the model technically supports 128K.

Second, place the important information at the edges. If you can control where information appears within the context, position the most critical content at the beginning or the end. This isn’t a trick; it reflects a documented feature of how these models process information.

Third, don’t assume the model has read everything you’ve provided. Incorporate verification into your workflow. If you’ve asked the model to reason over a specific document section, have it cite where it found the information. If it can’t indicate the source, assume it didn’t actually use it.

Fourth, smaller and targeted prompts outperform large and careless ones. Instead of dumping an entire repository into the context and hoping the model finds what it needs, do the retrieval work first. Extract the relevant sections, place them intentionally, and keep the context concise. A well-curated 10K-token prompt will outperform a sloppy 100K-token one nearly every time.

Fifth, and this is the uncomfortable truth, accept that the problem might not be solvable with current architectures. The focus on context window size is partly a marketing tactic. Vendors compete on the number because it’s easy to measure and market. But raw capacity without reliable usage is a vanity metric. A 1-million-token context window where the model reliably uses 200,000 tokens is functionally the same as a 200,000-token window. The extra zeros are merely decorative.

The bigger picture

There’s something somewhat amusing about the situation. We have created machines that can write poetry, pass bar exams, and explain quantum mechanics to children, yet they can’t reliably locate a paragraph in the middle of a lengthy document. The context window challenge serves as a reminder that these systems are, at their core, statistical pattern-matching tools operating under constraints that don’t always align with how we intend to use them.

The fix will probably come, eventually. Better attention mechanisms, smarter retrieval, and hybrid architectures that combine different approaches to memory and reasoning. But right now, in production, the context window is a leaky bucket. The vendors keep making it bigger, and nobody’s fixed the holes.

The responsible thing, the boring, unglamorous, career-limiting thing, is to design systems that account for these limitations rather than pretending they don’t exist. Test with information in different positions. Measure retrieval accuracy, not just generation fluency.

More insights:

  • The path to an agent-first web

    For three decades, the web has operated on an implicit contract between the people who build websites and the people who visit them. You design pages for human eyes and organise information for human brains, monetising attention through ads, upsells, and sticky navigation patter…

  • Generative engine optimisation: separating sound practice from snake oil

    A new three-letter acronym is stalking the marketing industry. Generative Engine Optimisation (GEO) is the practice of making your content visible in AI-generated answers, such as those produced by ChatGPT, Perplexity, Google AI Overviews, and Claude. The term was coined in a 20…

  • Automating your marketing 01: Paid Search Ads

    Google has always wanted you to believe that running search ads is simple and not as complex as it actually is. Set a budget (a generous one!), choose some keywords, and let the machine handle the rest. To be fair, the machine has become exceptionally good at certain aspects of …

  • Why AI models hallucinate

    In September 2025, OpenAI published a paper that said something the AI industry already suspected but hadn’t quite articulated. The paper, “Why Language Models Hallucinate”, authored by Adam Tauman Kalai, Ofir Nachum, Santosh Vempala, and Edwin Zhang, didn’t just catalogue the p…

  • Received wisdom: classic frameworks under AI pressure 01: David C Baker

    David C Baker has spent thirty years telling agency owners something they already suspected but lacked the courage to act on. You are not expensive enough, not focused enough in what you do. You are not sufficiently authoritative with your clients. The issue is not your work. Th…

All insights

Book a call

Have a challenge in mind or just want to connect? Schedule a call with Garrett, or reach out via email or LinkedIn.

A playful, hand-drawn illustration of a group of characters holding up scorecards with the number ‘11’. They sit behind a table scattered with various other numbers.