Why AI agents keep forgetting things, and the race to fix it
By Iain,
Ask ChatGPT something on Monday and return on Wednesday, and it will greet you with the warmth of a stranger. It has no recollection of your project, preferences, or the three hours you spent refining a prompt together. This amnesia is not a flaw in the traditional sense but a constraint inherent to how large language models operate. They process text within a fixed-size window, and when that window closes, everything inside it disappears.
For a chatbot answering one-off questions, this limitation is acceptable. However, for an AI agent expected to oversee a multi-week project, coordinate with other agents, or learn from its own mistakes over time, it is debilitating. As the industry advances towards agentic AI (systems that act autonomously over extended periods), the memory issue has shifted from a minor annoyance to the single greatest bottleneck in the field.
A surge of research in late 2025 and early 2026 has started to confront this challenge directly. The papers are emerging faster than anyone can read them, and a curated list maintained alongside the “Memory in the Age of AI Agents” survey already contains hundreds of entries, with new ones appearing weekly. What follows is an overview of why agent memory is so complex, what the most interesting new approaches look like, and where the gaps still remain.
The goldfish problem
To grasp the challenge, you must understand how LLMs currently process information. A model like GPT-5 or Claude handles a “context window” of tokens — roughly, words and word parts — on each call. These windows have expanded considerably, sometimes reaching 128,000 tokens or more, but they are still finite. Everything the model can reason about must fit within this window, and once a conversation exceeds it, older information simply drops out.
This is roughly equivalent to giving a human employee perfect reading comprehension but severe anterograde amnesia. They can analyse any document you present, but if you remove it for five minutes and then show it again, they will have no memory of seeing it. The employee is brilliant but entirely stateless, unable to build up experience.
The initial solutions were crude, and the most common was retrieval-augmented generation (RAG), which added a search engine, storing documents in a vector database, retrieving the most relevant segments when a question is asked, and inserting them into the context window. This method works well for fetching facts from a static knowledge base. However, it does not suit an agent that needs to recall what it attempted three steps earlier, why it failed, and what it decided to do differently.
This distinction is crucial because agentic work is sequential and cumulative. An agent developing a data processing workflow does not just need to remember that PostgreSQL uses port 5432. It must also recall that it already configured the database connection, that the first attempt timed out because the credentials were wrong, and that the user prefers environment variables over hardcoded secrets.
Short-term, long-term, and the gap between
Cognitive science has long distinguished between short-term (or working) memory and long-term memory, and the agent memory community has adopted the terminology wholesale. Short-term memory is what resides within the current context window, while long-term memory encompasses everything else, stored externally and retrieved as needed.
The issue is that most systems treat these two categories as entirely separate mechanisms, governed by different rules and managed by different code paths. Short-term memory might be handled by RAG-style summarisation, while long-term memory could be managed by a knowledge graph or vector store, and the two rarely learn to coordinate.
AgeMem, a framework published in January 2026 by Yi Yu and colleagues, addresses this directly. Instead of treating memory types as independent modules, AgeMem exposes both short-term and long-term memory operations as tools the agent can invoke, then trains the agent via reinforcement learning to independently decide what to store, what to retrieve, what to summarise, and what to discard. The model develops a unified policy for managing both memory types, much like how a human does not consciously switch between “short-term mode” and “long-term mode” when working on a problem. In tests across five reasoning benchmarks, this unified approach consistently outperformed systems that handled each memory type separately.
A related line of work, A-MEM (presented at NeurIPS 2025), draws inspiration from an analogue source: the Zettelkasten method, a note-taking system loved by academics and productivity enthusiasts. Instead of placing memories into a flat list, A-MEM creates structured notes with tags, keywords, and contextual descriptions, then dynamically links them to related notes in the archive. When a new memory arrives, the system analyses the existing collection and forms connections based on semantic similarity. The result is closer to a personal wiki than a filing cabinet, and it outperformed six baseline models on long-term conversational tasks while using a fraction of the tokens.
Learning to remember (and to forget)
One of the more provocative recent shifts is the application of reinforcement learning not to the agent’s primary task but specifically to its memory management. The argument boils down to if you train an agent to write code or answer questions, you are training the reasoning engine, but the decisions about what to keep in memory and what to discard are themselves a skill worth training separately.
MemRL, from researchers at Shanghai Jiao Tong University and others, makes this argument explicit. MemRL treats past experiences as episodes, stored with an “intent-experience-utility” structure that tracks not just what happened but how useful that memory turned out to be. Through a two-phase retrieval process, the system first filters memories by semantic relevance (does this past experience look like the current problem?) and then selects among the candidates based on learned Q-values (did this memory actually lead to a good outcome last time?). These utility scores continuously update based on environmental feedback, like a librarian gradually learning which books patrons actually read versus which gather dust.
The approach keeps the underlying model frozen, eliminating the risk of “catastrophic forgetting” (the tendency for neural networks to lose old skills when learning new ones). All adaptation occurs in the memory layer rather than the weights. In tests on benchmarks ranging from code generation to embodied navigation, MemRL outperformed both traditional RAG systems and fine-tuned models.
The context window as a scarce resource
Even as context windows have expanded from 4,000 tokens to 128,000 and beyond, they remain a limited resource for agents performing complex, multi-step tasks. Each tool call generates output; each reasoning step adds text; and an agent exploring a codebase or conducting research can consume 100,000 tokens in minutes, returning to the same goldfish problem with a slightly larger fishbowl.
Memex, published in March 2026, tackles this with an approach borrowed (by name, at least) from Vannevar Bush’s 1945 vision of a personal information machine. The system maintains a compact working context of concise summaries and stable indices while archiving full-fidelity records in an external store. When the agent needs to revisit something specific, it “dereferences” an index and recovers the exact original content. Unlike summarisation-based approaches that permanently lose information, Memex preserves everything and retrieves it losslessly when needed. The accompanying MemexRL training framework teaches the agent to decide what to compress, what to archive, how to label it, and when to pull it back — all while staying within a strict context budget.
Think of it as the difference between a student tearing pages out of their textbook (lossy summarisation) and one bookmarking them with Post-it notes (indexed archiving). Both keep the desk clear, but only one can reconstruct the original information on demand.
Memory as architecture, not afterthought
A separate strand of research has moved away from cognitive-science metaphors towards a more industrial approach, treating agent memory as a systems-engineering challenge.
Pancake, from a team working on LLM serving infrastructure, frames memory management as a performance issue. When agents store and retrieve memories using vector embeddings (the standard approach), they need to perform approximate nearest-neighbour searches across potentially millions of stored items. Pancake builds a multi-tier caching hierarchy for these searches, manages index coordination across multiple concurrent agents, and distributes work between GPU and CPU. In benchmarks, it achieved more than four times the throughput of existing frameworks. The name might be whimsical, but the problem is industrial: running dozens of agents simultaneously makes the memory infrastructure the bottleneck rather than the LLM itself.
DeepSeek, the Chinese AI lab behind the V3 and R1 models, has moved even further in a different direction. Their Conditional Memory paper (January 2026) introduces “Engram”, a module that embeds memory directly into the model architecture through massive lookup tables. The idea is that LLMs waste an absurd amount of computation doing tasks that could be handled by a simple table lookup, like recognising that “Diana, Princess of Wales” is a single entity across multiple tokens. Engram offloads that static pattern recognition to a 27-billion-parameter embedding table with O(1) retrieval, freeing up the neural network’s computational depth for reasoning. The surprising result was that the biggest performance improvements appeared not in knowledge retrieval (where a lookup table would be expected to help) but in reasoning benchmarks. By relieving the model of grunt work, Engram made the rest of the network, in functional terms, deeper.
When agents need to share memories
The most challenging frontier may be multi-agent memory. A single agent remembering things is hard enough, but when multiple agents need to coordinate through shared memory, the problems multiply in ways that feel familiar to anyone who has studied concurrent computing.
A position paper published in March 2026 by Zhongming Yu and colleagues reframes the entire challenge through the lens of computer architecture. Their argument is that multi-agent memory systems are approaching the same bottleneck that plagued multi-processor computers decades ago: memory consistency. When two CPU cores read and write to the same cache line simultaneously, you need protocols to prevent either from seeing stale data. Multi-agent systems face the same problem with semantic information as with bytes. Agent A updates a shared knowledge base with new results while Agent B is simultaneously reading from that same base to make a decision. Without a consistency model, Agent B may act on information that Agent A has already invalidated.
The authors propose a three-layer memory hierarchy (I/O, cache, and memory) analogous to the layers of a computer system, and identify two protocol gaps that have not yet been cleanly solved. There is no standard way for one agent to share its cached results with another, and there is no access control mechanism to prevent agents from reading or overwriting memories they should not touch. Any production multi-agent system running customer-facing tasks will encounter these problems, and the fact that we lack even a vocabulary to discuss them suggests how early we are.
What is still missing
For all the progress in the past year, agent memory remains far from solved, and at least four open problems loom large.
Evaluation is the first and most immediate problem, because most memory systems are tested on question-answering benchmarks where the task is to store some facts, retrieve them later, and check if the answers are right. This misses the harder aspects of memory, such as the ability to forget outdated information gracefully, to prioritise between conflicting memories, or to know when a past experience is misleading rather than helpful.
Multimodality is the second, and it is largely untouched, because almost all current memory systems operate on text, but agents increasingly work with images, audio, code, and structured data. A memory system that can recall a conversation but not the diagram that accompanied it is incomplete.
Privacy and trust form the third gap, and arguably the most politically charged. As agent memory grows more persistent and personal, the question of who controls that memory (and who can read it) becomes pressing. The multi-agent consistency problem has a direct analogue in access control, because if your personal agent shares memory with a company’s agent, which memories are shareable and which are private?
The fourth gap, and perhaps the deepest, sits between storage and comprehension. Current systems store what happened but do not, in any meaningful sense, grasp why it happened. An agent that remembers failing at a task and tries a different approach next time is performing pattern matching, not causal reasoning. The distance between those two things may turn out to be the distance between a useful tool and a genuinely intelligent one.
The flurry of papers landing on arXiv every week suggests that agent memory is where attention mechanisms were around 2018, a problem that everyone recognises as central, where the solution space is being explored aggressively, but where the winning architecture has not yet emerged. Your AI assistant, meanwhile, will continue to greet you on Wednesday as it has never met you, staring out through a clean context window at a world it has already seen and forgotten.
Like this? Get email updates or grab the RSS feed.
More insights:
-
The path to an agent-first web
For three decades, the web has operated on an implicit contract between the people who build websites and the people who visit them. You design pages for human eyes and organise information for human brains, monetising attention through ads, upsells, and sticky navigation patter…
-
Generative engine optimisation: separating sound practice from snake oil
A new three-letter acronym is stalking the marketing industry. Generative Engine Optimisation (GEO) is the practice of making your content visible in AI-generated answers, such as those produced by ChatGPT, Perplexity, Google AI Overviews, and Claude. The term was coined in a 20…
-
Automating your marketing 01: Paid Search Ads
Google has always wanted you to believe that running search ads is simple and not as complex as it actually is. Set a budget (a generous one!), choose some keywords, and let the machine handle the rest. To be fair, the machine has become exceptionally good at certain aspects of …
-
Why AI models hallucinate
In September 2025, OpenAI published a paper that said something the AI industry already suspected but hadn’t quite articulated. The paper, “Why Language Models Hallucinate”, authored by Adam Tauman Kalai, Ofir Nachum, Santosh Vempala, and Edwin Zhang, didn’t just catalogue the p…
-
Received wisdom: classic frameworks under AI pressure 01: David C Baker
David C Baker has spent thirty years telling agency owners something they already suspected but lacked the courage to act on. You are not expensive enough, not focused enough in what you do. You are not sufficiently authoritative with your clients. The issue is not your work. Th…