Part XI · AI Agents & Autonomous Systems · Chapter 04

Memory & Knowledge Management, what the agent remembers between steps.

An agent that forgets everything between steps can barely act. An agent that remembers everything, indiscriminately, is drowned in its own past. What agents actually need is what humans have spent millions of years evolving: a tiered memory architecture that stores the right things at the right granularity — keeping fresh context immediately accessible, condensing older experience into stable knowledge, and forgetting the rest. This chapter maps that architecture onto the technical substrate available to AI agents today.

Prerequisites

This chapter builds on the cognitive architecture introduced in LLM-Based Agents (Ch 02), particularly the perception/memory/planning/action decomposition. The RAG sections assume familiarity with dense retrieval and embedding models; the full treatment is in Retrieval-Augmented Generation (Part VI Ch 09). Vector store indexing references concepts from Linear Algebra (Part I Ch 01) — inner products and approximate nearest neighbours.

Sections

A Taxonomy of Memory working · episodic · semantic · parametric
Working Memory & the Context Window in-context · lost-in-the-middle · window management
Episodic Memory past experiences · write-time vs. read-time · trajectory storage
Semantic Memory factual knowledge · knowledge graphs · currency · maintenance
Parametric Memory model weights · knowledge neurons · context vs. parameters
RAG as External Memory retrieval-augmented · vs. in-context · advanced patterns
Vector Stores HNSW · IVF · hybrid search · FAISS · pgvector
Retrieval Strategies chunking · query transforms · re-ranking · multi-hop
Memory Consolidation Across Sessions MemGPT · summarisation · long-term capture
Forgetting & Relevance Decay time decay · staleness detection · controlled forgetting
Comparing Memory Architectures trade-offs · tiered stack · cost vs. recall
Frontier: Persistent Agent Memory ChatGPT memory · MemoryOS · continual learning · safety

A Taxonomy of Memory

Foundation · Cognitive Science

Cognitive science has long distinguished types of memory by their duration, capacity, and the kind of content they hold. The same distinctions turn out to map almost perfectly onto the technical mechanisms available to AI agents — which is not coincidental. The field borrowed the vocabulary deliberately, and the analogies are genuinely illuminating.

Working Memory

Active context

The immediate scratchpad. In humans, roughly 7±2 chunks; in LLM agents, the full context window — everything currently "in view." Fast access, strictly limited capacity, cleared between sessions.

Episodic Memory

What happened

Time-stamped records of specific past events: prior conversations, completed tasks, observations during past runs. In agents, typically stored externally and retrieved on demand.

Semantic Memory

What is known

Factual, world-knowledge: concepts, relationships, procedures — not tied to any specific episode. In agents, lives both in model weights and in external knowledge bases.

Human cognition also distinguishes procedural memory (how to perform a skill — riding a bike, touch-typing) and prospective memory (remembering to do something in the future). Both have agent analogues: procedural memory maps to fine-tuned skills or cached code routines; prospective memory maps to scheduled tasks and goal queues. These are covered in later chapters; this chapter focuses on the core three.

Why the Taxonomy Matters for System Design

Each memory type has different requirements for storage, retrieval speed, persistence, and update frequency. Conflating them leads to poor design: an agent that tries to cram all episodic history into the context window will run out of space; one that relies on parametric weights alone cannot update its knowledge without expensive retraining. A well-designed agent treats each type as a first-class concern and routes information through the appropriate channel.

Working Memory & the Context Window

In-Context · Immediate Availability

The LLM's context window is its working memory. Everything the model can attend to in a single forward pass — the system prompt, recent conversation turns, retrieved documents, tool outputs, the current reasoning trace — lives here. It is the highest-bandwidth, lowest-latency memory available: any token in the window is equally accessible to every attention head without a retrieval step.

Context windows have grown dramatically: from GPT-3's 4K tokens (roughly 3,000 words) to GPT-4's 128K, Gemini 1.5 Pro's 1M, and Claude's 200K. But the growth has not eliminated the working-memory bottleneck — it has changed its character. The question is no longer "does the information fit?" but "does the model pay attention to it once it's there?"

The Lost-in-the-Middle Problem

Liu et al. (2023) documented that LLM performance on fact retrieval from long contexts degrades sharply when the relevant information is placed in the middle of a long document. Models attend robustly to the beginning and end of the context but effectively lose track of content in between — a "lost in the middle" phenomenon analogous to the human serial position effect. This has concrete implications: agents should not assume that anything placed in context is equally accessible. Critical information should be placed at the start or end, or re-surfaced explicitly when needed.

Effective Context Utilisation \[\text{Recall}(p) \approx \alpha + \beta\,e^{-\gamma \cdot d(p, \{0, L\})}\] Empirical fit: recall of a fact at position \(p\) in a context of length \(L\) decays as a function of distance from the endpoints (\(d(p, \{0, L\})\) = min distance to start or end). The recency and primacy effects compound — middle content is significantly under-attended.

Context Window Management Strategies

Practitioners have developed several patterns to manage working memory efficiently. Sliding windows retain the most recent \(k\) turns, discarding older ones — preserving recency at the cost of losing early context. Summarisation compresses older turns into a running summary before they leave the window, trading verbatim accuracy for space. Chunked retrieval keeps only the system prompt and the current exchange in context, retrieving relevant history on demand — effectively converting working memory into a retrieval problem.

The most sophisticated systems combine all three: a short active window for recent turns, a compressed summary of the session so far, and a retrieval layer for anything older. This mirrors the human memory architecture almost exactly.

The 200K Token Myth

A 200K-token context window does not mean 200K tokens of reliable working memory. Empirically, effective recall begins to degrade at around 32–64K tokens for most models, and many practical agent tasks saturate useful context at far less. Longer windows are most valuable for specific tasks — reading an entire codebase, processing a long document — not for general-purpose agent memory. Plan accordingly.

Episodic Memory

Experience · Event Records

Episodic memory stores records of specific events: what the agent did, what it observed, what happened as a result, and when. Unlike semantic knowledge, episodes are time-stamped and contextually situated — they carry the fingerprint of their origin. For AI agents, episodic memory is the mechanism that allows learning from experience across sessions, without retraining the underlying model.

The Generative Agents paper (Park et al., 2023) provided the most influential demonstration. Each agent in their simulated social world maintained a memory stream — a flat, append-only log of natural-language observations ("Alice spoke to Bob about the party at 10:04 AM"). Memories were stored with a timestamp, an importance score (assigned by the LLM at write time), and later retrieved by a composite score combining recency, importance, and relevance to the current query.

Generative Agents Retrieval Score \[\text{score}(m_i, q) = \alpha_1 \cdot \text{recency}(m_i) + \alpha_2 \cdot \text{importance}(m_i) + \alpha_3 \cdot \text{relevance}(m_i, q)\] \(\text{recency}(m_i) = \lambda^{\Delta t_i}\) decays exponentially with time since last access (\(\lambda = 0.995\) in the original). \(\text{importance}\) is a 1–10 score the model assigns at write time. \(\text{relevance}\) is cosine similarity between query embedding and memory embedding. All three are normalised to \([0,1]\) before weighting.

Write-Time vs. Read-Time Processing

A critical design choice is how much processing to do when a memory is written versus when it is read. Minimal write-time processing (store the raw observation as text plus an embedding) keeps writes cheap and preserves fidelity, but puts the burden of interpretation on retrieval. Heavy write-time processing (extract entities, assign importance, summarise) makes retrieval faster and more precise, but is expensive and can introduce write-time errors that are hard to correct later.

Reflexion (Shinn et al., 2023) takes an extreme write-heavy approach: after each failed episode, the agent produces a long-form verbal reflection that is stored and prepended to future attempts. This is episodic memory as explicit self-narration — the stored memory is not a raw observation but an already-interpreted lesson. This works well for iterative task improvement but scales poorly to high-frequency event logging.

Episodic Memory as a Database

Practically, episodic memory is implemented as an append-only store — often a vector database — where each record contains a natural-language description of the event, a dense embedding for semantic search, and structured metadata (timestamp, importance, source, associated agent/session). Retrieval then becomes a hybrid search over semantic similarity and metadata filters. We return to the infrastructure details in §7.

Semantic Memory

Knowledge · Facts & Concepts

Semantic memory holds general-purpose knowledge: facts, concepts, relationships, and procedures that are not tied to any particular experience. For an agent helping with software engineering, semantic memory includes knowledge of programming languages, libraries, design patterns, and best practices. Unlike episodic memory, semantic memories are not time-stamped — they represent stable, decontextualised knowledge.

In AI agents, semantic memory lives in two places simultaneously. Parametric semantic memory is encoded in the model's weights during pretraining — the model "knows" things because it has seen them millions of times in training data. External semantic memory lives in knowledge bases, documents, and databases that the agent can query at runtime. The relationship between these two forms is a central tension in agent design.

Knowledge Graphs as Structured Semantic Memory

Knowledge graphs (KGs) provide a structured alternative to unstructured document retrieval. A KG represents knowledge as a set of typed triples: (subject, predicate, object) — e.g., (Python, is_a, programming_language), (Django, built_with, Python). Agents can query KGs with SPARQL or graph traversal, producing verifiable, structured answers rather than the fuzzy similarity-based results of vector retrieval.

The limitation is construction cost: building a high-quality KG requires either expensive human curation or imperfect automated extraction. Hybrid systems — LLM agents that retrieve from both a KG and a vector store, using the KG for structured facts and the vector store for unstructured context — have become a common production pattern.

Maintaining Semantic Memory Currency

Parametric knowledge becomes stale as the world changes. A model trained on data through 2024 knows nothing about events in 2025 unless they are injected at runtime via retrieval. Keeping semantic memory current requires either frequent retraining (expensive), retrieval augmentation (effective for factual updates), or continual learning — updating weights incrementally without catastrophic forgetting. Continual learning remains an active research area; most production systems default to retrieval augmentation for knowledge currency.

Parametric Memory

Weights · Implicit Knowledge

Parametric memory is the knowledge baked into a model's weights during training. Unlike other memory types, it has no explicit retrieval mechanism — the model does not "look something up" in its weights; it produces outputs consistent with what it absorbed during training. This makes parametric memory extraordinarily fast (no retrieval latency) and generalised (it applies pattern-matching across domains), but also opaque, uncorrectable without retraining, and bounded by the training data cutoff.

Petroni et al. (2019) famously probed BERT's parametric memory using cloze-style prompts ("The Eiffel Tower is located in ___") and found that even early language models stored surprising quantities of factual knowledge. Subsequent work by Meng et al. (ROME, 2022) showed that specific factual associations could be identified and surgically modified within the MLP layers of transformer models — suggesting that parametric knowledge has localisable structure, not just diffuse statistical associations.

Knowledge Neurons

Dai et al. (2022) identified "knowledge neurons" — specific neurons in the feed-forward layers of BERT whose activation correlates with the expression of particular factual knowledge. Suppressing these neurons reduces the model's ability to express the associated fact. This provides mechanistic evidence that parametric memory is not purely distributed: it has structure, though not the neat key-value structure of an explicit database.

Parametric vs. Contextual Memory Conflict

A well-documented failure mode arises when information retrieved into the context window conflicts with the model's parametric memory. Models often exhibit knowledge conflict — defaulting to their trained beliefs even when a retrieved document clearly states otherwise. This is particularly acute for facts that changed after the training cutoff. Mitigation strategies include explicit prompting ("Use only the information in the documents below, not your prior knowledge") and training on instruction datasets that reward contextual fidelity over parametric recall.

When to Rely on Parametric Memory

Parametric memory is best trusted for stable, widely-attested knowledge that would have appeared many times in training data: core scientific facts, standard programming APIs (pre-cutoff), mathematical relationships. It should not be trusted for recent events, niche domain facts, personal or proprietary information, or any claim that requires precision the model cannot verify. When in doubt, retrieve.

RAG as External Memory

Retrieval · Architecture

Retrieval-Augmented Generation (RAG, Lewis et al., 2020) was originally conceived as a way to keep language models factually grounded by conditioning generation on retrieved documents. In the agent context, RAG has evolved into something broader: a general-purpose mechanism for connecting a model's working memory to any external corpus — documents, databases, conversation histories, code repositories, or other agents' notes. RAG is the most practical answer to the question "how does an agent know things it wasn't trained on?"

The basic pipeline — embed the query, retrieve the top-\(k\) documents by cosine similarity, stuff them into context, generate — is well understood. What distinguishes agent RAG from document-QA RAG is the diversity and volume of sources, the importance of metadata filtering, and the need for multi-hop retrieval: answers that require chaining across multiple retrieved documents.

RAG vs. Pure In-Context Memory

The choice between keeping information in the context window versus retrieving it on demand is a fundamental architectural decision. In-context memory is always available with zero latency, but consumes tokens and degrades attention for distant content. RAG retrieves only what is needed, preserving context space, but introduces retrieval latency and can miss relevant content if the query is poorly formed.

Property	In-Context	RAG (External)	Parametric
Capacity	Context window (~100K–1M tokens)	Effectively unlimited	Fixed at training time
Latency	Zero — already in context	ms–s (embedding + ANN search)	Zero — implicit in forward pass
Updateable?	Yes — re-inject into context	Yes — update the index	Only via fine-tuning / retraining
Verifiable?	Yes — model can cite what's in context	Yes — sources are explicit	No — knowledge origin is opaque
Best for	Current task state, recent turns	Large corpora, long-term memory	Stable world knowledge, reasoning

Advanced RAG Patterns

Hypothetical Document Embeddings (HyDE) generate a hypothetical answer to the query first, then embed and retrieve against that hypothetical — often finding better matches than embedding the question directly. Multi-query retrieval generates several paraphrases of the query and merges the retrieved sets, improving recall. Recursive retrieval uses retrieved documents to refine subsequent queries, enabling multi-hop reasoning across a corpus. Re-ranking applies a cross-encoder (which jointly embeds query and candidate) after initial retrieval to re-score top-\(k\) candidates before injecting them into context — trading latency for precision.

Vector Stores

Infrastructure · ANN Search

A vector store is a database optimised for storing and searching dense vector embeddings. It is the infrastructure layer that makes external semantic and episodic memory practical at scale. Rather than exact lookup (which would require comparing a query against every stored vector), vector stores use approximate nearest-neighbour (ANN) algorithms that trade a small accuracy loss for orders-of-magnitude speed improvement.

Core Indexing Algorithms

HNSW (Hierarchical Navigable Small World) builds a multi-layer proximity graph where each node is connected to its nearest neighbours. Search starts at the top layer (sparse, fast traversal) and progressively descends to finer layers, converging quickly on approximate neighbours. HNSW achieves very high recall at low latency and is the dominant algorithm in production vector stores (Weaviate, Qdrant, pgvector with HNSW).

IVF (Inverted File Index) partitions vectors into Voronoi cells. A query is assigned to the nearest cell(s), and search is restricted to those cells. IVF is more memory-efficient than HNSW and scales better to very large corpora, but requires a training phase to build the cell partition. Faiss (Facebook AI) is the standard IVF implementation.

PQ (Product Quantisation) compresses vectors by decomposing them into sub-vectors, each encoded with a small codebook. This dramatically reduces memory footprint at the cost of recall — useful when the corpus is too large to fit in RAM. IVF + PQ (IVFFlat + PQ) is a common combination for billion-vector scale.

ANN Search Objective \[\hat{k}\text{-NN}(q) = \text{argmax}_{i \in \mathcal{C}(q)} \;\text{sim}(q, v_i)\] Rather than searching all \(N\) stored vectors, ANN restricts search to a candidate set \(\mathcal{C}(q)\) of size \(|\mathcal{C}| \ll N\) determined by the index structure. For HNSW, \(|\mathcal{C}|\) is \(O(\log N)\); for IVF with \(n_{\text{probe}}\) cells, it is \(O(n_{\text{probe}} \cdot N/n_{\text{cells}})\).

Hybrid Search: Vector + Keyword

Pure vector search excels at semantic similarity but performs poorly when the query contains rare or proper-noun terms that the embedding model doesn't represent well (a product name, a code snippet, an obscure acronym). Hybrid search combines a sparse retriever (BM25, TF-IDF) with the dense vector retriever and merges results via Reciprocal Rank Fusion (RRF) or a learned combination. Most production RAG systems use hybrid search as the default.

The Vector Store Landscape

Purpose-built vector databases (Pinecone, Weaviate, Qdrant, Chroma, Milvus) offer the richest feature sets — metadata filtering, multi-tenancy, real-time updates. Established databases have added vector support: pgvector brings ANN search to PostgreSQL, Redis has added a vector index module, and Elasticsearch supports dense retrieval via approximate kNN. For small-to-medium corpora (<10M vectors), any of these works well; for billion-scale, Milvus or Faiss with IVF+PQ is the standard choice.

Vector store retrieval pipeline: documents are embedded and indexed at write time; at query time, the agent's query is embedded and ANN search returns the top-k semantically similar chunks, which are injected into context.

Retrieval Strategies

Algorithms · Query Design

How you retrieve matters as much as what you store. The gap between naive top-\(k\) retrieval and a well-tuned retrieval pipeline can account for 20–40 percentage points of task performance on complex knowledge-intensive tasks. This section covers the algorithmic choices that make the difference.

Chunking Strategy

Before a document can be stored, it must be split into retrievable units. Fixed-size chunking (e.g., 512 tokens with 50-token overlap) is simple but breaks semantic units arbitrarily. Semantic chunking uses sentence boundaries, paragraph boundaries, or section headings as natural split points. Hierarchical chunking stores both fine-grained chunks (paragraphs) and their parent sections, retrieving both levels and injecting the most relevant fine chunks with their section context. This improves coherence significantly at moderate extra cost.

Query Transformation

Raw user queries are often poor retrieval queries — they may be too short, ambiguous, or use different vocabulary than the stored documents. Query transformation improves retrieval without touching the index. Query expansion appends synonyms or related terms. Step-back prompting first answers a more abstract question and uses that to refine the specific query. Multi-query generates 3–5 paraphrases and merges results. HyDE (Gao et al., 2022) generates a hypothetical ideal document then retrieves against its embedding.

      // HyDE retrieval — query → hypothetical doc → embedding → search

      function hyde_retrieve(query, k):

        hyp = LLM("Write a document that answers: " + query)

        vec = embed(hyp)         // embed the *answer*, not the question

        return vector_store.search(vec, top_k=k)

Re-ranking

Bi-encoder retrieval (embedding query and documents separately) is fast but imprecise — it can't capture fine-grained query-document interaction. A cross-encoder re-ranker processes the query and each retrieved candidate jointly, producing a precise relevance score at the cost of \(O(k)\) additional inference calls. The typical pipeline: retrieve top-50 with the bi-encoder, re-rank with the cross-encoder, pass top-5 to the model. Cross-encoders from the Cohere, Jina, or voyage-ai families are common choices.

Multi-Hop Retrieval

Many questions require chaining across multiple documents: "What framework does the author of Project X typically recommend for database migrations?" — which requires finding the author of Project X, then finding their recommendations. Single-hop RAG fails here. Multi-hop strategies include iterative retrieval (retrieve, read, identify what else is needed, retrieve again) and decomposed retrieval (break the question into sub-questions, retrieve for each, combine). Both are implemented naturally in a ReAct loop where the agent issues multiple retrieve tool calls.

Memory Consolidation Across Sessions

Long-Term · Cross-Session Learning

A conversation agent that begins each session with no memory of past interactions is fundamentally limited — it cannot build relationships, accumulate user preferences, or improve from experience. Memory consolidation is the process by which the raw record of a session is processed, compressed, and stored in a form that will be useful in future sessions.

Human memory consolidation happens during sleep — the hippocampus replays recent experiences and the most important patterns are strengthened and transferred to long-term cortical storage. AI agents need an analogous process: at session end (or at regular intervals during long-running operations), compress and organise episodic records into reusable knowledge.

The Consolidation Pipeline

A practical consolidation pipeline has several stages. Session summarisation compresses the full conversation or task log into a structured summary — what was accomplished, what was learned, what failed and why. Entity extraction identifies important entities mentioned (people, products, preferences, constraints) and updates a structured entity store. Preference extraction notes user preferences revealed during the session ("they prefer concise responses," "they use metric units"). Importance filtering discards routine exchanges and retains unusual, surprising, or high-value content for retrieval.

      // End-of-session consolidation (pseudo)

      on session_end(transcript):

        summary    = LLM.summarise(transcript, format="structured")

        entities   = LLM.extract_entities(transcript)

        prefs      = LLM.extract_preferences(transcript)

        importance = LLM.score_importance(summary)  // 1–10

        if importance > 4:

          memory_store.write(summary, entities, timestamp=now())

        entity_store.upsert(entities)

        user_profile.update(prefs)

The MemGPT Architecture

Packer et al. (2023) proposed MemGPT — an agent architecture that explicitly manages a two-tier memory system: main context (what's currently in the LLM's context window) and external context (a database of prior information). The agent can issue memory management function calls — archival_memory_insert, archival_memory_search, core_memory_replace — to explicitly move information between tiers. The LLM itself decides what to remember, what to forget from the immediate context, and what to retrieve. This makes memory management a first-class reasoning task rather than an implicit side effect.

What Should Be Remembered?

Not all information is worth consolidating. Useful heuristics: retain information that is surprising (contradicts priors), high-stakes (affected important decisions), persistent (a preference or constraint that will recur), or corrective (a mistake made and diagnosed). Discard routine confirmations, repetitions, and content that is easily regeneratable. An importance classifier — even a simple prompted LLM — can make this distinction automatically with good enough accuracy for most applications.

Consolidation at Scale

Long-running agents (personal assistants, enterprise workflow agents) accumulate thousands of sessions over months. Naive retrieval over the full history becomes slow and noisy. Hierarchical consolidation — sessions → weekly digests → monthly profiles → a persistent user model — mirrors how human autobiographical memory works: individual episodes fade, but their patterns are distilled into enduring knowledge. The weekly and monthly summaries are cheaper to retrieve, easier to read, and less sensitive to the specifics of any one session.

Forgetting & Relevance Decay

Management · Signal vs. Noise

An agent that never forgets anything is not more capable — it is slower, noisier, and increasingly confused as its memory fills with outdated, contradictory, and irrelevant records. Forgetting is not failure; it is curation. The question is what to forget and when.

Time-Based Decay

The simplest forgetting mechanism is temporal decay: memories become less retrievable over time. The Generative Agents paper used exponential recency decay with \(\lambda = 0.995\) per hour — a memory accessed 24 hours ago is \(0.995^{24} \approx 0.89\) as retrievable as one accessed now. This is a weak signal by itself; combined with importance and relevance, it prevents the retrieval pool from being dominated by recent but trivial events.

Ebbinghaus-Inspired Forgetting Curve \[R(t) = e^{-t/S}\] Retrieval strength \(R\) decays exponentially with time \(t\) at a rate governed by memory strength \(S\). Each time a memory is accessed, \(S\) increases (spaced repetition effect) — memories that are repeatedly retrieved become more resistant to forgetting. Agent systems can implement this by boosting the importance score on each access.

Staleness Detection

Time-based decay is agnostic to whether a memory's content is still true. A more principled approach detects staleness — memories that have been invalidated by subsequent events. An entity-update system that tracks facts like (user.preferred_language = Python) can overwrite or annotate old values when new ones are observed, preventing the agent from retrieving outdated preferences. Detecting contradiction automatically (rather than just overwriting on explicit instruction) is an open research problem; current systems handle it by keeping the most recent value and relying on recency bias in the retrieval score to surface it.

Controlled Forgetting in Practice

Production systems typically implement forgetting through a combination of TTL (time-to-live) policies on stored records, periodic pruning jobs that remove low-importance memories older than a threshold, and deduplication that collapses near-duplicate memories into a single canonical record. Vector databases that support metadata-based deletion (Pinecone, Weaviate, Qdrant all do) make TTL-based pruning straightforward to implement.

Comparing Memory Architectures

Trade-offs · System Design

No single memory type is universally superior — the right architecture depends on the task, the update frequency of the knowledge, the latency budget, and the cost tolerance. The following table synthesises the trade-offs across all approaches covered in this chapter.

Architecture	Capacity	Latency	Update cost	Best suited for
In-context only	Low (window-bound)	Zero	Re-inject per call	Short sessions, small KBs
RAG (dense)	Very high	Low (ms–100ms)	Embedding + index update	Large document corpora
RAG (hybrid)	Very high	Low–medium	Dual index update	Mixed vocab / named-entity tasks
Knowledge graph	High (structured)	Low (SPARQL)	Triple insertion / update	Structured facts, relationships
Parametric (weights)	Baked in at training	Zero	Full fine-tune	Stable world knowledge
Episodic log + retrieval	Very high	Low (ANN)	Append only	Long-running agents, personalisation
Consolidated summaries	High (compressed)	Low	Periodic batch job	Multi-session continuity

The Tiered Memory Stack

Most production agent systems end up with a tiered stack that combines several layers: (1) immediate context — current task state, system prompt, recent turns; (2) session episodic buffer — the full current-session history, summarised and chunked; (3) long-term external store — a vector database covering all past sessions and relevant documents, retrieved on demand; (4) structured entity store — a key-value or graph store for user preferences, persistent facts, and named entities; (5) parametric baseline — the underlying model's pretrained world knowledge. Each tier has a different update frequency, retrieval mechanism, and decay policy.

The Cost of Full Retrieval

Embedding every query and hitting a vector store for every agent step adds up. At 50ms per retrieval and 10 agent steps per task, that's 500ms of retrieval latency on top of LLM inference — potentially doubling end-to-end latency. High-frequency agentic systems often cache embeddings, pre-filter by metadata before vector search, or retrieve only at step boundaries rather than every step. Latency profiling should be part of every agent memory architecture review.

Frontier: Persistent Agent Memory

Frontier · Active Research

The aspiration of persistent agent memory — a system that builds an accurate, updatable model of users, tasks, and the world that persists indefinitely — is closer than it has ever been but still not solved. Several active research and commercial developments are pushing the frontier.

ChatGPT Memory and Commercial Precedents

OpenAI's Memory feature (2024) was the first at-scale deployment of cross-session agent memory for a consumer assistant. The system uses a separate memory store populated by the model's own writes — when the model judges something worth remembering, it calls a save_memory tool that persists a natural-language fact. Users can inspect, edit, and delete memories. The design prioritises user control and explainability over raw recall capacity — a reasonable trade-off for consumer trust but limiting for enterprise applications where memory depth matters more.

MemoryOS and Lifelong Learning

Research prototypes are exploring more ambitious architectures. MemoryOS (2024) proposes a three-tier memory system with automatic promotion: raw observations enter short-term storage, frequently accessed or high-importance items are promoted to mid-term storage, and distilled patterns move to long-term storage — all managed automatically. This mirrors the hippocampal-neocortical transfer model from neuroscience and addresses the key limitation of flat episodic stores: they scale poorly with time.

Continual Learning vs. Retrieval

A deeper question is whether long-term memory should live in weights or in a retrieval system. Retrieval systems are interpretable, updatable, and reliable — but they require explicit retrieval calls and can fail if the retrieval query is malformed. Continual learning in weights is seamless at inference time — the knowledge just "is there" — but current methods suffer from catastrophic forgetting: updating weights for new information tends to degrade performance on old information. LoRA-based continual learning and experience replay are active mitigations, but no robust solution exists for high-update-frequency domains.

Memory as a Safety Surface

Persistent memory introduces new safety concerns. A malicious document injected into an agent's memory store can manipulate its behaviour across future sessions — a long-range prompt injection attack. Memory that is never forgotten or audited can become a vector for persistent bias, surveillance, or manipulation. Agent memory systems require the same adversarial hardening as any other attack surface: input validation at write time, provenance tracking for every stored record, and regular auditing of what the memory store contains. These are open engineering challenges, not solved problems.

The Memory Alignment Problem

Whose interests does persistent agent memory serve? A memory system that perfectly recalls every user preference and adapts accordingly could, over time, become deeply manipulative — reinforcing and amplifying whatever the user already believes, never challenging or correcting. The most capable memory system might not be the most beneficial one. This is the memory dimension of the broader alignment problem: memory design involves value choices about what to remember, what to forget, and whose account of events to privilege.

Key Papers

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis et al., NeurIPS 2020

The paper that coined RAG. Combines a dense retriever (DPR) with a seq2seq generator; trained end-to-end. Established the retrieval-augmented paradigm that now underlies most agent memory systems. The foundational reference for RAG.

arXiv
Generative Agents: Interactive Simulacra of Human Behavior

Park et al., UIST 2023

The Smallville agent simulation. Introduces the memory stream, composite retrieval score, and reflection mechanism that have become the canonical reference architecture for episodic agent memory. Read this before designing any long-horizon agent memory system.

arXiv
MemGPT: Towards LLMs as Operating Systems

Packer et al., 2023

Frames the agent memory management problem as analogous to OS virtual memory. The model explicitly manages what lives in the context window (main context) vs. external storage, issuing function calls to swap content in and out. The most rigorous treatment of hierarchical agent memory management.

arXiv
Lost in the Middle: How Language Models Use Long Contexts

Liu et al., TACL 2024

Documents the primacy/recency effect in long-context LLMs — relevant content in the middle of a long context is significantly under-attended. Has practical implications for how agents should structure injected memory. Required reading before relying on full-context stuffing.

arXiv
Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE)

Gao et al., ACL 2023

Hypothetical Document Embeddings: generate a hypothetical ideal answer, embed it, retrieve against it. Outperforms direct query embedding on several zero-shot retrieval benchmarks. A practical retrieval improvement with a one-line implementation change.

arXiv
Language Models as Knowledge Bases?

Petroni et al., EMNLP 2019

The first systematic evaluation of factual knowledge stored in pretrained LM weights using cloze-style prompts. Established the "parametric memory" framing and showed that even BERT stores surprising quantities of factual associations. The foundational paper on parametric memory.

arXiv