22  Memory Systems

An agent that runs longer than a single context window has to remember what it did and survive losing the machine it did it on. This chapter is about the state an agent leaves behind: the durable record of a session, the workspace it mutated, and the memory it carries across sessions. By the end a reader can explain why a session log records intent separately from result, why branching a conversation is cheap but branching a workspace is not, and why a shared vector store is a data-breach risk rather than a relevance bug.

22.1 Problem

A harness has to survive any of four events: a harness crash, a replica reschedule, a sandbox loss, or a user resuming a session days later. None of these is exotic; all of them are the normal operating regime of a long-running agent. Surviving them means writing state to a durable store, and the hard question is not where to put that state but what to record.

The core difficulty is the gap between a side effect and its commit. A durable log records intent (“call git commit”) and, separately, result (“commit succeeded”). If the harness crashes between those two writes, recovery reads the last durable state, which is pre-result, and the agent re-issues the operation. For an idempotent operation that is harmless. For a non-idempotent one, a second git commit, a duplicate ticket, a repeated payment, the store and the external world diverge silently, and nothing in the log can tell you it happened. Reported agent retry rates of 15 to 30 percent make this a regime to design for, not an edge case [1].

A second problem sits above the first. A session is bounded by context compaction (Chapter 20): the context window fills, gets summarized, and the early turns fall out of the model’s reach. An agent that is supposed to know a user across weeks therefore cannot rely on the context window alone. It needs a separate store, durable and queryable, that the harness loads into context on demand. That store is memory, and it is a different system from the session log: the log is append-only intent, memory is curated and retrievable.

The third problem is tenancy. The moment memory is shared infrastructure, a retrieval that crosses a tenant boundary stops being a wrong answer and becomes a data breach. The same property that makes memory useful, that retrieved content lands directly in the model’s context, makes it dangerous, because that content is indistinguishable from instructions the user or harness put there.

22.2 Design

The durable-state primitive answers all three with one small contract, append, read, snapshot, fork, applied uniformly to state shapes that are anything but uniform: the session log (append-only intent), the workspace (mutable sandbox state), and long-term memory (curated, retrievable). One contract, several durability guarantees, and the design discipline is being honest about what each store does not guarantee.

22.2.1 Three durability models

The session log can be built three ways, and the meaningful difference between them is what they record.

The append-only event log writes every user message, model response, tool call, and tool result to a durable store. The log is the sole recovery artifact: any replica can serve any session by replaying it. This buys clean replay, clean audit, and transport-agnostic recovery, and it is the model that exhibits the intent-versus-result gap in its purest form. From the log alone, “requested but never ran” and “ran but never recorded” are identical: a tool call with no result.

sequenceDiagram
    participant LLM
    participant R as Runner
    participant SB as Sandbox
    participant SS as Session store

    R->>LLM: prompt
    LLM-->>R: tool_use: sandbox_exec("git commit -m 'feat'")
    R->>SB: gRPC Exec("git commit ...")
    SB-->>R: OK (commit created)
    Note over R: ✕ Runner crashes here
    Note over SS: Event NOT persisted

    Note over R,SS: Recovery
    R->>SS: GetSession
    SS-->>R: last known state (pre-commit)
    R->>LLM: replay from last checkpoint
    LLM-->>R: tool_use: sandbox_exec("git commit -m 'feat'")
    R->>SB: gRPC Exec("git commit ...")
    Note over SB: ✕ Duplicate commit or error:<br/>nothing to commit, working tree clean

The durable step log closes the gap by adding a third record between intent and result: a durable step boundary written before the effect executes. Replay skips any step whose outcome is already durable, so “we asked to do X” and “X completed” become two separate, individually-durable facts, and replay can ask “did this step already complete?” instead of assuming it did not.

The filesystem-as-state model takes the opposite extreme: the working context is ephemeral and the filesystem is the sole authoritative record of task state [6]. The event log demotes to a secondary log of intent; the filesystem holds the truth, usually through an opinionated convention (progress.md, plan.md) the agent maintains. This sidesteps the intent-versus-result gap for anything the filesystem can represent, the file either has the new content or it does not, but it relocates the gap rather than removing it: a git push, an external API call, an MCP server-side cursor advance are still outside the filesystem’s reach.

22.2.2 The shape of memory

Memory is the store that outlives a session, and it has a structure worth naming. Short-term memory is the session context within one conversation. Long-term memory splits into episodic (past interactions), semantic (facts about the user or domain), and procedural (learned workflows). External knowledge, the corpora behind retrieval-augmented generation (Chapter 25) and knowledge graphs, sits alongside but is a different thing: memory is per-user, curated, and writable, where RAG retrieves from a shared corpus the agent does not own.

graph TB
    classDef short fill:#d6eaf8,stroke:#2980b9
    classDef long fill:#fdebd0,stroke:#e67e22
    classDef external fill:#d5f5e3,stroke:#27ae60

    subgraph "Short-term"
        ST[Session context<br/>within one conversation]
    end

    subgraph "Long-term"
        EM[Episodic memory<br/>past interactions]
        SM[Semantic memory<br/>facts about user / domain]
        PM[Procedural memory<br/>learned workflows]
    end

    subgraph "External knowledge"
        RAG[RAG indexes<br/>documentation, tickets, code]
        KG[Knowledge graphs<br/>entities, relationships]
    end

    class ST short
    class EM,SM,PM long
    class RAG,KG external

The base layer is a vector store plus embedding retrieval: text is embedded, stored, and recalled by similarity. Pinecone [27], pgvector [28], and Chroma [29] are representative substrates. Above them sit opinionated memory frameworks, Letta [30] (the successor to MemGPT [31]), Zep [32], and Mem0 [33], that decide what to write, when, and how to compact it. Knowledge graphs model entities and relationships explicitly, which suits compositional queries but is harder to populate automatically, and a temporal knowledge graph adds the time dimension so a fact can be valid for a period and then superseded.

22.3 Evolution

Each of the three stores reached its current shape by the same route: a simple model met a failure it could not represent, and the next model added exactly the missing record.

Durable execution is the clearest case. The plain event log gave clean replay but blurred intent and result. Durable-execution frameworks, Restate [2], Temporal [3], and Inngest [4], journal each step, run the effect, then record its outcome, formalizing the distinction the event log blurs. Temporal’s own framing is that probabilistic LLM behavior makes naive retry logic insufficient, which is exactly the case durable journaling is built for [5]. By 2025 this crossed into the mainstream: AWS Durable Functions, Cloudflare Workflows, and Vercel’s Workflow DevKit all shipped with AI-agent use cases as primary framing [4]. The pattern is no longer framework-specific.

Session structure evolved in parallel. A linear append-only log treats a session as a list, and to back out you start over. Production harnesses moved past that. Rewind and truncate jump back to a checkpoint and discard the tail, rolling the workspace back to match, as in Claude Code’s /rewind [13] and Cursor’s per-edit checkpoints [14]. A forkable tree goes further: multiple live branches from any checkpoint, all preserved. LangGraph’s time-travel model treats checkpoints as a DAG with explicit branch IDs [16], Claude Code’s /fork spawns a child session from a shared history point [13], and Replit Agent [15] and OpenAI’s Codex CLI [17] ship their own variants. The session stopped being a list and became a tree, or at least a log with a movable head.

Memory followed the same arc from a research artifact to a product category. MemGPT [31] framed the LLM as an operating system that pages memory in and out of a bounded context, and its successor Letta [30] turned that into a hosted primitive. Zep [32] built memory on a temporal knowledge graph, and Mem0 [33] optimized for low latency and token cost. The category is young enough that its benchmarks are still contested, which is the next box.

ImportantWhat’s contested

Cross-vendor memory benchmarks are not yet trustworthy. Mem0 reports a 26 percent LLM-as-Judge gain over OpenAI memory on LOCOMO [34] with large latency and token-cost reductions [33]. Letta’s counter-benchmark reports a higher LoCoMo score with disputed methodology [35]. Zep reports its temporal knowledge graph leading on a different benchmark [32]. The vendors agree on neither the metric, the dataset split, nor the baseline, so treat any cross-vendor memory benchmark as contested until it is independently reproduced. The useful signal is the shape of the trade, recall quality against latency and token cost, not the headline number.

22.4 Trade-offs

Each store buys its guarantee with a specific sacrifice, and naming the knee matters more than the mechanism.

Workspace persistence is partial durability, not full recovery. The workspace, the sandbox’s mutable filesystem, outlives individual tool calls but is not on the durable log, and persisting it across pod lifecycle events has its own failure modes. The common Kubernetes pattern mounts a PersistentVolumeClaim at /workspace; the dominant failure is zone scoping. A ReadWriteOnce PVC is backed by a zonal block device that is physically attachable only within its own availability zone, so when a pod reschedules across zones the volume cannot follow and recovery falls back to a full re-bootstrap [7] [8]. When a pod is deliberately destroyed (teardown, timeout, scale-down) the PVC is often deleted with it, and uncommitted work is gone: Gitpod issue #9544 is the archetype, a workspace timeout fired before the final sync completed and a user lost roughly two hours of work [9]. The alternative, checkpointing the complete state on a cadence, trades snapshot frequency directly against storage overhead, and the frequency knob is load-bearing because it sets the maximum work a crash can erase. Replit’s snapshot engine captures filesystem and conversation context together and reports recovering from hourly OOM crashes without data loss [11] [12]. Any design that claims “sessions survive indefinitely” owes an explicit account of which failure modes it covers and which it does not.

Branching splits cheaply on conversation and expensively on the workspace. Branching conversation state is a pointer operation: the log is a tree of event IDs and forking a tree is cheap. Branching the workspace is not, because the workspace is a real filesystem and two branches cannot both own it.

graph LR
    classDef easy fill:#d5f5e3,stroke:#27ae60
    classDef hard fill:#f9d6d6,stroke:#c0392b

    CK[Checkpoint]
    CK --> B1[Branch A: conversation]
    CK --> B2[Branch B: conversation]
    CK --> W[Workspace / sandbox state]
    W --> C{Shared or cloned?}
    C -->|shared| RACE[Branches interfere]
    C -->|cloned| COST[Clone cost per fork]

    class B1,B2 easy
    class RACE,COST hard

The choice collapses to a dilemma: forks that share the workspace race on it, branch A’s writes are visible to branch B, and forks that clone it pay the clone cost. Three failure modes follow. A branch cannot fork the external systems it touches, git remotes, issue trackers, MCP cursors, so branchable and unbranchable state drift apart. Merge-back is structurally unsolved: files have a common ancestor at the checkpoint, so a three-way merge applies, but two divergent conversations have no merge base because the content is a reasoning trajectory, not editable lines, which is why the agent, not the harness, has to resolve a file merge. And orphan branches pin workspace and session-log resources until a TTL reclaims them, so without one the leak accumulates silently. Rewind alone suffices for interactive coding; forkable trees pay off when users explore alternatives in parallel or an evaluation pipeline replays a session with a changed prompt.

Memory trades cost against simplicity, and write-time safety against read-time convenience. Memory must be either loaded ambiently into every prompt (simple, expensive) or exposed as a tool the agent queries on demand (cheaper, scales better, but the agent has to know when to ask). Changing the embedding model invalidates the existing index and forces an expensive reindex, so the embedding choice is a long-term commitment. The sharpest trade is tenancy. A top-k similarity query against a shared vector store without an enforced tenant filter can return another tenant’s data, and the defense has to live in the index, not in application code. Partition at write time, per-tenant namespaces, pgvector row-level security or separate schemas, so a query physically cannot reach another tenant’s vectors. Attributing every chunk and filtering after the query is weaker, because filter-at-read-time is one bug away from a leak, whereas partition-at-write-time fails closed. This is where memory consumes the identity model directly (Chapter 34): tenancy is sourced from identity, and partitioning enforces it.

A second tenancy risk is specific to memory and has no analogue in the session log: because the agent writes memory based on user messages, a malicious user can craft messages that plant persistent misinformation. This is the prompt-injection problem moved into the memory store, where it survives across sessions. A misretrieved or poisoned record lands in the prompt and shifts the agent’s behavior in ways that are hard to debug, because the trace shows retrieval succeeding: the injected text is not an error, it is a correct retrieval of wrong content.

TipConstraint arrow

Whether branching and snapshotting are cheap or ruinous is decided one layer down, by the storage substrate, not by the harness. A PVC has no native cheap-clone primitive, so a fork on PVC must copy the whole workspace. Copy-on-write overlays and content-addressed snapshots duplicate only the divergent blocks, so a fork costs what diverged. Firecracker snapshots serialize an entire microVM and restore in tens to hundreds of milliseconds [22], which makes per-fork snapshotting affordable that would be ruinous on a PVC. The clone cost is bounded by the substrate, not by the harness, so the session-tree feature in the layer above is enabled or priced out by a storage choice made below it.

22.5 Implementation

Two properties are worth nailing down before the first crash, because both are invisible until exactly the wrong moment.

The first is the reconciliation gap. If the sandbox holds mutable state outside the durable log, the harness must reconcile the two on recovery. For idempotent operations it does not matter; the result is the same whether the operation ran once or twice. For non-idempotent ones, append to a file, increment a counter, create a commit, call a side-effectful API, the harness needs either a step-log boundary or an idempotency key threaded through the tool. The journal records the step’s intent and its key before the effect fires; on replay the harness performs check-then-act against that key, or against the external system’s own dedup, git’s ref state, a payment processor’s idempotency header, rather than blindly re-issuing. The step log does not make external effects idempotent on its own. It gives you the hook to make them so.

# Idempotency-keyed step, conceptual sketch.
def run_step(journal, key, effect):
    if journal.has_outcome(key):     # replay: skip a completed step
        return journal.outcome(key)
    journal.record_intent(key)       # durable boundary BEFORE the effect
    result = effect()                # the non-idempotent side effect
    journal.record_outcome(key, result)
    return result

The second is erasure reach. A right-to-erasure request has to cascade across every store that holds a user’s data, and memory is the painful case: framework-managed memory may not expose a delete API at all, so a store chosen for recall quality can quietly become a store you cannot erase. Partition-at- write-time helps here too, because a per-tenant namespace is a unit you can drop wholesale. The broader audit and compliance machinery, tamper-evident logs, data residency, the regulatory regimes, belongs to Chapter 34; the memory-specific obligation is to keep every record attributable to a tenant and erasable by tenant.

For the memory store itself, the base implementation is a vector index (pgvector [28] inside an existing Postgres is the low-operational-overhead default; Pinecone [27] or Chroma [29] when a dedicated store is warranted), wrapped either by a framework (Letta [30], Zep [32], Mem0 [33]) that manages write and compaction policy, or by a thin custom layer when the policy is simple enough to own. The harness (Chapter 23) is what loads the result into context, ambiently or through a retrieval tool, closing the loop back to the session that the memory outlived.

22.6 Further reading

  • “AI Agent Reliability Report 2025.” LLM agent retry rate 15–30%. fast.io
  • S. Ewen, G. van Dongen, I. Shilman. “Durable AI Loops: Fault Tolerance across Frameworks.” Restate, 2025. Restate
  • C. Davis. “Durable Execution meets AI.” Temporal, 2025. Temporal
  • C. Poly. “Durable Execution: The Key to Harnessing AI Agents in Production.” Inngest, 2026. Inngest
  • WorkOS. “Maxim Fateev on Durable Execution for AI Agents.” 2025. workos.com
  • Y. Zhou et al. “Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering.” 2026. arXiv:2604.08224
  • Rack2Cloud. “Kubernetes Day 2 Failures: 5 Incidents & the Metrics That Predict Them.” 2026. Rack2Cloud
  • Kubernetes Issue #121436. “PVC Binding Prevents Pod Rescheduling.” GitHub
  • Gitpod Issue #9544. “Data loss when workspace timeout fires before sync completes.” 2022. GitHub
  • Replit. “Inside Replit’s Snapshot Engine.” December 2025. blog.replit.com
  • Replit. “Finding and Solving Memory Leaks.” Replit Blog
  • Anthropic. “Claude Code: Checkpointing and Session Forks.” Claude Code Docs. code.claude.com
  • Cursor. “Checkpoints in the Agent Workflow.” 2025. stevekinney.com notes
  • Replit. “Checkpoints and Rollbacks.” Replit Docs
  • LangChain. “LangGraph Time Travel and Branching.” 2025. LangChain Docs
  • OpenAI Codex CLI. Conversation rewind (Esc) ships today; a code-reverting rewind is proposed in Issue #11626, 2026. GitHub
  • AWS. “Firecracker Snapshotting.” Firecracker Docs
  • D. Ustiugov et al. “Benchmarking, Analysis, and Optimization of Serverless Function Snapshots (REAP).” ASPLOS 2021. arXiv:2101.09355
  • Pinecone. “Managed Vector Database for AI.” pinecone.io
  • pgvector. “Open-source Vector Similarity Search for Postgres.” github.com/pgvector/pgvector
  • Chroma. “Open-source Embedding Database.” trychroma.com
  • Letta. “Stateful Agents with Memory as a First-Class Primitive.” letta.com
  • C. Packer et al. “MemGPT: Towards LLMs as Operating Systems.” 2023. arXiv:2310.08560
  • Zep. “Temporal Knowledge Graph for Agent Memory.” getzep.com
  • “Mem0: Memory for the AI Era.” April 2025. arXiv:2504.19413
  • Snap Research. “LOCOMO: Long-Term Conversational Memory Benchmark.” snap-research.github.io
  • Letta. “Benchmarking AI Agent Memory.” 2025. letta.com