Every vendor selling agent infrastructure in 2025 and 2026 has a slide about context windows. The numbers are impressive: 128K tokens, 1 million tokens, claims of 100 million with efficiency improvements that seem to defy the physics of attention mechanisms. The implicit argument is that if you just buy enough context, the memory problem goes away. It does not. It gets more expensive and more opaque.

The uncomfortable finding from Chroma's July 2025 evaluation of 18 leading models — including every major frontier model — is that performance degrades not gradually but catastrophically as context grows. The effect is measurable around 10,000 tokens, accelerates past 50,000, and becomes severe by 200,000. In controlled benchmarks, hallucination rates run at 12% with 200K tokens of optimised retrieval, against 28–31% at full 1-million-token capacity. The cost difference is a factor of roughly five. Researchers call the phenomenon context rot: the model's attention mechanism distributes so thinly across a long sequence that early information effectively disappears, even though it is technically present in the window.

This is not a failure of the models. It is a failure of the architectural assumption that context and memory are the same thing. They are not. Context is the active window. Memory is the governed system that decides what enters that window, in what form, at what moment. The context window is just one component of a memory architecture. Building an agent that treats context as memory is building an agent with no memory at all — one that knows everything that happened in the last twenty minutes and nothing that happened before.


The problem in concrete terms: what breaks and when

To understand why this matters for production deployments, it helps to walk through what actually happens to an agent running a long-horizon task — say, an autonomous procurement agent managing a multi-vendor negotiation over several weeks. The agent starts with its original objective, constraints, and initial context. For the first dozen or so steps, everything works as designed. The context window is manageable, recent events are visible, and the agent's outputs are coherent.

Around step 20, the context window begins to fill. The agent framework — with no explicit memory architecture — handles this in the worst possible way: it stuffs everything into the prompt and hopes the model attends to all of it. The model does not. Attention dilutes. Research from the Journal of Machine Learning Research in 2024 documents a 40% performance drop beyond 50,000 tokens due to this attention dilution effect. The agent begins to exhibit a specific failure mode: it answers correctly for things that appeared recently in its context, and incorrectly or inconsistently for things that appeared earlier.

By step 35, the failure mode becomes operationally significant. The agent has forgotten constraints it learned in the first week. It is recommending options it already rejected. It has lost track of the original goal framing. None of this appears in the model's outputs as an error message. The outputs are fluent, confident, and wrong. The agent that placed a $2.4M order for parts it already had, in the anecdote that opens almost every conversation about this problem, was not hallucinating in the technical sense. It was operating on a context from which the relevant constraint had been evicted.

Context rot is not a model problem. It is an architecture problem. The model is doing exactly what it was designed to do with the context it was given. The architecture failed to give it the right context.

Figure 1 — Context rot timeline: naive context stuffing versus governed memory architecture
Context rot timeline comparison Two horizontal timelines showing agent behaviour across 50 steps. Top timeline shows context stuffing with degradation after step 15, goal drift after step 30, and critical failure after step 40. Bottom timeline shows governed memory architecture with stable performance throughout. Naive: context stuffing (no memory architecture) 1 10 20 30 40 50 Context window filling. Attention begins diluting. Early constraints evicted. Goal drift begins. Re-recommends rejected options. Silent failures. Critical failure. Agent no longer coherent. Costly. Governed: four-layer memory architecture 1 20 40 50 Goal anchor always injected. Episodic store queried at each step. Working memory capped. Stable.
The top line shows what happens to agent coherence when context is treated as memory: degradation is gradual at first, then catastrophic as attention dilutes across a growing context. The bottom line shows a governed four-layer memory architecture: performance is stable across 50 steps because the working memory is actively managed, not passively accumulated.

The cost dimension that nobody puts in the pitch deck

Context rot has a performance cost. It also has a financial cost that is structurally hidden in the way most teams budget for AI deployments. Token pricing is almost always evaluated at single-turn or short-session scale — the demo scope, the proof-of-concept scope. The numbers that emerge from that evaluation are then used to project production costs, which are multi-session, long-horizon, and potentially recursive. The projection is wrong by an order of magnitude.

The Databricks RAG benchmarks from 2025 quantified what teams discover empirically: naive long-prompt workarounds — stuffing context to avoid building a proper memory architecture — inflate costs by 300% while providing no improvement in recall accuracy. The Stevens Institute "Hidden Economics of AI Agents" study from early 2026 found that unconstrained agent tasks cost $5–8 each, with Reflexion-style loops consuming 50 times the token volume of a single-pass equivalent. At production volume, these numbers are not theoretical. They are line items.

The cost table below compares three approaches: naive context stuffing, RAG alone (misused as a memory substitute), and a proper four-layer memory architecture. The comparison is across the dimensions that matter in a production deployment: cost per task, recall accuracy, coherence at step 50, cross-session continuity, and audit tractability.

Figure 2 — Approach comparison: cost, recall, and coherence at production scale
Dimension Naive context stuffing RAG only (no memory) Four-layer memory architecture
Cost per long-horizon task $5–8 unconstrained
Grows quadratically with session length
$1–3
Controlled if retrieval is disciplined
$0.3–0.8
Working memory capped; retrieval targeted
Hallucination rate at 200K token equivalent 28–31%
Full-capacity attention dilution
15–18%
Better, but no episodic continuity
10–13%
Optimised retrieval; goal anchor protected
Goal coherence at step 50 Degraded
Original objective evictable
Partial
Goal present if re-injected; often isn't
Maintained
Goal anchor is structurally un-evictable
Cross-session continuity None
Every session starts from zero
None
RAG retrieves knowledge, not history
Full
Episodic store persists across sessions
Auditability None
No record of what was in context at each step
Partial
Retrieval log exists; episodic context does not
Complete
Every retrieval event logged; context reconstructable

The auditability row deserves emphasis because it is the one that matters most in regulated environments. In financial services, healthcare, and legal applications — the domains where long-horizon agents create the most value — there is a compliance requirement that maps directly to memory architecture: the ability to reconstruct, for any decision the agent made, exactly what context it had access to at the moment it made it. A naive context-stuffing architecture cannot satisfy this requirement. An episodic memory store, with timestamped retrieval records, can. The memory architecture is not just a performance optimisation. In regulated deployments, it is a compliance requirement.


The four-layer memory architecture: a working specification

The architectural pattern that solves the context rot problem is not new — it draws on decades of cognitive science research into how human memory operates. Human memory evolved as a layered system because holding everything in working memory is neurologically impossible. The same design principle, applied to agent systems, produces a four-layer architecture where each layer has a defined purpose, a defined capacity constraint, and a defined relationship to the others.

What follows is a working specification of each layer — not as an abstract framework, but as a set of engineering decisions with explicit constraints and implementation implications.

The four-layer agent memory architecture — specification
Layer 1
Working Memory
"The desk right now"
The active context window. Contains: the immutable goal anchor (always injected first, never evictable), the retrieved episodic and semantic context for the current step, and the immediate tool results. The discipline is in what you exclude, not what you include. A working memory that grows with session length is not working memory — it is a context-stuffing strategy with a different name.
Implementation: Firestore session state + structured context injection at each ReAct turn. Goal anchor stored separately with injection precedence enforced by the orchestrator.
Hard constraint: target under 20K tokens for most enterprise tasks. The turn limit (max 12 in the deal desk architecture) is a working memory constraint, not a performance parameter.
Layer 2
Episodic Memory
"The meeting notes"
A time-stamped record of what the agent experienced at each step: what decision was made, what data was referenced, what was tried and rejected, what constraints were discovered. Stored externally and retrieved via semantic search when a current step requires historical context. This is how an agent at step 50 can "remember" that Option B was rejected at step 12 — without carrying all 50 steps in its active window.
Implementation: AlloyDB pgvector or Vertex AI Vector Search. Step summaries written asynchronously by the orchestrator at each turn boundary. Retrieval: top-K semantic similarity against the current step's query embedding. Retention policy: task-scoped, expires on task completion unless flagged for long-term retention.
Critical: episodic memory is distinct from RAG. RAG retrieves knowledge; episodic memory retrieves history. Conflating them — using the same vector store for both — produces an agent that has no reliable sense of what it has already tried.
Layer 3
Semantic Memory
"The reference library"
Stable domain knowledge: policy documents, product catalogues, regulatory requirements, contract precedents, customer profiles. Stored as vector embeddings. The agent queries this library at each step, retrieving only the 3–5 most relevant chunks for injection into working memory. This is where RAG belongs — not as a memory substitute, but as a knowledge retrieval mechanism for stable reference material.
Implementation: this is the AlloyDB pgvector + HNSW corpus described in the deal desk architecture. DLP-redacted at ingestion. SHAP-tracked at retrieval for explainability. Refresh cadence: asynchronous, triggered by corpus update events.
The critical distinction from Layer 2: semantic memory does not change with the agent's task history. It is the stable background knowledge. Episodic memory is the task-specific, time-ordered record. Both require vector retrieval, but they must be stored and queried separately.
Layer 4
Procedural Memory
"The playbook"
Encoded workflows, escalation rules, standard operating procedures, and autonomy boundaries. What the agent "knows how to do" independent of specific task content. In the deal desk architecture: the margin floor rule, the approval routing logic, the DLQ escalation procedure. These are not retrieved by semantic search — they are structurally enforced by the agent's tool manifest and the orchestrator's policy bindings.
Implementation: the Vertex AI Agent SDK tool manifest, the OPA policy bundle, and the Firestore configuration document. Not a vector store — procedural memory is deterministic and must be exactly reproducible. It should not be embedded and retrieved; it should be executed.
The most commonly missed layer in early enterprise deployments. Its absence is why agents that behave correctly on scripted demos break down on production edge cases — the procedural knowledge was implicit in the demo's setup, not explicitly encoded in the agent's architecture.

The goal anchor: the architectural primitive nobody talks about

Of the four layers, the one with the most direct impact on long-horizon task coherence is also the simplest: the goal anchor. An immutable record of the agent's original objective, injected into working memory at the start of every ReAct turn, with structural precedence over all other injected context. The goal anchor cannot be evicted. It is the first thing in the prompt, at every step, regardless of how full the working memory becomes.

This sounds almost trivially obvious. It is also violated in the majority of production agent deployments, for a reason that makes sense at development time and fails at production time: the goal is usually encoded in the system prompt, which is treated as static configuration. The system prompt is injected once. If it gets truncated — as prompts do when the context window fills — the goal goes with it. The model continues processing, now working from a context that contains the last twenty steps of tool results and no record of what it was supposed to accomplish. The outputs remain fluent. They are no longer goal-directed.

The architectural fix is to separate goal from system prompt. The goal anchor is a first-class data structure, stored in the agent's state, injected explicitly at the start of each turn's context construction, before the episodic and semantic retrievals, with a token budget reserved for it regardless of how much other material needs to fit in the window. It is not a string in a configuration file. It is a node in the orchestrator's state machine with protected injection semantics.

Figure 3 — Working memory assembly at each ReAct turn: injection order and budget allocation
Working memory assembly diagram Diagram showing how working memory is assembled at each ReAct turn. Goal anchor is injected first with protected budget. Then procedural constraints. Then top-K episodic memory retrieved by semantic search. Then top-K semantic knowledge from the RAG corpus. Then current tool results. Total is capped at the working memory budget. SOURCE ASSEMBLED WORKING MEMORY (per turn) Goal anchor Firestore state — structurally protected Procedural constraints OPA policy + tool manifest Episodic memory (top-K) Semantic search against step history Semantic knowledge (top-K) RAG corpus retrieval Current tool results Most recent step outputs ASSEMBLED WORKING MEMORY Goal anchor (always first, protected) ~500–1,000 tokens · cannot be truncated Procedural constraints ~1,000–2,000 tokens Episodic memory (top-3 steps) ~2,000–4,000 tokens Semantic knowledge (top-5 chunks) ~3,000–6,000 tokens Current tool results ~2,000–5,000 tokens (budget-capped) Working memory budget Hard cap: ~15,000–20,000 tokens per turn Key design rules ① Goal anchor is injected first and has a reserved token budget. It is never truncated, regardless of other content pressure. ② Episodic and semantic retrievals are separate queries against separate stores. Never combined. ③ Tool results are budget-capped. If results exceed budget, they are summarised before injection. ④ Total is enforced by the orchestrator, not the model. The model does not decide what it sees. The architecture does.
Working memory is assembled by the orchestrator at the start of each ReAct turn, following a defined injection order and token budget allocation. The goal anchor occupies a reserved slot at the top of the assembled context and is never evictable. Episodic and semantic retrievals are separate operations against separate stores. Tool results are budget-capped and summarised if they exceed their allocation. The total is enforced by the orchestrator — not by hoping the model attends to what matters.

Why bigger context windows do not solve this

Meta's Llama 4 offers a 10-million-token context window. Magic's LTM-2-Mini claims 100 million tokens with efficiency improvements over traditional attention. These are genuinely impressive engineering achievements, and the research trajectory is real. They do not, however, solve the memory architecture problem.

The reason is structural. A larger context window reduces the frequency of eviction — the moment at which something falls out of the model's active attention. It does not eliminate the attention dilution problem: distributing attention across 10 million tokens produces a different curve, but the same fundamental degradation. The METR research finding from 2024–2025 — that AI task duration capability doubles approximately every seven months, but success rates fall below 10% for tasks exceeding four hours of equivalent human effort — is not primarily a context window limitation. It is a coherence limitation. Longer context just defers the point at which coherence breaks down.

More practically: a 10-million-token context window at current pricing costs approximately 50 times more per token than an optimised retrieval-based approach. At production volume, this is not a marginal difference. It is the difference between an economically viable system and one that is performing well in benchmarks while burning money in operation.

The organisations that will capture value from the agent capability curve over the next two to three years are not the ones that waited for context windows to get large enough to make memory architecture unnecessary. They are the ones that built the memory infrastructure now — the episodic store, the goal anchor mechanics, the retrieval pipelines — so that when the models become capable of managing week-long autonomous workflows, the system surrounding those models is already designed to hold and govern the context those workflows require.

The model is the engine. The memory architecture is the chassis, the steering, and the brakes. You cannot build a vehicle from an engine alone, and you cannot build a production-grade agent from a model alone — regardless of how large its context window is.


The four questions that reveal whether a system has been designed or just deployed

In any architecture review of an agent system intended for long-horizon tasks, four questions will distinguish a system that has been properly designed from one that will fail quietly in production.

Where is the goal anchor, and how is it protected from eviction? If the team cannot immediately point to a specific data structure, a specific injection step in the orchestrator, and a specific token budget allocation with a non-eviction guarantee, the goal anchor has not been designed. It may exist in a system prompt. System prompts can be truncated. That is not the same thing.

Where is the episodic memory store, and how is it separate from the semantic knowledge store? If both are in the same vector database with no logical separation, the agent has no reliable way to distinguish "what I tried last Tuesday" from "what the policy document says." These are fundamentally different retrieval problems, and conflating their stores produces retrieval results that are accurate for neither.

What happens to this agent at step 50 if the session is interrupted and resumed? If the answer is "it starts over," there is no cross-session continuity. The episodic store is per-session and in-memory. Any task that cannot be completed in a single uninterrupted session — which is most enterprise tasks of meaningful scope — has a failure mode baked into the architecture.

Can you reconstruct, for any decision this agent made, exactly what context it had at the moment of that decision? If the audit trail is the model's output, the answer is no. The model's output does not preserve what was in its context window. Only a governed episodic store with timestamped retrieval records can answer this question — and in any regulated environment, this is the question a compliance examination will ask.

None of these questions are about model quality. All of them are about architecture. The difference between a system that answers all four affirmatively and one that answers none is not the sophistication of the model it uses. It is the investment in the infrastructure that surrounds the model. That infrastructure is the memory architecture. And the memory architecture, not the context window, is the real bottleneck — and the real opportunity — in enterprise agent deployments.

References & Further Reading

Chroma Technical Report, July 2025 — Performance degradation study across 18 leading models. Documents the context rot phenomenon and the measurable performance cliff past 50,000 tokens. The hallucination rate comparison (12% at 200K optimised vs. 28–31% at 1M full capacity) is sourced from this report.

Journal of Machine Learning Research, 2024 — Agent performance degradation study documenting the 40% performance drop beyond 50,000 tokens due to attention dilution. The foundational empirical basis for the context rot concept.

Stevens Institute of Technology, "Hidden Economics of AI Agents," January 2026 — Unconstrained agent cost quantification ($5–8 per software engineering task) and the 50× token consumption multiple for Reflexion-style loops relative to single-pass equivalents.

Databricks RAG Benchmarks, 2025 — The 300% cost inflation finding for naive long-prompt workarounds, with documented absence of recall improvement relative to optimised retrieval architectures.

METR Research, 2024–2025 — AI task duration capability doubling rate (~7 months) and the success rate drop below 10% for tasks exceeding four hours. The research framing the agent capability trajectory and the memory infrastructure gap it implies.

Oracle Developer Blog, February 2026 — The four-memory-type framework (working, episodic, semantic, procedural) for AI agents, referenced in the specification section. Customer service as the leading enterprise use case at 26.5% of deployments (LangChain 2025 survey data).

Augment Code, "Context Window Wars," October 2025 — Head-to-head cost comparison ($0.08 vs. $0.38–0.42 per query for optimised retrieval vs. full-capacity context window usage) and the hallucination rate differential at equivalent context loads.

Continue reading

These notes are published when there is something worth saying. To receive new Field Notes directly, write to hello@datadomine.com with the subject line: Field Notes.

All Field Notes Programmes Get in touch