Office Hours — How should you structure memory and context for AI agents so they can learn from past tasks without growing unbounded token usage?

How should you structure memory and context for AI agents so they can learn from past tasks without growing unbounded token usage?

This is the core infrastructure problem for agents running long-lived workflows. You’re stuck between two bad options: lose context and watch agents repeat mistakes, or keep everything and watch your token bill explode. The answer isn’t to pick one. It’s to tier your memory strategically.

The Unbounded Context Problem Is Real

Every task an agent completes generates a log. Every interaction, every tool call, every failure. If you naively append all of this to the system prompt or context window, you hit token limits within hours of continuous operation. At frontier model pricing, even with caching, you’re paying for redundant context on every request. This is the hidden cost that nobody mentions when they demo an agent running 100 autonomous tasks.

The Daily Signal just covered this: token efficiency for agentic AI means caching, lazy-loading, routing, and compaction are now non-negotiable. These aren’t optional optimizations. They’re the difference between a prototype that runs for a week and a production system that runs for a month.

Three-Tier Memory Architecture

Structure your agent’s memory into layers, each with different retention logic.

Immediate context (current task window). This is the working memory for the active task. Keep only the last N tool calls, the current goal, and any relevant constraints. This might be 3-5 recent interactions, not 50. When a task completes, this gets flushed or summarized.

Episodic memory (recent task history). Store summaries of the last 5-10 completed tasks: what was attempted, what failed, what worked. Not full transcripts. Summaries. “Attempted to deploy using Docker, failed with permission error, switched to Kubernetes, succeeded.” This goes into a lightweight vector store or JSON log with timestamps. Use this to seed the agent’s context for the next similar task.

Semantic memory (learned patterns). This is harder and more valuable. After running 50 file parsing tasks, the agent should encode “when the PDF is scanned, OCR confidence below 85% means skip and flag for human review.” This isn’t logged interaction history. It’s compressed domain knowledge extracted from failure patterns. You build this through periodic summarization runs (offline, not per-request) where you ask a model to distill the last 20 task summaries into 3-4 key patterns. Store these as agent instructions or as retrievable rules.

Concrete Implementation: Token Budget Per Layer

Let’s say you’re running Claude Opus 4.7 with a 200K context window and you want sustainable multi-hour agent operation.

Reserve 60K tokens for immediate context (current task, recent tool calls, constraints). This covers most single-task workflows.

Reserve 40K tokens for episodic memory (last 10 task summaries, each ~4K tokens). Retrieve only summaries relevant to the current task using a simple embedding lookup.

Reserve 20K tokens for semantic memory (20-30 extracted rules and patterns). Keep these as structured text, not embeddings.

Reserve 80K tokens as overhead (system prompt, tool definitions, response buffer).

That leaves zero room for growth, which means you need to actually implement the compression. On task completion, automatically summarize the task log to 500 tokens max. Extract one rule if the task hit a new failure pattern. Discard the immediate context.

Example: Compacting a Failed Task

Raw task log (Claude output, tool calls, feedback):

System: Deploy application to production
Model: I'll check the deployment configuration.
Tool: read_file("deploy.yaml") -> 15KB output
Model: Configuration looks good, running deploy script.
Tool: run_command("./deploy.sh") -> Error: insufficient disk space
Model: Freeing disk space...
Tool: cleanup_old_logs() -> Freed 2GB
Model: Retrying deployment.
Tool: run_command("./deploy.sh") -> Success

Compressed episodic memory (what goes into your vector store):

{
  "task_id": "deploy_001",
  "summary": "Deployment failed due to disk space. Freed old logs and retried successfully.",
  "outcome": "success",
  "learned_rule": "Check available disk before deploy steps",
  "timestamp": "2026-05-04T14:32:00Z",
  "tokens_in_log": 5200,
  "tokens_in_summary": 180
}

Extracted semantic rule (goes into agent instructions):

If a deployment tool returns "insufficient disk space" error:
1. Run cleanup_old_logs() first
2. Check df -h to confirm >10GB free
3. Retry the deployment
This prevents 30% of deployment retries.

The raw log is discarded. The summary goes into your vector store. The rule gets folded into the system prompt (or kept separate and retrieved when relevant). Next time the agent runs a deployment task in an hour, it retrieves the summary and the rule from semantic memory, not from the full context history. You’ve reduced 5200 tokens to 180 + 150 = 330 tokens of context usage.

The Caching Multiplier

Pair this with prompt caching. Your system prompt, tool definitions, and the latest extracted rules change slowly. Cache them. This reduces the per-request token cost of context by 80-90% (cached tokens cost 10% of standard tokens in most APIs). Compress your immediate context more aggressively because the fixed overhead is already cached.

When to Flush vs. Archive

After 20 completed tasks, the episodic memory store itself gets stale. Periodic consolidation runs (weekly, depending on task volume) re-summarize old summaries. “Last 5 deploy tasks all failed at the same step” becomes a rule, and individual task records get archived to cold storage.

Implement a simple decay function: summaries older than 1 week drop to 50% retrieval priority unless they match the current task type. This keeps your active vector store lean while preserving long-tail learning.

The Hard Part: Semantic Rule Extraction

The bottleneck isn’t logging or compression. It’s reliably extracting generalizable rules from task logs. You need a separate evaluation loop that runs offline:

Every N tasks, sample 10-20 recent summaries.
Ask a model (can be cheaper than your agent model) to identify patterns and propose rules.
Tag each rule with confidence and a specific condition (e.g., “applies when tool = deploy_script”).
Keep only rules that actually improve future agent success (which means you need a before/after test harness).

Without this, you end up with stale, incorrect rules. With it, you’re building a learnable agent.

Tools and Patterns

Use LangGraph with its built-in state management to implement this. Graph checkpoints already give you task boundaries. Add a post-task summarization node that compacts logs before persisting state.

For episodic storage, use Postgres with pgvector or a lightweight vector DB like Milvus. For semantic rules, keep them in version-controlled YAML or in a structured rules engine.

Monitor your actual token usage per layer. If semantic memory rules aren’t actually being retrieved, delete them. If episodic memory is never used, you’re not doing task batching (which is fine, but you’ve paid for infrastructure you don’t need).

Bottom line: Structure memory in three tiers (immediate context, episodic summaries, semantic rules), compress task logs immediately after completion to 500 tokens max, and extract generalizable rules only when they measurably improve agent performance. Cache the slow-moving pieces aggressively. This keeps token usage bounded while agents actually learn.

Question via Hacker News