Office Hours — How do you give local LLMs persistent context without bloating prompts or losing information?

How do you give local LLMs persistent context without bloating prompts or losing information?

Running LLMs locally means you lose the infrastructure that hosted providers bake in. Claude Opus 4.7 or GPT-5.5 can juggle 200K context windows without flinching. A local Llama 4 or Mistral Large 3 instance on your hardware has real constraints, and naive approaches (stuffing everything into the system prompt, concatenating all prior conversations) tank performance fast. You need architecture.

The core problem

Context bloat has two failure modes. First, long prompts degrade token efficiency and latency, especially on consumer-grade GPUs where every 1K tokens added to context costs real seconds. Second, models stop paying attention to information buried in massive context windows. They “lose” the signal in the noise, a documented phenomenon across frontier and open-weight models. You can’t just append conversation history indefinitely and expect coherent behavior.

The standard move is to split context into two separate concerns: working memory (the current task, recent conversation, immediate state) and reference material (persistent facts, documents, past decisions). Working memory stays tight. Reference material lives elsewhere.

Persistent context without the prompt

Use a file-backed or database-backed context store, separate from the prompt. When your local LLM needs to remember something durable, write it to a structured file or SQLite database, not the prompt itself.

Here’s a concrete pattern:

import sqlite3
import json
from datetime import datetime

class LocalLLMMemory:
    def __init__(self, db_path: str = "llm_memory.db"):
        self.db_path = db_path
        self._init_db()
    
    def _init_db(self):
        conn = sqlite3.connect(self.db_path)
        c = conn.cursor()
        c.execute('''
            CREATE TABLE IF NOT EXISTS facts (
                id INTEGER PRIMARY KEY,
                key TEXT UNIQUE,
                value TEXT,
                category TEXT,
                updated_at TIMESTAMP
            )
        ''')
        c.execute('''
            CREATE TABLE IF NOT EXISTS decisions (
                id INTEGER PRIMARY KEY,
                context TEXT,
                decision TEXT,
                reasoning TEXT,
                created_at TIMESTAMP
            )
        ''')
        conn.commit()
        conn.close()
    
    def store_fact(self, key: str, value: str, category: str = "general"):
        """Store a persistent fact about the user or context."""
        conn = sqlite3.connect(self.db_path)
        c = conn.cursor()
        c.execute('''
            INSERT OR REPLACE INTO facts (key, value, category, updated_at)
            VALUES (?, ?, ?, ?)
        ''', (key, value, category, datetime.now()))
        conn.commit()
        conn.close()
    
    def retrieve_facts(self, category: str = None) -> str:
        """Retrieve facts as a compact summary, not full context."""
        conn = sqlite3.connect(self.db_path)
        c = conn.cursor()
        
        if category:
            c.execute('SELECT key, value FROM facts WHERE category = ?', (category,))
        else:
            c.execute('SELECT key, value FROM facts')
        
        rows = c.fetchall()
        conn.close()
        
        if not rows:
            return ""
        
        # Return as compact JSON, not prose
        return json.dumps({key: value for key, value in rows})
    
    def store_decision(self, context: str, decision: str, reasoning: str):
        """Log decisions for future reference without storing full history."""
        conn = sqlite3.connect(self.db_path)
        c = conn.cursor()
        c.execute('''
            INSERT INTO decisions (context, decision, reasoning, created_at)
            VALUES (?, ?, ?, ?)
        ''', (context, decision, reasoning, datetime.now()))
        conn.commit()
        conn.close()
    
    def get_recent_decisions(self, limit: int = 5) -> str:
        """Retrieve only recent decisions, not entire history."""
        conn = sqlite3.connect(self.db_path)
        c = conn.cursor()
        c.execute('''
            SELECT decision, reasoning FROM decisions
            ORDER BY created_at DESC LIMIT ?
        ''', (limit,))
        rows = c.fetchall()
        conn.close()
        
        return json.dumps([
            {"decision": d, "reasoning": r} for d, r in rows
        ])

# Usage
memory = LocalLLMMemory()
memory.store_fact("user_role", "data engineer", category="profile")
memory.store_fact("project_deadline", "2026-06-15", category="project")
memory.store_decision(
    context="Choosing between Postgres and Snowflake for warehouse",
    decision="Chose Snowflake",
    reasoning="Query costs and scaling profile favor Snowflake for this dataset volume"
)

# When calling the local LLM, inject only relevant facts
system_prompt = f"""You are a helpful assistant with access to persistent context.

User profile:
{memory.retrieve_facts("profile")}

Recent decisions:
{memory.get_recent_decisions(limit=3)}

Current task: answer the user's question using this context."""

This keeps the prompt lean. Instead of appending gigabytes of conversation history, you query the database for what’s actually relevant. The model sees maybe 500 tokens of facts instead of 50K tokens of rambling.

The retrieval problem

But here’s the catch: which facts do you actually need for the current task? You can’t ask the LLM “which facts do you need?” without adding latency (another round trip). A pragmatic middle ground is category-based retrieval. Tag facts when you store them, then inject facts matching the current conversation’s inferred category.

If the user asks about deadlines, fetch facts tagged “project”. If they ask about code style, fetch “engineering”. This is crude but effective and runs offline.

For more sophisticated scenarios, use embeddings. Store fact embeddings alongside the facts themselves, then similarity-search them against the current user query. This costs more compute upfront but scales to large fact stores.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def retrieve_relevant_facts(query: str, top_k: int = 5) -> str:
    """Retrieve facts by semantic similarity to the query."""
    conn = sqlite3.connect(self.db_path)
    c = conn.cursor()
    c.execute('SELECT key, value FROM facts')
    rows = c.fetchall()
    conn.close()
    
    if not rows:
        return ""
    
    query_embedding = model.encode(query)
    fact_texts = [f"{key}: {value}" for key, value in rows]
    fact_embeddings = model.encode(fact_texts)
    
    scores = (query_embedding @ fact_embeddings.T)
    top_indices = scores.argsort()[-top_k:][::-1]
    
    relevant = [fact_texts[i] for i in top_indices if scores[i] > 0.3]
    return "\n".join(relevant) if relevant else ""

Handling conversation history

The other major context sink is conversation history. Don’t store the full dialogue. Instead, store a compressed summary after every 3-5 exchanges.

def compress_conversation(messages: list, llm_call) -> str:
    """Summarize recent exchanges into a single compact string."""
    history_text = "\n".join([
        f"{m['role']}: {m['content']}" for m in messages[-10:]  # Last 10 messages
    ])
    
    summary_prompt = f"""Summarize this conversation in 1-2 sentences, focusing on:
- What the user asked
- What was decided or learned
- Any follow-up action needed

Conversation:
{history_text}

Summary:"""
    
    summary = llm_call(summary_prompt)
    return summary

After compression, discard the raw messages and keep only the summary. Your next prompt includes the summary, not the full history.

Practical limits

Even with aggressive compression and fact storage, local L

Question via Hacker News