omlx: Run Local LLMs on Apple Silicon with Persistent KV Caching

omlx is a macOS-native local LLM server built on Apple’s MLX framework. Its defining feature: a two-tier SSD KV cache that persists between requests and server restarts, cutting time-to-first-token from 30–90 seconds (cold) to under 5 seconds (warm) for long-context workloads.

Most local LLM servers treat the SSD as inert storage. omlx uses it as a live memory tier: KV cache blocks evict to NVMe under RAM pressure, restore on prefix match, and survive brew services restart. For customer support bots, coding assistants, and document Q&A, that makes local inference actually usable instead of theoretically impressive.

The cache architecture, setup, and a complete working customer support RAG app — using omlx for both embeddings and completions, with ChromaDB for retrieval.

Why Local Inference on Apple Silicon Is Viable
The KV Cache Problem
omlx’s Two-Tier Cache Architecture
Setup
Tutorial: Customer Support RAG App
Architecture Diagram
Full Code
Running It
Exposing as a Web API
Performance Characteristics
Limitations

Why Local Inference on Apple Silicon Is Viable

Apple Silicon unified memory architecture puts the CPU, GPU, and Neural Engine on the same die, sharing a single memory pool. A 96 GB M4 Max has 96 GB of bandwidth-dense unified memory accessible to the GPU at ~546 GB/s , comparable to an A100’s 80 GB at 2 TB/s. Without tensor parallelism across multiple nodes, a 70B model still fits in 4-bit quantization.

The tradeoff is memory bandwidth vs. capacity. Tokens/second on Apple Silicon is roughly memory_bandwidth / (2 * model_parameters * bits_per_weight / 8). A 7B 4-bit model on an M3 Pro (150 GB/s) runs at roughly 150e9 / (2 * 7e9 * 0.5) ≈ 21 tokens/second. Fast enough for interactive use. Passable for agent loops.

MLX is Apple’s framework for exploiting this: lazy evaluation, unified memory arrays, and Metal-accelerated operations. omlx wraps mlx-lm with production server features: continuous batching, multi-model loading, admin UI, and the SSD KV cache.

The KV Cache Problem

Every forward pass through a transformer computes attention keys and values for every token in the context. These KV tensors are cached so subsequent tokens don’t recompute them from scratch. On a 7B model with 4-bit KV quantization, a 4096-token context occupies roughly:

layers × 2 × heads × head_dim × context_len × bytes_per_element
32 × 2 × 32 × 128 × 4096 × 0.5 bytes ≈ 536 MB

A 32K context is ~4 GB. On a 16 GB Mac running a 7B model (which itself takes ~4 GB), that leaves little headroom before the OS starts evicting. The standard behavior: cache fills, old entries are dropped, the next request with a matching prefix must recompute from scratch.

omlx’s Two-Tier Cache Architecture

omlx implements paged KV caching borrowed from vLLM’s PagedAttention, adapted for MLX and extended with an SSD tier.

Cache blocks are 16-token pages identified by a hash of their content. Blocks are copy-on-write, so two requests sharing a system prompt share the same physical cache pages until they diverge. When RAM pressure hits a configurable threshold, LRU hot blocks are serialized as safetensors files and evicted to the cold tier. The cold tier survives server restarts; on next startup, omlx rebuilds its block index from the SSD files.

For a customer support app with a fixed 2000-token system prompt, the first request warms the cache. Every subsequent request (including after brew services restart omlx) pays the cold-tier restore cost (~1.5s on NVMe) rather than a full 25-second prefill.

Setup

# Homebrew
brew install omlx

# Or from source
pip install omlx

omlx starts as a menu bar app. The port is shown in the menu bar popover on first launch (default: 10240). The admin dashboard is at http://localhost:<port>.

Pull a model and an embedding model from the dashboard’s HuggingFace downloader, or via CLI:

omlx pull mlx-community/Qwen3-8B-4bit
omlx pull mlx-community/bge-small-en-v1.5

The server exposes an OpenAI-compatible endpoint at /v1. Point any OpenAI SDK client at it:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:10240/v1", api_key="none")

Tutorial: Customer Support RAG App

The app does four things:

Ingest — load PDFs from a folder, chunk into ~500-word segments, embed each chunk via omlx
Index — store embeddings and chunk text in ChromaDB (local, no server needed)
Retrieve — embed the user’s question, find the top-k nearest chunks
Generate — send retrieved context + question to the LLM via omlx, stream the answer

Dependencies:

pip install openai pypdf chromadb

Architecture Diagram

Full Code

#!/usr/bin/env python3
"""
Customer support RAG app using omlx for local inference.

Usage:
  python support.py ingest ./docs   # load PDFs into ChromaDB
  python support.py                 # start chat loop
"""

import sys
import glob
from pathlib import Path

import pypdf
import chromadb
from openai import OpenAI

# ── omlx client (OpenAI-compatible) ──────────────────────────────────────────
OMLX_BASE_URL = "http://localhost:10240/v1"
LLM_MODEL = "mlx-community/Qwen3-8B-4bit"
EMBED_MODEL = "mlx-community/bge-small-en-v1.5"

client = OpenAI(base_url=OMLX_BASE_URL, api_key="none")

# ── ChromaDB (persistent local store at ./.chroma) ───────────────────────────
chroma = chromadb.PersistentClient(path="./.chroma")
collection = chroma.get_or_create_collection("support_docs")

SYSTEM_PROMPT = """You are a customer support agent. Answer questions using only the
documentation excerpts provided. If the answer isn't in the excerpts, say so clearly.
Be concise. Cite the source filename and page number for every claim."""

# ── Chunking ──────────────────────────────────────────────────────────────────
def chunk_text(text: str, size: int = 400, overlap: int = 40) -> list[str]:
    words = text.split()
    chunks = []
    step = size - overlap
    for i in range(0, len(words), step):
        chunk = " ".join(words[i : i + size])
        if chunk.strip():
            chunks.append(chunk)
    return chunks

# ── PDF loading ───────────────────────────────────────────────────────────────
def load_pdfs(pdf_dir: str) -> list[dict]:
    pages = []
    for path in sorted(glob.glob(f"{pdf_dir}/**/*.pdf", recursive=True)):
        reader = pypdf.PdfReader(path)
        for i, page in enumerate(reader.pages):
            text = page.extract_text() or ""
            if text.strip():
                pages.append({
                    "text": text,
                    "source": Path(path).name,
                    "page": i + 1,
                })
    return pages

# ── Embedding via omlx ────────────────────────────────────────────────────────
def embed(texts: list[str]) -> list[list[float]]:
    # Batch in groups of 64 to avoid hitting omlx's max batch size
    embeddings = []
    for i in range(0, len(texts), 64):
        batch = texts[i : i + 64]
        response = client.embeddings.create(model=EMBED_MODEL, input=batch)
        embeddings.extend(r.embedding for r in response.data)
    return embeddings

# ── Ingest ────────────────────────────────────────────────────────────────────
def ingest(pdf_dir: str):
    pages = load_pdfs(pdf_dir)
    if not pages:
        print(f"No PDFs found in {pdf_dir}")
        return

    chunks, metas, ids = [], [], []
    for doc in pages:
        for j, chunk in enumerate(chunk_text(doc["text"])):
            chunk_id = f"{doc['source']}:p{doc['page']}:c{j}"
            # Skip if already indexed
            if collection.get(ids=[chunk_id])["ids"]:
                continue
            chunks.append(chunk)
            metas.append({"source": doc["source"], "page": doc["page"]})
            ids.append(chunk_id)

    if not chunks:
        print("All chunks already indexed.")
        return

    print(f"Embedding {len(chunks)} new chunks...")
    embeddings = embed(chunks)

    collection.add(documents=chunks, embeddings=embeddings, ids=ids, metadatas=metas)
    print(f"Indexed {len(chunks)} chunks from {len(pages)} pages across {len(set(m['source'] for m in metas))} files.")

# ── Retrieval ─────────────────────────────────────────────────────────────────
def retrieve(question: str, k: int = 5) -> str:
    q_embedding = embed([question])[0]
    results = collection.query(query_embeddings=[q_embedding], n_results=k)

    excerpts = []
    for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
        excerpts.append(f"[{meta['source']} — page {meta['page']}]\n{doc}")
    return "\n\n---\n\n".join(excerpts)

# ── Generation ────────────────────────────────────────────────────────────────
def answer(question: str) -> None:
    context = retrieve(question)
    if not context:
        print("Agent: No relevant documentation found.\n")
        return

    stream = client.chat.completions.create(
        model=LLM_MODEL,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {
                "role": "user",
                "content": f"Documentation:\n\n{context}\n\n---\n\nQuestion: {question}",
            },
        ],
        temperature=0.1,
        stream=True,
    )

    print("Agent: ", end="", flush=True)
    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            print(delta, end="", flush=True)
    print("\n")

# ── Entry point ───────────────────────────────────────────────────────────────
if __name__ == "__main__":
    if len(sys.argv) >= 2 and sys.argv[1] == "ingest":
        pdf_dir = sys.argv[2] if len(sys.argv) >= 3 else "./docs"
        ingest(pdf_dir)
    else:
        if collection.count() == 0:
            print("No documents indexed. Run: python support.py ingest ./docs")
            sys.exit(1)
        print(f"Customer support chat ({collection.count()} chunks indexed). Ctrl+C to exit.\n")
        while True:
            try:
                question = input("You: ").strip()
                if question:
                    answer(question)
            except KeyboardInterrupt:
                print()
                break

Running It

# 1. Drop your product PDFs into ./docs
mkdir docs && cp ~/Downloads/product-manual.pdf docs/

# 2. Ingest (run once; re-run to add new files — skips already-indexed chunks)
python support.py ingest ./docs
# Embedding 847 new chunks...
# Indexed 847 chunks from 23 pages across 3 files.

# 3. Start the chat loop
python support.py
# Customer support chat (847 chunks indexed). Ctrl+C to exit.
#
# You: What's the return policy for defective items?
# Agent: Defective items can be returned within 90 days of purchase...
#        [returns-policy.pdf — page 4]

Restart omlx between runs: the second session skips the system prompt prefill because the SSD cache already has those KV blocks.

For production, swap chromadb.PersistentClient for a remote store (Qdrant, Weaviate, pgvector) and add a reranker pass before generation. The retrieve() function’s top-k=5 is conservative; bump it for broader coverage at the cost of a larger context window.

Exposing as a Web API

The chat loop in support.py is useful for testing, but a web app needs an HTTP API. Add a FastAPI server that wraps the same retrieve() and answer() logic — both a JSON endpoint and a streaming SSE endpoint so the frontend can show tokens as they arrive.

pip install fastapi uvicorn

# api.py
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import json

# Import the RAG functions from support.py
from support import retrieve, client, LLM_MODEL, SYSTEM_PROMPT, collection

app = FastAPI()

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # Lock this down in production
    allow_methods=["POST", "GET"],
    allow_headers=["*"],
)

class Question(BaseModel):
    question: str
    k: int = 5  # number of chunks to retrieve


@app.get("/health")
def health():
    return {"status": "ok", "chunks_indexed": collection.count()}


@app.post("/ask")
def ask(body: Question):
    """Non-streaming: returns full answer as JSON."""
    context = retrieve(body.question, k=body.k)
    if not context:
        return {"answer": "No relevant documentation found.", "context_chunks": 0}

    response = client.chat.completions.create(
        model=LLM_MODEL,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Documentation:\n\n{context}\n\nQuestion: {body.question}"},
        ],
        temperature=0.1,
        stream=False,
    )
    return {"answer": response.choices[0].message.content}


@app.post("/ask/stream")
def ask_stream(body: Question):
    """Streaming: returns answer as Server-Sent Events."""
    context = retrieve(body.question, k=body.k)

    def generate():
        if not context:
            yield f"data: {json.dumps({'done': True, 'answer': 'No relevant documentation found.'})}\n\n"
            return

        stream = client.chat.completions.create(
            model=LLM_MODEL,
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": f"Documentation:\n\n{context}\n\nQuestion: {body.question}"},
            ],
            temperature=0.1,
            stream=True,
        )
        for chunk in stream:
            delta = chunk.choices[0].delta.content
            if delta:
                yield f"data: {json.dumps({'token': delta})}\n\n"
        yield f"data: {json.dumps({'done': True})}\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

Run the server:

uvicorn api:app --host 0.0.0.0 --port 8000

Using the JSON endpoint (curl or fetch):

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the return policy for defective items?"}'

Using the streaming endpoint from a browser (plain JS, no dependencies):

async function ask(question) {
  const res = await fetch("http://localhost:8000/ask/stream", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ question }),
  });

  const reader = res.body.getReader();
  const decoder = new TextDecoder();
  let buffer = "";

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    buffer += decoder.decode(value, { stream: true });
    const lines = buffer.split("\n\n");
    buffer = lines.pop(); // keep incomplete chunk

    for (const line of lines) {
      if (!line.startsWith("data: ")) continue;
      const payload = JSON.parse(line.slice(6));
      if (payload.token) {
        document.getElementById("answer").textContent += payload.token;
      }
    }
  }
}

The streaming endpoint lets you render tokens to the DOM as they arrive — same UX as ChatGPT — while keeping all inference local.

Performance Characteristics

On an M3 Max (128 GB) running Qwen3-8B-4bit:

Scenario	TTFT (cold)	TTFT (warm, SSD cache)	Tokens/s
500-token system prompt + 200-token context	~8s	~1.2s	~28
2000-token system prompt + 1000-token context	~28s	~3.8s	~24
4000-token context (long doc)	~55s	~5.1s	~21

The warm numbers reflect the SSD cache restoring KV blocks from a prior session. The cold numbers are what every other local server pays on every new session.

Throughput drops with context length because attention is O(n²) in context size. This is a constraint of the MLX runtime, not specific to omlx. The SSD cache offsets the prefill cost, not the decode cost.

Limitations

macOS 15+ and Apple Silicon only. No Linux, no Intel Macs. This is a hard constraint from the PyObjC menu bar integration and MLX’s Metal dependency.

No KV quantization. vMLX supports q4/q8 KV cache quantization, which reduces the memory footprint of cached blocks. omlx’s SSD tier compensates somewhat by evicting rather than quantizing, but a 70B model’s KV blocks are still large.

Single-machine. There’s no distributed inference or tensor parallelism. You’re limited to what fits on one device’s unified memory. For 70B+ models, even with 4-bit weights, you need 96 GB+ to run comfortably.

MLX model availability. You’re limited to models with MLX-format weights on HuggingFace. The mlx-community org covers most mainstream models, but specialized fine-tunes may not have MLX versions.

For teams willing to accept the macOS constraint, omlx is currently the most production-capable option in the MLX-native tier — the SSD cache makes it behave like a warm cloud endpoint for stable-prefix workloads, on hardware you already own.

Frequently Asked Questions

What is omlx? omlx is a macOS-native local LLM inference server for Apple Silicon. It runs quantized open-source models locally using Apple’s MLX framework and exposes an OpenAI-compatible API at localhost. Its main differentiator is a two-tier KV cache that persists to SSD, eliminating cold-start prefill cost across server restarts.

Is omlx free? Yes. omlx is open source under the Apache 2.0 license. Source code is at github.com/jundot/omlx. Installation via Homebrew: brew install omlx.

What Macs does omlx support? omlx requires macOS 15.0 or later and an Apple Silicon chip (M1, M2, M3, M4, or later). Intel Macs and non-Apple hardware are not supported. The macOS requirement comes from the MLX framework’s dependency on Apple’s Metal GPU API and the menu bar app’s use of PyObjC.

How is omlx different from Ollama? Both expose an OpenAI-compatible local API and support similar models. The key differences: omlx uses Apple’s MLX framework (1.5–2x faster on Apple Silicon than Ollama’s llama.cpp backend for the same model size), omlx’s KV cache persists to SSD across restarts (Ollama’s is in-memory only), and omlx includes a native menu bar app and web admin dashboard. Ollama supports Linux and Windows; omlx is macOS-only.

What is RAG and why use it for customer support? Retrieval-Augmented Generation (RAG) is a pattern where a language model is given relevant document excerpts as context before generating a response, rather than relying solely on its training data. For customer support, this means the model can accurately answer questions about your specific product, return policy, or documentation — content it was never trained on. The model retrieves relevant chunks from your PDFs at query time and cites its sources.

Can omlx run embedding models alongside chat models? Yes. omlx supports multi-model loading — you can run a chat model and an embedding model simultaneously, with LRU eviction if memory gets tight. This means a RAG pipeline can use omlx for both the embedding step and the generation step without spinning up a separate embedding server.

What models work with omlx? Any model available in MLX format on HuggingFace. The mlx-community organization hosts quantized versions of most major open-source models: Qwen2.5 (0.5B–72B), Llama 3.x, Mistral, Phi-4, DeepSeek-R1 distills, Gemma 3, and others. Embedding models like bge-small-en-v1.5 and nomic-embed-text are also available.

How much RAM do I need? 16 GB is the stated minimum. In practice: 16 GB supports 7B models comfortably at 4-bit quantization. 32 GB allows 13B models or running a chat model + embedding model simultaneously. 64 GB+ enables 34B–70B models. The SSD KV cache helps when context fills RAM, but model weights still need to fit in unified memory.

Does omlx work with Claude Code? Yes. omlx has an explicit Claude Code integration: a context-scaling feature that adjusts token budget reporting so smaller MLX models work correctly with Claude Code’s auto-compact behavior, plus SSE keep-alive to prevent read timeouts during long prefill. See the omlx dashboard’s “Integrations” tab for the one-click setup.

What’s the difference between omlx and LM Studio? LM Studio is a cross-platform desktop application (Mac, Windows, Linux) with a GUI-first workflow for managing and chatting with local models. It uses llama.cpp and MLX backends. omlx is Mac-only, server-first, and optimized for programmatic use via its API. omlx’s SSD KV cache persists across restarts; LM Studio’s does not. For a developer building an application against a local LLM API, omlx’s architecture is more appropriate. For a non-technical user who wants to explore models interactively, LM Studio’s GUI is more accessible.

omlx: Run Local LLMs on Apple Silicon with a RAG Customer Support App