The Stack: Apple Silicon Local LLM Servers for Running Agents

Running an agent loop locally is harder than running a chat session. Agents make dozens of sequential LLM calls, each building on the last. Tool call parsing has to be reliable. Long context (accumulated tool outputs, intermediate reasoning, retrieved documents) has to be handled without ballooning TTFT on every turn. And the server has to stay alive and consistent across a multi-minute run without drifting or evicting state mid-task.

Most local LLM servers were built for the chat use case. They work for agents, but with friction.

What Agent Workloads Demand

Requirement	Why it matters for agents
Tool/function calling	Agent frameworks (Claude Code, LangGraph, smolagents) rely on structured tool call responses
Streaming	Long tool outputs need incremental delivery; blocking until completion kills responsiveness
Consistent prefix caching	The system prompt + prior tool calls should be cached; re-prefilling them every turn is expensive
Low TTFT at long context	Turn 10 of an agent loop may have 8K tokens of prior context; TTFT compounds across turns
Concurrent requests	Some frameworks issue parallel tool calls; a single-queue server serializes them
Stable tool call parsing	Malformed JSON in a tool call response terminates the agent run

The Servers

Ollama

Best for: quick setup, broad model compatibility, cross-platform teams

Ollama is the easiest entry point and has the largest model library. It wraps llama.cpp (with an optional MLX runner) behind a clean REST API and a CLI that feels like docker pull for models.

Tool calling works for models with built-in function calling templates (Llama 3.1+, Qwen2.5, Mistral Nemo). Ollama handles the formatting automatically if the model’s Modelfile has the right template; no hand-crafted tool call JSON needed.

ollama pull qwen2.5:7b
ollama serve  # starts at localhost:11434

The OpenAI-compatible endpoint at /v1 works with LangChain, LlamaIndex, and most agent frameworks without modification:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

Agent-specific gaps: Requests are processed sequentially per model with no continuous batching. Parallel tool calls from a framework that issues concurrent requests will queue. Prefix caching is in-memory only; restarting Ollama between sessions means the next agent run pays full prefill cost again. For single-agent runs on a local machine, neither of these matters. For multi-agent pipelines or long-running sessions, they do.

LM Studio

Best for: non-technical users, model discovery, Windows/Linux/Mac parity

LM Studio is the most polished desktop experience. Model downloads, quantization selection, and context length tuning are all in a GUI. The lms CLI and local server expose an OpenAI-compatible API identical to Ollama’s.

Continuous batching arrived in 0.4.0 (llama.cpp backend) and 0.4.2 (MLX backend) , making it viable for parallel tool calls. Context shift on Llama models is handled automatically in the GUI, though the behavior in server mode requires manual configuration.

Agent-specific gaps: The server is tightly coupled to the GUI process; running headless on a remote Mac is possible but awkward. No persistent KV cache across sessions. Closed source, so you can’t inspect or modify the batching behavior.

llama.cpp server

Best for: maximum control, cross-platform deployment, GGUF ecosystem

llama-server (the HTTP server in llama.cpp) is what Ollama wraps, exposed directly. Every knob is accessible: context length, KV quantization type, batch size, number of parallel sequences, slot management.

Tool calling via Jinja2 templates works for any model with a chat template. The --jinja flag enables automatic tool call formatting from the model’s own template.

llama-server \
  --model ./Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 99 \
  --parallel 4 \
  --cache-type-k q8_0 \
  --port 8080

--parallel 4 enables 4 concurrent sequences. --cache-type-k q8_0 quantizes the KV cache to 8-bit, roughly halving its memory footprint and extending the effective context you can keep warm simultaneously.

KV slot save/restore (--slot-save-path) writes cache snapshots to disk, but it’s a manual operation: call the /slots API to save and restore. Not seamless, but available.

Agent-specific gaps: Metal acceleration on Apple Silicon works but the MLX-based servers run 1.5–2x faster on the same hardware for the same model size. If you’re Mac-only, llama.cpp is giving up performance for portability.

MLX-LM serve

Best for: baseline MLX serving, Apple-maintained, minimal dependencies

mlx_lm.server is the official Apple ML server for MLX models: the reference implementation. Correct and maintained, but not production-hardened.

pip install mlx-lm
mlx_lm.server --model mlx-community/Qwen2.5-7B-Instruct-4bit --port 8080

OpenAI-compatible. Tool calling works if the model supports it. Rotating fixed-size KV cache handles context overflow by dropping old tokens. Fine for chat, but disruptive for agent loops where early tool call results matter.

Agent-specific gaps: No continuous batching, no SSD KV tier, no admin UI. Single-request serving. For development and testing it’s perfectly adequate. For a real agent workload running overnight, you want more.

omlx

Best for: Apple Silicon-only teams, persistent context, Claude Code integration

omlx wraps MLX with a production server layer: continuous batching, multi-model loading, web admin dashboard, and a two-tier SSD KV cache.

The SSD cache matters specifically for agents with long, stable system prompts. An agent’s system prompt (tools definitions, instructions, persona) is typically 1000–3000 tokens and identical across every call in a run. With omlx, that prefix is cached as block-hashed pages in RAM, spills to SSD under memory pressure, and survives server restarts. Turn 15 of an agent loop pays the same TTFT as turn 1.

Tool calling and streaming both work via the OpenAI-compatible API. The Claude Code integration is explicit: omlx has a context-scaling feature that adjusts token budget reporting so smaller MLX models work correctly with Claude Code’s auto-compact behavior.

from openai import OpenAI
client = OpenAI(base_url="http://localhost:10240/v1", api_key="none")

response = client.chat.completions.create(
    model="mlx-community/Qwen2.5-7B-Instruct-4bit",
    tools=[{
        "type": "function",
        "function": {
            "name": "search_docs",
            "description": "Search internal documentation",
            "parameters": {
                "type": "object",
                "properties": {"query": {"type": "string"}},
                "required": ["query"],
            },
        },
    }],
    messages=[{"role": "user", "content": "Find our refund policy"}],
)

Agent-specific gaps: macOS 15+ and Apple Silicon only. No KV quantization (vMLX has this). If you need to run the same stack on Linux or Windows, omlx doesn’t port.

vMLX / MLX Studio

Best for: highest MLX throughput, KV quantization, 256 concurrent sequences

vMLX is the least-known option here and arguably the most technically ambitious. It implements a 5-layer caching stack: prefix cache → paged KV → KV quantization (q4/q8) → continuous batching → persistent disk cache.

The KV quantization is the feature omlx lacks. Quantizing the KV cache to 4-bit roughly quadruples the number of cached sequences you can hold in a given RAM budget, which matters if you’re running multiple parallel agent instances.

The jang-q inference engine claims 256 concurrent sequences. MLX Studio is the closed-source desktop app that sits on top; the engine itself is Apache 2.0.

Agent-specific gaps: Smaller community, less documentation. The persistent disk cache is a warm-start prefill cache, not live block-level SSD eviction under pressure. Closer to llama.cpp’s slot save/restore than to omlx’s hot/cold tier. Installation is less streamlined than Ollama or omlx.

Comparison Table

	Ollama	LM Studio	llama.cpp server	MLX-LM serve	omlx	vMLX
Backend	llama.cpp + MLX	llama.cpp + MLX	ggml/GGUF	MLX	MLX	MLX (jang-q)
Tool calling	Yes	Yes	Yes (Jinja2)	Yes	Yes	Yes
Streaming	Yes	Yes	Yes	Yes	Yes	Yes
Continuous batching	No	Yes (0.4.0+)	Yes (`--parallel`)	No	Yes	Yes
KV cache — RAM	Prefix caching	Prefix caching	Slot-based	Rotating	Paged (hot tier)	Paged + quantized
KV cache — SSD	No	No	Manual snapshots	No	Live eviction	Warm-start only
Survives restart	No	No	Manual	No	Yes (SSD tier)	Partial
macOS-native app	Optional	Yes	No	No	Yes (PyObjC)	Yes (MLX Studio)
Cross-platform	Yes	Yes	Yes	macOS/Linux	macOS only	macOS only
Open source	MIT	No	MIT/GPL	Apache 2.0	Apache 2.0	Engine: Apache 2.0
GitHub stars	~166k	—	~100k	~4.3k	~7.3k	~80

Choosing for Your Workload

Single agent, short sessions, cross-platform team: Ollama. It’s the default for a reason. The model library is comprehensive, the API is stable, and the community has solved most integration problems already.

High-throughput parallel agents on Mac: vMLX’s KV quantization and 256-sequence concurrency make it worth the setup friction if you’re running many parallel instances.

Long-running agents with stable system prompts on Mac: omlx. The SSD KV cache eliminates per-session prefill cost in a way no other tool matches. Claude Code users get the context-scaling integration for free.

Maximum control over quantization and batching parameters: llama.cpp server. Every knob is exposed. The --cache-type-k flag lets you trade cache precision for capacity at a level the higher-level wrappers don’t offer.

Non-technical team members who need to run agents locally: LM Studio. The GUI handles model management, and the API is identical to Ollama’s — existing agent code works without changes.

One pattern that works well: Ollama for development (fast iteration, easy model swaps), omlx or vMLX for production runs that need persistent context. Both expose the same OpenAI-compatible API, so the swap is a one-line base_url change.