The Stack: Apple Silicon Local LLM Servers for Running Agents
Ollama, LM Studio, omlx, llama.cpp, MLX-LM, vMLX — compared on the specific requirements of local agent workloads on Apple Silicon.
The Stack: Apple Silicon Local LLM Servers for Running Agents
Running an agent loop locally is harder than running a chat session. Agents make dozens of sequential LLM calls, each building on the last. Tool call parsing has to be reliable. Long context (accumulated tool outputs, intermediate reasoning, retrieved documents) has to be handled without ballooning TTFT on every turn. And the server has to stay alive and consistent across a multi-minute run without drifting or evicting state mid-task.
Most local LLM servers were built for the chat use case. They work for agents, but with friction.
What Agent Workloads Demand
| Requirement | Why it matters for agents |
|---|---|
| Tool/function calling | Agent frameworks (Claude Code, LangGraph, smolagents) rely on structured tool call responses |
| Streaming | Long tool outputs need incremental delivery; blocking until completion kills responsiveness |
| Consistent prefix caching | The system prompt + prior tool calls should be cached; re-prefilling them every turn is expensive |
| Low TTFT at long context | Turn 10 of an agent loop may have 8K tokens of prior context; TTFT compounds across turns |
| Concurrent requests | Some frameworks issue parallel tool calls; a single-queue server serializes them |
| Stable tool call parsing | Malformed JSON in a tool call response terminates the agent run |
The Servers
Ollama
Best for: quick setup, broad model compatibility, cross-platform teams
Ollama is the easiest entry point and has the largest model library. It wraps llama.cpp (with an optional MLX runner) behind a clean REST API and a CLI that feels like docker pull for models.
Tool calling works for models with built-in function calling templates (Llama 3.1+, Qwen2.5, Mistral Nemo). Ollama handles the formatting automatically if the model’s Modelfile has the right template; no hand-crafted tool call JSON needed.
ollama pull qwen2.5:7b
ollama serve # starts at localhost:11434
The OpenAI-compatible endpoint at /v1 works with LangChain, LlamaIndex, and most agent frameworks without modification:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
Agent-specific gaps: Requests are processed sequentially per model with no continuous batching. Parallel tool calls from a framework that issues concurrent requests will queue. Prefix caching is in-memory only; restarting Ollama between sessions means the next agent run pays full prefill cost again. For single-agent runs on a local machine, neither of these matters. For multi-agent pipelines or long-running sessions, they do.
LM Studio
Best for: non-technical users, model discovery, Windows/Linux/Mac parity
LM Studio is the most polished desktop experience. Model downloads, quantization selection, and context length tuning are all in a GUI. The lms CLI and local server expose an OpenAI-compatible API identical to Ollama’s.
Continuous batching arrived in 0.4.0 (llama.cpp backend) and 0.4.2 (MLX backend) , making it viable for parallel tool calls. Context shift on Llama models is handled automatically in the GUI, though the behavior in server mode requires manual configuration.
Agent-specific gaps: The server is tightly coupled to the GUI process; running headless on a remote Mac is possible but awkward. No persistent KV cache across sessions. Closed source, so you can’t inspect or modify the batching behavior.
llama.cpp server
Best for: maximum control, cross-platform deployment, GGUF ecosystem
llama-server (the HTTP server in llama.cpp) is what Ollama wraps, exposed directly. Every knob is accessible: context length, KV quantization type, batch size, number of parallel sequences, slot management.
Tool calling via Jinja2 templates works for any model with a chat template. The --jinja flag enables automatic tool call formatting from the model’s own template.
llama-server \
--model ./Qwen2.5-7B-Instruct-Q4_K_M.gguf \
--ctx-size 32768 \
--n-gpu-layers 99 \
--parallel 4 \
--cache-type-k q8_0 \
--port 8080
--parallel 4 enables 4 concurrent sequences. --cache-type-k q8_0 quantizes the KV cache to 8-bit, roughly halving its memory footprint and extending the effective context you can keep warm simultaneously.
KV slot save/restore (--slot-save-path) writes cache snapshots to disk, but it’s a manual operation: call the /slots API to save and restore. Not seamless, but available.
Agent-specific gaps: Metal acceleration on Apple Silicon works but the MLX-based servers run 1.5–2x faster on the same hardware for the same model size. If you’re Mac-only, llama.cpp is giving up performance for portability.
MLX-LM serve
Best for: baseline MLX serving, Apple-maintained, minimal dependencies
mlx_lm.server is the official Apple ML server for MLX models: the reference implementation. Correct and maintained, but not production-hardened.
pip install mlx-lm
mlx_lm.server --model mlx-community/Qwen2.5-7B-Instruct-4bit --port 8080
OpenAI-compatible. Tool calling works if the model supports it. Rotating fixed-size KV cache handles context overflow by dropping old tokens. Fine for chat, but disruptive for agent loops where early tool call results matter.
Agent-specific gaps: No continuous batching, no SSD KV tier, no admin UI. Single-request serving. For development and testing it’s perfectly adequate. For a real agent workload running overnight, you want more.
omlx
Best for: Apple Silicon-only teams, persistent context, Claude Code integration
omlx wraps MLX with a production server layer: continuous batching, multi-model loading, web admin dashboard, and a two-tier SSD KV cache.
The SSD cache matters specifically for agents with long, stable system prompts. An agent’s system prompt (tools definitions, instructions, persona) is typically 1000–3000 tokens and identical across every call in a run. With omlx, that prefix is cached as block-hashed pages in RAM, spills to SSD under memory pressure, and survives server restarts. Turn 15 of an agent loop pays the same TTFT as turn 1.
Tool calling and streaming both work via the OpenAI-compatible API. The Claude Code integration is explicit: omlx has a context-scaling feature that adjusts token budget reporting so smaller MLX models work correctly with Claude Code’s auto-compact behavior.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:10240/v1", api_key="none")
response = client.chat.completions.create(
model="mlx-community/Qwen2.5-7B-Instruct-4bit",
tools=[{
"type": "function",
"function": {
"name": "search_docs",
"description": "Search internal documentation",
"parameters": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
},
},
}],
messages=[{"role": "user", "content": "Find our refund policy"}],
)
Agent-specific gaps: macOS 15+ and Apple Silicon only. No KV quantization (vMLX has this). If you need to run the same stack on Linux or Windows, omlx doesn’t port.
vMLX / MLX Studio
Best for: highest MLX throughput, KV quantization, 256 concurrent sequences
vMLX is the least-known option here and arguably the most technically ambitious. It implements a 5-layer caching stack: prefix cache → paged KV → KV quantization (q4/q8) → continuous batching → persistent disk cache.
The KV quantization is the feature omlx lacks. Quantizing the KV cache to 4-bit roughly quadruples the number of cached sequences you can hold in a given RAM budget, which matters if you’re running multiple parallel agent instances.
The jang-q inference engine claims 256 concurrent sequences. MLX Studio is the closed-source desktop app that sits on top; the engine itself is Apache 2.0.
Agent-specific gaps: Smaller community, less documentation. The persistent disk cache is a warm-start prefill cache, not live block-level SSD eviction under pressure. Closer to llama.cpp’s slot save/restore than to omlx’s hot/cold tier. Installation is less streamlined than Ollama or omlx.
Comparison Table
| Ollama | LM Studio | llama.cpp server | MLX-LM serve | omlx | vMLX | |
|---|---|---|---|---|---|---|
| Backend | llama.cpp + MLX | llama.cpp + MLX | ggml/GGUF | MLX | MLX | MLX (jang-q) |
| Tool calling | Yes | Yes | Yes (Jinja2) | Yes | Yes | Yes |
| Streaming | Yes | Yes | Yes | Yes | Yes | Yes |
| Continuous batching | No | Yes (0.4.0+) | Yes (--parallel) | No | Yes | Yes |
| KV cache — RAM | Prefix caching | Prefix caching | Slot-based | Rotating | Paged (hot tier) | Paged + quantized |
| KV cache — SSD | No | No | Manual snapshots | No | Live eviction | Warm-start only |
| Survives restart | No | No | Manual | No | Yes (SSD tier) | Partial |
| macOS-native app | Optional | Yes | No | No | Yes (PyObjC) | Yes (MLX Studio) |
| Cross-platform | Yes | Yes | Yes | macOS/Linux | macOS only | macOS only |
| Open source | MIT | No | MIT/GPL | Apache 2.0 | Apache 2.0 | Engine: Apache 2.0 |
| GitHub stars | ~166k | — | ~100k | ~4.3k | ~7.3k | ~80 |
Choosing for Your Workload
Single agent, short sessions, cross-platform team: Ollama. It’s the default for a reason. The model library is comprehensive, the API is stable, and the community has solved most integration problems already.
High-throughput parallel agents on Mac: vMLX’s KV quantization and 256-sequence concurrency make it worth the setup friction if you’re running many parallel instances.
Long-running agents with stable system prompts on Mac: omlx. The SSD KV cache eliminates per-session prefill cost in a way no other tool matches. Claude Code users get the context-scaling integration for free.
Maximum control over quantization and batching parameters: llama.cpp server. Every knob is exposed. The --cache-type-k flag lets you trade cache precision for capacity at a level the higher-level wrappers don’t offer.
Non-technical team members who need to run agents locally: LM Studio. The GUI handles model management, and the API is identical to Ollama’s — existing agent code works without changes.
One pattern that works well: Ollama for development (fast iteration, easy model swaps), omlx or vMLX for production runs that need persistent context. Both expose the same OpenAI-compatible API, so the swap is a one-line base_url change.