Local LLM on a $550 AMD Mini PC: 28B Models at 20 tok/s
AMD 780M iGPU + 64GB DDR5 runs Gemma 4 28B at 19.5 tok/s. Setup guide, benchmarks, and cost breakdown vs. Mac Mini for local LLM inference under $600.
Local LLM on a $550 AMD Mini PC: 28B Models at 20 tok/s
A post by Base Camp Bernie went viral: a MinisForum UM790 Pro running Gemma 4 28B at 19.5 tok/s, Qwen3.5-32B at 20.8 tok/s, 196k context window. The hardware is a $300 mini PC with an integrated GPU using system RAM as VRAM.
The setup: AMD Ryzen 9 7940HS with a Radeon 780M iGPU, 64GB DDR5-5600, llama.cpp compiled with Vulkan. This post covers why this works at a technical level, what the benchmark numbers actually mean, the full software setup, and an honest cost breakdown.
Quick answer: The AMD Radeon 780M has no dedicated VRAM. It shares DDR5 system RAM with the CPU, so 64GB of installed RAM becomes 64GB of effective “VRAM.” MoE models like Gemma 4 28B only activate ~4–5B parameters per token despite having 28B total, matching DDR5’s bandwidth ceiling to the task. The result is 19–22 tok/s at ~$550 all-in, usable for local coding assistants, agent loops, and long-context work without a cloud API.
Table of Contents
- Why the 780M Works: No VRAM Boundary
- Why Dense 28B Gets 4 tok/s and MoE 28B Gets 20 tok/s
- Benchmark Numbers
- How Much RAM Do You Need for Local LLM Inference?
- Architecture: The iGPU Inference Path
- Software Setup
- Agentic Workflows
- Real Cost
- Against the Alternatives
- Limitations
- Who This Is For
- FAQ
Why the 780M Works: No VRAM Boundary
To understand why a cheap mini PC can run 28B parameter models, you need to understand the usual bottleneck: VRAM.
When an LLM generates a token, it has to read all of its weights from memory to do the math. A 28B model at 4-bit quantization takes up about 14GB. For that to run on a GPU, the entire 14GB needs to fit inside the GPU’s dedicated video memory (VRAM). An RTX 3090 has 24GB of VRAM, so 28B fits. An RTX 3060 has 12GB, so it doesn’t.
The AMD Radeon 780M breaks this rule because it has no dedicated VRAM at all.
The 780M is an integrated GPU (iGPU) built into the same chip as the CPU. There’s no separate memory chip. Instead, both the CPU and the GPU share one unified pool of DDR5 system RAM. Install 64GB of DDR5 and the GPU can use all 64GB as its “VRAM.” Install 96GB and it can use that. There’s no hard ceiling beyond what your motherboard supports.
This is the same architectural insight behind Apple Silicon’s success for local AI. Apple M-series chips use the same unified memory design. The difference is that Apple’s version runs at 300–546 GB/s, while DDR5-5600 in dual-channel delivers ~80 GB/s. Apple’s is faster, but AMD’s is available on cheap x86 hardware that runs Linux and Windows natively.
The memory bandwidth math:
Memory bandwidth is measured in GB per second: how much data can move between memory and the processor per second. For LLM inference, this is the core constraint. Every token generated requires reading the entire set of active model weights. Faster memory bandwidth = more tokens per second.
| Memory Type | Bandwidth | Common Hardware |
|---|---|---|
| DDR5-5600 dual-channel (780M) | ~80 GB/s | AMD Ryzen mini PCs |
| GDDR6X (RTX 3090) | ~936 GB/s | Discrete GPU desktop |
| LPDDR5X (Apple M4 Max) | ~546 GB/s | Mac Mini, MacBook Pro |
| GDDR6 (RTX 4070) | ~504 GB/s | Discrete GPU desktop |
The 780M is slower than all of them. The answer is in the model architecture.
The software unlock is Vulkan. AMD’s ROCm stack (their CUDA equivalent) targets discrete graphics cards and doesn’t recognize iGPUs. llama.cpp’s Vulkan backend works on any GPU that supports the Vulkan graphics API, including integrated ones. Without Vulkan, llama.cpp falls back to CPU-only inference at roughly 3–4x slower speeds. With Vulkan enabled at build time, the 780M’s 12 RDNA 3 compute units handle the matrix multiplications for every token generated.
Why Dense 28B Gets 4 tok/s and MoE 28B Gets 20 tok/s
Here’s the core formula for LLM token generation speed on memory-bandwidth-limited hardware:
tokens per second ≈ memory bandwidth / bytes read per token
bytes read per token ≈ active parameters × bytes per weight
For a dense model, “active parameters” equals all parameters. Every single weight is read for every single token. A dense 27B model at Q4_K quantization (4 bits per weight, so 0.5 bytes per weight) uses ~14GB of storage. Each token reads that entire 14GB:
14 GB ÷ 80 GB/s ≈ 0.175 seconds per token → ~5.7 tok/s theoretical
Real: 4–5 tok/s after system overhead
That matches the measured numbers exactly: dense Qwen3.5-27B at Q4_K hits 5.8 tok/s. At Q8 (8-bit quantization, 1 byte per weight, 27GB total) it drops to 2.8 tok/s. The model got bigger, so each token reads more data, so it runs slower. The relationship is direct and predictable.
Mixture of Experts (MoE) breaks this relationship.
A dense transformer has one set of parameters that processes every token. A MoE transformer instead has many parallel sets of “expert” networks, and a small routing layer that decides which experts to activate for each token.
Gemma 4 28B has 28B total parameters, but those are split across roughly 64 expert blocks. Each token activates only 2–4 of those 64 experts. In practice, only about 4–5B parameters are active for any given token. The other 23B sit in RAM untouched.
4 GB active ÷ 80 GB/s ≈ 0.05 seconds per token → ~20 tok/s
That’s the 19.5 tok/s measured result. Five times faster than a dense model with the same parameter count, because each token reads five times less data. The full 28B is there for quality (the routing mechanism draws from a diverse pool of specialized experts), but the bandwidth cost per token is that of a 4B model.
This is why “28B MoE” and “28B dense” are not equivalent statements when it comes to inference hardware requirements.
Prefill vs. generation:
Token generation (decoding) is memory-bandwidth-bound, as described above. But loading your prompt into context (prefill) works differently. During prefill, the model processes all your input tokens simultaneously in parallel batches, which saturates the GPU’s compute cores rather than its memory bus. Gemma 4 28B hits ~110 tok/s during prefill on this setup. A 10,000-token document takes about 90 seconds to load. A 100k-token context takes roughly 15 minutes. Once loaded, generation runs at full 19.5 tok/s speed. Cold-start on long contexts is slow; steady-state generation is not.
Benchmark Numbers
Measured on UM790 Pro with 64GB DDR5-5600, llama.cpp + Vulkan:
| Model | Quant | Size | Prefill (t/s) | Generate (t/s) | Context |
|---|---|---|---|---|---|
| Gemma 4 28B MoE | Q4_0 | 14 GB | ~110 | 19.5 | 196k |
| Gemma 4 28B MoE | Q4_K_XL | 18 GB | ~80 | ~15 | 196k |
| Gemma 4 4B | Q4 | ~3 GB | — | 21.7 | — |
| Qwen3.5-27B (dense) | Q4_K | 14 GB | 32.3 | 5.8 | — |
| Qwen3.5-32B (dense) | Q4_K | 24 GB | — | 2.8 | — |
| Qwen3.5-32B-A3B (MoE) | — | — | — | 20.8 | — |
| Nemotron-Cascade | — | — | — | 24.8 | — |
Nemotron-Cascade is a dual-mode model: a 4B and 56B model chained, where routine tokens take the 4B path and complex reasoning escalates to 56B. The 24.8 tok/s average reflects most tokens routing through the small model.
Numbers are community-measured, not official benchmarks. Your results will vary by context length, quantization, and driver version.
Speculative decoding also applies on this hardware. Running Gemma 4 2B as a draft model for Gemma 4 28B gives roughly 30–40% throughput improvement, depending on draft acceptance rate. The small model predicts several tokens ahead; the large model verifies them in a single pass. When the draft is right (often, for common continuations), multiple tokens are accepted per forward pass through the large model.
Recommended Models for AMD 780M
These models are well-matched to the 780M’s bandwidth profile. MoE architectures are the priority; dense models above 13B are listed for completeness but are too slow for interactive use.
| Model | Disk Size | Generate (tok/s) | RAM Needed | Best Use |
|---|---|---|---|---|
| Gemma 4 28B Q4_0 | 14 GB | ~20 | 32GB+ | Coding, reasoning, 196k context |
| Qwen3.5-32B-A3B Q4 (MoE) | ~18 GB | ~21 | 32GB+ | Math, multilingual, coding |
| Gemma 4 4B Q8 | ~4 GB | ~40 | 8GB+ | Fast chat, speculative draft |
| Nemotron-Cascade Q4 | ~20 GB | ~25 | 32GB+ | Auto-routes simple/complex queries |
| Qwen2.5-14B Q6 | ~11 GB | ~18 | 16GB+ | General use, dense but fast enough |
| Mistral Nemo 12B Q8 | ~12 GB | ~18 | 16GB+ | Coding assistant, low latency |
For a local coding assistant, Gemma 4 28B Q4_0 is the current best-in-class option on this hardware. Qwen3.5-32B-A3B is the best alternative if you need stronger math or multilingual capability. Both are MoE and hit the 80 GB/s DDR5 bandwidth ceiling efficiently.
How Much RAM Do You Need for Local LLM Inference?
On unified memory hardware like the 780M, RAM is VRAM. The model must fully fit in RAM for efficient inference. Anything that spills to NVMe runs at SSD speeds (~3–7 GB/s), making inference 10–20x slower.
| Model Size | Min RAM | Recommended | Notes |
|---|---|---|---|
| 3B–7B dense | 8 GB | 16 GB | Fast at any DDR5 speed; good for edge |
| 13B dense | 16 GB | 32 GB | ~15–18 tok/s on 780M with 32GB DDR5 |
| 28B MoE (Gemma 4, Qwen3.5) | 32 GB | 64 GB | 32GB loads model; 64GB adds context headroom |
| 32B dense | 48 GB | 64 GB | Dense 32B = ~3 tok/s — use MoE equivalent instead |
| 70B MoE | 64 GB | 96 GB | ~10–12 tok/s on 780M; fits in 64GB at Q4 |
Why 64GB for the 28B use case: Gemma 4 28B Q4_0 takes up 14GB. At a 32k context, the KV cache adds another 4–8GB. OS and background processes take 4–8GB. That puts a 32GB system at the edge of comfortable operation. 64GB gives breathing room for 128k+ context windows and parallel processes.
The practical minimum for a useful local AI box is 32GB DDR5. 16GB limits you to 7B models and short contexts. 64GB is the target for 28B MoE models with long contexts.
Architecture: The iGPU Inference Path
llama.cpp uses Vulkan compute shaders to run matrix multiplications on the 780M. The weights stay in DDR5 and are accessed by the iGPU’s memory controller through the unified address space.
One setup detail: the BIOS VRAM allocation setting (which reserves a fixed slice of RAM for the iGPU) caps at 8–16GB by default. llama.cpp with Vulkan can exceed this by using GTT (Graphics Translation Table) allocation, which lets the Mesa RADV driver page additional system RAM into the GPU’s address space. Setting the RADV_PERFTEST environment variable before running llama.cpp enables GTT-related driver fixes in the RADV stack and is commonly needed for models larger than the BIOS VRAM allocation.
Software Setup
Requirements: llama.cpp built with Vulkan (not ROCm), GGUF model files, 64GB DDR5 for 28B MoE.
Build llama.cpp with Vulkan support:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=on
cmake --build build --config Release -j$(nproc)
Download Gemma 4 28B:
pip install huggingface_hub
# VERIFY: repo and filename — check HuggingFace for the actual GGUF upload path
huggingface-cli download google/gemma-4-27b-it-GGUF \
gemma-4-27b-it-Q4_0.gguf \
--local-dir ./models
Run with GPU offload:
# VERIFY: exact RADV_PERFTEST value — check Mesa RADV docs or llama.cpp AMD wiki for current recommendation
RADV_PERFTEST=gtt ./build/bin/llama-cli \
-m ./models/gemma-4-27b-it-Q4_0.gguf \
--n-gpu-layers 99 \
--cache-type-k q8 \
--threads 8 \
-c 32768 \
-p "Your prompt here"
--n-gpu-layers 99 offloads all layers to Vulkan. Without it, inference runs on CPU only at ~5 tok/s for a 28B model. --cache-type-k q8 uses 8-bit KV cache quantization, reducing memory pressure for longer contexts.
Ollama handles the Vulkan backend automatically on AMD hardware with recent Mesa drivers:
curl -fsSL https://ollama.ai/install.sh | sh
ollama run gemma4:28b
LM Studio’s Vulkan support also works and provides a GUI for model management and context configuration.
Ollama vs llama.cpp vs LM Studio on AMD iGPU
| Tool | AMD iGPU Support | Setup | Best For |
|---|---|---|---|
| llama.cpp (Vulkan) | Full, manual tuning | Medium | Max performance, RADV flags, scripting |
| Ollama | Auto-detects Vulkan | Easy | Quick start, REST API, model management |
| LM Studio | Vulkan | Easy | GUI, Windows/Linux/Mac, context tuning |
| Jan.ai | Vulkan | Easy | GUI with extension ecosystem |
| KoboldCpp | Vulkan | Medium | Long-context creative/roleplay workloads |
For AMD iGPU specifically: llama.cpp gives the highest throughput because you can pass the RADV_PERFTEST flag and tune --cache-type-k directly. Ollama is the fastest path to a working setup and runs a local REST API on port 11434, compatible with OpenAI’s API format. LM Studio is the best choice on Windows where environment variable tuning is inconvenient.
Agentic Workflows
For long-running agent loops, the local setup removes the three biggest frictions: API costs, rate limits, and per-request latency from network round-trips.
A Hermes-style agent with 40+ tools (shell execution, filesystem ops, HTTP fetches, search) runs entirely against the local Gemma 4 28B with no API keys needed. The 196k context window handles deep multi-tool conversations that would be expensive at cloud API pricing. The per-token cost is zero after hardware purchase.
At 19.5 tok/s, Gemma 4 28B is fast enough for interactive use in an agent loop. Response latency for a 200-token reply is about 10 seconds, comparable to cloud API calls under moderate load. For batch tasks running overnight, the hardware simply runs unattended.
Dense models above ~13B are not viable for interactive agentic use on this hardware. A dense 28B at 4–5 tok/s produces output too slowly for iteration-heavy workflows.
Real Cost
The “$300” headline is the base unit price, not the full 64GB config needed to run 28B models.
| Component | Cost (USD) |
|---|---|
| MinisForum UM790 Pro (base config) | ~$300–350 |
| 64GB DDR5-5600 SO-DIMM (2×32GB) | ~$120–180 |
| 1TB NVMe SSD (model storage) | ~$60–80 |
| Total | ~$480–610 |
The original post’s “$305” figure caused some confusion: the screenshots were priced in Canadian dollars. At current exchange rates, $300 CAD is roughly $215 USD. The UM790 Pro in various configurations spans $280–380 USD depending on bundled RAM and sale timing.
DDR5 SO-DIMM pricing has risen 3–4x from its 2025 lows per community reports. A 32GB config (2×16GB) costs around $50–70 and still loads Gemma 4 28B Q4_0 (needs ~14GB), but leaves little headroom for context or concurrent processes.
Local AI vs Cloud API: Break-Even Analysis
At ~$550 all-in, the AMD box pays for itself when cloud API costs exceed that threshold. Break-even depends entirely on usage volume.
| Monthly Token Usage | Cloud Cost (Claude Sonnet ~$3/1M in, $15/1M out) | Break-Even |
|---|---|---|
| 1M output tokens/mo | ~$15/mo | ~37 months |
| 5M output tokens/mo | ~$75/mo | ~7 months |
| 20M output tokens/mo | ~$300/mo | ~2 months |
| Agentic loop (100M+/mo) | ~$1,500/mo | <1 month |
The break-even is fast for agentic workloads. A coding agent that runs tests, reads files, and iterates on code generates tokens quickly. 100M tokens per month is plausible for a developer’s daily driver. At $1,500/month in API costs, $550 hardware pays for itself in under two weeks.
For light or infrequent use, cloud APIs are cheaper. The local hardware case strengthens as token volume grows, as privacy requirements tighten, or as you need offline operation.
Against the Alternatives
| Option | Est. Cost | RAM | 28B MoE tok/s | Dense 13B tok/s | Notes |
|---|---|---|---|---|---|
| AMD 780M mini PC (64GB DDR5) | ~$550 | 64GB | ~20 | ~15 | Vulkan only; no CUDA |
| Mac Mini M4 (16GB) | $599 | 16GB | ~25 | ~22 | 16GB won’t comfortably fit 28B Q4 |
| Mac Mini M4 (24GB) | $799 | 24GB | ~35 | ~30 | Best value under $1k for local inference |
| Mac Mini M4 Pro (48GB) | ~$1,399 | 48GB | ~60 | ~55 | Large context + fast dense models |
| RTX 3060 12GB build | ~$700 | 12GB VRAM | N/A (won’t fit) | ~60 | 14GB Q4 exceeds 12GB VRAM |
| RTX 3090 24GB build | ~$900 | 24GB VRAM | ~20 | ~80 | CUDA ecosystem; 24GB fits 28B Q4 |
| Cloud API (pay-as-you-go) | ~$0.07/1M tok | — | — | — | No hardware cost; no offline or agent loops |
The Mac Mini comparison at the same memory tier
This is where the AMD case gets interesting. Matching 64GB of unified memory on a Mac requires the Mac Studio with M4 Max (which starts around $1,999 for 36GB, with the 64GB config closer to $2,399). The Mac Mini maxes out at 48GB with the M4 Pro at ~$1,399. Apple doesn’t sell a 64GB Mac Mini.
So the comparison at equivalent RAM capacity is:
| Option | Cost | RAM | Memory Bandwidth | 28B MoE tok/s |
|---|---|---|---|---|
| AMD 780M mini PC | ~$550 | 64GB | ~80 GB/s | ~20 |
| Mac Mini M4 Pro (48GB) | ~$1,399 | 48GB | ~273 GB/s | ~60 |
| Mac Studio M4 Max (64GB) | ~$2,399 | 64GB | ~546 GB/s | ~100+ |
The AMD box is 4–5x cheaper than Apple hardware at the same RAM tier. The Apple hardware is 3–7x faster per token. Whether the performance difference justifies the price difference depends entirely on your use case.
For a developer running a personal coding assistant 8 hours a day, 20 tok/s at $550 is perfectly usable. For a production workload handling many simultaneous requests, the Mac Studio’s bandwidth advantage compounds fast.
Apple’s unified memory also delivers meaningfully more bandwidth per gigabyte. An M4 Max running a 28B MoE model at 64GB still has ~482 GB/s available for other processes. The AMD box is more or less fully committed when a 28B model is loaded.
One practical point: the Mac Mini at 24GB ($799) and the AMD box at 64GB ($550) serve different needs. The Mac is faster but memory-constrained for very long contexts. The AMD box is slower but can hold a 28B model plus a 100k+ token context in memory simultaneously. If your use case requires large context windows, the AMD box’s RAM headroom matters more than the Mac’s bandwidth advantage.
Limitations
No CUDA. Fine-tuning, LoRA training, bitsandbytes, most quantization workflows, and the majority of GPU-accelerated research code all assume CUDA. This is an inference-only setup.
Dense models above ~13B are too slow for interactive use. 4–5 tok/s is readable but painful for iteration. The hardware is genuinely well-suited to MoE models and poor at dense ones above 13B.
Prefill speed on long contexts. At ~110 tok/s prefill, loading a 100k-token context takes roughly 15 minutes. Generation after that is fine, but cold-start for long-context workloads is slow.
Vulkan backend maturity. llama.cpp’s Vulkan path has improved quickly over the past year but still lags CUDA on some quantization formats, Flash Attention variants, and experimental sampling methods. Some community-quantized models (PolarQuant formats, GGUFs with newer tensor types) may fall back to CPU for unsupported ops.
DDR5 pricing. The current elevated DDR5 SO-DIMM prices make the 64GB config meaningfully more expensive than the base unit alone.
Who This Is For
Developers who want local inference running continuously (coding assistant, offline data processing, agent loops) and want to avoid API bills or need air-gap operation. At ~$550 all-in, the hardware breaks even against moderate API usage in roughly two months.
People who already own an AMD Ryzen mini PC with expandable DDR5 slots. Adding 64GB of RAM and pointing llama.cpp at it is a cheap path to running 28B models with no additional compute purchase.
For the highest performance per dollar under $1k, the Mac Mini M4 at $799 (24GB) has a stronger claim. For larger context windows, Linux-native operation, or a lower entry price into the 28B model tier, the AMD path is worth serious consideration. The gap between a $550 AMD mini PC and a $799 Mac is narrowest when the workload is MoE-heavy. Given the trajectory of model releases toward MoE architectures, that’s increasingly the relevant case.
FAQ
Can you run AI locally without a dedicated GPU? Yes. AMD Ryzen processors with Radeon integrated graphics (780M, 890M, and newer) share DDR5 system RAM with the iGPU, effectively giving the GPU access to all installed RAM as VRAM. With 64GB of DDR5, llama.cpp via Vulkan can run 28B MoE models at 18–22 tok/s with no discrete GPU required.
What is the best mini PC for running local AI in 2026? For budget local AI under $600, the MinisForum UM790 Pro (AMD Ryzen 9 7940HS, Radeon 780M) with 64GB DDR5-5600 is the leading option. For higher performance with a higher budget, the Mac Mini M4 (24GB, $799) or Mac Mini M4 Pro (48GB, ~$1,399) outperform it significantly due to higher memory bandwidth.
How much RAM do I need to run a 28B model locally? Gemma 4 28B Q4_0 requires ~14GB to load. A 32GB system loads it but leaves limited room for context and system overhead. 64GB is the recommended minimum for comfortable operation with 32k–128k context windows. For 70B models, 64GB is the minimum and 96GB is preferred.
Does Ollama support AMD integrated graphics? Ollama auto-detects Vulkan-capable GPUs, including AMD iGPUs like the Radeon 780M and 890M, on Linux. On Windows, Vulkan support in Ollama is available but may require a recent version of Ollama and up-to-date AMD drivers. ROCm (AMD’s CUDA equivalent) does not support iGPUs, only discrete Radeon cards.
What is a Mixture of Experts (MoE) model and why does it matter for cheap hardware? A MoE model splits its parameters across many specialized expert blocks and only activates a small subset for each token. Gemma 4 28B has 28B total parameters but activates only ~4–5B per token. Since inference speed on memory-bandwidth-limited hardware equals bandwidth divided by bytes read per token, MoE models run 4–5x faster than dense models of the same stated size. This is the key reason the AMD 780M can run “28B” models at interactive speeds.
Can I run AI models offline with no internet connection? Once models are downloaded to local storage, inference runs entirely on-device with no network connection needed. The AMD mini PC setup described here requires internet only for initial model downloads from HuggingFace or Ollama’s model registry. Inference, context storage, and agent tool calls all run locally.
Is the AMD 780M good for AI inference? The 780M is adequate for MoE models up to ~32B and dense models up to ~13B. Its 12 RDNA 3 compute units handle Vulkan-accelerated matrix multiplications at 18–25 tok/s on well-matched workloads. It is not suitable for fine-tuning, CUDA-dependent tooling, or dense models above 13B at interactive speeds. For inference-only use with modern MoE models, it works.
What is a GGUF file and why do local AI tools use it? GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp and compatible tools (Ollama, LM Studio, Jan.ai) for storing quantized LLM weights. It packages model weights, tokenizer data, and metadata in a single binary file. The quantization (Q4_0, Q4_K, Q8_0, etc.) reduces weight precision to shrink the file and speed up inference at some quality cost. Q4_K_M offers the best quality-to-size tradeoff for most models; Q4_0 prioritizes speed.
How does local AI compare to cloud API for a coding assistant? A local 28B MoE model at 20 tok/s generates a 200-token code response in about 10 seconds, comparable to cloud API latency under moderate load. The local model has no rate limits, zero marginal cost per token, and can access your local filesystem directly for agentic coding tasks. Quality is below GPT-4 class models but in the range of GPT-3.5 to early GPT-4 for coding tasks, depending on the model.