At least for the Gemma4-26B-A4B, Token-gen speed with OMLX is far worse on my M1 Max 64GB Macbook, compared to llama-server:
Quick benchmark on M1 Max 64GB, Gemma 4 26B-A4B (MoE), comparing matched dynamic 4-bit quants. Workload
was Claude Code, which sends ~35K tokens of input context per request (system prompt + tools + user
message):
llama.cpp (unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL, llama-server -fa on -c 131072 --jinja --temp 1.0
--top-p 0.95 --top-k 64):
- pp ≈ 395 tok/s
- tg ≈ 40 tok/s
oMLX (unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit, omlx serve --model-dir ~/models/omlx, with
sampling.max_context_window and max_tokens bumped to 131072 in ~/.omlx/settings.json):
- pp ≈ 350 tok/s
- tg ≈ 5–13 tok/s
Same model family and quant tier. Prompt processing is comparable, but oMLX's token generation is 3–7x
slower than llama.cpp's Metal backend. Counter-intuitive given MLX is Apple's native ML framework.
Same. Opencode + oMLX (0.3.4) + unsloth-Qwen3-Coder-Next-mlx-8bit on my M5 Max w 128GB is the sweet spot for me locally. The prompt decode caching keeps things coherent and fast even when contexts get north of 100k tokens.