OminiX-MLX leverages Apple’s unified memory architecture and Metal GPU acceleration. This guide covers optimization strategies for maximum performance.
Quick wins
1. Use quantized models
Switch to 8-bit quantized weights for 3-4x memory reduction and 10-20% speedup from reduced memory bandwidth:
# Before: BF16 model (14 GB)
huggingface-cli download mlx-community/Qwen3-4B-bf16 \
--local-dir ./models/Qwen3-4B
# After: 8-bit quantized (4 GB)
huggingface-cli download mlx-community/Qwen3-4B-8bit \
--local-dir ./models/Qwen3-4B-8bit
See 8-bit quantization for details.
2. Use KVCache with pre-allocation
Switch from ConcatKeyValueCache to KVCache for long sequences:
use mlx_rs_core::{KVCache, KeyValueCache};
// Before: Allocates on every token
let mut cache: Vec<Option<ConcatKeyValueCache>> = vec![None; num_layers];
// After: Pre-allocates in 256-token chunks
let mut cache: Vec<Option<KVCache>> = vec![None; num_layers];
Saves 15-25% generation time for sequences > 512 tokens.
See KV cache implementation for details.
For MoE models, fused SwiGLU is 10-12x faster:
use mlx_rs_core::fused_swiglu;
// Before: Separate silu + multiply (slow)
let gate_activated = gate.silu()?;
let output = gate_activated.multiply(&x)?;
// After: Fused Metal kernel (10x faster)
let output = fused_swiglu(&x, &gate)?;
From mlx-rs-core/src/metal_kernels.rs:
Fused SwiGLU: Custom Metal kernel computes silu(gate) * x in a single pass. Critical for MoE models which have many SwiGLU calls per forward pass (~10-12x faster than separate ops).
Memory optimization
Unified memory architecture
Apple Silicon uses unified memory shared between CPU and GPU:
┌─────────────────────────────────────┐
│ Unified Memory (128 GB) │
│ │
│ ┌──────────┐ ┌─────────────┐ │
│ │ CPU │◄────►│ GPU/ANE │ │
│ │ Cores │ │ (Metal) │ │
│ └──────────┘ └─────────────┘ │
└─────────────────────────────────────┘
Benefits:
- Zero-copy data transfer between CPU and GPU
- No explicit memory management needed
- Models can use full system memory
Trade-offs:
- Memory pressure affects both CPU and GPU
- Large models reduce available CPU memory
- GPU performance degrades when memory is full
Memory budget
Reserve headroom for OS and other processes:
| System RAM | Model Budget | Recommended Max Model Size |
|---|
| 16 GB | 12 GB | 3-4B parameters (8-bit) |
| 32 GB | 24 GB | 7B parameters (8-bit) |
| 64 GB | 48 GB | 14B parameters (8-bit) |
| 128 GB | 100 GB | 70B parameters (3-bit) |
Running out of memory forces swapping to disk, causing severe slowdowns (100x+ slower). Always leave 4-8 GB free.
Monitor memory usage
# Real-time memory monitoring
while true; do
echo "$(date): $(vm_stat | grep 'Pages free' | awk '{print $3}' | sed 's/\.//')"
sleep 1
done
# Or use Activity Monitor (GUI)
Inference speed optimization
Batch size tuning
Larger batch sizes improve GPU utilization but increase memory:
// Batch size 1 (default, interactive)
let input = prompt.index((0, NewAxis)); // [1, seq_len]
// Batch size 4 (offline processing)
let batch = stack(&[prompt1, prompt2, prompt3, prompt4], 0)?; // [4, seq_len]
| Batch Size | Memory | Tokens/sec | Latency | Use Case |
|---|
| 1 | 1.0x | 45 tok/s | 22ms | Interactive chat |
| 4 | 1.3x | 120 tok/s | 33ms | Offline processing |
| 8 | 1.7x | 180 tok/s | 44ms | Batch inference |
Temperature and sampling
Greedy decoding (temp=0) is fastest:
// Fastest: Greedy (argmax)
let generator = Generate::new(&mut model, &mut cache, 0.0, &prompt);
// Slower: Sampling with temperature
let generator = Generate::new(&mut model, &mut cache, 0.7, &prompt);
// Slowest: Top-p/top-k sampling (not yet implemented)
| Method | Relative Speed | Use Case |
|---|
| Greedy (temp=0) | 1.0x | Factual QA, code completion |
| Temperature sampling | 0.95x | Balanced creativity |
| Top-p/top-k | 0.85x | Creative writing |
Lazy evaluation
MLX uses lazy evaluation - operations are fused and optimized:
use mlx_rs::transforms::{eval, async_eval};
// Operations are queued, not executed
let a = x.multiply(&y)?;
let b = a.add(&z)?;
let c = b.relu()?;
// Execute all at once (fused into single kernel)
eval(&[&c])?;
// Or async for non-blocking
let _ = async_eval([&c]);
Manual eval() calls are rarely needed. MLX automatically evaluates when you call .item(), .as_slice(), or iterate. Let the framework handle fusion.
Model-specific optimizations
Grouped Query Attention (GQA)
Models with GQA use fewer KV heads for better efficiency:
// Standard MHA: num_heads = 32
let q_heads = 32;
let kv_heads = 32; // Same as Q
let memory_per_layer = batch * kv_heads * seq_len * head_dim * 4; // 2 bytes × 2 (K+V)
// GQA: num_kv_heads = 8 (4x reduction)
let q_heads = 32;
let kv_heads = 8; // 4x fewer
let memory_per_layer = batch * kv_heads * seq_len * head_dim * 4; // 4x less memory
| Architecture | KV Memory | Speed | Quality |
|---|
| MHA (32 heads) | 1.0x | Baseline | Baseline |
| GQA (8 KV heads) | 0.25x | 1.15x | ~Equal |
Most modern models use GQA: Qwen3, Moxin, GLM4, Mistral.
Mixture of Experts (MoE)
MoE models activate only 2-4 experts per token:
// 45 total experts, but only top-k=2 are active per token
let expert_outputs = Vec::new();
for expert_idx in top_k_indices {
let expert = &self.experts[expert_idx];
// Only 2 experts run → 22x less compute than dense
expert_outputs.push(expert.forward(x)?);
}
From glm4-moe-mlx/README.md:
45 Experts: Shared experts + routed experts architecture. Top-k routing selects 2-4 experts per token, providing 45x model capacity with ~2x compute cost.
| Model Type | Parameters | Active Params | Memory | Speed |
|---|
| Dense 9B | 9B | 9B | 18 GB | 35 tok/s |
| MoE 9B (45 experts) | 45B | 9B | 20 GB | 15-20 tok/s |
Vision-Language Models
Optimize image preprocessing:
// Pre-compute image embeddings once
let dino_features = dino_encoder.forward(&image)?; // [1, 256, 1024]
let siglip_features = siglip_encoder.forward(&image)?; // [1, 256, 1152]
let visual_tokens = projector.forward(&concat)?; // [1, 256, 4096]
// Reuse for multiple text generations
for prompt in prompts {
let input_ids = build_input(visual_tokens.clone(), prompt)?;
let output = decoder.forward(&input_ids, &mut cache)?;
}
Vision encoding takes 100-200ms. Reuse embeddings when asking multiple questions about the same image.
Benchmarking
Measure tokens per second
use std::time::Instant;
let start = Instant::now();
let mut num_tokens = 0;
for token in generator.take(100) {
let _ = token?;
num_tokens += 1;
}
let elapsed = start.elapsed().as_secs_f32();
let tok_per_sec = num_tokens as f32 / elapsed;
println!("Speed: {:.1} tok/s", tok_per_sec);
Separate prefill and decode
let prefill_start = Instant::now();
let first_token = generator.next().unwrap()?; // Prefill
let prefill_time = prefill_start.elapsed();
let decode_start = Instant::now();
let remaining: Vec<_> = generator.take(99).collect();
let decode_time = decode_start.elapsed();
println!("Prefill: {:.0}ms", prefill_time.as_millis());
println!("Decode: {:.1} tok/s", 99.0 / decode_time.as_secs_f32());
# Record trace
xctrace record --template 'Metal System Trace' \
--output trace.trace \
--target-stdout - \
--launch -- cargo run --release --example generate
# Open in Instruments
open trace.trace
Look for:
- GPU utilization (should be > 80%)
- Memory bandwidth (watch for saturation)
- Kernel launch overhead (minimize small kernels)
Measured on Apple M4 Max (128 GB):
Language models
| Model | Precision | Memory | Prefill | Decode | Notes |
|---|
| Qwen3-0.5B | BF16 | 1.2 GB | 15ms | 120 tok/s | Fastest small |
| Qwen3-4B | BF16 | 8 GB | 45ms | 45 tok/s | Balanced |
| Qwen3-4B | 8-bit | 4.5 GB | 50ms | 42 tok/s | Memory-optimized |
| Moxin-7B VLM | 8-bit | 10 GB | 250ms | 30 tok/s | Includes vision |
| GLM4-9B | 4-bit | 6 GB | 55ms | 35 tok/s | MHA baseline |
| GLM4-MoE | 3-bit | 20 GB | 80ms | 15-20 tok/s | Variable (MoE) |
| MiniCPM-SALA-9B | 8-bit | 9.6 GB | 60ms | 28 tok/s | Hybrid attention |
Audio models
| Model | Precision | Memory | Speed | Notes |
|---|
| Paraformer-large | BF16 | 500 MB | 18x RT | Chinese/English |
| Qwen3-ASR-0.6B | 8-bit | 1.0 GB | 50x RT | 30+ languages |
| Qwen3-ASR-1.7B | 8-bit | 2.5 GB | 30x RT | Best accuracy |
| GPT-SoVITS | BF16 | 2 GB | 4x RT | Voice cloning |
Image models
| Model | Precision | Memory | Speed | Resolution |
|---|
| Z-Image | BF16 | 8 GB | ~3s/image | 512×512 |
| FLUX.2-klein | BF16 | 13 GB | ~5s/image | 1024×1024 |
Hardware comparison
Apple Silicon generations
| Chip | GPU Cores | Memory BW | Qwen3-4B Speed |
|---|
| M1 | 8 | 68 GB/s | 25 tok/s |
| M1 Max | 32 | 400 GB/s | 38 tok/s |
| M2 | 10 | 100 GB/s | 30 tok/s |
| M2 Max | 38 | 400 GB/s | 42 tok/s |
| M3 Max | 40 | 400 GB/s | 45 tok/s |
| M4 Max | 40 | 410 GB/s | 48 tok/s |
Memory bandwidth is the primary bottleneck for LLM inference. Max/Ultra variants offer 4-6x more bandwidth than base chips.
Memory configurations
| Configuration | Recommended Models |
|---|
| 16 GB (M1/M2 base) | 0.5-3B models (8-bit) |
| 32 GB (M1/M2 Pro) | 7B models (8-bit) |
| 64 GB (M1/M2 Max) | 14B models (8-bit), 7B (BF16) |
| 128 GB (M3/M4 Max) | 70B models (3-bit), 14B (BF16) |
Release vs debug builds
Always use --release for benchmarking:
# Debug: 10x slower, large memory overhead
cargo run --example generate
# Release: Full optimizations
cargo run --release --example generate
| Build | Speed | Binary Size | Use Case |
|---|
| Debug | 0.1x | 50 MB | Development |
| Release | 1.0x | 5 MB | Production |
Common pitfalls
Avoid:
- Running multiple large models simultaneously (memory pressure)
- Using
unwrap() in hot paths (panic overhead)
- Excessive
.item() calls (forces evaluation)
- Small batch sizes on large models (underutilizes GPU)
- Debug builds for performance testing
Do:
- Use quantized models for production
- Pre-allocate buffers (KVCache with step size)
- Batch operations when possible
- Let MLX handle evaluation (lazy execution)
- Monitor memory usage
- Profile with Metal System Trace
References