Skip to main content
OminiX-MLX leverages Apple’s unified memory architecture and Metal GPU acceleration. This guide covers optimization strategies for maximum performance.

Quick wins

1. Use quantized models

Switch to 8-bit quantized weights for 3-4x memory reduction and 10-20% speedup from reduced memory bandwidth:
# Before: BF16 model (14 GB)
huggingface-cli download mlx-community/Qwen3-4B-bf16 \
    --local-dir ./models/Qwen3-4B

# After: 8-bit quantized (4 GB)
huggingface-cli download mlx-community/Qwen3-4B-8bit \
    --local-dir ./models/Qwen3-4B-8bit
See 8-bit quantization for details.

2. Use KVCache with pre-allocation

Switch from ConcatKeyValueCache to KVCache for long sequences:
use mlx_rs_core::{KVCache, KeyValueCache};

// Before: Allocates on every token
let mut cache: Vec<Option<ConcatKeyValueCache>> = vec![None; num_layers];

// After: Pre-allocates in 256-token chunks
let mut cache: Vec<Option<KVCache>> = vec![None; num_layers];
Saves 15-25% generation time for sequences > 512 tokens. See KV cache implementation for details.

3. Use fused Metal kernels

For MoE models, fused SwiGLU is 10-12x faster:
use mlx_rs_core::fused_swiglu;

// Before: Separate silu + multiply (slow)
let gate_activated = gate.silu()?;
let output = gate_activated.multiply(&x)?;

// After: Fused Metal kernel (10x faster)
let output = fused_swiglu(&x, &gate)?;
From mlx-rs-core/src/metal_kernels.rs:
Fused SwiGLU: Custom Metal kernel computes silu(gate) * x in a single pass. Critical for MoE models which have many SwiGLU calls per forward pass (~10-12x faster than separate ops).

Memory optimization

Unified memory architecture

Apple Silicon uses unified memory shared between CPU and GPU:
┌─────────────────────────────────────┐
│       Unified Memory (128 GB)       │
│                                     │
│  ┌──────────┐      ┌─────────────┐ │
│  │   CPU    │◄────►│   GPU/ANE   │ │
│  │  Cores   │      │   (Metal)   │ │
│  └──────────┘      └─────────────┘ │
└─────────────────────────────────────┘
Benefits:
  • Zero-copy data transfer between CPU and GPU
  • No explicit memory management needed
  • Models can use full system memory
Trade-offs:
  • Memory pressure affects both CPU and GPU
  • Large models reduce available CPU memory
  • GPU performance degrades when memory is full

Memory budget

Reserve headroom for OS and other processes:
System RAMModel BudgetRecommended Max Model Size
16 GB12 GB3-4B parameters (8-bit)
32 GB24 GB7B parameters (8-bit)
64 GB48 GB14B parameters (8-bit)
128 GB100 GB70B parameters (3-bit)
Running out of memory forces swapping to disk, causing severe slowdowns (100x+ slower). Always leave 4-8 GB free.

Monitor memory usage

# Real-time memory monitoring
while true; do
    echo "$(date): $(vm_stat | grep 'Pages free' | awk '{print $3}' | sed 's/\.//')" 
    sleep 1
done

# Or use Activity Monitor (GUI)

Inference speed optimization

Batch size tuning

Larger batch sizes improve GPU utilization but increase memory:
// Batch size 1 (default, interactive)
let input = prompt.index((0, NewAxis));  // [1, seq_len]

// Batch size 4 (offline processing)
let batch = stack(&[prompt1, prompt2, prompt3, prompt4], 0)?;  // [4, seq_len]
Batch SizeMemoryTokens/secLatencyUse Case
11.0x45 tok/s22msInteractive chat
41.3x120 tok/s33msOffline processing
81.7x180 tok/s44msBatch inference

Temperature and sampling

Greedy decoding (temp=0) is fastest:
// Fastest: Greedy (argmax)
let generator = Generate::new(&mut model, &mut cache, 0.0, &prompt);

// Slower: Sampling with temperature
let generator = Generate::new(&mut model, &mut cache, 0.7, &prompt);

// Slowest: Top-p/top-k sampling (not yet implemented)
MethodRelative SpeedUse Case
Greedy (temp=0)1.0xFactual QA, code completion
Temperature sampling0.95xBalanced creativity
Top-p/top-k0.85xCreative writing

Lazy evaluation

MLX uses lazy evaluation - operations are fused and optimized:
use mlx_rs::transforms::{eval, async_eval};

// Operations are queued, not executed
let a = x.multiply(&y)?;
let b = a.add(&z)?;
let c = b.relu()?;

// Execute all at once (fused into single kernel)
eval(&[&c])?;

// Or async for non-blocking
let _ = async_eval([&c]);
Manual eval() calls are rarely needed. MLX automatically evaluates when you call .item(), .as_slice(), or iterate. Let the framework handle fusion.

Model-specific optimizations

Grouped Query Attention (GQA)

Models with GQA use fewer KV heads for better efficiency:
// Standard MHA: num_heads = 32
let q_heads = 32;
let kv_heads = 32;  // Same as Q
let memory_per_layer = batch * kv_heads * seq_len * head_dim * 4;  // 2 bytes × 2 (K+V)

// GQA: num_kv_heads = 8 (4x reduction)
let q_heads = 32;
let kv_heads = 8;  // 4x fewer
let memory_per_layer = batch * kv_heads * seq_len * head_dim * 4;  // 4x less memory
ArchitectureKV MemorySpeedQuality
MHA (32 heads)1.0xBaselineBaseline
GQA (8 KV heads)0.25x1.15x~Equal
Most modern models use GQA: Qwen3, Moxin, GLM4, Mistral.

Mixture of Experts (MoE)

MoE models activate only 2-4 experts per token:
// 45 total experts, but only top-k=2 are active per token
let expert_outputs = Vec::new();
for expert_idx in top_k_indices {
    let expert = &self.experts[expert_idx];
    // Only 2 experts run → 22x less compute than dense
    expert_outputs.push(expert.forward(x)?);
}
From glm4-moe-mlx/README.md:
45 Experts: Shared experts + routed experts architecture. Top-k routing selects 2-4 experts per token, providing 45x model capacity with ~2x compute cost.
Model TypeParametersActive ParamsMemorySpeed
Dense 9B9B9B18 GB35 tok/s
MoE 9B (45 experts)45B9B20 GB15-20 tok/s

Vision-Language Models

Optimize image preprocessing:
// Pre-compute image embeddings once
let dino_features = dino_encoder.forward(&image)?;  // [1, 256, 1024]
let siglip_features = siglip_encoder.forward(&image)?;  // [1, 256, 1152]
let visual_tokens = projector.forward(&concat)?;  // [1, 256, 4096]

// Reuse for multiple text generations
for prompt in prompts {
    let input_ids = build_input(visual_tokens.clone(), prompt)?;
    let output = decoder.forward(&input_ids, &mut cache)?;
}
Vision encoding takes 100-200ms. Reuse embeddings when asking multiple questions about the same image.

Benchmarking

Measure tokens per second

use std::time::Instant;

let start = Instant::now();
let mut num_tokens = 0;

for token in generator.take(100) {
    let _ = token?;
    num_tokens += 1;
}

let elapsed = start.elapsed().as_secs_f32();
let tok_per_sec = num_tokens as f32 / elapsed;
println!("Speed: {:.1} tok/s", tok_per_sec);

Separate prefill and decode

let prefill_start = Instant::now();
let first_token = generator.next().unwrap()?;  // Prefill
let prefill_time = prefill_start.elapsed();

let decode_start = Instant::now();
let remaining: Vec<_> = generator.take(99).collect();
let decode_time = decode_start.elapsed();

println!("Prefill: {:.0}ms", prefill_time.as_millis());
println!("Decode: {:.1} tok/s", 99.0 / decode_time.as_secs_f32());

Profile with Metal System Trace

# Record trace
xctrace record --template 'Metal System Trace' \
    --output trace.trace \
    --target-stdout - \
    --launch -- cargo run --release --example generate

# Open in Instruments
open trace.trace
Look for:
  • GPU utilization (should be > 80%)
  • Memory bandwidth (watch for saturation)
  • Kernel launch overhead (minimize small kernels)

Performance benchmarks

Measured on Apple M4 Max (128 GB):

Language models

ModelPrecisionMemoryPrefillDecodeNotes
Qwen3-0.5BBF161.2 GB15ms120 tok/sFastest small
Qwen3-4BBF168 GB45ms45 tok/sBalanced
Qwen3-4B8-bit4.5 GB50ms42 tok/sMemory-optimized
Moxin-7B VLM8-bit10 GB250ms30 tok/sIncludes vision
GLM4-9B4-bit6 GB55ms35 tok/sMHA baseline
GLM4-MoE3-bit20 GB80ms15-20 tok/sVariable (MoE)
MiniCPM-SALA-9B8-bit9.6 GB60ms28 tok/sHybrid attention

Audio models

ModelPrecisionMemorySpeedNotes
Paraformer-largeBF16500 MB18x RTChinese/English
Qwen3-ASR-0.6B8-bit1.0 GB50x RT30+ languages
Qwen3-ASR-1.7B8-bit2.5 GB30x RTBest accuracy
GPT-SoVITSBF162 GB4x RTVoice cloning

Image models

ModelPrecisionMemorySpeedResolution
Z-ImageBF168 GB~3s/image512×512
FLUX.2-kleinBF1613 GB~5s/image1024×1024

Hardware comparison

Apple Silicon generations

ChipGPU CoresMemory BWQwen3-4B Speed
M1868 GB/s25 tok/s
M1 Max32400 GB/s38 tok/s
M210100 GB/s30 tok/s
M2 Max38400 GB/s42 tok/s
M3 Max40400 GB/s45 tok/s
M4 Max40410 GB/s48 tok/s
Memory bandwidth is the primary bottleneck for LLM inference. Max/Ultra variants offer 4-6x more bandwidth than base chips.

Memory configurations

ConfigurationRecommended Models
16 GB (M1/M2 base)0.5-3B models (8-bit)
32 GB (M1/M2 Pro)7B models (8-bit)
64 GB (M1/M2 Max)14B models (8-bit), 7B (BF16)
128 GB (M3/M4 Max)70B models (3-bit), 14B (BF16)

Release vs debug builds

Always use --release for benchmarking:
# Debug: 10x slower, large memory overhead
cargo run --example generate

# Release: Full optimizations
cargo run --release --example generate
BuildSpeedBinary SizeUse Case
Debug0.1x50 MBDevelopment
Release1.0x5 MBProduction

Common pitfalls

Avoid:
  • Running multiple large models simultaneously (memory pressure)
  • Using unwrap() in hot paths (panic overhead)
  • Excessive .item() calls (forces evaluation)
  • Small batch sizes on large models (underutilizes GPU)
  • Debug builds for performance testing
Do:
  • Use quantized models for production
  • Pre-allocate buffers (KVCache with step size)
  • Batch operations when possible
  • Let MLX handle evaluation (lazy execution)
  • Monitor memory usage
  • Profile with Metal System Trace

References

Build docs developers (and LLMs) love