Performance tuning - OminiX-MLX

OminiX-MLX leverages Apple’s unified memory architecture and Metal GPU acceleration. This guide covers optimization strategies for maximum performance.

Quick wins

1. Use quantized models

Switch to 8-bit quantized weights for 3-4x memory reduction and 10-20% speedup from reduced memory bandwidth:

# Before: BF16 model (14 GB)
huggingface-cli download mlx-community/Qwen3-4B-bf16 \
    --local-dir ./models/Qwen3-4B

# After: 8-bit quantized (4 GB)
huggingface-cli download mlx-community/Qwen3-4B-8bit \
    --local-dir ./models/Qwen3-4B-8bit

See 8-bit quantization for details.

2. Use KVCache with pre-allocation

Switch from ConcatKeyValueCache to KVCache for long sequences:

use mlx_rs_core::{KVCache, KeyValueCache};

// Before: Allocates on every token
let mut cache: Vec<Option<ConcatKeyValueCache>> = vec![None; num_layers];

// After: Pre-allocates in 256-token chunks
let mut cache: Vec<Option<KVCache>> = vec![None; num_layers];

Saves 15-25% generation time for sequences > 512 tokens. See KV cache implementation for details.

3. Use fused Metal kernels

For MoE models, fused SwiGLU is 10-12x faster:

use mlx_rs_core::fused_swiglu;

// Before: Separate silu + multiply (slow)
let gate_activated = gate.silu()?;
let output = gate_activated.multiply(&x)?;

// After: Fused Metal kernel (10x faster)
let output = fused_swiglu(&x, &gate)?;

From mlx-rs-core/src/metal_kernels.rs:

Fused SwiGLU: Custom Metal kernel computes silu(gate) * x in a single pass. Critical for MoE models which have many SwiGLU calls per forward pass (~10-12x faster than separate ops).

Memory optimization

Unified memory architecture

Apple Silicon uses unified memory shared between CPU and GPU:

┌─────────────────────────────────────┐
│       Unified Memory (128 GB)       │
│                                     │
│  ┌──────────┐      ┌─────────────┐ │
│  │   CPU    │◄────►│   GPU/ANE   │ │
│  │  Cores   │      │   (Metal)   │ │
│  └──────────┘      └─────────────┘ │
└─────────────────────────────────────┘

Benefits:

Zero-copy data transfer between CPU and GPU
No explicit memory management needed
Models can use full system memory

Trade-offs:

Memory pressure affects both CPU and GPU
Large models reduce available CPU memory
GPU performance degrades when memory is full

Memory budget

Reserve headroom for OS and other processes:

System RAM	Model Budget	Recommended Max Model Size
16 GB	12 GB	3-4B parameters (8-bit)
32 GB	24 GB	7B parameters (8-bit)
64 GB	48 GB	14B parameters (8-bit)
128 GB	100 GB	70B parameters (3-bit)

Running out of memory forces swapping to disk, causing severe slowdowns (100x+ slower). Always leave 4-8 GB free.

Monitor memory usage

# Real-time memory monitoring
while true; do
    echo "$(date): $(vm_stat | grep 'Pages free' | awk '{print $3}' | sed 's/\.//')" 
    sleep 1
done

# Or use Activity Monitor (GUI)

Inference speed optimization

Batch size tuning

Larger batch sizes improve GPU utilization but increase memory:

// Batch size 1 (default, interactive)
let input = prompt.index((0, NewAxis));  // [1, seq_len]

// Batch size 4 (offline processing)
let batch = stack(&[prompt1, prompt2, prompt3, prompt4], 0)?;  // [4, seq_len]

Batch Size	Memory	Tokens/sec	Latency	Use Case
1	1.0x	45 tok/s	22ms	Interactive chat
4	1.3x	120 tok/s	33ms	Offline processing
8	1.7x	180 tok/s	44ms	Batch inference

Temperature and sampling

Greedy decoding (temp=0) is fastest:

// Fastest: Greedy (argmax)
let generator = Generate::new(&mut model, &mut cache, 0.0, &prompt);

// Slower: Sampling with temperature
let generator = Generate::new(&mut model, &mut cache, 0.7, &prompt);

// Slowest: Top-p/top-k sampling (not yet implemented)

Method	Relative Speed	Use Case
Greedy (temp=0)	1.0x	Factual QA, code completion
Temperature sampling	0.95x	Balanced creativity
Top-p/top-k	0.85x	Creative writing

Lazy evaluation

MLX uses lazy evaluation - operations are fused and optimized:

use mlx_rs::transforms::{eval, async_eval};

// Operations are queued, not executed
let a = x.multiply(&y)?;
let b = a.add(&z)?;
let c = b.relu()?;

// Execute all at once (fused into single kernel)
eval(&[&c])?;

// Or async for non-blocking
let _ = async_eval([&c]);

Manual eval() calls are rarely needed. MLX automatically evaluates when you call .item(), .as_slice(), or iterate. Let the framework handle fusion.

Model-specific optimizations

Grouped Query Attention (GQA)

Models with GQA use fewer KV heads for better efficiency:

// Standard MHA: num_heads = 32
let q_heads = 32;
let kv_heads = 32;  // Same as Q
let memory_per_layer = batch * kv_heads * seq_len * head_dim * 4;  // 2 bytes × 2 (K+V)

// GQA: num_kv_heads = 8 (4x reduction)
let q_heads = 32;
let kv_heads = 8;  // 4x fewer
let memory_per_layer = batch * kv_heads * seq_len * head_dim * 4;  // 4x less memory

Architecture	KV Memory	Speed	Quality
MHA (32 heads)	1.0x	Baseline	Baseline
GQA (8 KV heads)	0.25x	1.15x	~Equal

Most modern models use GQA: Qwen3, Moxin, GLM4, Mistral.

Mixture of Experts (MoE)

MoE models activate only 2-4 experts per token:

// 45 total experts, but only top-k=2 are active per token
let expert_outputs = Vec::new();
for expert_idx in top_k_indices {
    let expert = &self.experts[expert_idx];
    // Only 2 experts run → 22x less compute than dense
    expert_outputs.push(expert.forward(x)?);
}

From glm4-moe-mlx/README.md:

45 Experts: Shared experts + routed experts architecture. Top-k routing selects 2-4 experts per token, providing 45x model capacity with ~2x compute cost.

Model Type	Parameters	Active Params	Memory	Speed
Dense 9B	9B	9B	18 GB	35 tok/s
MoE 9B (45 experts)	45B	9B	20 GB	15-20 tok/s

Vision-Language Models

Optimize image preprocessing:

// Pre-compute image embeddings once
let dino_features = dino_encoder.forward(&image)?;  // [1, 256, 1024]
let siglip_features = siglip_encoder.forward(&image)?;  // [1, 256, 1152]
let visual_tokens = projector.forward(&concat)?;  // [1, 256, 4096]

// Reuse for multiple text generations
for prompt in prompts {
    let input_ids = build_input(visual_tokens.clone(), prompt)?;
    let output = decoder.forward(&input_ids, &mut cache)?;
}

Vision encoding takes 100-200ms. Reuse embeddings when asking multiple questions about the same image.

Benchmarking

Measure tokens per second

use std::time::Instant;

let start = Instant::now();
let mut num_tokens = 0;

for token in generator.take(100) {
    let _ = token?;
    num_tokens += 1;
}

let elapsed = start.elapsed().as_secs_f32();
let tok_per_sec = num_tokens as f32 / elapsed;
println!("Speed: {:.1} tok/s", tok_per_sec);

Separate prefill and decode

let prefill_start = Instant::now();
let first_token = generator.next().unwrap()?;  // Prefill
let prefill_time = prefill_start.elapsed();

let decode_start = Instant::now();
let remaining: Vec<_> = generator.take(99).collect();
let decode_time = decode_start.elapsed();

println!("Prefill: {:.0}ms", prefill_time.as_millis());
println!("Decode: {:.1} tok/s", 99.0 / decode_time.as_secs_f32());

Profile with Metal System Trace

# Record trace
xctrace record --template 'Metal System Trace' \
    --output trace.trace \
    --target-stdout - \
    --launch -- cargo run --release --example generate

# Open in Instruments
open trace.trace

Look for:

GPU utilization (should be > 80%)
Memory bandwidth (watch for saturation)
Kernel launch overhead (minimize small kernels)

Performance benchmarks

Measured on Apple M4 Max (128 GB):

Language models

Model	Precision	Memory	Prefill	Decode	Notes
Qwen3-0.5B	BF16	1.2 GB	15ms	120 tok/s	Fastest small
Qwen3-4B	BF16	8 GB	45ms	45 tok/s	Balanced
Qwen3-4B	8-bit	4.5 GB	50ms	42 tok/s	Memory-optimized
Moxin-7B VLM	8-bit	10 GB	250ms	30 tok/s	Includes vision
GLM4-9B	4-bit	6 GB	55ms	35 tok/s	MHA baseline
GLM4-MoE	3-bit	20 GB	80ms	15-20 tok/s	Variable (MoE)
MiniCPM-SALA-9B	8-bit	9.6 GB	60ms	28 tok/s	Hybrid attention

Audio models

Model	Precision	Memory	Speed	Notes
Paraformer-large	BF16	500 MB	18x RT	Chinese/English
Qwen3-ASR-0.6B	8-bit	1.0 GB	50x RT	30+ languages
Qwen3-ASR-1.7B	8-bit	2.5 GB	30x RT	Best accuracy
GPT-SoVITS	BF16	2 GB	4x RT	Voice cloning

Image models

Model	Precision	Memory	Speed	Resolution
Z-Image	BF16	8 GB	~3s/image	512×512
FLUX.2-klein	BF16	13 GB	~5s/image	1024×1024

Hardware comparison

Apple Silicon generations

Chip	GPU Cores	Memory BW	Qwen3-4B Speed
M1	8	68 GB/s	25 tok/s
M1 Max	32	400 GB/s	38 tok/s
M2	10	100 GB/s	30 tok/s
M2 Max	38	400 GB/s	42 tok/s
M3 Max	40	400 GB/s	45 tok/s
M4 Max	40	410 GB/s	48 tok/s

Memory bandwidth is the primary bottleneck for LLM inference. Max/Ultra variants offer 4-6x more bandwidth than base chips.

Memory configurations

Configuration	Recommended Models
16 GB (M1/M2 base)	0.5-3B models (8-bit)
32 GB (M1/M2 Pro)	7B models (8-bit)
64 GB (M1/M2 Max)	14B models (8-bit), 7B (BF16)
128 GB (M3/M4 Max)	70B models (3-bit), 14B (BF16)

Release vs debug builds

Always use --release for benchmarking:

# Debug: 10x slower, large memory overhead
cargo run --example generate

# Release: Full optimizations
cargo run --release --example generate

Build	Speed	Binary Size	Use Case
Debug	0.1x	50 MB	Development
Release	1.0x	5 MB	Production

Common pitfalls

Avoid:

Running multiple large models simultaneously (memory pressure)
Using unwrap() in hot paths (panic overhead)
Excessive .item() calls (forces evaluation)
Small batch sizes on large models (underutilizes GPU)
Debug builds for performance testing

Do:

Use quantized models for production
Pre-allocate buffers (KVCache with step size)
Batch operations when possible
Let MLX handle evaluation (lazy execution)
Monitor memory usage
Profile with Metal System Trace

References

Apple MLX Performance Guide
mlx-rs-core/src/metal_kernels.rs:1-340
mlx-rs-core/src/cache.rs:79-182
README.md:467-483

Get Started

Core Concepts

Language Models

Vision-Language Models

Speech Recognition

Text-to-Speech

Image Generation

API Server

Advanced

​Quick wins

​1. Use quantized models

​2. Use KVCache with pre-allocation

​3. Use fused Metal kernels

​Memory optimization

​Unified memory architecture

​Memory budget

​Monitor memory usage

​Inference speed optimization

​Batch size tuning

​Temperature and sampling

​Lazy evaluation

​Model-specific optimizations

​Grouped Query Attention (GQA)

​Mixture of Experts (MoE)

​Vision-Language Models

​Benchmarking

​Measure tokens per second

​Separate prefill and decode

​Profile with Metal System Trace

​Performance benchmarks

​Language models

​Audio models

​Image models

​Hardware comparison

​Apple Silicon generations

​Memory configurations

​Release vs debug builds

​Common pitfalls

​References

Build docs developers (and LLMs) love

Quick wins

1. Use quantized models

2. Use KVCache with pre-allocation

3. Use fused Metal kernels

Memory optimization

Unified memory architecture

Memory budget

Monitor memory usage

Inference speed optimization

Batch size tuning

Temperature and sampling

Lazy evaluation

Model-specific optimizations

Grouped Query Attention (GQA)

Mixture of Experts (MoE)

Vision-Language Models

Benchmarking

Measure tokens per second

Separate prefill and decode

Profile with Metal System Trace

Performance benchmarks

Language models

Audio models

Image models

Hardware comparison

Apple Silicon generations

Memory configurations

Release vs debug builds

Common pitfalls

References