Skip to main content
OminiX-MLX provides production-ready implementations of popular large language models optimized for Apple Silicon. All models leverage Metal GPU acceleration for fast inference with minimal memory usage.

Supported model families

Qwen3

0.6B to 32B parameters. Fast inference with 4-bit quantization support. Best for general-purpose text generation.

GLM-4

9B parameter models with unique architecture. Partial RoPE and fused MLP for efficiency.

Mixtral

8x7B and 8x22B MoE models. Custom Metal kernels for 10-12x faster expert dispatch.

Mistral

7B models with sliding window attention. Efficient long-context processing with GQA.

MiniCPM-SALA

9B hybrid attention model. Million-token context with lightning attention.

Common features

All language model implementations share these capabilities:
  • Metal GPU acceleration: Native Apple Silicon optimization with MLX framework
  • Quantization support: 4-bit and 8-bit quantized models for reduced memory usage
  • KV cache: Step-based key-value caching for efficient autoregressive generation
  • Streaming generation: Token-by-token output for interactive applications
  • Tokenizer integration: HuggingFace tokenizer support with chat templates

Unified API

All models follow a consistent Rust API pattern:
use qwen3_mlx::{load_model, load_tokenizer, Generate, KVCache};
use mlx_rs::ops::indexing::NewAxis;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load model and tokenizer
    let tokenizer = load_tokenizer("path/to/model")?;
    let mut model = load_model("path/to/model")?;

    // Tokenize prompt
    let encoding = tokenizer.encode("Hello, I am", true)?;
    let prompt = mlx_rs::Array::from(encoding.get_ids()).index(NewAxis);

    // Generate tokens
    let mut cache = Vec::new();
    let generator = Generate::<KVCache>::new(&mut model, &mut cache, 0.7, &prompt);

    for token in generator.take(100) {
        let token = token?;
        let text = tokenizer.decode(&[token.item::<u32>()], true)?;
        print!("{}", text);
    }

    Ok(())
}

Performance comparison

Benchmarks on Apple M3 Max (40-core GPU):
ModelSizePrefillDecodeMemory
Qwen3-4B (4-bit)3 GB250 tok/s75 tok/s3 GB
GLM-4-9B (4-bit)6 GB~200 tok/s~50 tok/s6 GB
Mixtral-8x7B (4-bit)26 GB80 tok/s25 tok/s26 GB
Mistral-7B (4-bit)4 GB~220 tok/s55 tok/s4 GB
MiniCPM-SALA-9B (8-bit)9.6 GB443 tok/s28 tok/s9.6 GB

Model selection guide

For interactive chat

  • Qwen3-4B (4-bit): Best balance of speed and quality for general chat
  • Mistral-7B (4-bit): Strong instruction following with sliding window attention

For long context

  • MiniCPM-SALA-9B: Million-token context capability with hybrid attention
  • Mistral-7B: 4096 token sliding window for efficient long sequences

For maximum quality

  • Mixtral-8x7B: 47B total parameters with expert routing
  • Qwen3-32B: Largest dense model (requires 64GB+ memory)

For memory-constrained systems

  • Qwen3-0.6B: Smallest model at 1.2 GB
  • Qwen3-1.7B: Good quality with only 3.4 GB memory

Next steps

Download models

Get pre-converted MLX models from HuggingFace Hub

API reference

Detailed API documentation for all model implementations

Build docs developers (and LLMs) love