Skip to main content

Overview

OminiX-MLX is a layered Rust ecosystem for ML inference on Apple Silicon. The architecture follows a bottom-up design where lower-level crates provide safe abstractions over MLX, and higher-level crates implement specific model families.

Architecture diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                              User Application                                │
│                    (OminiX-API / Custom Rust Application)                   │
└───────────────────────────────┬─────────────────────────────────────────────┘

        ┌───────────────────────┼───────────────────────────┐
        │                       │                           │
        ▼                       ▼                           ▼
┌───────────────┐         ┌─────────────────┐         ┌─────────────────┐
│  LLM / VLM    │         │   Audio Crates  │         │  Image Crates   │
├───────────────┤         ├─────────────────┤         ├─────────────────┤
│ qwen3-mlx     │         │ funasr-mlx      │         │ flux-klein-mlx  │
│ glm4-mlx      │         │ funasr-nano-mlx │         │ zimage-mlx      │
│ glm4-moe-mlx  │         │ qwen3-asr-mlx   │         │ qwen-image-mlx  │
│ mixtral-mlx   │         │ gpt-sovits-mlx  │         │                 │
│ mistral-mlx   │         │                 │         │                 │
│ moxin-vlm-mlx │         │                 │         │                 │
│ minicpm-sala  │         │                 │         │                 │
└───────┬───────┘         └────────┬────────┘         └────────┬────────┘
        │                          │                           │
        └──────────────────────────┼───────────────────────────┘


                    ┌──────────────────────────┐
                    │       mlx-rs-core        │
                    ├──────────────────────────┤
                    │ • KV Cache Management    │
                    │ • RoPE Embeddings        │
                    │ • Attention (SDPA)       │
                    │ • Audio Processing       │
                    │ • Metal Kernels          │
                    │ • Speculative Decoding   │
                    └────────────┬─────────────┘


                    ┌──────────────────────────┐
                    │         mlx-rs           │
                    ├──────────────────────────┤
                    │ • Safe Rust API          │
                    │ • Array Operations       │
                    │ • Neural Network Layers  │
                    │ • Transforms (eval, jit) │
                    │ • Random/Ops/Indexing    │
                    └────────────┬─────────────┘


                    ┌──────────────────────────┐
                    │         mlx-sys          │
                    ├──────────────────────────┤
                    │ • FFI Bindings (bindgen) │
                    │ • mlx-c Submodule        │
                    └────────────┬─────────────┘


                    ┌──────────────────────────┐
                    │      Apple MLX (C++)     │
                    ├──────────────────────────┤
                    │ • Metal GPU Backend      │
                    │ • Accelerate Framework   │
                    │ • Unified Memory         │
                    │ • Lazy Evaluation        │
                    └──────────────────────────┘

Layer breakdown

Foundation layer (mlx-sys)

The lowest layer provides raw FFI bindings to Apple’s MLX C++ library:
  • Auto-generated bindings: Uses bindgen to create safe FFI interfaces
  • mlx-c submodule: Git submodule tracking the upstream MLX C bindings
  • Zero-cost abstractions: Direct mapping to C functions with no runtime overhead

Core abstraction layer (mlx-rs)

Provides a safe, idiomatic Rust API over mlx-sys:
  • Array operations: N-dimensional arrays with automatic memory management
  • Neural network layers: Linear, convolution, attention, normalization
  • Function transforms: Automatic differentiation (grad), compilation
  • Device management: CPU/GPU device abstraction with unified memory
  • Random operations: Random number generation and distributions
  • Type safety: Compile-time shape and dtype validation where possible
Key modules in mlx-rs/src/:
  • array/mod.rs: Core Array type and operations
  • device.rs: Device abstraction (CPU/GPU) - mlx-rs/src/device.rs:11
  • stream.rs: Execution streams for parallel computation - mlx-rs/src/stream.rs:110
  • ops/: Mathematical and neural network operations
  • transforms/: Function transformations (grad, compile)
  • nn/: High-level neural network layers

Shared infrastructure layer (mlx-rs-core)

Common components shared across model implementations: KV Cache Management
  • ConcatKeyValueCache: Simple concatenating cache for autoregressive generation
  • KeyValueCache trait: Interface for custom cache implementations
  • Used by all LLM/VLM crates for efficient token generation
Attention Utilities
  • scaled_dot_product_attention(): Optimized SDPA with mask support
  • create_attention_mask(): Causal and sliding window mask generation
  • initialize_rope(): RoPE embeddings with scaling configurations
Audio Processing
  • WAV I/O: Load/save 16/24/32-bit PCM audio
  • Resampling: High-quality sinc interpolation
  • Mel spectrograms: STFT-based feature extraction
  • HuBERT preprocessing: Specialized audio normalization
Metal Kernels
  • fused_swiglu(): Fused SwiGLU activation (45x faster for MoE models)
  • Custom Metal shaders for specialized operations

Model implementation layer

Model-specific crates implementing complete inference pipelines: LLM/VLM Crates (qwen3-mlx, glm4-mlx, mixtral-mlx, etc.)
  • Model architecture definitions
  • Weight loading from safetensors/HuggingFace
  • Tokenizer integration
  • Generation loops with KV caching
  • Quantization support (4-bit, 8-bit)
Audio Crates (funasr-mlx, qwen3-asr-mlx, gpt-sovits-mlx)
  • Audio frontend processing (mel spectrograms, STFT)
  • Encoder/decoder architectures (Paraformer, Whisper-style)
  • Vocabulary management
  • Real-time streaming support
Image Crates (flux-klein-mlx, zimage-mlx)
  • VAE encoders/decoders
  • Diffusion transformers (DiT, MMDiT)
  • Text encoder integration
  • Latent space manipulation

Application layer

User-facing applications and APIs:
  • ominix-api: Unified HTTP server with OpenAI-compatible endpoints
  • Custom applications: User code directly importing model crates
  • Example binaries: Reference implementations in each crate’s examples/ directory

Crate structure

OminiX-MLX/
├── mlx-rs/              # Core MLX Rust bindings
├── mlx-rs-core/         # Shared inference infrastructure

├── qwen3-mlx/           # Qwen2, Qwen3, Qwen3-MoE
├── glm4-mlx/            # GLM4
├── glm4-moe-mlx/        # GLM4-MoE (45 experts)
├── mixtral-mlx/         # Mixtral 8x7B/8x22B
├── mistral-mlx/         # Mistral 7B
├── moxin-vlm-mlx/       # Moxin-7B VLM (vision-language)
├── MiniCPM-SALA-MLX/    # MiniCPM-SALA 9B (hybrid attention, 1M context)

├── gpt-sovits-mlx/      # GPT-SoVITS voice cloning
├── funasr-mlx/          # FunASR Paraformer ASR
├── funasr-nano-mlx/     # FunASR-Nano (SenseVoice + Qwen)
├── qwen3-asr-mlx/       # Qwen3-ASR (30+ languages, 0.6B/1.7B)

├── ominix-api/          # Unified OpenAI-compatible API server

├── flux-klein-mlx/      # FLUX.2-klein image generation
├── zimage-mlx/          # Z-Image generation
└── qwen-image-mlx/      # Qwen image generation

Data flow patterns

LLM inference pipeline

┌─────────────────┐
│   Input Text    │
└────────┬────────┘


┌─────────────────┐
│   Tokenizer     │  Convert text to token IDs
│  (tokenizers)   │
└────────┬────────┘


┌─────────────────┐
│   Embedding     │  token_ids → hidden_states [batch, seq_len, hidden_dim]
└────────┬────────┘


┌─────────────────┐
│  Transformer    │  Apply attention + MLP layers
│   Layers (N)    │  • Self-attention with KV cache
│                 │  • RoPE position embeddings
│                 │  • RMSNorm/LayerNorm
│                 │  • MLP with SwiGLU/GELU
└────────┬────────┘


┌─────────────────┐
│   LM Head       │  hidden → logits [batch, seq_len, vocab_size]
└────────┬────────┘


┌─────────────────┐
│   Sampling      │  logits → next_token (argmax/temperature)
└────────┬────────┘


┌─────────────────┐
│  Detokenizer    │  token_id → text
└────────┬────────┘


┌─────────────────┐
│  Output Text    │
└─────────────────┘
Key optimization: KV cache stores past key/value tensors to avoid recomputing attention for previous tokens. Only the new token’s keys/values are computed each step.

ASR inference pipeline

┌─────────────────┐
│   Audio File    │  (WAV/MP3)
└────────┬────────┘


┌─────────────────┐
│  Audio Frontend │  
│                 │  • Load and decode audio
│                 │  • Resample to 16kHz
│                 │  • Extract mel spectrogram / STFT features
└────────┬────────┘


┌─────────────────┐
│   Encoder       │  Features → encoder hidden states
│  (Paraformer/   │  • Conformer/Transformer blocks
│   SAN-M)        │  • Temporal convolution
└────────┬────────┘


┌─────────────────┐
│   Decoder       │  Hidden states → token probabilities
│  (CTC / CIF)    │  • CTC: Frame-level prediction
│                 │  • CIF: Continuous integrate-and-fire
└────────┬────────┘


┌─────────────────┐
│  Vocabulary     │  Token IDs → characters/words
│   Mapping       │
└────────┬────────┘


┌─────────────────┐
│  Transcript     │  Final text output
└─────────────────┘

Image generation pipeline

┌─────────────────┐
│  Text Prompt    │
└────────┬────────┘


┌─────────────────┐
│  Text Encoder   │  text → embeddings
│  (CLIP/T5/Qwen) │
└────────┬────────┘


┌─────────────────┐
│  Random Noise   │  latent_shape ~ N(0, 1)
└────────┬────────┘


┌─────────────────┐
│  Diffusion      │  Iterative denoising
│  Transformer    │  • DiT/MMDiT architecture
│  (DiT/MMDiT)    │  • Attention with text conditioning
│                 │  • N denoising steps
└────────┬────────┘


┌─────────────────┐
│  VAE Decoder    │  latent → pixel space
│                 │  [batch, latent_dim, h/8, w/8] → [batch, 3, h, w]
└────────┬────────┘


┌─────────────────┐
│  PNG/JPEG       │  tensor → image file
│  Encoding       │
└────────┬────────┘


┌─────────────────┐
│  Output Image   │
└─────────────────┘

Memory management

OminiX-MLX leverages Rust’s ownership system combined with MLX’s unified memory:
  • Reference counting: MLX arrays use internal reference counting; Rust’s Drop decrements the count
  • Zero-copy operations: Arrays can be shared between CPU and GPU without copying
  • Lazy materialization: Array data only allocated when eval() is called
  • Automatic cleanup: When Rust values go out of scope, MLX memory is freed
The Array type is thread-safe (Send) and uses MLX’s internal reference counting, similar to Arc<T> but managed by MLX.

Parallelism and streams

MLX uses streams to enable parallel execution:
  • Default stream: Operations use StreamOrDevice::default() which maps to GPU by default
  • Explicit streams: Create separate streams for parallel computation
  • No data races: MLX handles synchronization between operations on different streams
  • Device specification: Operations can target CPU or GPU via the stream parameter
Example parallel execution:
use mlx_rs::{StreamOrDevice, ops};

// Operations on different streams execute in parallel
let a = ops::add(&x, &y, StreamOrDevice::cpu())?;
let b = ops::mul(&x, &y, StreamOrDevice::gpu())?;

Build system

The project uses Cargo workspaces for efficient builds:
  • Workspace root: Top-level Cargo.toml defines all member crates
  • Shared dependencies: Common dependencies specified once at workspace level
  • Incremental compilation: Changing one model crate only rebuilds that crate
  • Feature flags: metal and accelerate features control MLX backend
# Build all crates
cargo build --release

# Build specific model crate
cargo build --release -p qwen3-mlx

# Build with specific features
cargo build --release --features metal,accelerate

Design principles

Modularity: Each model family is a separate crate with minimal dependencies Type safety: Leverage Rust’s type system to catch errors at compile time Zero-cost abstractions: Rust wrappers add no runtime overhead over raw MLX Ergonomic APIs: Provide convenient builders, macros, and method chaining Pure Rust inference: No Python runtime required; models run standalone Production-ready: Focus on reliability, error handling, and performance

Next steps

MLX framework

Learn about the MLX framework and Metal acceleration

Unified memory

Understand Apple Silicon’s unified memory architecture

Lazy evaluation

Explore lazy evaluation and compute graph optimization

Core API

Browse the mlx-rs core API reference

Build docs developers (and LLMs) love