Skip to main content
This guide will walk you through running your first inference with OminiX-MLX. We’ll use the Qwen3-4B language model as an example.
Make sure you’ve completed the installation guide before continuing.

Choose your path

Language models

Text generation with Qwen3

Speech recognition

Transcribe audio with Paraformer

Image generation

Generate images with FLUX.2-klein

LLM example: Text generation

Let’s generate text using the Qwen3-4B language model.
1

Download the model

Use the HuggingFace CLI to download a pre-trained model:
huggingface-cli download mlx-community/Qwen3-4B-bf16 \
  --local-dir ./models/Qwen3-4B
The model is approximately 8GB. Download time depends on your internet connection.
2

Build the Qwen3 crate

If you haven’t already, build the Qwen3 crate:
cargo build --release -p qwen3-mlx
3

Run text generation

Generate text with a simple prompt:
cargo run --release -p qwen3-mlx --example generate_qwen3 -- \
  ./models/Qwen3-4B "Explain quantum computing in simple terms:"
You should see output like:
Loading model from: ./models/Qwen3-4B
Model loaded in 2.3s
Prompt (8 tokens): Explain quantum computing in simple terms:
---
Quantum computing is a type of computing that uses quantum mechanics...
---
Generated 100 tokens in 2.2s (45.5 tok/s)
4

Try interactive chat (optional)

For a more interactive experience, run the chat example:
cargo run --release -p qwen3-mlx --example chat_qwen3 -- \
  ./models/Qwen3-4B
This opens an interactive REPL where you can have a conversation with the model.

Use in your own code

Here’s how to use Qwen3 in your Rust application:
use qwen3_mlx::{load_model, load_tokenizer, Generate, KVCache};
use mlx_rs::ops::indexing::{IndexOp, NewAxis};
use mlx_rs::transforms::eval;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load model and tokenizer
    let mut model = load_model("./models/Qwen3-4B")?;
    let tokenizer = load_tokenizer("./models/Qwen3-4B")?;

    // Tokenize prompt
    let prompt = "Hello, how are you?";
    let encoding = tokenizer.encode(prompt, true)?;
    let prompt_tokens = mlx_rs::Array::from(encoding.get_ids()).index(NewAxis);

    // Generate tokens
    let mut cache = Vec::new();
    let temperature = 0.7;
    let generator = Generate::<KVCache>::new(
        &mut model,
        &mut cache,
        temperature,
        &prompt_tokens,
    );

    let mut tokens = Vec::new();
    for token in generator.take(100) {
        let token = token?;
        tokens.push(token.clone());

        // Decode and print every 10 tokens
        if tokens.len() % 10 == 0 {
            eval(&tokens)?;
            let slice: Vec<u32> = tokens.drain(..)
                .map(|t| t.item::<u32>())
                .collect();
            let text = tokenizer.decode(&slice, true)?;
            print!("{}", text);
        }
    }

    Ok(())
}
For better performance on smaller devices, use 4-bit or 8-bit quantized models like Qwen3-4B-8bit-mlx.

ASR example: Speech recognition

Transcribe audio files using the Paraformer ASR model.
1

Download the model

Download a Paraformer model from HuggingFace:
huggingface-cli download moxin-org/paraformer-zh-mlx \
  --local-dir ./models/paraformer
2

Build the FunASR crate

cargo build --release -p funasr-mlx
3

Transcribe audio

Run transcription on a WAV file:
cargo run --release -p funasr-mlx --example transcribe -- \
  ./audio/test.wav ./models/paraformer
Output:
Loading audio: ./audio/test.wav
  44100 samples, 16000 Hz, 2.76s
Loading model from: ./models/paraformer
  8192 tokens loaded
Transcribing...

=== Results ===
Text: 今天天气真不错

Performance:
  Audio duration: 2.76s
  Inference time: 145 ms
  RTF: 0.0525x
  Speed: 19.0x real-time

Use in your own code

use funasr_mlx::{load_model, transcribe, Vocabulary, parse_cmvn_file};
use funasr_mlx::audio::{load_wav, resample};
use mlx_rs::module::Module;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load audio and resample to 16kHz
    let (samples, sample_rate) = load_wav("audio.wav")?;
    let samples = if sample_rate != 16000 {
        resample(&samples, sample_rate, 16000)
    } else {
        samples
    };

    // Load model
    let mut model = load_model("./models/paraformer/paraformer.safetensors")?;
    model.training_mode(false);

    // Load CMVN normalization
    let (addshift, rescale) = parse_cmvn_file("./models/paraformer/am.mvn")?;
    model.set_cmvn(addshift, rescale);

    // Load vocabulary
    let vocab = Vocabulary::load("./models/paraformer/tokens.txt")?;

    // Transcribe
    let text = transcribe(&mut model, &samples, &vocab)?;
    println!("Transcription: {}", text);

    Ok(())
}
For multilingual support (30+ languages), use the Qwen3-ASR model instead.

Image generation example

Generate images from text prompts using FLUX.2-klein.
1

Download the model

huggingface-cli download black-forest-labs/FLUX.2-klein-4B \
  --local-dir ./models/flux-klein
FLUX.2-klein requires approximately 13GB of RAM. Ensure you have sufficient memory.
2

Build the FLUX crate

cargo build --release -p flux-klein-mlx
3

Generate an image

cargo run --release -p flux-klein-mlx --example generate_klein -- \
  "a beautiful sunset over mountains"
The generated image will be saved as output_klein.ppm.For faster generation, use 4 denoising steps (default):
cargo run --release -p flux-klein-mlx --example generate_klein -- \
  --steps 4 "a cat sitting on a couch"
4

Convert to PNG (optional)

Convert the PPM output to PNG:
# Using ImageMagick
convert output_klein.ppm output_klein.png

# Or using Python
python -c "from PIL import Image; Image.open('output_klein.ppm').save('output_klein.png')"
Use the --quantize flag to reduce memory usage:
cargo run --release -p flux-klein-mlx --example generate_klein -- \
  --quantize "a beautiful sunset"

Next steps

Explore LLMs

Learn about all supported language models and their features

Vision-language models

Process images and text with Moxin-7B VLM

API server

Deploy OpenAI-compatible API endpoints

Advanced usage

Optimize performance with quantization and other techniques

Common patterns

Quantization for memory efficiency

Reduce memory usage with 8-bit or 4-bit quantization:
use moxin_vlm_mlx::load_model;

let vlm = load_model("./models/Moxin-7B-VLM-hf")?;
let quantized_vlm = vlm.quantize(64, 8)?;  // 8-bit quantization

Streaming output

Stream tokens as they’re generated for better UX:
for (i, token) in generator.enumerate() {
    let token = token?;
    let token_id = token.item::<u32>();
    
    // Check for EOS
    if token_id == eos_token_id {
        break;
    }
    
    // Decode and print immediately
    let text = tokenizer.decode(&[token_id], true)?;
    print!("{}", text);
    std::io::stdout().flush()?;
}

Batch processing

Process multiple inputs efficiently:
let prompts = vec!["Hello", "How are you?", "Goodbye"];
let encodings: Vec<_> = prompts.iter()
    .map(|p| tokenizer.encode(p, true))
    .collect::<Result<_, _>>()?;

// Process in batch...

Troubleshooting

If the HuggingFace download is slow, try:
  1. Using a VPN if you’re in a region with restricted access
  2. Downloading manually from huggingface.co and placing files in the expected directory
  3. Using the HF_HUB_ENABLE_HF_TRANSFER=1 environment variable for faster downloads
Try these solutions:
  1. Use a smaller model (e.g., Qwen3-0.5B instead of Qwen3-4B)
  2. Use quantized models (8-bit or 4-bit)
  3. Reduce batch size or max_tokens parameter
  4. Close other applications to free up memory
Ensure you’re using release builds:
cargo run --release -p qwen3-mlx --example generate_qwen3 -- ...
Debug builds are 10-100x slower than release builds.
Make sure model paths are correct and contain all required files:
  • config.json
  • tokenizer.json or tokenizer.model
  • model.safetensors or pytorch_model.bin
For ASR models, also check for:
  • am.mvn (CMVN normalization)
  • tokens.txt (vocabulary)

Build docs developers (and LLMs) love