Speech recognition overview

OminiX-MLX provides three state-of-the-art speech recognition models optimized for Apple Silicon, delivering 18-50x real-time transcription speeds through Metal GPU acceleration.

Available models

Qwen3-ASR

Multilingual ASR supporting 30+ languages with 30-50x real-time speed

Paraformer

Non-autoregressive Chinese ASR with 18x+ real-time speed

FunASR-Nano

LLM-based 800M parameter model supporting 31 languages

Performance comparison

Model	Languages	Speed	Architecture	Parameters
Qwen3-ASR-1.7B	30+ languages	30x RT	Encoder-decoder	1.7B
Qwen3-ASR-0.6B	30+ languages	22x RT	Encoder-decoder	0.6B
Paraformer	Chinese	18-75x RT	Non-autoregressive	220M
FunASR-Nano	31 languages	~10x RT	LLM-based	800M

Speed measured on Apple M3/M4 series chips. RT = real-time factor.

Key features

Pure Rust implementation

All models are implemented in Rust with zero Python dependencies at runtime:

Native Metal GPU acceleration via MLX
Efficient memory management
Cross-platform binary distribution
Direct integration into Rust applications

Optimized for Apple Silicon

Metal GPU acceleration for neural network operations
Accelerate framework for audio processing (FFT, resampling)
8-bit quantization support for reduced memory usage
Efficient batch processing for long-form audio

Production-ready API

Unified API server provides OpenAI-compatible endpoints:

# Start API server
cargo run --release -p ominix-api -- \
    --asr-model ~/.OminiX/models/qwen3-asr-1.7b --port 8080

# Transcribe audio (OpenAI Whisper-compatible)
curl http://localhost:8080/v1/audio/transcriptions \
    -F [email protected] -F language=English

Architecture overview

Qwen3-ASR architecture

Audio (16kHz) → 128-mel Spectrogram → Conv2d×3 (8× downsample)
             → Transformer Encoder → Linear Projector → Qwen3 Decoder → Text

Paraformer architecture

Audio (16kHz) → 80-mel Spectrogram → LFR 7/6
             → SAN-M Encoder (50 layers) → CIF Predictor
             → Bidirectional Decoder → Tokens (parallel)

FunASR-Nano architecture

Audio (16kHz) → 80-mel Spectrogram → Whisper Encoder (frozen)
             → Audio Adaptor → Qwen LLM → Text

Supported audio formats

All models support:

WAV - Native support (any sample rate, mono/stereo)
MP3, M4A, FLAC, OGG, AAC - Automatic conversion via ffmpeg
Raw samples - Direct f32 array input at 16kHz

Automatic resampling to 16kHz is handled internally.

Model selection guide

Choose Qwen3-ASR when you need:

Multilingual support (30+ languages)
Best accuracy on Chinese, English, Japanese, Korean
Long-form audio transcription (automatic 30s chunking)
Production-grade quality and speed balance

Choose Paraformer when you need:

Chinese-only transcription
Maximum speed (non-autoregressive)
Lower memory footprint
Extremely fast inference for short audio

Choose FunASR-Nano when you need:

31 language support including dialects
Far-field/noisy environment robustness
Regional accent recognition
LLM-based semantic understanding

Quick start

Download a model

Download any ASR model from HuggingFace:

# Qwen3-ASR-1.7B (recommended)
huggingface-cli download mlx-community/Qwen3-ASR-1.7B-8bit \
    --local-dir ~/.OminiX/models/qwen3-asr-1.7b

# Paraformer
git clone https://modelscope.cn/models/damo/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch.git

# FunASR-Nano
huggingface-cli download mlx-community/Fun-ASR-Nano-2512-fp16 \
    --local-dir ~/.OminiX/models/funasr-nano

Transcribe audio

Use the command-line interface:

# Qwen3-ASR
cargo run --release --example transcribe -- audio.wav

# Paraformer (after conversion)
cargo run --release --example transcribe -- audio.wav /path/to/paraformer

# FunASR-Nano
cargo run --release --example transcribe -- audio.wav

Integrate into your application

Use the Rust API:

use qwen3_asr_mlx::{Qwen3ASR, default_model_path};

let mut model = Qwen3ASR::load(default_model_path())?;
let text = model.transcribe_with_language("audio.wav", "English")?;
println!("Transcription: {}", text);

Next steps

Qwen3-ASR

Learn about the multilingual Qwen3-ASR models

Paraformer

Explore the high-speed Paraformer Chinese ASR

FunASR-Nano

Discover the LLM-based FunASR-Nano

API Reference

View the unified API documentation

Get Started

Core Concepts

Language Models

Vision-Language Models

Speech Recognition

Text-to-Speech

Image Generation

API Server

Advanced

Speech recognition overview

Available models

Qwen3-ASR

Paraformer

FunASR-Nano

Performance comparison

Key features

Pure Rust implementation

Optimized for Apple Silicon

Production-ready API

Architecture overview

Qwen3-ASR architecture

Paraformer architecture

FunASR-Nano architecture

Supported audio formats

Model selection guide

Choose Qwen3-ASR when you need:

Choose Paraformer when you need:

Choose FunASR-Nano when you need:

Quick start

Next steps

Qwen3-ASR

Paraformer

FunASR-Nano

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Language Models

Vision-Language Models

Speech Recognition

Text-to-Speech

Image Generation

API Server

Advanced

​Available models

Qwen3-ASR

Paraformer

FunASR-Nano

​Performance comparison

​Key features

​Pure Rust implementation

​Optimized for Apple Silicon

​Production-ready API

​Architecture overview

​Qwen3-ASR architecture

​Paraformer architecture

​FunASR-Nano architecture

​Supported audio formats

​Model selection guide

​Choose Qwen3-ASR when you need:

​Choose Paraformer when you need:

​Choose FunASR-Nano when you need:

​Quick start

​Next steps

Qwen3-ASR

Paraformer

FunASR-Nano

API Reference

Build docs developers (and LLMs) love

Available models

Performance comparison

Key features

Pure Rust implementation

Optimized for Apple Silicon

Production-ready API

Architecture overview

Qwen3-ASR architecture

Paraformer architecture

FunASR-Nano architecture

Supported audio formats

Model selection guide

Choose Qwen3-ASR when you need:

Choose Paraformer when you need:

Choose FunASR-Nano when you need:

Quick start

Next steps