Features
- 30+ languages - Chinese, English, Japanese, Korean, French, German, Spanish, and 23 more
- 30x-50x real-time on Apple Silicon (M-series) with 8-bit quantized models
- Long-form audio - Automatic 30-second chunking for files of any length
- Config-driven - One binary supports both 0.6B and 1.7B model sizes
- Zero Python - Pure Rust implementation, no Python runtime needed
- Auto-build tokenizer - Generates
tokenizer.jsonfromvocab.json+merges.txtif missing
Model variants
Qwen3-ASR-1.7B-8bit
Recommended
- Size: 2.46 GB
- Speed: ~30x real-time on M4 Max
- Best accuracy across all benchmarks
- HuggingFace:
mlx-community/Qwen3-ASR-1.7B-8bit
Qwen3-ASR-0.6B-8bit
Faster download
- Size: 1.01 GB
- Speed: ~22x real-time on M4 Max
- Good accuracy with smaller footprint
- HuggingFace:
mlx-community/Qwen3-ASR-0.6B-8bit
Quick start
Architecture
Qwen3-ASR uses an encoder-decoder architecture optimized for multilingual speech recognition:Component details
| Component | 0.6B | 1.7B |
|---|---|---|
| Encoder layers | 18 | 24 |
| Encoder d_model | 896 | 1024 |
| Encoder heads | 14 | 16 |
| Encoder FFN dim | 3584 | 4096 |
| Decoder layers | 28 | 28 |
| Decoder hidden | 1024 | 2048 |
| Decoder heads (Q/KV) | 16/8 | 16/8 |
Architecture components
- Audio frontend - WhisperFeatureExtractor compatible (128 mels, n_fft=400, hop=160)
- Audio encoder - 3× Conv2d (stride 2, GELU) + sinusoidal position embeddings + Transformer with windowed block attention
- Projector - Linear(d_model → d_model, GELU) + Linear(d_model → decoder_hidden)
- Text decoder - Qwen3 with GQA, Q/K RMSNorm, SwiGLU MLP, RoPE (theta=1M), tied embeddings
Supported languages
Qwen3-ASR supports 30+ languages with state-of-the-art accuracy: Primary languages: Chinese, English, Cantonese, Arabic, German, French, Spanish, Portuguese, Indonesian, Italian, Korean, Russian, Thai, Vietnamese, Japanese, Turkish, Hindi, Malay, Dutch, Swedish, Danish, Finnish, Polish, Czech, Filipino, Persian, Greek, Romanian, Hungarian, Macedonian Chinese dialects (22 additional): Sichuan, Cantonese, Wu, Minnan, Hakka, and 17 more regional dialectsBenchmarks
Qwen3-ASR-1.7B outperforms Whisper-large-v3 on nearly every benchmark:Chinese Mandarin (CER ↓)
| Dataset | Whisper-large-v3 | Qwen3-ASR-0.6B | Qwen3-ASR-1.7B |
|---|---|---|---|
| WenetSpeech (meeting) | 19.11 | 6.88 | 5.88 |
| AISHELL-2 | 5.06 | 3.15 | 2.71 |
| SpeechIO | 7.56 | 3.44 | 2.88 |
English (WER ↓)
| Dataset | Whisper-large-v3 | Qwen3-ASR-0.6B | Qwen3-ASR-1.7B |
|---|---|---|---|
| LibriSpeech (other) | 3.97 | 4.55 | 3.38 |
| GigaSpeech | 9.76 | 8.88 | 8.45 |
| CommonVoice-en | 9.90 | 9.92 | 7.39 |
Multilingual (WER ↓, averaged)
| Dataset | Whisper-large-v3 | Qwen3-ASR-0.6B | Qwen3-ASR-1.7B |
|---|---|---|---|
| MLS | 8.62 | 13.19 | 8.55 |
| CommonVoice | 10.77 | 12.75 | 9.18 |
| Fleurs | 5.27 | 7.57 | 4.90 |
API reference
Load model
Transcribe audio
Advanced configuration
Model path resolution
The model path is resolved in the following order:- Explicit path passed to
Qwen3ASR::load() QWEN3_ASR_MODEL_PATHenvironment variable~/.OminiX/models/qwen3-asr-1.7b(default)
Audio input formats
WAV files
Native support for WAV files with automatic resampling:- Any sample rate (automatically resampled to 16kHz)
- Mono or stereo (stereo downmixed to mono)
- 16/24/32-bit integer or float
Other formats (MP3, M4A, FLAC, etc.)
Automatic conversion via ffmpeg (requires ffmpeg installed):Raw audio samples
Direct input of 16kHz mono f32 samples:API server
Qwen3-ASR is available via the unified OminiX-API server with OpenAI-compatible endpoints:Performance tips
Choose the right model size
- 1.7B - Best accuracy, suitable for production use with M3/M4 chips
- 0.6B - Faster inference, good for development or resource-constrained scenarios
Optimize for long audio
Use chunked processing for audio longer than 30 seconds:Batch processing
Process multiple files efficiently:Weight format
Models use safetensors format with two key prefixes:audio_tower.*- Audio encoder (full precision fp16)model.*- Text decoder (8-bit affine quantized, group_size=64)
Requirements
- macOS 13.5+ (Ventura or later)
- Apple Silicon (M1/M2/M3/M4)
- Rust 1.82.0+
- Optional: ffmpeg for non-WAV audio formats
Project structure
Example code
Complete transcription example fromexamples/transcribe.rs:
Credits
- Qwen3-ASR by the Qwen team at Alibaba Cloud
- mlx-community for quantized MLX model conversions
- mlx-rs for Rust MLX bindings