Features
- Fast inference: Metal GPU acceleration with async token pipelining
- Quantization support: 4-bit and bf16 models for flexible memory/quality tradeoffs
- Step-based KV cache: Memory-efficient autoregressive generation
- Chat templates: Native support for multi-turn conversations
Installation
Add to yourCargo.toml:
Quick start
Examples
Text generation
Generate text from a prompt:Interactive chat
Multi-turn conversation with chat templates:- Loading chat templates from
tokenizer_config.json - Building conversation history
- Streaming token output
- EOS token detection for Qwen3 (tokens 151643, 151645)
Supported models
Qwen3-0.6B
Size: 1.2 GB
Use case: Embedded applications, testing
HF path:
Use case: Embedded applications, testing
HF path:
mlx-community/Qwen3-0.6B-bf16Qwen3-1.7B
Size: 3.4 GB
Use case: Resource-constrained deployments
HF path:
Use case: Resource-constrained deployments
HF path:
mlx-community/Qwen3-1.7B-bf16Qwen3-4B
Size: 8 GB
Use case: General-purpose chat, recommended
HF path:
Use case: General-purpose chat, recommended
HF path:
mlx-community/Qwen3-4B-bf16Qwen3-8B
Size: 16 GB
Use case: Higher quality responses
HF path:
Use case: Higher quality responses
HF path:
mlx-community/Qwen3-8B-bf16Qwen3-14B
Size: 28 GB
Use case: Advanced reasoning
HF path:
Use case: Advanced reasoning
HF path:
mlx-community/Qwen3-14B-bf16Qwen3-32B
Size: 64 GB
Use case: Maximum quality (requires M3 Max 128GB)
HF path:
Use case: Maximum quality (requires M3 Max 128GB)
HF path:
mlx-community/Qwen3-32B-bf16Quantized variants
All models available with 4-bit quantization for 4x memory reduction:-bf16 with -4bit in any HuggingFace path above.
Performance
Benchmark results (Apple M3 Max, 40-core GPU)
| Model | Precision | Prompt Speed | Decode Speed | Memory |
|---|---|---|---|---|
| Qwen3-4B | bf16 | 150 tok/s | 45 tok/s | 8 GB |
| Qwen3-4B | 4-bit | 250 tok/s | 75 tok/s | 3 GB |
- 1.67x faster prompt processing
- 1.67x faster token generation
- 2.67x less memory usage
Speed vs sequence length
Prompt processing speed scales linearly with input length, while decode speed remains constant per token. For a 1000-token input:- Qwen3-4B (4-bit): ~4 seconds prefill time
- Decode: 75 tokens/second regardless of context length
Converting models
Convert any Qwen3 model from HuggingFace:./mlx_model by default.
API reference
Core functions
config.json- Model configurationmodel.safetensorsormodel-*.safetensors- Model weights
tokenizer.json.
Generation
KV cache types
Troubleshooting
Out of memory errors
Try these solutions in order:- Use 4-bit quantized model instead of bf16
- Use smaller model (e.g., Qwen3-1.7B instead of Qwen3-4B)
- Reduce max token limit in generation
- Close other applications to free memory
Slow generation speed
- Ensure you’re using
--releasebuild mode - Verify Metal is enabled: check for GPU utilization in Activity Monitor
- Update to latest macOS version for best Metal performance
- Use quantized models for faster inference
Model download fails
Related models
- Qwen3-ASR - Speech recognition with Qwen3 backbone
- Qwen-Image - Vision-language model with Qwen architecture