Model Overview
Qwen3-TTS offers a family of models optimized for different use cases, from voice cloning to instruction-based voice design. All models are built on the same architecture but are fine-tuned for specific capabilities.Model Comparison
| Model | Parameters | Features | Language Support | Streaming | Instruction Control |
|---|---|---|---|---|---|
| Qwen3-TTS-12Hz-1.7B-VoiceDesign | 1.7B | Voice design from text descriptions | 10 languages | ✅ | ✅ |
| Qwen3-TTS-12Hz-1.7B-CustomVoice | 1.7B | 9 premium voices with style control | 10 languages | ✅ | ✅ |
| Qwen3-TTS-12Hz-1.7B-Base | 1.7B | 3-second voice cloning | 10 languages | ✅ | ❌ |
| Qwen3-TTS-12Hz-0.6B-CustomVoice | 0.6B | 9 premium voices | 10 languages | ✅ | ❌ |
| Qwen3-TTS-12Hz-0.6B-Base | 0.6B | 3-second voice cloning | 10 languages | ✅ | ❌ |
Model Variants
VoiceDesign
Create custom voices from natural language descriptions
CustomVoice
Use 9 premium pre-trained voices with style control
Base
Clone any voice from just 3 seconds of audio
Tokenizer
Encode/decode audio for compression and transport
1. Qwen3-TTS-12Hz-1.7B-VoiceDesign
Best for: Creating unique voices based on detailed descriptions- Capabilities
- Use Cases
- Example
- Generate voices from natural language descriptions
- Control timbre, age, gender, emotion, and speaking style
- Instruction-based prosody and tone adjustment
- Adaptive emotional expression based on text semantics
- Streaming generation support
2. Qwen3-TTS-12Hz-1.7B-CustomVoice
Best for: Production applications requiring consistent, high-quality voices- Capabilities
- Available Voices
- Example
- 9 premium pre-trained voices
- Style control via natural language instructions
- Covers various genders, ages, languages, and dialects
- Instruction-based emotion and prosody control
- Streaming generation support
3. Qwen3-TTS-12Hz-1.7B-Base
Best for: Voice cloning and fine-tuning for custom applications- Capabilities
- Use Cases
- Example
- Clone any voice from 3 seconds of audio
- Maintains speaker characteristics and style
- Support for reference audio in multiple formats (file, URL, bytes)
- Can be fine-tuned for specific domains or speakers
- Streaming generation support
4. Qwen3-TTS-12Hz-0.6B Models
Best for: Applications requiring faster inference or limited resources The 0.6B variants offer:- Faster inference: ~1.7x faster than 1.7B models
- Lower memory: Reduced GPU memory requirements
- Good quality: Maintains high quality with slight trade-off vs 1.7B
- Same features: CustomVoice and Base variants available
The 0.6B models are ideal for edge devices, high-throughput services, or applications where speed is prioritized over maximum quality.
5. Qwen3-TTS-Tokenizer-12Hz
Best for: Audio compression, transport, and training data preparation- Capabilities
- Use Cases
- Example
- Encode audio to discrete codes (compression)
- Decode codes back to audio (reconstruction)
- 12Hz frame rate (12 frames per second)
- 16 residual quantizers
- 2048 codebook size per quantizer
- 24kHz sample rate
Model Selection Guide
I want to create unique character voices
I want to create unique character voices
Recommended: Qwen3-TTS-12Hz-1.7B-VoiceDesignThis model excels at generating voices from descriptions. You can specify exact characteristics like “teenage girl with cheerful tone” or “elderly professor with authoritative voice.”
I need consistent professional voices
I need consistent professional voices
Recommended: Qwen3-TTS-12Hz-1.7B-CustomVoice (or 0.6B for faster inference)Use the 9 premium pre-trained voices for production applications. They’re highly consistent and support instruction-based style control.
I want to clone a specific person's voice
I want to clone a specific person's voice
Recommended: Qwen3-TTS-12Hz-1.7B-Base (or 0.6B for faster inference)Provide 3 seconds of reference audio to clone any voice. Ideal for personal voice assistants or preserving specific speaker characteristics.
I need fast inference for high-throughput services
I need fast inference for high-throughput services
Recommended: Qwen3-TTS-12Hz-0.6B-CustomVoice or 0.6B-BaseThe 0.6B models offer significantly faster inference while maintaining good quality. Perfect for real-time applications or serving many concurrent requests.
I want to fine-tune for my specific use case
I want to fine-tune for my specific use case
Recommended: Qwen3-TTS-12Hz-1.7B-Base or 0.6B-BaseBase models serve as excellent starting points for fine-tuning. See the Fine-Tuning Guide for details.
I need audio compression and codec functionality
I need audio compression and codec functionality
Recommended: Qwen3-TTS-Tokenizer-12HzUse the tokenizer for encode/decode operations, audio compression, or preparing training data.
Performance Benchmarks
Content Consistency (WER ↓)
Word Error Rate on Seed-TTS test set:| Model | Chinese | English |
|---|---|---|
| Qwen3-TTS-12Hz-1.7B-Base | 0.77 | 1.24 |
| Qwen3-TTS-12Hz-0.6B-Base | 0.92 | 1.32 |
| CosyVoice 3 | 0.71 | 1.45 |
| MiniMax-Speech | 0.83 | 1.65 |
Speaker Similarity (Cosine Similarity ↑)
Speaker similarity on multilingual test set:| Language | 1.7B-Base | 0.6B-Base |
|---|---|---|
| Chinese | 0.799 | 0.811 |
| English | 0.775 | 0.829 |
| Japanese | 0.788 | 0.798 |
| Korean | 0.799 | 0.812 |
Higher scores indicate better speaker similarity preservation. The 0.6B model surprisingly outperforms 1.7B in some languages.
Hardware Requirements
GPU Memory Usage
| Model | FP16/BF16 | FP32 |
|---|---|---|
| 1.7B models | ~4-6 GB | ~8-12 GB |
| 0.6B models | ~2-3 GB | ~4-6 GB |
| Tokenizer | ~1-2 GB | ~2-4 GB |
Inference Speed
Typical inference times (per second of audio):| Model | GPU | Speed (RTF) |
|---|---|---|
| 1.7B | A100 | ~0.1x |
| 1.7B | V100 | ~0.15x |
| 0.6B | A100 | ~0.06x |
| 0.6B | V100 | ~0.09x |
RTF (Real-Time Factor): 0.1x means generating 1 second of audio takes 0.1 seconds. Lower is faster.
Model Downloads
Models are automatically downloaded when you callfrom_pretrained(). For manual downloads:
- Hugging Face
- ModelScope
Next Steps
Quick Start
Install and start using Qwen3-TTS models
Architecture
Understand how the models work internally
Language Support
Learn about multilingual capabilities
Fine-Tuning
Customize models for your specific use case