Evaluation Setup
All models were evaluated with the following configuration:- Dtype:
torch.bfloat16 - Max new tokens: 2048
- Sampling parameters: Defaults from checkpoint’s
generate_config.json - Language setting:
language="auto"for Seed-Test and InstructTTS-Eval- Explicit language for other test sets
Speech Generation Quality
Seed-TTS Test Set
Zero-shot speech generation measured by Word Error Rate (WER, ↓ lower is better).| Model | test-zh | test-en |
|---|---|---|
| Seed-TTS | 1.12 | 2.25 |
| MaskGCT | 2.27 | 2.62 |
| E2 TTS | 1.97 | 2.19 |
| F5-TTS | 1.56 | 1.83 |
| Spark TTS | 1.20 | 1.98 |
| Llasa-8B | 1.59 | 2.97 |
| KALL-E | 0.96 | 1.94 |
| FireRedTTS 2 | 1.14 | 1.95 |
| CosyVoice 3 | 0.71 | 1.45 |
| MiniMax-Speech | 0.83 | 1.65 |
| Qwen3-TTS-25Hz-0.6B-Base | 1.18 | 1.64 |
| Qwen3-TTS-25Hz-1.7B-Base | 1.10 | 1.49 |
| Qwen3-TTS-12Hz-0.6B-Base | 0.92 | 1.32 |
| Qwen3-TTS-12Hz-1.7B-Base | 0.77 | 1.24 |
Qwen3-TTS-12Hz-1.7B-Base achieves best-in-class performance on English and competitive results on Chinese.
Multilingual Performance
Performance across 10 languages. WER (↓) for content consistency, Cosine Similarity (↑) for speaker similarity.Content Consistency (WER ↓)
| Language | Qwen3-TTS-25Hz-0.6B | Qwen3-TTS-25Hz-1.7B | Qwen3-TTS-12Hz-0.6B | Qwen3-TTS-12Hz-1.7B | MiniMax | ElevenLabs |
|---|---|---|---|---|---|---|
| Chinese | 1.108 | 0.777 | 1.145 | 0.928 | 2.252 | 16.026 |
| English | 1.048 | 1.014 | 0.836 | 0.934 | 2.164 | 2.339 |
| German | 1.501 | 0.960 | 1.089 | 1.235 | 1.906 | 0.572 |
| Italian | 1.169 | 1.105 | 1.534 | 0.948 | 1.543 | 1.743 |
| Portuguese | 2.046 | 1.778 | 2.254 | 1.526 | 1.877 | 1.331 |
| Spanish | 2.031 | 1.491 | 1.491 | 1.126 | 1.029 | 1.084 |
| Japanese | 4.189 | 5.121 | 6.404 | 3.823 | 3.519 | 10.646 |
| Korean | 2.852 | 2.631 | 1.741 | 1.755 | 1.747 | 1.865 |
| French | 2.852 | 2.631 | 2.931 | 2.858 | 4.099 | 5.216 |
| Russian | 5.957 | 4.535 | 4.458 | 3.212 | 4.281 | 3.878 |
Speaker Similarity (Cosine Similarity ↑)
| Language | Qwen3-TTS-25Hz-0.6B | Qwen3-TTS-25Hz-1.7B | Qwen3-TTS-12Hz-0.6B | Qwen3-TTS-12Hz-1.7B | MiniMax | ElevenLabs |
|---|---|---|---|---|---|---|
| Chinese | 0.797 | 0.796 | 0.811 | 0.799 | 0.780 | 0.677 |
| English | 0.811 | 0.815 | 0.829 | 0.775 | 0.756 | 0.613 |
| German | 0.749 | 0.737 | 0.769 | 0.775 | 0.733 | 0.614 |
| Italian | 0.722 | 0.718 | 0.792 | 0.817 | 0.699 | 0.579 |
| Portuguese | 0.790 | 0.783 | 0.794 | 0.817 | 0.805 | 0.711 |
| Spanish | 0.732 | 0.731 | 0.812 | 0.814 | 0.762 | 0.615 |
| Japanese | 0.810 | 0.807 | 0.798 | 0.788 | 0.776 | 0.738 |
| Korean | 0.824 | 0.814 | 0.812 | 0.799 | 0.779 | 0.700 |
| French | 0.698 | 0.703 | 0.700 | 0.714 | 0.628 | 0.535 |
| Russian | 0.734 | 0.744 | 0.781 | 0.792 | 0.761 | 0.676 |
Cross-Lingual Synthesis
Performance on cross-lingual tasks (e.g., English speaker speaking Chinese). Mixed Error Rate: WER for English, CER for others (↓).| Task | Qwen3-TTS-25Hz-1.7B | Qwen3-TTS-12Hz-1.7B | CosyVoice3 | CosyVoice2 |
|---|---|---|---|---|
| en-to-zh | 5.66 | 4.77 | 5.09 | 13.5 |
| ja-to-zh | 3.92 | 3.43 | 3.05 | 48.1 |
| ko-to-zh | 1.14 | 1.08 | 1.06 | 7.70 |
| zh-to-en | 2.91 | 2.77 | 2.98 | 6.47 |
| ja-to-en | 3.95 | 3.04 | 4.20 | 17.1 |
| ko-to-en | 3.48 | 3.09 | 4.19 | 11.2 |
| zh-to-ja | 9.29 | 8.40 | 7.08 | 13.1 |
| en-to-ja | 7.74 | 7.21 | 6.80 | 14.9 |
| ko-to-ja | 4.17 | 3.67 | 3.93 | 5.86 |
| zh-to-ko | 8.12 | 4.82 | 14.4 | 24.8 |
| en-to-ko | 6.83 | 5.14 | 5.87 | 21.9 |
| ja-to-ko | 6.86 | 5.59 | 7.92 | 21.5 |
Controllable Speech Generation
Performance on InstructTTSEval benchmark. Metrics: APS (Attribute Perception & Synthesis ↑), DSD (Description-Speech Consistency ↑), RP (Response Precision ↑).Target Speaker Control
| Model | InstructTTSEval-ZH | InstructTTSEval-EN | ||||
|---|---|---|---|---|---|---|
| APS | DSD | RP | APS | DSD | RP | |
| Gemini-flash | 88.2 | 90.9 | 77.3 | 92.3 | 93.8 | 80.1 |
| Gemini-pro | 89.0 | 90.1 | 75.5 | 87.6 | 86.0 | 67.2 |
| Qwen3TTS-25Hz-1.7B-CustomVoice | 83.1 | 75.0 | 63.0 | 79.0 | 82.8 | 69.3 |
| Qwen3TTS-12Hz-1.7B-CustomVoice | 83.0 | 77.8 | 61.2 | 77.3 | 77.1 | 63.7 |
| GPT-4o-mini-tts | 54.9 | 52.3 | 46.0 | 76.4 | 74.3 | 54.8 |
Voice Design
| Model | InstructTTSEval-ZH | InstructTTSEval-EN | ||||
|---|---|---|---|---|---|---|
| APS | DSD | RP | APS | DSD | RP | |
| Qwen3TTS-12Hz-1.7B-VD | 85.2 | 81.1 | 65.1 | 82.9 | 82.4 | 68.4 |
| Mimo-Audio-7B-Instruct | 75.7 | 74.3 | 61.5 | 80.6 | 77.6 | 59.5 |
| VoiceSculptor | 75.7 | 64.7 | 61.5 | - | - | - |
| Hume | - | - | - | 83.0 | 75.3 | 54.3 |
| VoxInstruct | 47.5 | 52.3 | 42.6 | 54.9 | 57.0 | 39.3 |
| Parler-tts-mini | - | - | - | 63.4 | 48.7 | 28.6 |
| Parler-tts-large | - | - | - | 60.0 | 45.9 | 31.2 |
| PromptTTS | - | - | - | 64.3 | 47.2 | 31.4 |
| PromptStyle | - | - | - | 57.4 | 46.4 | 30.9 |
Qwen3-TTS-12Hz-1.7B-VoiceDesign leads in voice design tasks, demonstrating strong instruction following and voice controllability.
Key Insights
Best-in-class English
Qwen3-TTS-12Hz-1.7B achieves the lowest WER on English speech generation (1.24)
Strong Multilingual
Competitive performance across 10 languages with consistent speaker similarity
Cross-lingual Excellence
Leading results on English-to-Korean and other cross-lingual tasks
Voice Design Leader
State-of-the-art controllable speech generation with natural language instructions