Skip to main content

Evaluation Setup

All models were evaluated with the following configuration:
  • Dtype: torch.bfloat16
  • Max new tokens: 2048
  • Sampling parameters: Defaults from checkpoint’s generate_config.json
  • Language setting:
    • language="auto" for Seed-Test and InstructTTS-Eval
    • Explicit language for other test sets

Speech Generation Quality

Seed-TTS Test Set

Zero-shot speech generation measured by Word Error Rate (WER, ↓ lower is better).
Modeltest-zhtest-en
Seed-TTS1.122.25
MaskGCT2.272.62
E2 TTS1.972.19
F5-TTS1.561.83
Spark TTS1.201.98
Llasa-8B1.592.97
KALL-E0.961.94
FireRedTTS 21.141.95
CosyVoice 30.711.45
MiniMax-Speech0.831.65
Qwen3-TTS-25Hz-0.6B-Base1.181.64
Qwen3-TTS-25Hz-1.7B-Base1.101.49
Qwen3-TTS-12Hz-0.6B-Base0.921.32
Qwen3-TTS-12Hz-1.7B-Base0.771.24
Qwen3-TTS-12Hz-1.7B-Base achieves best-in-class performance on English and competitive results on Chinese.

Multilingual Performance

Performance across 10 languages. WER (↓) for content consistency, Cosine Similarity (↑) for speaker similarity.

Content Consistency (WER ↓)

LanguageQwen3-TTS-25Hz-0.6BQwen3-TTS-25Hz-1.7BQwen3-TTS-12Hz-0.6BQwen3-TTS-12Hz-1.7BMiniMaxElevenLabs
Chinese1.1080.7771.1450.9282.25216.026
English1.0481.0140.8360.9342.1642.339
German1.5010.9601.0891.2351.9060.572
Italian1.1691.1051.5340.9481.5431.743
Portuguese2.0461.7782.2541.5261.8771.331
Spanish2.0311.4911.4911.1261.0291.084
Japanese4.1895.1216.4043.8233.51910.646
Korean2.8522.6311.7411.7551.7471.865
French2.8522.6312.9312.8584.0995.216
Russian5.9574.5354.4583.2124.2813.878

Speaker Similarity (Cosine Similarity ↑)

LanguageQwen3-TTS-25Hz-0.6BQwen3-TTS-25Hz-1.7BQwen3-TTS-12Hz-0.6BQwen3-TTS-12Hz-1.7BMiniMaxElevenLabs
Chinese0.7970.7960.8110.7990.7800.677
English0.8110.8150.8290.7750.7560.613
German0.7490.7370.7690.7750.7330.614
Italian0.7220.7180.7920.8170.6990.579
Portuguese0.7900.7830.7940.8170.8050.711
Spanish0.7320.7310.8120.8140.7620.615
Japanese0.8100.8070.7980.7880.7760.738
Korean0.8240.8140.8120.7990.7790.700
French0.6980.7030.7000.7140.6280.535
Russian0.7340.7440.7810.7920.7610.676

Cross-Lingual Synthesis

Performance on cross-lingual tasks (e.g., English speaker speaking Chinese). Mixed Error Rate: WER for English, CER for others (↓).
TaskQwen3-TTS-25Hz-1.7BQwen3-TTS-12Hz-1.7BCosyVoice3CosyVoice2
en-to-zh5.664.775.0913.5
ja-to-zh3.923.433.0548.1
ko-to-zh1.141.081.067.70
zh-to-en2.912.772.986.47
ja-to-en3.953.044.2017.1
ko-to-en3.483.094.1911.2
zh-to-ja9.298.407.0813.1
en-to-ja7.747.216.8014.9
ko-to-ja4.173.673.935.86
zh-to-ko8.124.8214.424.8
en-to-ko6.835.145.8721.9
ja-to-ko6.865.597.9221.5

Controllable Speech Generation

Performance on InstructTTSEval benchmark. Metrics: APS (Attribute Perception & Synthesis ↑), DSD (Description-Speech Consistency ↑), RP (Response Precision ↑).

Target Speaker Control

ModelInstructTTSEval-ZHInstructTTSEval-EN
APSDSDRPAPSDSDRP
Gemini-flash88.290.977.392.393.880.1
Gemini-pro89.090.175.587.686.067.2
Qwen3TTS-25Hz-1.7B-CustomVoice83.175.063.079.082.869.3
Qwen3TTS-12Hz-1.7B-CustomVoice83.077.861.277.377.163.7
GPT-4o-mini-tts54.952.346.076.474.354.8

Voice Design

ModelInstructTTSEval-ZHInstructTTSEval-EN
APSDSDRPAPSDSDRP
Qwen3TTS-12Hz-1.7B-VD85.281.165.182.982.468.4
Mimo-Audio-7B-Instruct75.774.361.580.677.659.5
VoiceSculptor75.764.761.5---
Hume---83.075.354.3
VoxInstruct47.552.342.654.957.039.3
Parler-tts-mini---63.448.728.6
Parler-tts-large---60.045.931.2
PromptTTS---64.347.231.4
PromptStyle---57.446.430.9
Qwen3-TTS-12Hz-1.7B-VoiceDesign leads in voice design tasks, demonstrating strong instruction following and voice controllability.

Key Insights

Best-in-class English

Qwen3-TTS-12Hz-1.7B achieves the lowest WER on English speech generation (1.24)

Strong Multilingual

Competitive performance across 10 languages with consistent speaker similarity

Cross-lingual Excellence

Leading results on English-to-Korean and other cross-lingual tasks

Voice Design Leader

State-of-the-art controllable speech generation with natural language instructions

References

For detailed methodology and citations, see the technical paper.

Build docs developers (and LLMs) love