Benchmarks

Evaluation Setup

All models were evaluated with the following configuration:

Dtype: torch.bfloat16
Max new tokens: 2048
Sampling parameters: Defaults from checkpoint’s generate_config.json
Language setting:
- language="auto" for Seed-Test and InstructTTS-Eval
- Explicit language for other test sets

Speech Generation Quality

Seed-TTS Test Set

Zero-shot speech generation measured by Word Error Rate (WER, ↓ lower is better).

Model	test-zh	test-en
Seed-TTS	1.12	2.25
MaskGCT	2.27	2.62
E2 TTS	1.97	2.19
F5-TTS	1.56	1.83
Spark TTS	1.20	1.98
Llasa-8B	1.59	2.97
KALL-E	0.96	1.94
FireRedTTS 2	1.14	1.95
CosyVoice 3	0.71	1.45
MiniMax-Speech	0.83	1.65
Qwen3-TTS-25Hz-0.6B-Base	1.18	1.64
Qwen3-TTS-25Hz-1.7B-Base	1.10	1.49
Qwen3-TTS-12Hz-0.6B-Base	0.92	1.32
Qwen3-TTS-12Hz-1.7B-Base	0.77	1.24

Qwen3-TTS-12Hz-1.7B-Base achieves best-in-class performance on English and competitive results on Chinese.

Multilingual Performance

Performance across 10 languages. WER (↓) for content consistency, Cosine Similarity (↑) for speaker similarity.

Content Consistency (WER ↓)

Language	Qwen3-TTS-25Hz-0.6B	Qwen3-TTS-25Hz-1.7B	Qwen3-TTS-12Hz-0.6B	Qwen3-TTS-12Hz-1.7B	MiniMax	ElevenLabs
Chinese	1.108	0.777	1.145	0.928	2.252	16.026
English	1.048	1.014	0.836	0.934	2.164	2.339
German	1.501	0.960	1.089	1.235	1.906	0.572
Italian	1.169	1.105	1.534	0.948	1.543	1.743
Portuguese	2.046	1.778	2.254	1.526	1.877	1.331
Spanish	2.031	1.491	1.491	1.126	1.029	1.084
Japanese	4.189	5.121	6.404	3.823	3.519	10.646
Korean	2.852	2.631	1.741	1.755	1.747	1.865
French	2.852	2.631	2.931	2.858	4.099	5.216
Russian	5.957	4.535	4.458	3.212	4.281	3.878

Speaker Similarity (Cosine Similarity ↑)

Language	Qwen3-TTS-25Hz-0.6B	Qwen3-TTS-25Hz-1.7B	Qwen3-TTS-12Hz-0.6B	Qwen3-TTS-12Hz-1.7B	MiniMax	ElevenLabs
Chinese	0.797	0.796	0.811	0.799	0.780	0.677
English	0.811	0.815	0.829	0.775	0.756	0.613
German	0.749	0.737	0.769	0.775	0.733	0.614
Italian	0.722	0.718	0.792	0.817	0.699	0.579
Portuguese	0.790	0.783	0.794	0.817	0.805	0.711
Spanish	0.732	0.731	0.812	0.814	0.762	0.615
Japanese	0.810	0.807	0.798	0.788	0.776	0.738
Korean	0.824	0.814	0.812	0.799	0.779	0.700
French	0.698	0.703	0.700	0.714	0.628	0.535
Russian	0.734	0.744	0.781	0.792	0.761	0.676

Cross-Lingual Synthesis

Performance on cross-lingual tasks (e.g., English speaker speaking Chinese). Mixed Error Rate: WER for English, CER for others (↓).

Task	Qwen3-TTS-25Hz-1.7B	Qwen3-TTS-12Hz-1.7B	CosyVoice3	CosyVoice2
en-to-zh	5.66	4.77	5.09	13.5
ja-to-zh	3.92	3.43	3.05	48.1
ko-to-zh	1.14	1.08	1.06	7.70
zh-to-en	2.91	2.77	2.98	6.47
ja-to-en	3.95	3.04	4.20	17.1
ko-to-en	3.48	3.09	4.19	11.2
zh-to-ja	9.29	8.40	7.08	13.1
en-to-ja	7.74	7.21	6.80	14.9
ko-to-ja	4.17	3.67	3.93	5.86
zh-to-ko	8.12	4.82	14.4	24.8
en-to-ko	6.83	5.14	5.87	21.9
ja-to-ko	6.86	5.59	7.92	21.5

Controllable Speech Generation

Performance on InstructTTSEval benchmark. Metrics: APS (Attribute Perception & Synthesis ↑), DSD (Description-Speech Consistency ↑), RP (Response Precision ↑).

Target Speaker Control

Model	InstructTTSEval-ZH			InstructTTSEval-EN
	APS	DSD	RP	APS	DSD	RP
Gemini-flash	88.2	90.9	77.3	92.3	93.8	80.1
Gemini-pro	89.0	90.1	75.5	87.6	86.0	67.2
Qwen3TTS-25Hz-1.7B-CustomVoice	83.1	75.0	63.0	79.0	82.8	69.3
Qwen3TTS-12Hz-1.7B-CustomVoice	83.0	77.8	61.2	77.3	77.1	63.7
GPT-4o-mini-tts	54.9	52.3	46.0	76.4	74.3	54.8

Voice Design

Model	InstructTTSEval-ZH			InstructTTSEval-EN
	APS	DSD	RP	APS	DSD	RP
Qwen3TTS-12Hz-1.7B-VD	85.2	81.1	65.1	82.9	82.4	68.4
Mimo-Audio-7B-Instruct	75.7	74.3	61.5	80.6	77.6	59.5
VoiceSculptor	75.7	64.7	61.5	-	-	-
Hume	-	-	-	83.0	75.3	54.3
VoxInstruct	47.5	52.3	42.6	54.9	57.0	39.3
Parler-tts-mini	-	-	-	63.4	48.7	28.6
Parler-tts-large	-	-	-	60.0	45.9	31.2
PromptTTS	-	-	-	64.3	47.2	31.4
PromptStyle	-	-	-	57.4	46.4	30.9

Qwen3-TTS-12Hz-1.7B-VoiceDesign leads in voice design tasks, demonstrating strong instruction following and voice controllability.

Key Insights

Best-in-class English

Qwen3-TTS-12Hz-1.7B achieves the lowest WER on English speech generation (1.24)

Strong Multilingual

Competitive performance across 10 languages with consistent speaker similarity

Cross-lingual Excellence

Leading results on English-to-Korean and other cross-lingual tasks

Voice Design Leader

State-of-the-art controllable speech generation with natural language instructions

References

For detailed methodology and citations, see the technical paper.

Resources

Evaluation Setup

Speech Generation Quality

Seed-TTS Test Set

Multilingual Performance

Content Consistency (WER ↓)

Speaker Similarity (Cosine Similarity ↑)

Cross-Lingual Synthesis

Controllable Speech Generation

Target Speaker Control

Voice Design

Key Insights

Best-in-class English

Strong Multilingual

Cross-lingual Excellence

Voice Design Leader

References

Build docs developers (and LLMs) love

Resources

​Evaluation Setup

​Speech Generation Quality

​Seed-TTS Test Set

​Multilingual Performance

​Content Consistency (WER ↓)

​Speaker Similarity (Cosine Similarity ↑)

​Cross-Lingual Synthesis

​Controllable Speech Generation

​Target Speaker Control

​Voice Design

​Key Insights

Best-in-class English

Strong Multilingual

Cross-lingual Excellence

Voice Design Leader

​References

Build docs developers (and LLMs) love

Evaluation Setup

Speech Generation Quality

Seed-TTS Test Set

Multilingual Performance

Content Consistency (WER ↓)

Speaker Similarity (Cosine Similarity ↑)

Cross-Lingual Synthesis

Controllable Speech Generation

Target Speaker Control

Voice Design

Key Insights

References