Skip to main content

Overview

Qwen3-TTS provides comprehensive multilingual support, covering 10 major languages with native-quality synthesis. The models are trained on diverse multilingual data and support cross-lingual voice cloning and generation.

Supported Languages

Chinese

普通话, Beijing Dialect, Sichuan Dialect

English

American, British, and neutral accents

Japanese

Standard Japanese (標準語)

Korean

Standard Korean (표준어)

German

Standard German (Hochdeutsch)

French

European French (français européen)

Russian

Standard Russian (русский)

Portuguese

Brazilian and European Portuguese

Spanish

European and Latin American Spanish

Italian

Standard Italian (italiano)

Language Quality Comparison

Content Consistency (WER/CER ↓)

Word Error Rate (WER) or Character Error Rate (CER) on multilingual test set - lower is better:
Language1.7B-Base0.6B-BaseQuality Tier
Chinese0.9281.145⭐⭐⭐ Excellent
English0.9340.836⭐⭐⭐ Excellent
Korean1.7551.741⭐⭐⭐ Excellent
German1.2351.089⭐⭐ Very Good
Italian0.9481.534⭐⭐ Very Good
Portuguese1.5262.254⭐⭐ Very Good
Spanish1.1261.491⭐⭐ Very Good
French2.8582.931⭐ Good
Russian3.2124.458⭐ Good
Japanese3.8236.404⭐ Good
Chinese, English, and Korean achieve the best content accuracy, making them ideal for production applications requiring high precision.

Speaker Similarity (Cosine Similarity ↑)

Speaker embedding similarity on voice cloning tasks - higher is better:
Language1.7B-Base0.6B-BaseQuality Tier
English0.7750.829⭐⭐⭐ Excellent
Portuguese0.8170.794⭐⭐⭐ Excellent
Spanish0.8140.812⭐⭐⭐ Excellent
Italian0.8170.792⭐⭐⭐ Excellent
Chinese0.7990.811⭐⭐⭐ Excellent
Korean0.7990.812⭐⭐⭐ Excellent
Russian0.7920.781⭐⭐⭐ Excellent
Japanese0.7880.798⭐⭐ Very Good
German0.7750.769⭐⭐ Very Good
French0.7140.700⭐⭐ Very Good
All languages achieve strong speaker similarity (>0.70), indicating excellent voice cloning capabilities across the board.

Speaker Native Languages

For CustomVoice models, the following 9 premium speakers are available:
SpeakerVoice DescriptionNative LanguageRecommended Languages
VivianBright, slightly edgy young femaleChineseChinese, English
SerenaWarm, gentle young femaleChineseChinese, English
Uncle_FuSeasoned male, low mellow timbreChineseChinese
DylanYouthful Beijing male, clear naturalChinese (Beijing)Chinese, English
EricLively Chengdu male, slightly huskyChinese (Sichuan)Chinese
RyanDynamic male, strong rhythmic driveEnglishEnglish, Chinese
AidenSunny American male, clear midrangeEnglishEnglish
Ono_AnnaPlayful Japanese female, light nimbleJapaneseJapanese, English
SoheeWarm Korean female, rich emotionKoreanKorean, English
For best quality, use each speaker’s native language. However, all speakers can speak any of the 10 supported languages with reasonable quality.

Cross-Lingual Capabilities

Qwen3-TTS supports cross-lingual voice cloning, allowing you to clone a voice in one language and generate speech in another.

Cross-Lingual Performance

Mixed Error Rate (WER for English, CER for others) on cross-lingual benchmark - lower is better:
Task1.7B-Base0.6B-BaseQuality
Korean → English3.093.48⭐⭐⭐ Excellent
Japanese → English3.043.95⭐⭐⭐ Excellent
English → Chinese4.775.66⭐⭐ Very Good
Korean → Japanese3.674.17⭐⭐ Very Good
Korean → Chinese4.828.12⭐⭐ Very Good
English → Korean5.146.83⭐⭐ Very Good
Japanese → Korean5.596.86⭐⭐ Very Good
English → Japanese7.217.74⭐ Good
Chinese → Japanese8.409.29⭐ Good
The 1.7B model generally performs better on cross-lingual tasks, especially for Korean and Japanese target languages.

Cross-Lingual Example

from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base"
)

# Clone English voice and speak Chinese
wavs, sr = model.generate_voice_clone(
    text="你好,很高兴见到你。",  # Chinese text
    language="Chinese",
    ref_audio="english_speaker.wav",  # English reference
    ref_text="Hello, nice to meet you."
)

Language-Specific Considerations

Strengths:
  • Excellent accuracy (WER ~0.93 for 1.7B)
  • Strong dialect support (Beijing, Sichuan)
  • Native speakers available (Vivian, Serena, Uncle_Fu, Dylan, Eric)
Considerations:
  • Tone accuracy is critical; may occasionally flatten in complex prosody
  • Text input should use simplified or traditional Chinese consistently
  • Pinyin input not officially supported
Best Practices:
wavs, sr = model.generate_custom_voice(
    text="这是一段测试文本,包含了复杂的语调和情感。",
    language="Chinese",  # Always specify
    speaker="Vivian",
    instruct="温柔地说,带有一点兴奋的语气"  # Chinese instructions work best
)
Strengths:
  • Excellent accuracy (WER ~0.93 for 1.7B)
  • Multiple native speakers (Ryan, Aiden)
  • Strong cross-lingual source language
Considerations:
  • Accents: models default to neutral/American accent
  • British English: supported but may sound slightly American-influenced
  • Contractions and informal speech handled well
Best Practices:
wavs, sr = model.generate_custom_voice(
    text="I'm really excited to show you what we've built!",
    language="English",
    speaker="Ryan",
    instruct="enthusiastic and energetic, slightly faster pace"
)
Strengths:
  • Native speaker available (Ono_Anna)
  • Good speaker similarity (0.788)
Considerations:
  • Higher character error rate (~3.8-6.4%)
  • Pitch accent may not always be perfect
  • Kanji, hiragana, and katakana all supported
Best Practices:
wavs, sr = model.generate_custom_voice(
    text="こんにちは、お元気ですか?今日はいい天気ですね。",
    language="Japanese",
    speaker="Ono_Anna",  # Use native speaker for best quality
    instruct="明るく元気な声で"  # Japanese instructions recommended
)
Strengths:
  • Excellent accuracy (WER ~1.75)
  • Native speaker available (Sohee)
  • Strong cross-lingual capabilities
Considerations:
  • Hangul input only (no romanization)
  • Handles formal and informal speech
Best Practices:
wavs, sr = model.generate_custom_voice(
    text="안녕하세요, 만나서 반갑습니다.",
    language="Korean",
    speaker="Sohee",
    instruct="따뜻하고 친근한 목소리로"  # Korean instructions
)
Strengths:
  • Good accuracy (WER ~1.09-1.24)
  • Good speaker similarity (0.77)
Considerations:
  • Compound words handled well
  • Umlauts (ä, ö, ü) supported
  • May occasionally anglicize pronunciation
Best Practices:
wavs, sr = model.generate_voice_clone(
    text="Guten Tag! Wie geht es Ihnen heute?",
    language="German",
    ref_audio="german_speaker.wav",
    ref_text="Hallo, schön Sie kennenzulernen."
)
Strengths:
  • Handles liaison and elision
  • Reasonable accuracy for European languages
Considerations:
  • Moderate error rate (~2.86-2.93)
  • Nasal vowels may be approximated
  • Accents (é, è, ê, etc.) should be included
Best Practices:
wavs, sr = model.generate_voice_clone(
    text="Bonjour ! Comment allez-vous aujourd'hui ?",
    language="French",
    ref_audio="french_speaker.wav",
    ref_text="Enchanté de faire votre connaissance."
)
Strengths:
  • Strong speaker similarity (0.79)
  • Handles Cyrillic script
Considerations:
  • Moderate error rate (~3.2-4.5)
  • Stress patterns may not always be perfect
  • Cyrillic input required
Best Practices:
wavs, sr = model.generate_voice_clone(
    text="Привет! Как дела?",
    language="Russian",
    ref_audio="russian_speaker.wav",
    ref_text="Здравствуйте, рад познакомиться."
)
Strengths:
  • Excellent speaker similarity (0.817)
  • Supports both Brazilian and European variants
Considerations:
  • Error rate ~1.5-2.3
  • Diacritics (ã, õ, ç) should be included
  • May default to Brazilian pronunciation
Best Practices:
wavs, sr = model.generate_voice_clone(
    text="Olá! Como você está hoje?",
    language="Portuguese",
    ref_audio="portuguese_speaker.wav",
    ref_text="Prazer em conhecê-lo."
)
Strengths:
  • Excellent speaker similarity (0.814)
  • Good accuracy (WER ~1.13-1.49)
  • Supports European and Latin American variants
Considerations:
  • Accent marks (á, é, í, ó, ú) should be included
  • ñ character supported
  • May default to Castilian pronunciation
Best Practices:
wavs, sr = model.generate_voice_clone(
    text="¡Hola! ¿Cómo estás hoy?",
    language="Spanish",
    ref_audio="spanish_speaker.wav",
    ref_text="Mucho gusto en conocerte."
)
Strengths:
  • Excellent speaker similarity (0.817)
  • Good accuracy (WER ~0.95-1.53)
  • Handles double consonants well
Considerations:
  • Accent marks (à, è, é, ì, ò, ù) should be included
  • Regional accents not explicitly supported
Best Practices:
wavs, sr = model.generate_voice_clone(
    text="Ciao! Come stai oggi?",
    language="Italian",
    ref_audio="italian_speaker.wav",
    ref_text="Piacere di conoscerti."
)

Automatic Language Detection

Qwen3-TTS supports automatic language detection when language="Auto" is specified:
# Automatic language detection
wavs, sr = model.generate_custom_voice(
    text="Hello, 你好, こんにちは",
    language="Auto",  # Detects mixed languages
    speaker="Vivian"
)
Limitations of Auto mode:
  • May misdetect short phrases or ambiguous text
  • Mixed-language text (code-switching) may be inconsistent
  • For best quality, explicitly specify the language when known

Multilingual Generation Tips

1

Choose the Right Speaker

Use native speakers for best quality:
  • Chinese text → Vivian, Serena, Uncle_Fu, Dylan, Eric
  • English text → Ryan, Aiden
  • Japanese text → Ono_Anna
  • Korean text → Sohee
2

Specify Language Explicitly

Always specify language when known to avoid detection errors:
language="Chinese"  # Better than language="Auto"
3

Use Native-Language Instructions

Instructions work best in the same language as the content:
# Chinese text with Chinese instruction
text="你好", instruct="温柔的语气"

# English text with English instruction  
text="Hello", instruct="gentle tone"
4

Test Cross-Lingual Carefully

Cross-lingual quality varies by language pair. Test thoroughly:
  • English → Chinese: Very good
  • Japanese → Chinese: Good
  • Chinese → Japanese: Moderate

Language Roadmap

Future language support (tentative):
  • Arabic (العربية)
  • Hindi (हिन्दी)
  • Turkish (Türkçe)
  • Vietnamese (Tiếng Việt)
  • Thai (ไทย)
Check the GitHub repository for updates.

Next Steps

Voice Cloning

Learn how to clone voices across languages

Custom Voice

Use premium speakers in different languages

Voice Design

Create voices with language-specific characteristics

Examples

See multilingual code examples

Build docs developers (and LLMs) love