The Qwen3-TTS repository includes comprehensive example scripts demonstrating various use cases and model types. All examples are available in the examples/ directory on GitHub.
Comprehensive voice cloning examples with the Base model.Features:
Voice cloning from reference audio
Single and batch voice cloning
Reusable voice clone prompts
X-vector only mode
Multiple clone modes (ICL and x-vector)
Key code snippet:
model = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-Base", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2",)ref_audio = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"ref_text = "Okay. Yeah. I resent you. I love you. I respect you..."wavs, sr = model.generate_voice_clone( text="I am solving the equation: x = [-b ± √(b²-4ac)] / 2a?", language="English", ref_audio=ref_audio, ref_text=ref_text,)
from qwen_tts import Qwen3TTSTokenizertokenizer = Qwen3TTSTokenizer.from_pretrained( "Qwen/Qwen3-TTS-Tokenizer-12Hz", device_map="cuda:0",)# Encode from URLaudio_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/tokenizer_demo_1.wav"enc = tokenizer.encode(audio_url)# Decode back to audiowavs, sr = tokenizer.decode(enc)
Try reducing batch size or using a smaller model variant (0.6B instead of 1.7B).
model = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice", # Smaller model device_map="cuda:0", dtype=torch.float16, # Use fp16 instead of bf16)
FlashAttention installation fails
FlashAttention is optional but improves performance. If installation fails:
# Load without FlashAttentionmodel = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", device_map="cuda:0", dtype=torch.bfloat16, # attn_implementation="flash_attention_2", # Omit this line)
Model download is slow
Use ModelScope for faster downloads in Mainland China: