Clone any voice from a 3-second reference audio sample
The Base models (Qwen3-TTS-12Hz-1.7B-Base and Qwen3-TTS-12Hz-0.6B-Base) enable rapid voice cloning from just 3 seconds of reference audio. Clone any voice and generate new speech with the same timbre and characteristics.
import torchimport soundfile as sffrom qwen_tts import Qwen3TTSModelmodel = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-Base", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2",)ref_audio = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"ref_text = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."wavs, sr = model.generate_voice_clone( text="I am solving the equation: x = [-b ± √(b²-4ac)] / 2a? Nobody can — it's a disaster (◍•͈⌔•͈◍), very sad!", language="English", ref_audio=ref_audio, ref_text=ref_text,)sf.write("output.wav", wavs[0], sr)
For better performance when generating multiple times with the same voice, create a reusable prompt:
import torchimport soundfile as sffrom qwen_tts import Qwen3TTSModelmodel = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-Base", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2",)ref_audio = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"ref_text = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."# Step 1: Create the voice clone prompt onceprompt_items = model.create_voice_clone_prompt( ref_audio=ref_audio, ref_text=ref_text, x_vector_only_mode=False,)# Step 2: Reuse the prompt for multiple generationssentences = [ "Sentence A: This is the first test.", "Sentence B: Here's another example.", "Sentence C: And one more for good measure.",]for i, text in enumerate(sentences): wavs, sr = model.generate_voice_clone( text=text, language="English", voice_clone_prompt=prompt_items, # Reuse the same prompt ) sf.write(f"output_{i}.wav", wavs[0], sr)
In-Context Learning (ICL) uses both the reference audio codes and speaker embedding:
wavs, sr = model.generate_voice_clone( text="Your new text", language="English", ref_audio=ref_audio, ref_text=ref_text, # Text is REQUIRED x_vector_only_mode=False, # ICL mode (default))
Advantages:
Higher quality cloning
Better preservation of voice characteristics
More natural prosody
Requirements:
Must provide ref_text (transcript of reference audio)
Uses only the speaker embedding (x-vector) without reference codes:
wavs, sr = model.generate_voice_clone( text="Your new text", language="English", ref_audio=ref_audio, ref_text=None, # Text NOT required x_vector_only_mode=True, # Only use speaker embedding)
Advantages:
No need for reference text
Faster processing
Disadvantages:
Lower cloning quality
Less accurate voice characteristics
ICL mode (x_vector_only_mode=False) is strongly recommended for best quality. Only use x-vector mode when you cannot provide reference text.
ref_audios = [ "https://example.com/voice1.wav", "https://example.com/voice2.wav",]ref_texts = [ "Reference text for voice one.", "Reference text for voice two.",]# Create prompts for both voicesprompt_items = model.create_voice_clone_prompt( ref_audio=ref_audios, ref_text=ref_texts,)# Generate with different voiceswavs, sr = model.generate_voice_clone( text=["Text in voice one.", "Text in voice two."], language=["English", "English"], voice_clone_prompt=prompt_items,)for i, wav in enumerate(wavs): sf.write(f"output_{i}.wav", wav, sr)
Create a custom voice with VoiceDesign, then clone it for consistent character voices:
# Step 1: Design voicedesign_model = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign", device_map="cuda:0", dtype=torch.bfloat16,)ref_text = "Hello, I'm your new virtual assistant."ref_wavs, sr = design_model.generate_voice_design( text=ref_text, language="English", instruct="Female, 30s, professional and friendly tone, clear articulation")# Step 2: Clone the designed voiceclone_model = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-1.7B-Base", device_map="cuda:0", dtype=torch.bfloat16,)voice_prompt = clone_model.create_voice_clone_prompt( ref_audio=(ref_wavs[0], sr), ref_text=ref_text,)# Step 3: Use for all future generationswavs, sr = clone_model.generate_voice_clone( text="How can I help you today?", language="English", voice_clone_prompt=voice_prompt,)
See the Voice Design guide for more details on this workflow.