Overview
The voice cloning workflow involves extracting features from reference audio and reusing them for multiple generations. Thecreate_voice_clone_prompt() method and VoiceClonePromptItem dataclass provide this functionality.
Key benefits:
- Extract voice features once, generate many times
- Avoid redundant audio processing
- Enable batch voice cloning with different voices
VoiceClonePromptItem
Qwen3TTSForConditionalGeneration.generate(..., voice_clone_prompt=...).
Source: qwen_tts/inference/qwen3_tts_model.py:40-52
Fields
Reference audio codes extracted by the speech tokenizer.
- Shape:
(T, Q)or(T,)depending on tokenizer (25Hz/12Hz) Nonewhenx_vector_only_mode=True
Speaker embedding vector extracted from reference audio.Shape:
(D,) where D is the embedding dimension.Whether to use speaker embedding only (ignores
ref_code and ref_text).True: X-vector only mode - only speaker embedding is usedFalse: ICL mode - uses both embedding and reference codes/text
Whether ICL (In-Context Learning) mode is enabled.Always the inverse of
x_vector_only_mode:Truewhenx_vector_only_mode=FalseFalsewhenx_vector_only_mode=True
Transcription of the reference audio.
- Required when
icl_mode=True(x_vector_only_mode=False) - Ignored when
x_vector_only_mode=True
Example
create_voice_clone_prompt
qwen_tts/inference/qwen3_tts_model.py:355-458
Modes
X-vector Only Mode (x_vector_only_mode=True)
- Only speaker embedding is used to clone voice
ref_textandref_codeare ignored- Mutually exclusive with ICL mode
- Faster and simpler, but may be less accurate
ICL Mode (x_vector_only_mode=False)
- ICL (In-Context Learning) mode is enabled automatically
- Both speaker embedding and reference codes/text are used
ref_textis required in this mode- More accurate voice cloning
Parameters
Reference audio(s) used to extract:
ref_codeviamodel.speech_tokenizer.encode(...)ref_spk_embeddingviamodel.extract_speaker_embedding(...)(resampled to 24kHz)
str: Local wav path, URL, or base64 audio string(np.ndarray, sr): Tuple of waveform + sampling rate- List of the above
"reference.wav""https://example.com/audio.wav"(audio_array, 24000)["ref1.wav", "ref2.wav"]
Reference transcript(s) - transcription of the reference audio.Required when
x_vector_only_mode=False (ICL mode).Can be:- Single string (applied to all audio)
- List of strings matching the length of
ref_audio Noneor empty string whenx_vector_only_mode=True
Whether to use speaker embedding only.
True: X-vector only mode (noref_textrequired)False: ICL mode (requiresref_text)
- Single boolean (applied to all audio)
- List of booleans matching the length of
ref_audio
Batch Behavior
ref_audiocan be a single item or a listref_textandx_vector_only_modecan be scalars or lists- If any of them are lists with length > 1, all lists must match in length
Returns
List of prompt items that can be passed to
generate_voice_clone(voice_clone_prompt=...).Each item contains extracted voice features ready for synthesis.Raises
ValueError- Ifx_vector_only_mode=Falsebutref_textis missingValueError- If batch lengths mismatchValueError- If the model is not a Base model
Example
Audio Input Formats
Bothcreate_voice_clone_prompt() and generate_voice_clone() support flexible audio input:
Local File Path
URL
Base64
NumPy Array
Best Practices
1. Reuse Prompts
Create prompts once and reuse them for multiple generations:2. Choose the Right Mode
- ICL Mode (
x_vector_only_mode=False): Better quality, requires reference text - X-vector Only Mode (
x_vector_only_mode=True): Faster, no reference text needed
3. Reference Audio Quality
- Use clean, high-quality reference audio (minimal background noise)
- 3-10 seconds of speech is usually sufficient
- Ensure the reference text exactly matches the audio (for ICL mode)
See Also
- generate_voice_clone() - Use prompts for voice cloning
- Qwen3TTSModel - Main model class
- Voice Cloning Guide - Complete tutorial