Overview
TheQwen3TTSModel provides three generation methods, each designed for a specific model type:
- generate_custom_voice() - For CustomVoice models using predefined speakers
- generate_voice_design() - For VoiceDesign models using natural-language instructions
- generate_voice_clone() - For Base models using reference audio
(wavs: List[np.ndarray], sample_rate: int)
generate_custom_voice
qwen_tts/inference/qwen3_tts_model.py:731-839
Parameters
Text(s) to synthesize. Can be a single string or a list of strings for batch generation.Example:
"Hello, world!"["Hello", "Goodbye"]
Speaker name(s). Will be validated against
model.get_supported_speakers() (case-insensitive).Can be a single speaker or a list matching the length of text.Example:"aurora"["aurora", "nova"]
Language(s) for each sample. If
None, defaults to "Auto" for all samples.Can be a single language (applied to all texts) or a list matching the length of text.Example:"English"["English", "Chinese"]
Optional instruction(s) to control speaking style. If
None, treated as empty (no instruction).Note: Not supported for 0.6B models (will be ignored if provided).Example:"Speak with excitement"["", "Speak slowly"](empty string = no instruction)
Using non-streaming text input. When set to
False, simulates streaming text input (does not enable true streaming generation).Generation Parameters
Whether to use sampling. Recommended to be set to
True for most use cases.Top-k sampling parameter. Only the top
k most likely tokens are considered.Top-p (nucleus) sampling parameter. Keeps the smallest set of tokens whose cumulative probability exceeds
p.Sampling temperature. Higher values (e.g., 1.2) make output more random; lower values (e.g., 0.7) make it more deterministic.
Penalty to reduce repeated tokens/codes. Values > 1.0 discourage repetition.
Sampling switch for the sub-talker. Only valid for
qwen3-tts-tokenizer-v2.Top-k for sub-talker sampling. Only valid for
qwen3-tts-tokenizer-v2.Top-p for sub-talker sampling. Only valid for
qwen3-tts-tokenizer-v2.Temperature for sub-talker sampling. Only valid for
qwen3-tts-tokenizer-v2.Maximum number of new codec tokens to generate.
Any other keyword arguments supported by HuggingFace Transformers
generate() will be forwarded to the underlying Qwen3TTSForConditionalGeneration.generate(...).Returns
List of generated waveforms as float32 numpy arrays. One array per input text.
Sample rate of the generated audio (typically 24000 Hz).
Raises
ValueError- If any speaker/language is unsupported or batch sizes mismatchValueError- If the model is not a CustomVoice model
Example
generate_voice_design
qwen_tts/inference/qwen3_tts_model.py:636-728
Parameters
Text(s) to synthesize. Can be a single string or a list of strings for batch generation.
Instruction(s) describing desired voice/style. Empty string is allowed (treated as no instruction).Example:
"A professional female voice with a warm tone""A deep male voice speaking slowly"""(empty = no instruction)
Language(s) for each sample. If
None, defaults to "Auto" for all samples.Can be a single language (applied to all texts) or a list matching the length of text.Using non-streaming text input. When set to
False, simulates streaming text input (does not enable true streaming generation).Generation Parameters
Same generation parameters asgenerate_custom_voice(): do_sample, top_k, top_p, temperature, repetition_penalty, subtalker_dosample, subtalker_top_k, subtalker_top_p, subtalker_temperature, max_new_tokens, and **kwargs.
Returns
List of generated waveforms as float32 numpy arrays.
Sample rate of the generated audio.
Raises
ValueError- If batch sizes mismatch or the model is not a VoiceDesign model
Example
generate_voice_clone
(ref_audio, ref_text, x_vector_only_mode)and let this method build the prompt, ORvoice_clone_promptas a list ofVoiceClonePromptItemreturned bycreate_voice_clone_prompt(), ORvoice_clone_promptas a dict (advanced usage)
qwen_tts/inference/qwen3_tts_model.py:469-633
Parameters
Text(s) to synthesize. Can be a single string or a list of strings for batch generation.
Language(s) for each sample. If
None, defaults to "Auto" for all samples.Reference audio(s) for prompt building. Required if
voice_clone_prompt is not provided.Supported formats:str: Local wav path, URL, or base64 audio string(np.ndarray, sr): Tuple of waveform + sampling rate- List of the above
"reference.wav""https://example.com/audio.wav"(audio_array, 24000)["ref1.wav", "ref2.wav"]
Reference text(s) - transcription of the reference audio. Required when
x_vector_only_mode=False (ICL mode).Example:"This is the reference text"["Reference 1", "Reference 2"]
If
True, only speaker embedding is used (ignores ref_text/ref_code).If False, ICL mode is used automatically (requires ref_text).Can be a single boolean or a list matching the batch size.List of
VoiceClonePromptItem from create_voice_clone_prompt(), or a dict for advanced usage.If provided, ref_audio, ref_text, and x_vector_only_mode are ignored.Using non-streaming text input. When set to
False, simulates streaming text input (does not enable true streaming generation).Generation Parameters
Same generation parameters asgenerate_custom_voice(): do_sample, top_k, top_p, temperature, repetition_penalty, subtalker_dosample, subtalker_top_k, subtalker_top_p, subtalker_temperature, max_new_tokens, and **kwargs.
Returns
List of generated waveforms as float32 numpy arrays.
Sample rate of the generated audio.
Raises
ValueError- If batch sizes mismatch or required prompt inputs are missingValueError- If the model is not a Base model
Example
See Also
- Voice Clone Prompt - Build and reuse voice clone prompts
- Qwen3TTSModel - Main model class documentation
- Voice Cloning Guide - Complete voice cloning tutorial