Skip to main content

Overview

The Qwen3TTSModel provides three generation methods, each designed for a specific model type:
  • generate_custom_voice() - For CustomVoice models using predefined speakers
  • generate_voice_design() - For VoiceDesign models using natural-language instructions
  • generate_voice_clone() - For Base models using reference audio
All methods return the same format: (wavs: List[np.ndarray], sample_rate: int)

generate_custom_voice

generate_custom_voice(
    text: Union[str, List[str]],
    speaker: Union[str, List[str]],
    language: Union[str, List[str]] = None,
    instruct: Optional[Union[str, List[str]]] = None,
    non_streaming_mode: bool = True,
    **kwargs
) -> Tuple[List[np.ndarray], int]
Generate speech with the CustomVoice model using a predefined speaker ID, optionally controlled by instruction text. Model Type: CustomVoice only Source: qwen_tts/inference/qwen3_tts_model.py:731-839

Parameters

text
Union[str, List[str]]
required
Text(s) to synthesize. Can be a single string or a list of strings for batch generation.Example:
  • "Hello, world!"
  • ["Hello", "Goodbye"]
speaker
Union[str, List[str]]
required
Speaker name(s). Will be validated against model.get_supported_speakers() (case-insensitive).Can be a single speaker or a list matching the length of text.Example:
  • "aurora"
  • ["aurora", "nova"]
language
Union[str, List[str]]
default:"None"
Language(s) for each sample. If None, defaults to "Auto" for all samples.Can be a single language (applied to all texts) or a list matching the length of text.Example:
  • "English"
  • ["English", "Chinese"]
instruct
Optional[Union[str, List[str]]]
default:"None"
Optional instruction(s) to control speaking style. If None, treated as empty (no instruction).Note: Not supported for 0.6B models (will be ignored if provided).Example:
  • "Speak with excitement"
  • ["", "Speak slowly"] (empty string = no instruction)
non_streaming_mode
bool
default:"True"
Using non-streaming text input. When set to False, simulates streaming text input (does not enable true streaming generation).

Generation Parameters

do_sample
bool
default:"True"
Whether to use sampling. Recommended to be set to True for most use cases.
top_k
int
default:"50"
Top-k sampling parameter. Only the top k most likely tokens are considered.
top_p
float
default:"1.0"
Top-p (nucleus) sampling parameter. Keeps the smallest set of tokens whose cumulative probability exceeds p.
temperature
float
default:"0.9"
Sampling temperature. Higher values (e.g., 1.2) make output more random; lower values (e.g., 0.7) make it more deterministic.
repetition_penalty
float
default:"1.05"
Penalty to reduce repeated tokens/codes. Values > 1.0 discourage repetition.
subtalker_dosample
bool
default:"True"
Sampling switch for the sub-talker. Only valid for qwen3-tts-tokenizer-v2.
subtalker_top_k
int
default:"50"
Top-k for sub-talker sampling. Only valid for qwen3-tts-tokenizer-v2.
subtalker_top_p
float
default:"1.0"
Top-p for sub-talker sampling. Only valid for qwen3-tts-tokenizer-v2.
subtalker_temperature
float
default:"0.9"
Temperature for sub-talker sampling. Only valid for qwen3-tts-tokenizer-v2.
max_new_tokens
int
default:"2048"
Maximum number of new codec tokens to generate.
**kwargs
Any
Any other keyword arguments supported by HuggingFace Transformers generate() will be forwarded to the underlying Qwen3TTSForConditionalGeneration.generate(...).

Returns

wavs
List[np.ndarray]
List of generated waveforms as float32 numpy arrays. One array per input text.
sample_rate
int
Sample rate of the generated audio (typically 24000 Hz).

Raises

  • ValueError - If any speaker/language is unsupported or batch sizes mismatch
  • ValueError - If the model is not a CustomVoice model

Example

from qwen_tts import Qwen3TTSModel
import soundfile as sf

# Load CustomVoice model
model = Qwen3TTSModel.from_pretrained(
    "Qwen3-TTS-CustomVoice-2B",
    device_map="cuda:0"
)

# Single generation
wavs, sr = model.generate_custom_voice(
    text="Hello, how are you today?",
    speaker="aurora",
    language="English"
)

# Save audio
sf.write("output.wav", wavs[0], sr)

# Batch generation
wavs, sr = model.generate_custom_voice(
    text=["Hello!", "Goodbye!"],
    speaker="aurora",
    language="English",
    temperature=0.8
)

# With instruction (2B+ models)
wavs, sr = model.generate_custom_voice(
    text="Welcome to our service.",
    speaker="aurora",
    language="English",
    instruct="Speak with a professional tone"
)

generate_voice_design

generate_voice_design(
    text: Union[str, List[str]],
    instruct: Union[str, List[str]],
    language: Union[str, List[str]] = None,
    non_streaming_mode: bool = True,
    **kwargs
) -> Tuple[List[np.ndarray], int]
Generate speech with the VoiceDesign model using natural-language style instructions. Model Type: VoiceDesign only Source: qwen_tts/inference/qwen3_tts_model.py:636-728

Parameters

text
Union[str, List[str]]
required
Text(s) to synthesize. Can be a single string or a list of strings for batch generation.
instruct
Union[str, List[str]]
required
Instruction(s) describing desired voice/style. Empty string is allowed (treated as no instruction).Example:
  • "A professional female voice with a warm tone"
  • "A deep male voice speaking slowly"
  • "" (empty = no instruction)
language
Union[str, List[str]]
default:"None"
Language(s) for each sample. If None, defaults to "Auto" for all samples.Can be a single language (applied to all texts) or a list matching the length of text.
non_streaming_mode
bool
default:"True"
Using non-streaming text input. When set to False, simulates streaming text input (does not enable true streaming generation).

Generation Parameters

Same generation parameters as generate_custom_voice(): do_sample, top_k, top_p, temperature, repetition_penalty, subtalker_dosample, subtalker_top_k, subtalker_top_p, subtalker_temperature, max_new_tokens, and **kwargs.

Returns

wavs
List[np.ndarray]
List of generated waveforms as float32 numpy arrays.
sample_rate
int
Sample rate of the generated audio.

Raises

  • ValueError - If batch sizes mismatch or the model is not a VoiceDesign model

Example

from qwen_tts import Qwen3TTSModel
import soundfile as sf

# Load VoiceDesign model
model = Qwen3TTSModel.from_pretrained(
    "Qwen3-TTS-VoiceDesign-2B",
    device_map="cuda:0"
)

# Generate with voice description
wavs, sr = model.generate_voice_design(
    text="Welcome to our service.",
    instruct="A professional female voice with a warm, friendly tone",
    language="English"
)

sf.write("welcome.wav", wavs[0], sr)

# Batch generation with different instructions
wavs, sr = model.generate_voice_design(
    text=["Hello!", "Goodbye!"],
    instruct=[
        "An energetic young male voice",
        "A calm elderly female voice"
    ],
    language="English"
)

generate_voice_clone

generate_voice_clone(
    text: Union[str, List[str]],
    language: Union[str, List[str]] = None,
    ref_audio: Optional[Union[AudioLike, List[AudioLike]]] = None,
    ref_text: Optional[Union[str, List[Optional[str]]]] = None,
    x_vector_only_mode: Union[bool, List[bool]] = False,
    voice_clone_prompt: Optional[Union[Dict[str, Any], List[VoiceClonePromptItem]]] = None,
    non_streaming_mode: bool = False,
    **kwargs
) -> Tuple[List[np.ndarray], int]
Voice clone speech using the Base model. You can provide either:
  • (ref_audio, ref_text, x_vector_only_mode) and let this method build the prompt, OR
  • voice_clone_prompt as a list of VoiceClonePromptItem returned by create_voice_clone_prompt(), OR
  • voice_clone_prompt as a dict (advanced usage)
Model Type: Base only Source: qwen_tts/inference/qwen3_tts_model.py:469-633

Parameters

text
Union[str, List[str]]
required
Text(s) to synthesize. Can be a single string or a list of strings for batch generation.
language
Union[str, List[str]]
default:"None"
Language(s) for each sample. If None, defaults to "Auto" for all samples.
ref_audio
Optional[Union[AudioLike, List[AudioLike]]]
default:"None"
Reference audio(s) for prompt building. Required if voice_clone_prompt is not provided.Supported formats:
  • str: Local wav path, URL, or base64 audio string
  • (np.ndarray, sr): Tuple of waveform + sampling rate
  • List of the above
Example:
  • "reference.wav"
  • "https://example.com/audio.wav"
  • (audio_array, 24000)
  • ["ref1.wav", "ref2.wav"]
ref_text
Optional[Union[str, List[Optional[str]]]]
default:"None"
Reference text(s) - transcription of the reference audio. Required when x_vector_only_mode=False (ICL mode).Example:
  • "This is the reference text"
  • ["Reference 1", "Reference 2"]
x_vector_only_mode
Union[bool, List[bool]]
default:"False"
If True, only speaker embedding is used (ignores ref_text/ref_code).If False, ICL mode is used automatically (requires ref_text).Can be a single boolean or a list matching the batch size.
voice_clone_prompt
Optional[Union[Dict[str, Any], List[VoiceClonePromptItem]]]
default:"None"
List of VoiceClonePromptItem from create_voice_clone_prompt(), or a dict for advanced usage.If provided, ref_audio, ref_text, and x_vector_only_mode are ignored.
non_streaming_mode
bool
default:"False"
Using non-streaming text input. When set to False, simulates streaming text input (does not enable true streaming generation).

Generation Parameters

Same generation parameters as generate_custom_voice(): do_sample, top_k, top_p, temperature, repetition_penalty, subtalker_dosample, subtalker_top_k, subtalker_top_p, subtalker_temperature, max_new_tokens, and **kwargs.

Returns

wavs
List[np.ndarray]
List of generated waveforms as float32 numpy arrays.
sample_rate
int
Sample rate of the generated audio.

Raises

  • ValueError - If batch sizes mismatch or required prompt inputs are missing
  • ValueError - If the model is not a Base model

Example

from qwen_tts import Qwen3TTSModel
import soundfile as sf

# Load Base model
model = Qwen3TTSModel.from_pretrained(
    "Qwen3-TTS-Base-2B",
    device_map="cuda:0"
)

# Simple voice cloning (ICL mode)
wavs, sr = model.generate_voice_clone(
    text="This is a test of voice cloning.",
    ref_audio="reference.wav",
    ref_text="This is the reference audio transcription.",
    language="English"
)

sf.write("cloned.wav", wavs[0], sr)

# X-vector only mode (speaker embedding only)
wavs, sr = model.generate_voice_clone(
    text="Quick test.",
    ref_audio="reference.wav",
    x_vector_only_mode=True,  # ref_text not required
    language="English"
)

# Reuse voice prompt (efficient for multiple generations)
prompt = model.create_voice_clone_prompt(
    ref_audio="reference.wav",
    ref_text="Reference transcription"
)

# Generate multiple times with same voice
for text in ["Hello", "Goodbye", "Welcome"]:
    wavs, sr = model.generate_voice_clone(
        text=text,
        voice_clone_prompt=prompt,
        language="English"
    )
    # Process wavs...

See Also

Build docs developers (and LLMs) love