Skip to main content
The CustomVoice models (Qwen3-TTS-12Hz-1.7B-CustomVoice and Qwen3-TTS-12Hz-0.6B-CustomVoice) provide high-quality speech generation using 9 carefully curated premium speakers, with optional natural language instructions to control tone, emotion, and speaking style.

Available Speakers

The CustomVoice models include 9 premium speakers covering various combinations of gender, age, language, and dialect:
SpeakerVoice DescriptionNative Language
VivianBright, slightly edgy young female voice.Chinese
SerenaWarm, gentle young female voice.Chinese
Uncle_FuSeasoned male voice with a low, mellow timbre.Chinese
DylanYouthful Beijing male voice with a clear, natural timbre.Chinese (Beijing Dialect)
EricLively Chengdu male voice with a slightly husky brightness.Chinese (Sichuan Dialect)
RyanDynamic male voice with strong rhythmic drive.English
AidenSunny American male voice with a clear midrange.English
Ono_AnnaPlayful Japanese female voice with a light, nimble timbre.Japanese
SoheeWarm Korean female voice with rich emotion.Korean
We recommend using each speaker’s native language for the best quality, though each speaker can speak any language supported by the model.

Single Inference

Generate speech for a single text with a specific speaker:
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

# Generate speech
wavs, sr = model.generate_custom_voice(
    text="其实我真的有发现,我是一个特别善于观察别人情绪的人。",
    language="Chinese",  # Or "Auto" for automatic detection
    speaker="Vivian",
    instruct="用特别愤怒的语气说",  # Optional instruction
)

sf.write("output.wav", wavs[0], sr)

Batch Inference

Process multiple texts efficiently in a single batch:
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

# Batch generation
wavs, sr = model.generate_custom_voice(
    text=[
        "其实我真的有发现,我是一个特别善于观察别人情绪的人。", 
        "She said she would be here by noon."
    ],
    language=["Chinese", "English"],
    speaker=["Vivian", "Ryan"],
    instruct=["", "Very happy."]  # Empty string means no instruction
)

for i, wav in enumerate(wavs):
    sf.write(f"output_{i}.wav", wav, sr)

Language Selection

You can specify the language explicitly or use automatic detection:
# Automatic language detection
wavs, sr = model.generate_custom_voice(
    text="Hello world",
    language="Auto",  # Or omit the parameter
    speaker="Ryan",
)

# Explicit language (recommended for best results)
wavs, sr = model.generate_custom_voice(
    text="Hello world",
    language="English",
    speaker="Ryan",
)
Supported languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian

Instruction Control

The 1.7B CustomVoice model supports natural language instructions to control voice characteristics:
# Emotional control
wavs, sr = model.generate_custom_voice(
    text="I can't believe you did that!",
    language="English",
    speaker="Ryan",
    instruct="Say it in a very angry and disappointed tone",
)

# Speaking style control
wavs, sr = model.generate_custom_voice(
    text="Welcome to our presentation today.",
    language="English",
    speaker="Aiden",
    instruct="Speak slowly and professionally, like giving a formal speech",
)

# Without instruction (natural style)
wavs, sr = model.generate_custom_voice(
    text="Good morning everyone!",
    language="English",
    speaker="Ryan",
    # No instruct parameter - uses natural speaking style
)
The 0.6B CustomVoice model does not support instruction control. Instructions will be ignored for this model.

Checking Supported Speakers and Languages

Query what speakers and languages your model supports:
# Get list of supported speakers
speakers = model.get_supported_speakers()
print("Available speakers:", speakers)
# Output: ['aiden', 'dylan', 'eric', 'ono_anna', 'ryan', 'serena', 'sohee', 'uncle_fu', 'vivian']

# Get list of supported languages
languages = model.get_supported_languages()
print("Available languages:", languages)
# Output: ['auto', 'chinese', 'english', 'french', 'german', ...]

Generation Parameters

Customize the generation process with additional parameters:
wavs, sr = model.generate_custom_voice(
    text="Hello world",
    language="English",
    speaker="Ryan",
    # Generation parameters
    max_new_tokens=2048,         # Maximum tokens to generate
    temperature=0.9,             # Sampling temperature (higher = more random)
    top_k=50,                    # Top-k sampling
    top_p=1.0,                   # Nucleus sampling
    do_sample=True,              # Enable sampling
    repetition_penalty=1.05,     # Penalty for repetition
)

Model Comparison

ModelSizeInstruction SupportStreamingBest Use Case
Qwen3-TTS-12Hz-1.7B-CustomVoice1.7B✅ Yes✅ YesHigh-quality with style control
Qwen3-TTS-12Hz-0.6B-CustomVoice0.6B❌ No✅ YesFast, lightweight inference

Complete Example

Here’s a complete working example from the official examples:
examples/test_model_12hz_custom_voice.py
import time
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

def main():
    device = "cuda:0"
    MODEL_PATH = "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice/"

    tts = Qwen3TTSModel.from_pretrained(
        MODEL_PATH,
        device_map=device,
        dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )

    # Single inference with instruction
    torch.cuda.synchronize()
    t0 = time.time()

    wavs, sr = tts.generate_custom_voice(
        text="其实我真的有发现,我是一个特别善于观察别人情绪的人。",
        language="Chinese",
        speaker="Vivian",
        instruct="用特别愤怒的语气说",
    )

    torch.cuda.synchronize()
    t1 = time.time()
    print(f"[CustomVoice Single] time: {t1 - t0:.3f}s")

    sf.write("qwen3_tts_test_custom_single.wav", wavs[0], sr)

    # Batch inference
    texts = ["其实我真的有发现,我是一个特别善于观察别人情绪的人。", "She said she would be here by noon."]
    languages = ["Chinese", "English"]
    speakers = ["Vivian", "Ryan"]
    instructs = ["", "Very happy."]

    torch.cuda.synchronize()
    t0 = time.time()

    wavs, sr = tts.generate_custom_voice(
        text=texts,
        language=languages,
        speaker=speakers,
        instruct=instructs,
        max_new_tokens=2048,
    )

    torch.cuda.synchronize()
    t1 = time.time()
    print(f"[CustomVoice Batch] time: {t1 - t0:.3f}s")

    for i, w in enumerate(wavs):
        sf.write(f"qwen3_tts_test_custom_batch_{i}.wav", w, sr)

if __name__ == "__main__":
    main()

Next Steps

Build docs developers (and LLMs) love