Custom Voice Generation

The CustomVoice models (Qwen3-TTS-12Hz-1.7B-CustomVoice and Qwen3-TTS-12Hz-0.6B-CustomVoice) provide high-quality speech generation using 9 carefully curated premium speakers, with optional natural language instructions to control tone, emotion, and speaking style.

Available Speakers

The CustomVoice models include 9 premium speakers covering various combinations of gender, age, language, and dialect:

Speaker	Voice Description	Native Language
Vivian	Bright, slightly edgy young female voice.	Chinese
Serena	Warm, gentle young female voice.	Chinese
Uncle_Fu	Seasoned male voice with a low, mellow timbre.	Chinese
Dylan	Youthful Beijing male voice with a clear, natural timbre.	Chinese (Beijing Dialect)
Eric	Lively Chengdu male voice with a slightly husky brightness.	Chinese (Sichuan Dialect)
Ryan	Dynamic male voice with strong rhythmic drive.	English
Aiden	Sunny American male voice with a clear midrange.	English
Ono_Anna	Playful Japanese female voice with a light, nimble timbre.	Japanese
Sohee	Warm Korean female voice with rich emotion.	Korean

We recommend using each speaker’s native language for the best quality, though each speaker can speak any language supported by the model.

Single Inference

Generate speech for a single text with a specific speaker:

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

# Generate speech
wavs, sr = model.generate_custom_voice(
    text="其实我真的有发现，我是一个特别善于观察别人情绪的人。",
    language="Chinese",  # Or "Auto" for automatic detection
    speaker="Vivian",
    instruct="用特别愤怒的语气说",  # Optional instruction
)

sf.write("output.wav", wavs[0], sr)

Batch Inference

Process multiple texts efficiently in a single batch:

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

# Batch generation
wavs, sr = model.generate_custom_voice(
    text=[
        "其实我真的有发现，我是一个特别善于观察别人情绪的人。", 
        "She said she would be here by noon."
    ],
    language=["Chinese", "English"],
    speaker=["Vivian", "Ryan"],
    instruct=["", "Very happy."]  # Empty string means no instruction
)

for i, wav in enumerate(wavs):
    sf.write(f"output_{i}.wav", wav, sr)

Language Selection

You can specify the language explicitly or use automatic detection:

# Automatic language detection
wavs, sr = model.generate_custom_voice(
    text="Hello world",
    language="Auto",  # Or omit the parameter
    speaker="Ryan",
)

# Explicit language (recommended for best results)
wavs, sr = model.generate_custom_voice(
    text="Hello world",
    language="English",
    speaker="Ryan",
)

Supported languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian

Instruction Control

The 1.7B CustomVoice model supports natural language instructions to control voice characteristics:

# Emotional control
wavs, sr = model.generate_custom_voice(
    text="I can't believe you did that!",
    language="English",
    speaker="Ryan",
    instruct="Say it in a very angry and disappointed tone",
)

# Speaking style control
wavs, sr = model.generate_custom_voice(
    text="Welcome to our presentation today.",
    language="English",
    speaker="Aiden",
    instruct="Speak slowly and professionally, like giving a formal speech",
)

# Without instruction (natural style)
wavs, sr = model.generate_custom_voice(
    text="Good morning everyone!",
    language="English",
    speaker="Ryan",
    # No instruct parameter - uses natural speaking style
)

The 0.6B CustomVoice model does not support instruction control. Instructions will be ignored for this model.

Checking Supported Speakers and Languages

Query what speakers and languages your model supports:

# Get list of supported speakers
speakers = model.get_supported_speakers()
print("Available speakers:", speakers)
# Output: ['aiden', 'dylan', 'eric', 'ono_anna', 'ryan', 'serena', 'sohee', 'uncle_fu', 'vivian']

# Get list of supported languages
languages = model.get_supported_languages()
print("Available languages:", languages)
# Output: ['auto', 'chinese', 'english', 'french', 'german', ...]

Generation Parameters

Customize the generation process with additional parameters:

wavs, sr = model.generate_custom_voice(
    text="Hello world",
    language="English",
    speaker="Ryan",
    # Generation parameters
    max_new_tokens=2048,         # Maximum tokens to generate
    temperature=0.9,             # Sampling temperature (higher = more random)
    top_k=50,                    # Top-k sampling
    top_p=1.0,                   # Nucleus sampling
    do_sample=True,              # Enable sampling
    repetition_penalty=1.05,     # Penalty for repetition
)

Model Comparison

Model	Size	Instruction Support	Streaming	Best Use Case
Qwen3-TTS-12Hz-1.7B-CustomVoice	1.7B	✅ Yes	✅ Yes	High-quality with style control
Qwen3-TTS-12Hz-0.6B-CustomVoice	0.6B	❌ No	✅ Yes	Fast, lightweight inference

Complete Example

Here’s a complete working example from the official examples:

examples/test_model_12hz_custom_voice.py

import time
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

def main():
    device = "cuda:0"
    MODEL_PATH = "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice/"

    tts = Qwen3TTSModel.from_pretrained(
        MODEL_PATH,
        device_map=device,
        dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )

    # Single inference with instruction
    torch.cuda.synchronize()
    t0 = time.time()

    wavs, sr = tts.generate_custom_voice(
        text="其实我真的有发现，我是一个特别善于观察别人情绪的人。",
        language="Chinese",
        speaker="Vivian",
        instruct="用特别愤怒的语气说",
    )

    torch.cuda.synchronize()
    t1 = time.time()
    print(f"[CustomVoice Single] time: {t1 - t0:.3f}s")

    sf.write("qwen3_tts_test_custom_single.wav", wavs[0], sr)

    # Batch inference
    texts = ["其实我真的有发现，我是一个特别善于观察别人情绪的人。", "She said she would be here by noon."]
    languages = ["Chinese", "English"]
    speakers = ["Vivian", "Ryan"]
    instructs = ["", "Very happy."]

    torch.cuda.synchronize()
    t0 = time.time()

    wavs, sr = tts.generate_custom_voice(
        text=texts,
        language=languages,
        speaker=speakers,
        instruct=instructs,
        max_new_tokens=2048,
    )

    torch.cuda.synchronize()
    t1 = time.time()
    print(f"[CustomVoice Batch] time: {t1 - t0:.3f}s")

    for i, w in enumerate(wavs):
        sf.write(f"qwen3_tts_test_custom_batch_{i}.wav", w, sr)

if __name__ == "__main__":
    main()

Next Steps

Learn about Voice Design for creating custom voice characteristics
Explore Voice Cloning to replicate any voice
See Streaming Generation for real-time applications

Get Started

Core Concepts

Guides

Advanced

Custom Voice Generation

Available Speakers

Single Inference

Batch Inference

Language Selection

Instruction Control

Checking Supported Speakers and Languages

Generation Parameters

Model Comparison

Complete Example

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​Available Speakers

​Single Inference

​Batch Inference

​Language Selection

​Instruction Control

​Checking Supported Speakers and Languages

​Generation Parameters

​Model Comparison

​Complete Example

​Next Steps

Build docs developers (and LLMs) love

Available Speakers

Single Inference

Batch Inference

Language Selection

Instruction Control

Checking Supported Speakers and Languages

Generation Parameters

Model Comparison

Complete Example

Next Steps