Skip to main content
This guide will walk you through generating your first speech with Qwen3-TTS using all three model types: CustomVoice, VoiceDesign, and Base (voice cloning).

Prerequisites

Before you begin, make sure you have:
  • Python 3.9 or higher (Python 3.12 recommended)
  • A CUDA-compatible GPU (optional but recommended)
  • Installed the qwen-tts package (see Installation)

CustomVoice: Generate with Preset Speakers

The CustomVoice models provide 9 premium preset voices across multiple languages.
1

Import and load the model

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
2

Generate speech with a preset speaker

wavs, sr = model.generate_custom_voice(
    text="Hello, welcome to Qwen3-TTS!",
    language="English",
    speaker="Ryan",
)

sf.write("output_custom_voice.wav", wavs[0], sr)
Available speakers: Vivian, Serena, Uncle_Fu, Dylan, Eric (Chinese); Ryan, Aiden (English); Ono_Anna (Japanese); Sohee (Korean). See the Custom Voice guide for details.
3

Add instruction control (1.7B model only)

wavs, sr = model.generate_custom_voice(
    text="I'm so excited to announce this!",
    language="English",
    speaker="Ryan",
    instruct="Very happy and enthusiastic.",
)

sf.write("output_with_instruction.wav", wavs[0], sr)

VoiceDesign: Create Custom Voices

The VoiceDesign model lets you create voices from natural language descriptions.
1

Load the VoiceDesign model

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
2

Generate with a voice description

wavs, sr = model.generate_voice_design(
    text="Welcome! I'm here to help you get started.",
    language="English",
    instruct="Male, 30s, professional and friendly tone, clear articulation",
)

sf.write("output_voice_design.wav", wavs[0], sr)
Be specific in your instructions. Include age, gender, tone, emotion, accent, and speaking style for best results.

Base: Clone Any Voice

The Base models can clone a voice from just 3 seconds of reference audio.
1

Load the Base model

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
2

Prepare reference audio

You need a reference audio file (3+ seconds) and its transcript.
ref_audio = "path/to/reference.wav"
ref_text = "This is the text spoken in the reference audio."
The reference audio can be a local file path, URL, base64 string, or numpy array with sample rate.
3

Clone the voice

wavs, sr = model.generate_voice_clone(
    text="Now I can speak any text in this voice!",
    language="English",
    ref_audio=ref_audio,
    ref_text=ref_text,
)

sf.write("output_voice_clone.wav", wavs[0], sr)

Batch Processing

All models support batch processing for efficiency:
# Batch generate with CustomVoice
wavs, sr = model.generate_custom_voice(
    text=[
        "First sentence.",
        "Second sentence.",
        "Third sentence."
    ],
    language=["English", "English", "English"],
    speaker=["Ryan", "Aiden", "Ryan"],
)

# Save all outputs
for i, wav in enumerate(wavs):
    sf.write(f"output_{i}.wav", wav, sr)

Common Parameters

All generation methods support these optional parameters:
ParameterTypeDefaultDescription
do_sampleboolTrueEnable sampling for more natural speech
temperaturefloat1.0Controls randomness (0.0-1.0)
top_kint50Top-k sampling parameter
top_pfloat1.0Nucleus sampling parameter
max_new_tokensint2048Maximum tokens to generate
non_streaming_modeboolFalseForce non-streaming generation
Example with custom parameters:
wavs, sr = model.generate_custom_voice(
    text="Hello world!",
    language="English",
    speaker="Ryan",
    temperature=0.7,
    top_k=100,
    top_p=0.95,
)

Model Selection Guide

CustomVoice

Use when:
  • You need consistent, high-quality preset voices
  • You want instruction-based control (1.7B)
  • You need multilingual support
Models:
  • 0.6B: 9 preset speakers
  • 1.7B: 9 speakers + instructions

VoiceDesign

Use when:
  • You need custom voice characteristics
  • You want to describe voices in natural language
  • You need creative voice variations
Model:
  • 1.7B only

Base

Use when:
  • You need to clone specific voices
  • You have reference audio samples
  • You want to fine-tune for your use case
Models:
  • 0.6B and 1.7B
  • Both support 3-second cloning

Next Steps

Explore Guides

Learn about advanced features and workflows

API Reference

Explore the complete API documentation

Performance Tips

Optimize for speed and quality

Fine-tuning

Customize models for your specific needs

Troubleshooting

Models are downloaded from Hugging Face on first use. For faster downloads in China, use ModelScope:
pip install modelscope
modelscope download --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --local_dir ./models
Then load from the local directory:
model = Qwen3TTSModel.from_pretrained("./models/Qwen3-TTS-12Hz-1.7B-CustomVoice", ...)
Try these solutions:
  1. Use the 0.6B model instead of 1.7B
  2. Reduce batch size
  3. Use dtype=torch.float16 instead of bfloat16
  4. Generate shorter text segments
Check these factors:
  1. Use the 12Hz models (better quality than 25Hz)
  2. Set appropriate language explicitly
  3. For voice cloning, ensure reference audio is clear (3+ seconds, good quality)
  4. Adjust temperature (try 0.7-0.9 for more stable output)
FlashAttention 2 is optional but recommended:
  • Requires CUDA 11.8 or higher
  • On machines with limited RAM, use: MAX_JOBS=4 pip install flash-attn --no-build-isolation
  • If it fails, the model will work without it (just slower)

Getting Help

Join the Community

Ask questions, report issues, and get help from the community on GitHub

Build docs developers (and LLMs) love