Quickstart

This guide will walk you through generating your first speech with Qwen3-TTS using all three model types: CustomVoice, VoiceDesign, and Base (voice cloning).

Prerequisites

Before you begin, make sure you have:

Python 3.9 or higher (Python 3.12 recommended)
A CUDA-compatible GPU (optional but recommended)
Installed the qwen-tts package (see Installation)

CustomVoice: Generate with Preset Speakers

The CustomVoice models provide 9 premium preset voices across multiple languages.

Import and load the model

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

Generate speech with a preset speaker

wavs, sr = model.generate_custom_voice(
    text="Hello, welcome to Qwen3-TTS!",
    language="English",
    speaker="Ryan",
)

sf.write("output_custom_voice.wav", wavs[0], sr)

Available speakers: Vivian, Serena, Uncle_Fu, Dylan, Eric (Chinese); Ryan, Aiden (English); Ono_Anna (Japanese); Sohee (Korean). See the Custom Voice guide for details.

Add instruction control (1.7B model only)

wavs, sr = model.generate_custom_voice(
    text="I'm so excited to announce this!",
    language="English",
    speaker="Ryan",
    instruct="Very happy and enthusiastic.",
)

sf.write("output_with_instruction.wav", wavs[0], sr)

VoiceDesign: Create Custom Voices

The VoiceDesign model lets you create voices from natural language descriptions.

Load the VoiceDesign model

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

Generate with a voice description

wavs, sr = model.generate_voice_design(
    text="Welcome! I'm here to help you get started.",
    language="English",
    instruct="Male, 30s, professional and friendly tone, clear articulation",
)

sf.write("output_voice_design.wav", wavs[0], sr)

Be specific in your instructions. Include age, gender, tone, emotion, accent, and speaking style for best results.

Base: Clone Any Voice

The Base models can clone a voice from just 3 seconds of reference audio.

Load the Base model

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

Prepare reference audio

You need a reference audio file (3+ seconds) and its transcript.

ref_audio = "path/to/reference.wav"
ref_text = "This is the text spoken in the reference audio."

The reference audio can be a local file path, URL, base64 string, or numpy array with sample rate.

Clone the voice

wavs, sr = model.generate_voice_clone(
    text="Now I can speak any text in this voice!",
    language="English",
    ref_audio=ref_audio,
    ref_text=ref_text,
)

sf.write("output_voice_clone.wav", wavs[0], sr)

Batch Processing

All models support batch processing for efficiency:

# Batch generate with CustomVoice
wavs, sr = model.generate_custom_voice(
    text=[
        "First sentence.",
        "Second sentence.",
        "Third sentence."
    ],
    language=["English", "English", "English"],
    speaker=["Ryan", "Aiden", "Ryan"],
)

# Save all outputs
for i, wav in enumerate(wavs):
    sf.write(f"output_{i}.wav", wav, sr)

Common Parameters

All generation methods support these optional parameters:

Parameter	Type	Default	Description
`do_sample`	bool	True	Enable sampling for more natural speech
`temperature`	float	1.0	Controls randomness (0.0-1.0)
`top_k`	int	50	Top-k sampling parameter
`top_p`	float	1.0	Nucleus sampling parameter
`max_new_tokens`	int	2048	Maximum tokens to generate
`non_streaming_mode`	bool	False	Force non-streaming generation

Example with custom parameters:

wavs, sr = model.generate_custom_voice(
    text="Hello world!",
    language="English",
    speaker="Ryan",
    temperature=0.7,
    top_k=100,
    top_p=0.95,
)

Model Selection Guide

CustomVoice

Use when:

You need consistent, high-quality preset voices
You want instruction-based control (1.7B)
You need multilingual support

Models:

0.6B: 9 preset speakers
1.7B: 9 speakers + instructions

VoiceDesign

Use when:

You need custom voice characteristics
You want to describe voices in natural language
You need creative voice variations

Model:

1.7B only

Base

Use when:

You need to clone specific voices
You have reference audio samples
You want to fine-tune for your use case

Models:

0.6B and 1.7B
Both support 3-second cloning

Next Steps

Explore Guides

Learn about advanced features and workflows

API Reference

Explore the complete API documentation

Performance Tips

Optimize for speed and quality

Fine-tuning

Customize models for your specific needs

Troubleshooting

Model downloads are slow

Models are downloaded from Hugging Face on first use. For faster downloads in China, use ModelScope:

pip install modelscope
modelscope download --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --local_dir ./models

Then load from the local directory:

model = Qwen3TTSModel.from_pretrained("./models/Qwen3-TTS-12Hz-1.7B-CustomVoice", ...)

Out of memory errors

Try these solutions:

Use the 0.6B model instead of 1.7B
Reduce batch size
Use dtype=torch.float16 instead of bfloat16
Generate shorter text segments

Audio quality is poor

Check these factors:

Use the 12Hz models (better quality than 25Hz)
Set appropriate language explicitly
For voice cloning, ensure reference audio is clear (3+ seconds, good quality)
Adjust temperature (try 0.7-0.9 for more stable output)

FlashAttention 2 installation fails

FlashAttention 2 is optional but recommended:

Requires CUDA 11.8 or higher
On machines with limited RAM, use: MAX_JOBS=4 pip install flash-attn --no-build-isolation
If it fails, the model will work without it (just slower)

Getting Help

Join the Community

Ask questions, report issues, and get help from the community on GitHub

Get Started

Core Concepts

Guides

Advanced

Prerequisites

CustomVoice: Generate with Preset Speakers

VoiceDesign: Create Custom Voices

Base: Clone Any Voice

Batch Processing

Common Parameters

Model Selection Guide

CustomVoice

VoiceDesign

Base

Next Steps

Explore Guides

API Reference

Performance Tips

Fine-tuning

Troubleshooting

Getting Help

Join the Community

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​Prerequisites

​CustomVoice: Generate with Preset Speakers

​VoiceDesign: Create Custom Voices

​Base: Clone Any Voice

​Batch Processing

​Common Parameters

​Model Selection Guide

CustomVoice

VoiceDesign

Base

​Next Steps

Explore Guides

API Reference

Performance Tips

Fine-tuning

​Troubleshooting

​Getting Help

Join the Community

Build docs developers (and LLMs) love

Prerequisites

CustomVoice: Generate with Preset Speakers

VoiceDesign: Create Custom Voices

Base: Clone Any Voice

Batch Processing

Common Parameters

Model Selection Guide

Next Steps

Troubleshooting

Getting Help