Skip to main content

Overview

The Qwen3-TTS repository includes comprehensive example scripts demonstrating various use cases and model types. All examples are available in the examples/ directory on GitHub.

Available Examples

CustomVoice Model

Test CustomVoice model with 9 premium speakers and instruction control

VoiceDesign Model

Generate voices from natural language descriptions

Base Model (Voice Clone)

3-second voice cloning with reference audio

Tokenizer Usage

Encode and decode audio with Qwen3-TTS-Tokenizer

Example Details

test_model_12hz_custom_voice.py

Demonstrates usage of the CustomVoice model with predefined speakers. Features:
  • Single and batch inference
  • Language selection (Chinese, English, etc.)
  • Speaker selection from 9 premium voices
  • Instruction-based control (tone, emotion, style)
Key code snippet:
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

wavs, sr = model.generate_custom_voice(
    text="其实我真的有发现,我是一个特别善于观察别人情绪的人。",
    language="Chinese",
    speaker="Vivian",
    instruct="用特别愤怒的语气说",
)
View full example →

test_model_12hz_voice_design.py

Shows how to design custom voices using natural language descriptions. Features:
  • Voice creation from text descriptions
  • Single and batch generation
  • Multilingual voice design
  • Emotional and stylistic control
Key code snippet:
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

wavs, sr = model.generate_voice_design(
    text="哥哥,你回来啦,人家等了你好久好久了,要抱抱!",
    language="Chinese",
    instruct="体现撒娇稚嫩的萝莉女声,音调偏高且起伏明显,营造出黏人、做作又刻意卖萌的听觉效果。",
)
View full example →

test_model_12hz_base.py

Comprehensive voice cloning examples with the Base model. Features:
  • Voice cloning from reference audio
  • Single and batch voice cloning
  • Reusable voice clone prompts
  • X-vector only mode
  • Multiple clone modes (ICL and x-vector)
Key code snippet:
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

ref_audio = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"
ref_text = "Okay. Yeah. I resent you. I love you. I respect you..."

wavs, sr = model.generate_voice_clone(
    text="I am solving the equation: x = [-b ± √(b²-4ac)] / 2a?",
    language="English",
    ref_audio=ref_audio,
    ref_text=ref_text,
)
View full example →

test_tokenizer_12hz.py

Demonstrates audio encoding and decoding with the tokenizer. Features:
  • Single and batch audio encoding
  • Audio decoding from codes
  • Multiple input formats (URLs, paths, numpy arrays)
  • Dict and list payload handling
Key code snippet:
from qwen_tts import Qwen3TTSTokenizer

tokenizer = Qwen3TTSTokenizer.from_pretrained(
    "Qwen/Qwen3-TTS-Tokenizer-12Hz",
    device_map="cuda:0",
)

# Encode from URL
audio_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/tokenizer_demo_1.wav"
enc = tokenizer.encode(audio_url)

# Decode back to audio
wavs, sr = tokenizer.decode(enc)
View full example →

Common Patterns

Model Initialization

All examples use a consistent pattern for loading models:
import torch
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/MODEL_NAME",
    device_map="cuda:0",              # GPU device
    dtype=torch.bfloat16,             # Recommended dtype
    attn_implementation="flash_attention_2",  # Optional but recommended
)

Batch Processing

All generation methods support batch inference:
wavs, sr = model.generate_custom_voice(
    text=["First sentence.", "Second sentence."],
    language=["English", "English"],
    speaker=["Ryan", "Aiden"],
    instruct=["", "Very happy."]
)

Saving Output

All examples use soundfile for saving audio:
import soundfile as sf

sf.write("output.wav", wavs[0], sr)

Running Examples

Prerequisites

pip install qwen-tts
pip install flash-attn --no-build-isolation  # Optional but recommended

Clone Repository

git clone https://github.com/QwenLM/Qwen3-TTS.git
cd Qwen3-TTS/examples

Run Examples

# CustomVoice example
python test_model_12hz_custom_voice.py

# VoiceDesign example
python test_model_12hz_voice_design.py

# Base model (voice cloning)
python test_model_12hz_base.py

# Tokenizer example
python test_tokenizer_12hz.py
Examples will automatically download model weights on first run. Ensure you have sufficient disk space and a stable internet connection.

Advanced Usage

For more advanced usage patterns, including:
  • Voice design then clone workflow
  • Reusable voice prompts
  • Custom generation parameters
  • Streaming generation
Refer to the API Reference and Quickstart Guide.

Troubleshooting

Try reducing batch size or using a smaller model variant (0.6B instead of 1.7B).
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice",  # Smaller model
    device_map="cuda:0",
    dtype=torch.float16,  # Use fp16 instead of bf16
)
FlashAttention is optional but improves performance. If installation fails:
# Load without FlashAttention
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    # attn_implementation="flash_attention_2",  # Omit this line
)
Use ModelScope for faster downloads in Mainland China:
pip install modelscope
modelscope download --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
  --local_dir ./models/Qwen3-TTS-12Hz-1.7B-CustomVoice
Then load from local path:
model = Qwen3TTSModel.from_pretrained(
    "./models/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    ...
)

Build docs developers (and LLMs) love