Skip to main content

Model Overview

Qwen3-TTS offers a family of models optimized for different use cases, from voice cloning to instruction-based voice design. All models are built on the same architecture but are fine-tuned for specific capabilities.

Model Comparison

ModelParametersFeaturesLanguage SupportStreamingInstruction Control
Qwen3-TTS-12Hz-1.7B-VoiceDesign1.7BVoice design from text descriptions10 languages
Qwen3-TTS-12Hz-1.7B-CustomVoice1.7B9 premium voices with style control10 languages
Qwen3-TTS-12Hz-1.7B-Base1.7B3-second voice cloning10 languages
Qwen3-TTS-12Hz-0.6B-CustomVoice0.6B9 premium voices10 languages
Qwen3-TTS-12Hz-0.6B-Base0.6B3-second voice cloning10 languages
Supported Languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian

Model Variants

VoiceDesign

Create custom voices from natural language descriptions

CustomVoice

Use 9 premium pre-trained voices with style control

Base

Clone any voice from just 3 seconds of audio

Tokenizer

Encode/decode audio for compression and transport

1. Qwen3-TTS-12Hz-1.7B-VoiceDesign

Best for: Creating unique voices based on detailed descriptions
  • Generate voices from natural language descriptions
  • Control timbre, age, gender, emotion, and speaking style
  • Instruction-based prosody and tone adjustment
  • Adaptive emotional expression based on text semantics
  • Streaming generation support

2. Qwen3-TTS-12Hz-1.7B-CustomVoice

Best for: Production applications requiring consistent, high-quality voices
  • 9 premium pre-trained voices
  • Style control via natural language instructions
  • Covers various genders, ages, languages, and dialects
  • Instruction-based emotion and prosody control
  • Streaming generation support

3. Qwen3-TTS-12Hz-1.7B-Base

Best for: Voice cloning and fine-tuning for custom applications
  • Clone any voice from 3 seconds of audio
  • Maintains speaker characteristics and style
  • Support for reference audio in multiple formats (file, URL, bytes)
  • Can be fine-tuned for specific domains or speakers
  • Streaming generation support

4. Qwen3-TTS-12Hz-0.6B Models

Best for: Applications requiring faster inference or limited resources The 0.6B variants offer:
  • Faster inference: ~1.7x faster than 1.7B models
  • Lower memory: Reduced GPU memory requirements
  • Good quality: Maintains high quality with slight trade-off vs 1.7B
  • Same features: CustomVoice and Base variants available
The 0.6B models are ideal for edge devices, high-throughput services, or applications where speed is prioritized over maximum quality.

5. Qwen3-TTS-Tokenizer-12Hz

Best for: Audio compression, transport, and training data preparation
  • Encode audio to discrete codes (compression)
  • Decode codes back to audio (reconstruction)
  • 12Hz frame rate (12 frames per second)
  • 16 residual quantizers
  • 2048 codebook size per quantizer
  • 24kHz sample rate

Model Selection Guide

Recommended: Qwen3-TTS-12Hz-1.7B-VoiceDesignThis model excels at generating voices from descriptions. You can specify exact characteristics like “teenage girl with cheerful tone” or “elderly professor with authoritative voice.”
Recommended: Qwen3-TTS-12Hz-1.7B-CustomVoice (or 0.6B for faster inference)Use the 9 premium pre-trained voices for production applications. They’re highly consistent and support instruction-based style control.
Recommended: Qwen3-TTS-12Hz-1.7B-Base (or 0.6B for faster inference)Provide 3 seconds of reference audio to clone any voice. Ideal for personal voice assistants or preserving specific speaker characteristics.
Recommended: Qwen3-TTS-12Hz-0.6B-CustomVoice or 0.6B-BaseThe 0.6B models offer significantly faster inference while maintaining good quality. Perfect for real-time applications or serving many concurrent requests.
Recommended: Qwen3-TTS-12Hz-1.7B-Base or 0.6B-BaseBase models serve as excellent starting points for fine-tuning. See the Fine-Tuning Guide for details.
Recommended: Qwen3-TTS-Tokenizer-12HzUse the tokenizer for encode/decode operations, audio compression, or preparing training data.

Performance Benchmarks

Content Consistency (WER ↓)

Word Error Rate on Seed-TTS test set:
ModelChineseEnglish
Qwen3-TTS-12Hz-1.7B-Base0.771.24
Qwen3-TTS-12Hz-0.6B-Base0.921.32
CosyVoice 30.711.45
MiniMax-Speech0.831.65

Speaker Similarity (Cosine Similarity ↑)

Speaker similarity on multilingual test set:
Language1.7B-Base0.6B-Base
Chinese0.7990.811
English0.7750.829
Japanese0.7880.798
Korean0.7990.812
Higher scores indicate better speaker similarity preservation. The 0.6B model surprisingly outperforms 1.7B in some languages.

Hardware Requirements

GPU Memory Usage

ModelFP16/BF16FP32
1.7B models~4-6 GB~8-12 GB
0.6B models~2-3 GB~4-6 GB
Tokenizer~1-2 GB~2-4 GB
Memory requirements increase with batch size and sequence length. Add 2-4 GB for FlashAttention 2 and generation overhead.

Inference Speed

Typical inference times (per second of audio):
ModelGPUSpeed (RTF)
1.7BA100~0.1x
1.7BV100~0.15x
0.6BA100~0.06x
0.6BV100~0.09x
RTF (Real-Time Factor): 0.1x means generating 1 second of audio takes 0.1 seconds. Lower is faster.

Model Downloads

Models are automatically downloaded when you call from_pretrained(). For manual downloads:
pip install -U "huggingface_hub[cli]"

# Download specific model
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign \
  --local-dir ./Qwen3-TTS-12Hz-1.7B-VoiceDesign

huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
  --local-dir ./Qwen3-TTS-12Hz-1.7B-CustomVoice

huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base \
  --local-dir ./Qwen3-TTS-12Hz-1.7B-Base

huggingface-cli download Qwen/Qwen3-TTS-Tokenizer-12Hz \
  --local-dir ./Qwen3-TTS-Tokenizer-12Hz

Next Steps

Quick Start

Install and start using Qwen3-TTS models

Architecture

Understand how the models work internally

Language Support

Learn about multilingual capabilities

Fine-Tuning

Customize models for your specific use case

Build docs developers (and LLMs) love