Installation & Setup
How do I install Qwen3-TTS?
How do I install Qwen3-TTS?
Do I need FlashAttention?
Do I need FlashAttention?
- Reduced GPU memory usage
- Faster inference
- Better performance
What are the system requirements?
What are the system requirements?
- Python 3.9 or newer (3.12 recommended)
- CUDA-compatible GPU with 8GB+ VRAM
- 16GB+ system RAM
- Python 3.12
- NVIDIA GPU with 16GB+ VRAM (A100, RTX 4090, etc.)
- 32GB+ system RAM
- FlashAttention-2 support (Ampere or newer)
- 8GB VRAM is sufficient
- 16GB system RAM
Model downloads are very slow. What can I do?
Model downloads are very slow. What can I do?
Model Selection
Which model should I use?
Which model should I use?
- Use 9 predefined premium voices
- Control tone/emotion with natural language instructions
- Best for: Apps with fixed character voices, production deployments
- Generate voices from text descriptions
- Create unique voices on demand
- Best for: Creative applications, voice prototyping, character design
- Clone any voice from 3-second reference audio
- Highest quality cloning
- Best for: Personalization, voice transfer, fine-tuning base
- 0.6B: Faster, less memory, good quality
- 1.7B: Best quality, more natural, better instruction following
What's the difference between 12Hz and 25Hz tokenizers?
What's the difference between 12Hz and 25Hz tokenizers?
- Higher quality audio reconstruction
- Better content consistency (lower WER)
- Models: Qwen3-TTS-12Hz-(0.6B/1.7B)-(CustomVoice/VoiceDesign/Base)
- Faster inference
- Slightly lower quality
- Models: Qwen3-TTS-25Hz-(0.6B/1.7B)-(CustomVoice/Base)
Can I fine-tune the models?
Can I fine-tune the models?
- Domain-specific vocabulary
- Custom language variants or dialects
- Specialized voice characteristics
- Improved performance on specific tasks
Usage & Features
What languages are supported?
What languages are supported?
- Chinese (Mandarin)
- English
- Japanese
- Korean
- German
- French
- Russian
- Portuguese
- Spanish
- Italian
- Beijing dialect (Dylan)
- Sichuan dialect (Eric)
How long does voice cloning take?
How long does voice cloning take?
- First generation with new voice: 2-4 seconds (includes prompt extraction)
- Subsequent generations with cached prompt: < 1 second
What audio formats are supported?
What audio formats are supported?
- WAV, MP3, FLAC, OGG (via librosa)
- URLs (http/https)
- NumPy arrays
- Base64 strings
- Tuples:
(numpy_array, sample_rate)
- NumPy arrays (float32, -1.0 to 1.0)
- Sample rate: 16000 Hz
Does Qwen3-TTS support streaming?
Does Qwen3-TTS support streaming?
- First audio packet after single character input
- End-to-end latency as low as 97ms
- Suitable for real-time interactive scenarios
Can I control speaking rate, pitch, or emotion?
Can I control speaking rate, pitch, or emotion?
What are the available speakers in CustomVoice?
What are the available speakers in CustomVoice?
| Speaker | Description | Native Language |
|---|---|---|
| Vivian | Bright, slightly edgy young female | Chinese |
| Serena | Warm, gentle young female | Chinese |
| Uncle_Fu | Seasoned male, low and mellow | Chinese |
| Dylan | Youthful Beijing male, clear and natural | Chinese (Beijing) |
| Eric | Lively Chengdu male, husky brightness | Chinese (Sichuan) |
| Ryan | Dynamic male, strong rhythmic drive | English |
| Aiden | Sunny American male, clear midrange | English |
| Ono_Anna | Playful Japanese female, light and nimble | Japanese |
| Sohee | Warm Korean female, rich emotion | Korean |
Performance & Optimization
How can I speed up inference?
How can I speed up inference?
-
Use FlashAttention-2
-
Use bfloat16 dtype
- Use smaller model (0.6B instead of 1.7B)
-
Batch processing
-
Use vLLM for production deployments
- See vLLM-Omni documentation
- Optimized inference engine
- Better throughput and latency
I'm getting CUDA out of memory errors
I'm getting CUDA out of memory errors
-
Use smaller model
-
Use float16 instead of bfloat16
-
Reduce batch size
-
Reduce max_new_tokens
-
Clear CUDA cache
Can I run Qwen3-TTS on CPU?
Can I run Qwen3-TTS on CPU?
- Use 0.6B model for reasonable performance
- Expect 10-50x slower than GPU
- Consider using quantized models (coming soon)
How can I deploy Qwen3-TTS in production?
How can I deploy Qwen3-TTS in production?
-
DashScope API (Easiest)
- Managed service by Alibaba Cloud
- No infrastructure needed
- See API documentation
-
vLLM-Omni (Self-hosted, optimized)
- Best performance for self-hosted deployments
- Batch inference optimization
- See vLLM-Omni quickstart
-
Gradio demo (Quick prototypes)
-
Custom Flask/FastAPI service
- Wrap the Python API in your own service
- Full control over API design
- Use gunicorn/uvicorn for production
Troubleshooting
Audio quality is poor or has artifacts
Audio quality is poor or has artifacts
- Reference audio quality (Base model)
- Use high-quality reference audio (16kHz+, no noise)
- Provide accurate transcript
- Use 3+ seconds of reference audio
- Text quality
- Check for typos or special characters
- Ensure proper punctuation
- Avoid extremely long sentences
- Model selection
- Try 1.7B model instead of 0.6B
- Use 12Hz instead of 25Hz tokenizer
- Generation parameters
Voice cloning doesn't match the reference
Voice cloning doesn't match the reference
-
Disable x-vector only mode
-
Improve reference audio
- Use clean audio without background noise
- Ensure clear speech without mumbling
- Use 3-5 seconds of audio
- Include diverse phonemes
-
Match target language to reference
- If reference is English, generate English first
- Cross-lingual cloning is harder
-
Use longer reference audio
- 5-10 seconds often works better than 3 seconds
- Multiple sentences with varied intonation
The web demo microphone doesn't work (Base model)
The web demo microphone doesn't work (Base model)
-
Use HTTPS (Required for remote access)
-
Access via localhost (HTTP allowed for localhost)
Then access at
http://127.0.0.1:8000 -
Check browser permissions
- Allow microphone access when prompted
- Check browser settings for microphone permissions
API & Integration
Is there a REST API?
Is there a REST API?
Can I use Qwen3-TTS with LangChain or other frameworks?
Can I use Qwen3-TTS with LangChain or other frameworks?
Licensing & Commercial Use
What is the license?
What is the license?
- Use commercially
- Modify and distribute
- Use privately
- Include in proprietary software
Can I use Qwen3-TTS for commercial applications?
Can I use Qwen3-TTS for commercial applications?
- You are responsible for the content generated
- Do not use for illegal, harmful, or infringing content
- Follow local laws regarding AI-generated audio
- Consider ethical implications (deepfakes, impersonation, etc.)