Prerequisites
Before you begin, make sure you have:- Python 3.9 or higher (Python 3.12 recommended)
- A CUDA-compatible GPU (optional but recommended)
- Installed the
qwen-ttspackage (see Installation)
CustomVoice: Generate with Preset Speakers
The CustomVoice models provide 9 premium preset voices across multiple languages.Generate speech with a preset speaker
Available speakers: Vivian, Serena, Uncle_Fu, Dylan, Eric (Chinese); Ryan, Aiden (English); Ono_Anna (Japanese); Sohee (Korean). See the Custom Voice guide for details.
VoiceDesign: Create Custom Voices
The VoiceDesign model lets you create voices from natural language descriptions.Base: Clone Any Voice
The Base models can clone a voice from just 3 seconds of reference audio.Prepare reference audio
You need a reference audio file (3+ seconds) and its transcript.
The reference audio can be a local file path, URL, base64 string, or numpy array with sample rate.
Batch Processing
All models support batch processing for efficiency:Common Parameters
All generation methods support these optional parameters:| Parameter | Type | Default | Description |
|---|---|---|---|
do_sample | bool | True | Enable sampling for more natural speech |
temperature | float | 1.0 | Controls randomness (0.0-1.0) |
top_k | int | 50 | Top-k sampling parameter |
top_p | float | 1.0 | Nucleus sampling parameter |
max_new_tokens | int | 2048 | Maximum tokens to generate |
non_streaming_mode | bool | False | Force non-streaming generation |
Model Selection Guide
CustomVoice
Use when:
- You need consistent, high-quality preset voices
- You want instruction-based control (1.7B)
- You need multilingual support
- 0.6B: 9 preset speakers
- 1.7B: 9 speakers + instructions
VoiceDesign
Use when:
- You need custom voice characteristics
- You want to describe voices in natural language
- You need creative voice variations
- 1.7B only
Base
Use when:
- You need to clone specific voices
- You have reference audio samples
- You want to fine-tune for your use case
- 0.6B and 1.7B
- Both support 3-second cloning
Next Steps
Explore Guides
Learn about advanced features and workflows
API Reference
Explore the complete API documentation
Performance Tips
Optimize for speed and quality
Fine-tuning
Customize models for your specific needs
Troubleshooting
Model downloads are slow
Model downloads are slow
Models are downloaded from Hugging Face on first use. For faster downloads in China, use ModelScope:Then load from the local directory:
Out of memory errors
Out of memory errors
Try these solutions:
- Use the 0.6B model instead of 1.7B
- Reduce batch size
- Use
dtype=torch.float16instead ofbfloat16 - Generate shorter text segments
Audio quality is poor
Audio quality is poor
Check these factors:
- Use the 12Hz models (better quality than 25Hz)
- Set appropriate language explicitly
- For voice cloning, ensure reference audio is clear (3+ seconds, good quality)
- Adjust temperature (try 0.7-0.9 for more stable output)
FlashAttention 2 installation fails
FlashAttention 2 installation fails
FlashAttention 2 is optional but recommended:
- Requires CUDA 11.8 or higher
- On machines with limited RAM, use:
MAX_JOBS=4 pip install flash-attn --no-build-isolation - If it fails, the model will work without it (just slower)
Getting Help
Join the Community
Ask questions, report issues, and get help from the community on GitHub