Features
Few-shot voice cloning
Clone voices with just seconds of reference audio
Mixed language support
Natural handling of mixed Chinese-English text
High performance
4x real-time synthesis on Apple Silicon
Pure Rust
No Python dependencies at inference time
First-time setup
Download and convert all required model weights (~2GB):- Installs Python dependencies (torch CPU, safetensors, transformers)
- Downloads pretrained checkpoints from HuggingFace
- Converts to MLX-compatible safetensors format
- Places output in
~/.dora/models/primespeech/gpt-sovits-mlx/
After setup, Python is no longer required. All inference runs in pure Rust.
Model files
The setup creates the following structure:Quick start
Basic voice cloning
Mixed language synthesis
GPT-SoVITS automatically detects and handles mixed language text:Voice cloning workflow
Prepare reference audio
Record or select a clean audio sample (WAV format, 5-30 seconds recommended). The reference audio should:
- Be clear with minimal background noise
- Contain the target speaker’s voice
- Be in WAV format (any sample rate, will be resampled automatically)
Set reference audio
Configure the reference audio for voice cloning:
Few-shot mode requires the CNHubert model and produces better quality by using the reference transcript.
Architecture
GPT-SoVITS combines a GPT-style autoregressive model with a VITS vocoder:Components
| Module | Description |
|---|---|
audio | WAV I/O, resampling, mel spectrogram |
cache | KV cache for autoregressive generation |
text | G2PW, pinyin, language detection, phoneme processing |
models/t2s | GPT text-to-semantic transformer |
models/vits | SoVITS VITS vocoder |
models/hubert | CNHubert audio encoder |
models/bert | Chinese BERT embeddings |
inference | T2S generation with cache |
voice_clone | High-level voice cloning API |
Advanced usage
Custom configuration
Text preprocessing
Control how text is converted to phonemes:Audio I/O operations
Low-level audio processing:Performance benchmarks
Measured on Apple M3 Max for 2 seconds of audio output:Inference breakdown
Inference breakdown
| Stage | Time | Notes |
|---|---|---|
| Reference processing | ~50ms | CNHubert + quantization |
| BERT embedding | ~20ms | Text encoding |
| T2S generation | ~100ms | GPT decoding (variable) |
| VITS synthesis | ~50ms | Audio generation |
| Total | ~220ms | For 2s audio output |
Memory usage
Memory usage
- Model loading: ~2GB GPU memory
- Runtime peak: ~3GB GPU memory
- CPU memory: ~1GB
Quality metrics
Quality metrics
- Sample rate: 24kHz output
- Bit depth: 32-bit float (saved as 16-bit PCM)
- Latency: ~220ms for typical utterance
- Voice similarity: High (comparable to reference)
CLI reference
Thevoice_clone example provides a full CLI interface:
Voice configuration
Create~/.OminiX/models/voices.json to configure voice presets:
Troubleshooting
Model setup fails
Model setup fails
Make sure you have Python 3.10+ installed and run:If download fails, check your internet connection and HuggingFace access.
Audio quality issues
Audio quality issues
- Use clean reference audio with minimal background noise
- Try few-shot mode with reference text for better quality
- Use pre-computed codes extracted from Python for best results:
Performance is slow
Performance is slow
- Make sure you’re building with
--releaseflag - Check GPU utilization with Activity Monitor
- Verify MLX is using Metal GPU (not CPU fallback)
Mixed language sounds wrong
Mixed language sounds wrong
The G2PW model automatically handles mixed Chinese-English. If pronunciation is incorrect:
- Verify the text is properly formatted
- Check that language detection is working correctly
- Try explicit language specification if needed
Next steps
TTS overview
Back to TTS overview
API reference
Explore the complete API