Available TTS engines
GPT-SoVITS
Few-shot voice cloning with 4x real-time synthesis
Key capabilities
Voice cloning
Clone any voice with just a few seconds of reference audio. GPT-SoVITS supports both zero-shot (no reference text) and few-shot (with reference text) voice cloning modes.Mixed language support
GPT-SoVITS handles mixed Chinese-English text naturally with automatic language detection and proper G2P (grapheme-to-phoneme) conversion.High performance
All TTS engines leverage Metal GPU acceleration via MLX for maximum performance on Apple Silicon:| Engine | Real-time Factor | Quality | Voice Cloning |
|---|---|---|---|
| GPT-SoVITS | 4x | High | Yes |
Real-time factor measures how fast audio is generated. 4x means a 2-second audio clip is generated in 0.5 seconds.
Performance metrics
Benchmarks on Apple M3 Max for GPT-SoVITS:
Total: ~220ms for 2 seconds of audio output (4x real-time)
Shared infrastructure
All TTS engines use common audio processing components frommlx-rs-core:
- Audio I/O:
load_wav,save_wav,resample - Signal processing:
compute_mel_spectrogram - Caching:
KVCache,ConcatKeyValueCachefor autoregressive generation
System requirements
Hardware requirements
Hardware requirements
- Apple Silicon Mac (M1/M2/M3/M4)
- macOS 13.0 or later
- 16GB RAM minimum (32GB recommended)
- 5GB disk space for models
Software requirements
Software requirements
- Rust 1.75 or later
- Python 3.10+ (for one-time model setup only)
- No runtime Python dependencies
Model storage
Model storage
Models are stored in
~/.dora/models/primespeech/ by default:- GPT-SoVITS: ~2GB
Next steps
GPT-SoVITS guide
Learn how to use voice cloning with GPT-SoVITS
API reference
Explore the TTS API documentation