Skip to main content
OminiX-MLX provides multiple text-to-speech (TTS) engines optimized for Apple Silicon, delivering natural-sounding voice synthesis with GPU acceleration.

Available TTS engines

GPT-SoVITS

Few-shot voice cloning with 4x real-time synthesis

Key capabilities

Voice cloning

Clone any voice with just a few seconds of reference audio. GPT-SoVITS supports both zero-shot (no reference text) and few-shot (with reference text) voice cloning modes.
use gpt_sovits_mlx::VoiceCloner;

// Create voice cloner with default models
let mut cloner = VoiceCloner::with_defaults()?;

// Set reference audio for voice cloning
cloner.set_reference_audio("reference.wav")?;

// Synthesize speech
let audio = cloner.synthesize("Hello, world!")?;

// Save output
cloner.save_wav(&audio, "output.wav")?;

Mixed language support

GPT-SoVITS handles mixed Chinese-English text naturally with automatic language detection and proper G2P (grapheme-to-phoneme) conversion.
// Mixed language synthesis works automatically
let audio = cloner.synthesize("你好 world! 今天天气 is great!")?;

High performance

All TTS engines leverage Metal GPU acceleration via MLX for maximum performance on Apple Silicon:
EngineReal-time FactorQualityVoice Cloning
GPT-SoVITS4xHighYes
Real-time factor measures how fast audio is generated. 4x means a 2-second audio clip is generated in 0.5 seconds.

Performance metrics

Benchmarks on Apple M3 Max for GPT-SoVITS:
1

Reference processing

~50ms for CNHubert encoding and quantization
2

BERT embedding

~20ms for text encoding
3

T2S generation

~100ms for GPT decoding (variable length)
4

VITS synthesis

~50ms for audio waveform generation
Total: ~220ms for 2 seconds of audio output (4x real-time)

Shared infrastructure

All TTS engines use common audio processing components from mlx-rs-core:
  • Audio I/O: load_wav, save_wav, resample
  • Signal processing: compute_mel_spectrogram
  • Caching: KVCache, ConcatKeyValueCache for autoregressive generation

System requirements

  • Apple Silicon Mac (M1/M2/M3/M4)
  • macOS 13.0 or later
  • 16GB RAM minimum (32GB recommended)
  • 5GB disk space for models
  • Rust 1.75 or later
  • Python 3.10+ (for one-time model setup only)
  • No runtime Python dependencies
Models are stored in ~/.dora/models/primespeech/ by default:
  • GPT-SoVITS: ~2GB

Next steps

GPT-SoVITS guide

Learn how to use voice cloning with GPT-SoVITS

API reference

Explore the TTS API documentation

Build docs developers (and LLMs) love