Text-to-speech overview

OminiX-MLX provides multiple text-to-speech (TTS) engines optimized for Apple Silicon, delivering natural-sounding voice synthesis with GPU acceleration.

Available TTS engines

GPT-SoVITS

Few-shot voice cloning with 4x real-time synthesis

Key capabilities

Voice cloning

Clone any voice with just a few seconds of reference audio. GPT-SoVITS supports both zero-shot (no reference text) and few-shot (with reference text) voice cloning modes.

use gpt_sovits_mlx::VoiceCloner;

// Create voice cloner with default models
let mut cloner = VoiceCloner::with_defaults()?;

// Set reference audio for voice cloning
cloner.set_reference_audio("reference.wav")?;

// Synthesize speech
let audio = cloner.synthesize("Hello, world!")?;

// Save output
cloner.save_wav(&audio, "output.wav")?;

Mixed language support

GPT-SoVITS handles mixed Chinese-English text naturally with automatic language detection and proper G2P (grapheme-to-phoneme) conversion.

// Mixed language synthesis works automatically
let audio = cloner.synthesize("你好 world! 今天天气 is great!")?;

High performance

All TTS engines leverage Metal GPU acceleration via MLX for maximum performance on Apple Silicon:

Engine	Real-time Factor	Quality	Voice Cloning
GPT-SoVITS	4x	High	Yes

Real-time factor measures how fast audio is generated. 4x means a 2-second audio clip is generated in 0.5 seconds.

Performance metrics

Benchmarks on Apple M3 Max for GPT-SoVITS:

Reference processing

~50ms for CNHubert encoding and quantization

BERT embedding

~20ms for text encoding

T2S generation

~100ms for GPT decoding (variable length)

VITS synthesis

~50ms for audio waveform generation

Total: ~220ms for 2 seconds of audio output (4x real-time)

Shared infrastructure

All TTS engines use common audio processing components from mlx-rs-core:

Audio I/O: load_wav, save_wav, resample
Signal processing: compute_mel_spectrogram
Caching: KVCache, ConcatKeyValueCache for autoregressive generation

System requirements

Hardware requirements

Apple Silicon Mac (M1/M2/M3/M4)
macOS 13.0 or later
16GB RAM minimum (32GB recommended)
5GB disk space for models

Software requirements

Rust 1.75 or later
Python 3.10+ (for one-time model setup only)
No runtime Python dependencies

Model storage

Models are stored in ~/.dora/models/primespeech/ by default:

GPT-SoVITS: ~2GB

Next steps

GPT-SoVITS guide

Learn how to use voice cloning with GPT-SoVITS

API reference

Explore the TTS API documentation

Get Started

Core Concepts

Language Models

Vision-Language Models

Speech Recognition

Text-to-Speech

Image Generation

API Server

Advanced

Text-to-speech overview

Available TTS engines

GPT-SoVITS

Key capabilities

Voice cloning

Mixed language support

High performance

Performance metrics

Shared infrastructure

System requirements

Next steps

GPT-SoVITS guide

API reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Language Models

Vision-Language Models

Speech Recognition

Text-to-Speech

Image Generation

API Server

Advanced

​Available TTS engines

GPT-SoVITS

​Key capabilities

​Voice cloning

​Mixed language support

​High performance

​Performance metrics

​Shared infrastructure

​System requirements

​Next steps

GPT-SoVITS guide

API reference

Build docs developers (and LLMs) love

Available TTS engines

Key capabilities

Voice cloning

Mixed language support

High performance

Performance metrics

Shared infrastructure

System requirements

Next steps