GPT-SoVITS MLX

Pure Rust implementation of GPT-SoVITS with MLX acceleration for Apple Silicon. Enables few-shot voice cloning with just a few seconds of reference audio.

Features

Few-shot voice cloning: Clone any voice with just a few seconds of reference audio
Mixed Chinese-English: Natural handling of mixed language text
High performance: 4x realtime synthesis on Apple Silicon
Pure Rust: No Python dependencies at runtime

Performance

On Apple Silicon (M-series):

Model loading: ~50ms
Synthesis: ~4x realtime (generates 20s audio in 5s)
Memory: ~2GB for all models

Installation

cargo add gpt-sovits-mlx

Quick start

use gpt_sovits_mlx::VoiceCloner;

let mut cloner = VoiceCloner::with_defaults()?;
cloner.set_reference_audio("/path/to/reference.wav")?;
let audio = cloner.synthesize("你好，世界！")?;
cloner.play(&audio)?;

VoiceCloner

Main API for voice cloning with GPT-SoVITS.

VoiceCloner::new

Create a new voice cloner with configuration.

pub fn new(config: VoiceClonerConfig) -> Result<Self, Error>

config

VoiceClonerConfig

required

Configuration with model paths and sampling parameters

return

Result<VoiceCloner, Error>

Voice cloner instance with loaded models

VoiceCloner::with_defaults

Create with default configuration.

pub fn with_defaults() -> Result<Self, Error>

return

Result<VoiceCloner, Error>

Voice cloner with default model paths from ~/.OminiX/models/gpt-sovits-mlx

VoiceCloner::set_reference_audio

Set reference audio for voice cloning (zero-shot mode).

pub fn set_reference_audio(&mut self, path: impl AsRef<Path>) -> Result<(), Error>

path

impl AsRef<Path>

required

Path to reference audio file (WAV format)

return

Result<(), Error>

Success if reference loaded and mel spectrogram computed

Zero-shot mode uses only the reference audio mel spectrogram for voice style.

VoiceCloner::set_reference_audio_with_text

Set reference audio with transcript for few-shot mode.

pub fn set_reference_audio_with_text(
    &mut self,
    audio_path: impl AsRef<Path>,
    text: &str,
) -> Result<(), Error>

audio_path

impl AsRef<Path>

required

Path to reference audio file

text

&str

required

Transcript of the reference audio

return

Result<(), Error>

Success if reference loaded and HuBERT semantic codes extracted

Few-shot mode extracts semantic tokens from the reference audio using HuBERT, which provides better voice cloning quality than zero-shot mode.

VoiceCloner::set_reference_with_precomputed_codes

Set reference using pre-computed prompt semantic codes.

pub fn set_reference_with_precomputed_codes(
    &mut self,
    audio_path: impl AsRef<Path>,
    text: &str,
    codes_path: impl AsRef<Path>,
) -> Result<(), Error>

audio_path

impl AsRef<Path>

required

Path to reference audio file (for mel spectrogram)

text

&str

required

Transcript of the reference audio

codes_path

impl AsRef<Path>

required

Path to binary file containing i32 codes (little-endian) or .npy file

return

Result<(), Error>

Success if reference and codes loaded

Use this when the Rust HuBERT produces poor results. You can extract prompt semantic codes using Python and load them here.

VoiceCloner::synthesize

Synthesize speech from text.

pub fn synthesize(&mut self, text: &str) -> Result<AudioOutput, Error>

text

&str

required

Text to synthesize (up to 10,000 characters)

return

Result<AudioOutput, Error>

Generated audio with samples, sample rate, duration, and token count

Text is automatically split at punctuation marks and each segment is processed separately for better quality. Long segments are further chunked to prevent T2S attention degradation.

VoiceCloner::synthesize_with_options

Synthesize speech with timeout and cancellation support.

pub fn synthesize_with_options(
    &mut self,
    text: &str,
    options: SynthesisOptions,
) -> Result<AudioOutput, Error>

text

&str

required

Text to synthesize

options

SynthesisOptions

required

Synthesis options (timeout, cancellation token, speed override)

return

Result<AudioOutput, Error>

Generated audio or error if cancelled/timed out

use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use std::time::Duration;

let cancel = Arc::new(AtomicBool::new(false));
let options = SynthesisOptions {
    timeout: Some(Duration::from_secs(30)),
    cancel_token: Some(cancel.clone()),
    ..Default::default()
};

// Call cancel.store(true, Ordering::Relaxed) to cancel
let audio = cloner.synthesize_with_options("Hello", options)?;

VoiceCloner::synthesize_from_tokens

Synthesize audio from external semantic tokens.

pub fn synthesize_from_tokens(
    &mut self,
    text: &str,
    tokens: &[i32],
) -> Result<AudioOutput, Error>

text

&str

required

Text to get phoneme IDs

tokens

&[i32]

required

Pre-computed semantic tokens

return

Result<AudioOutput, Error>

Generated audio

This bypasses token generation and directly vocodes the provided tokens. Useful for comparing Rust VITS with Python’s semantic tokens.

VoiceCloner::few_shot_available

Check if few-shot mode is available.

pub fn few_shot_available(&self) -> bool

return

bool

True if HuBERT model is loaded

VoiceCloner::is_few_shot_mode

Check if currently in few-shot mode.

pub fn is_few_shot_mode(&self) -> bool

return

bool

True if prompt semantic codes and reference text are set

Types

VoiceClonerConfig

Configuration for voice cloner.

pub struct VoiceClonerConfig {
    pub t2s_weights: String,
    pub bert_weights: String,
    pub bert_tokenizer: String,
    pub vits_weights: String,
    pub vits_pretrained_base: Option<String>,
    pub hubert_weights: String,
    pub sample_rate: u32,        // 32000
    pub top_k: i32,              // 15
    pub top_p: f32,              // 1.0
    pub temperature: f32,        // 1.0
    pub repetition_penalty: f32, // 1.2
    pub noise_scale: f32,        // 0.5
    pub speed: f32,              // 1.0
    pub vits_onnx_path: Option<String>,
    pub use_mlx_vits: bool,      // false
    pub use_gpu_mel: bool,       // true
}

Default model directory: Uses $GPT_SOVITS_MODEL_DIR if set, otherwise ~/.OminiX/models/gpt-sovits-mlx.

AudioOutput

Generated audio output.

pub struct AudioOutput {
    pub samples: Vec<f32>,      // Raw audio samples (-1.0 to 1.0)
    pub sample_rate: u32,       // Sample rate (32000)
    pub duration: f32,          // Duration in seconds
    pub num_tokens: usize,      // Number of semantic tokens
}

AudioOutput::duration_secs

Get duration in seconds.

pub fn duration_secs(&self) -> f32

return

f32

Duration calculated from samples.len() / sample_rate

AudioOutput::to_i16_samples

Convert to i16 samples for WAV output.

pub fn to_i16_samples(&self) -> Vec<i16>

return

Vec<i16>

Samples converted to 16-bit PCM, clamped to [-1, 1]

AudioOutput::apply_fade_in

Apply fade-in to reduce initial noise artifacts.

pub fn apply_fade_in(&mut self, fade_ms: f32)

fade_ms

f32

required

Fade-in duration in milliseconds (default: 50ms)

AudioOutput::trim_start

Trim audio from the start to remove initial artifacts.

pub fn trim_start(&mut self, trim_ms: f32)

trim_ms

f32

required

Duration to trim in milliseconds

SynthesisOptions

Options for synthesis with timeout and cancellation support.

pub struct SynthesisOptions {
    pub timeout: Option<Duration>,
    pub cancel_token: Option<Arc<AtomicBool>>,
    pub max_tokens_per_chunk: Option<usize>,
    pub speed_override: Option<f32>,
}

SynthesisOptions::with_timeout

Create options with a timeout.

pub fn with_timeout(timeout: Duration) -> Self

timeout

Duration

required

Maximum time allowed for synthesis

return

SynthesisOptions

Options with timeout set

SynthesisOptions::with_cancel_token

Create options with a cancellation token.

pub fn with_cancel_token(token: Arc<AtomicBool>) -> Self

token

Arc<AtomicBool>

required

Cancellation token - set to true to cancel synthesis

return

SynthesisOptions

Options with cancel token set

Text preprocessing

preprocess_text

Preprocess text to phonemes.

pub fn preprocess_text(text: &str) -> (Array, Vec<String>, Vec<i32>, String)

text

&str

required

Input text in Chinese or English

return

(Array, Vec<String>, Vec<i32>, String)

Tuple of (phoneme_ids, phonemes, word2ph, normalized_text)

Language

Supported languages.

pub enum Language {
    Chinese,
    English,
}

Model files

Required files in model directory:

T2S weights: doubao_mixed_gpt_new.safetensors
BERT weights: bert.safetensors
BERT tokenizer: chinese-roberta-tokenizer/tokenizer.json
VITS weights: doubao_mixed_sovits_new.safetensors
HuBERT weights: hubert.safetensors (for few-shot mode)
ONNX VITS (optional): vits.onnx (recommended for best quality)

Environment variables

GPT_SOVITS_MODEL_DIR - Override default model directory

Modes

Zero-shot mode

Uses only reference audio mel spectrogram for voice style:

cloner.set_reference_audio("/path/to/reference.wav")?;
let audio = cloner.synthesize("你好，世界！")?;

Few-shot mode

Uses reference audio + transcript for stronger conditioning via HuBERT:

cloner.set_reference_audio_with_text(
    "/path/to/reference.wav",
    "这是参考音频的文本内容"
)?;
let audio = cloner.synthesize("你好，世界！")?;

Few-shot mode provides better quality but requires the reference audio transcript.

Core Libraries

Language Models

Vision-Language

Audio

Image

Server

​GPT-SoVITS MLX

​Features

​Performance

​Installation

​Quick start

​VoiceCloner

​VoiceCloner::new

​VoiceCloner::with_defaults

​VoiceCloner::set_reference_audio

​VoiceCloner::set_reference_audio_with_text

​VoiceCloner::set_reference_with_precomputed_codes

​VoiceCloner::synthesize

​VoiceCloner::synthesize_with_options

​VoiceCloner::synthesize_from_tokens

​VoiceCloner::few_shot_available

​VoiceCloner::is_few_shot_mode

​Types

​VoiceClonerConfig

​AudioOutput

​AudioOutput::duration_secs

​AudioOutput::to_i16_samples

​AudioOutput::apply_fade_in

​AudioOutput::trim_start

​SynthesisOptions

​SynthesisOptions::with_timeout

​SynthesisOptions::with_cancel_token

​Text preprocessing

​preprocess_text

​Language

​Model files

​Environment variables

​Modes

​Zero-shot mode

​Few-shot mode

Build docs developers (and LLMs) love

GPT-SoVITS MLX

Features

Performance

Installation

Quick start

VoiceCloner

VoiceCloner::new

VoiceCloner::with_defaults

VoiceCloner::set_reference_audio

VoiceCloner::set_reference_audio_with_text

VoiceCloner::set_reference_with_precomputed_codes

VoiceCloner::synthesize

VoiceCloner::synthesize_with_options

VoiceCloner::synthesize_from_tokens

VoiceCloner::few_shot_available

VoiceCloner::is_few_shot_mode

Types

VoiceClonerConfig

AudioOutput

AudioOutput::duration_secs

AudioOutput::to_i16_samples

AudioOutput::apply_fade_in

AudioOutput::trim_start

SynthesisOptions

SynthesisOptions::with_timeout

SynthesisOptions::with_cancel_token

Text preprocessing

preprocess_text

Language

Model files

Environment variables

Modes

Zero-shot mode

Few-shot mode