Skip to main content

GPT-SoVITS MLX

Pure Rust implementation of GPT-SoVITS with MLX acceleration for Apple Silicon. Enables few-shot voice cloning with just a few seconds of reference audio.

Features

  • Few-shot voice cloning: Clone any voice with just a few seconds of reference audio
  • Mixed Chinese-English: Natural handling of mixed language text
  • High performance: 4x realtime synthesis on Apple Silicon
  • Pure Rust: No Python dependencies at runtime

Performance

On Apple Silicon (M-series):
  • Model loading: ~50ms
  • Synthesis: ~4x realtime (generates 20s audio in 5s)
  • Memory: ~2GB for all models

Installation

cargo add gpt-sovits-mlx

Quick start

use gpt_sovits_mlx::VoiceCloner;

let mut cloner = VoiceCloner::with_defaults()?;
cloner.set_reference_audio("/path/to/reference.wav")?;
let audio = cloner.synthesize("你好,世界!")?;
cloner.play(&audio)?;

VoiceCloner

Main API for voice cloning with GPT-SoVITS.

VoiceCloner::new

Create a new voice cloner with configuration.
pub fn new(config: VoiceClonerConfig) -> Result<Self, Error>
config
VoiceClonerConfig
required
Configuration with model paths and sampling parameters
return
Result<VoiceCloner, Error>
Voice cloner instance with loaded models

VoiceCloner::with_defaults

Create with default configuration.
pub fn with_defaults() -> Result<Self, Error>
return
Result<VoiceCloner, Error>
Voice cloner with default model paths from ~/.OminiX/models/gpt-sovits-mlx

VoiceCloner::set_reference_audio

Set reference audio for voice cloning (zero-shot mode).
pub fn set_reference_audio(&mut self, path: impl AsRef<Path>) -> Result<(), Error>
path
impl AsRef<Path>
required
Path to reference audio file (WAV format)
return
Result<(), Error>
Success if reference loaded and mel spectrogram computed
Zero-shot mode uses only the reference audio mel spectrogram for voice style.

VoiceCloner::set_reference_audio_with_text

Set reference audio with transcript for few-shot mode.
pub fn set_reference_audio_with_text(
    &mut self,
    audio_path: impl AsRef<Path>,
    text: &str,
) -> Result<(), Error>
audio_path
impl AsRef<Path>
required
Path to reference audio file
text
&str
required
Transcript of the reference audio
return
Result<(), Error>
Success if reference loaded and HuBERT semantic codes extracted
Few-shot mode extracts semantic tokens from the reference audio using HuBERT, which provides better voice cloning quality than zero-shot mode.

VoiceCloner::set_reference_with_precomputed_codes

Set reference using pre-computed prompt semantic codes.
pub fn set_reference_with_precomputed_codes(
    &mut self,
    audio_path: impl AsRef<Path>,
    text: &str,
    codes_path: impl AsRef<Path>,
) -> Result<(), Error>
audio_path
impl AsRef<Path>
required
Path to reference audio file (for mel spectrogram)
text
&str
required
Transcript of the reference audio
codes_path
impl AsRef<Path>
required
Path to binary file containing i32 codes (little-endian) or .npy file
return
Result<(), Error>
Success if reference and codes loaded
Use this when the Rust HuBERT produces poor results. You can extract prompt semantic codes using Python and load them here.

VoiceCloner::synthesize

Synthesize speech from text.
pub fn synthesize(&mut self, text: &str) -> Result<AudioOutput, Error>
text
&str
required
Text to synthesize (up to 10,000 characters)
return
Result<AudioOutput, Error>
Generated audio with samples, sample rate, duration, and token count
Text is automatically split at punctuation marks and each segment is processed separately for better quality. Long segments are further chunked to prevent T2S attention degradation.

VoiceCloner::synthesize_with_options

Synthesize speech with timeout and cancellation support.
pub fn synthesize_with_options(
    &mut self,
    text: &str,
    options: SynthesisOptions,
) -> Result<AudioOutput, Error>
text
&str
required
Text to synthesize
options
SynthesisOptions
required
Synthesis options (timeout, cancellation token, speed override)
return
Result<AudioOutput, Error>
Generated audio or error if cancelled/timed out
use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use std::time::Duration;

let cancel = Arc::new(AtomicBool::new(false));
let options = SynthesisOptions {
    timeout: Some(Duration::from_secs(30)),
    cancel_token: Some(cancel.clone()),
    ..Default::default()
};

// Call cancel.store(true, Ordering::Relaxed) to cancel
let audio = cloner.synthesize_with_options("Hello", options)?;

VoiceCloner::synthesize_from_tokens

Synthesize audio from external semantic tokens.
pub fn synthesize_from_tokens(
    &mut self,
    text: &str,
    tokens: &[i32],
) -> Result<AudioOutput, Error>
text
&str
required
Text to get phoneme IDs
tokens
&[i32]
required
Pre-computed semantic tokens
return
Result<AudioOutput, Error>
Generated audio
This bypasses token generation and directly vocodes the provided tokens. Useful for comparing Rust VITS with Python’s semantic tokens.

VoiceCloner::few_shot_available

Check if few-shot mode is available.
pub fn few_shot_available(&self) -> bool
return
bool
True if HuBERT model is loaded

VoiceCloner::is_few_shot_mode

Check if currently in few-shot mode.
pub fn is_few_shot_mode(&self) -> bool
return
bool
True if prompt semantic codes and reference text are set

Types

VoiceClonerConfig

Configuration for voice cloner.
pub struct VoiceClonerConfig {
    pub t2s_weights: String,
    pub bert_weights: String,
    pub bert_tokenizer: String,
    pub vits_weights: String,
    pub vits_pretrained_base: Option<String>,
    pub hubert_weights: String,
    pub sample_rate: u32,        // 32000
    pub top_k: i32,              // 15
    pub top_p: f32,              // 1.0
    pub temperature: f32,        // 1.0
    pub repetition_penalty: f32, // 1.2
    pub noise_scale: f32,        // 0.5
    pub speed: f32,              // 1.0
    pub vits_onnx_path: Option<String>,
    pub use_mlx_vits: bool,      // false
    pub use_gpu_mel: bool,       // true
}
Default model directory: Uses $GPT_SOVITS_MODEL_DIR if set, otherwise ~/.OminiX/models/gpt-sovits-mlx.

AudioOutput

Generated audio output.
pub struct AudioOutput {
    pub samples: Vec<f32>,      // Raw audio samples (-1.0 to 1.0)
    pub sample_rate: u32,       // Sample rate (32000)
    pub duration: f32,          // Duration in seconds
    pub num_tokens: usize,      // Number of semantic tokens
}

AudioOutput::duration_secs

Get duration in seconds.
pub fn duration_secs(&self) -> f32
return
f32
Duration calculated from samples.len() / sample_rate

AudioOutput::to_i16_samples

Convert to i16 samples for WAV output.
pub fn to_i16_samples(&self) -> Vec<i16>
return
Vec<i16>
Samples converted to 16-bit PCM, clamped to [-1, 1]

AudioOutput::apply_fade_in

Apply fade-in to reduce initial noise artifacts.
pub fn apply_fade_in(&mut self, fade_ms: f32)
fade_ms
f32
required
Fade-in duration in milliseconds (default: 50ms)

AudioOutput::trim_start

Trim audio from the start to remove initial artifacts.
pub fn trim_start(&mut self, trim_ms: f32)
trim_ms
f32
required
Duration to trim in milliseconds

SynthesisOptions

Options for synthesis with timeout and cancellation support.
pub struct SynthesisOptions {
    pub timeout: Option<Duration>,
    pub cancel_token: Option<Arc<AtomicBool>>,
    pub max_tokens_per_chunk: Option<usize>,
    pub speed_override: Option<f32>,
}

SynthesisOptions::with_timeout

Create options with a timeout.
pub fn with_timeout(timeout: Duration) -> Self
timeout
Duration
required
Maximum time allowed for synthesis
return
SynthesisOptions
Options with timeout set

SynthesisOptions::with_cancel_token

Create options with a cancellation token.
pub fn with_cancel_token(token: Arc<AtomicBool>) -> Self
token
Arc<AtomicBool>
required
Cancellation token - set to true to cancel synthesis
return
SynthesisOptions
Options with cancel token set

Text preprocessing

preprocess_text

Preprocess text to phonemes.
pub fn preprocess_text(text: &str) -> (Array, Vec<String>, Vec<i32>, String)
text
&str
required
Input text in Chinese or English
return
(Array, Vec<String>, Vec<i32>, String)
Tuple of (phoneme_ids, phonemes, word2ph, normalized_text)

Language

Supported languages.
pub enum Language {
    Chinese,
    English,
}

Model files

Required files in model directory:
  1. T2S weights: doubao_mixed_gpt_new.safetensors
  2. BERT weights: bert.safetensors
  3. BERT tokenizer: chinese-roberta-tokenizer/tokenizer.json
  4. VITS weights: doubao_mixed_sovits_new.safetensors
  5. HuBERT weights: hubert.safetensors (for few-shot mode)
  6. ONNX VITS (optional): vits.onnx (recommended for best quality)

Environment variables

GPT_SOVITS_MODEL_DIR - Override default model directory

Modes

Zero-shot mode

Uses only reference audio mel spectrogram for voice style:
cloner.set_reference_audio("/path/to/reference.wav")?;
let audio = cloner.synthesize("你好,世界!")?;

Few-shot mode

Uses reference audio + transcript for stronger conditioning via HuBERT:
cloner.set_reference_audio_with_text(
    "/path/to/reference.wav",
    "这是参考音频的文本内容"
)?;
let audio = cloner.synthesize("你好,世界!")?;
Few-shot mode provides better quality but requires the reference audio transcript.

Build docs developers (and LLMs) love