GPT-SoVITS MLX
Pure Rust implementation of GPT-SoVITS with MLX acceleration for Apple Silicon. Enables few-shot voice cloning with just a few seconds of reference audio.
Features
- Few-shot voice cloning: Clone any voice with just a few seconds of reference audio
- Mixed Chinese-English: Natural handling of mixed language text
- High performance: 4x realtime synthesis on Apple Silicon
- Pure Rust: No Python dependencies at runtime
On Apple Silicon (M-series):
- Model loading: ~50ms
- Synthesis: ~4x realtime (generates 20s audio in 5s)
- Memory: ~2GB for all models
Installation
Quick start
use gpt_sovits_mlx::VoiceCloner;
let mut cloner = VoiceCloner::with_defaults()?;
cloner.set_reference_audio("/path/to/reference.wav")?;
let audio = cloner.synthesize("你好,世界!")?;
cloner.play(&audio)?;
VoiceCloner
Main API for voice cloning with GPT-SoVITS.
VoiceCloner::new
Create a new voice cloner with configuration.
pub fn new(config: VoiceClonerConfig) -> Result<Self, Error>
config
VoiceClonerConfig
required
Configuration with model paths and sampling parameters
return
Result<VoiceCloner, Error>
Voice cloner instance with loaded models
VoiceCloner::with_defaults
Create with default configuration.
pub fn with_defaults() -> Result<Self, Error>
return
Result<VoiceCloner, Error>
Voice cloner with default model paths from ~/.OminiX/models/gpt-sovits-mlx
VoiceCloner::set_reference_audio
Set reference audio for voice cloning (zero-shot mode).
pub fn set_reference_audio(&mut self, path: impl AsRef<Path>) -> Result<(), Error>
Path to reference audio file (WAV format)
Success if reference loaded and mel spectrogram computed
Zero-shot mode uses only the reference audio mel spectrogram for voice style.
VoiceCloner::set_reference_audio_with_text
Set reference audio with transcript for few-shot mode.
pub fn set_reference_audio_with_text(
&mut self,
audio_path: impl AsRef<Path>,
text: &str,
) -> Result<(), Error>
Path to reference audio file
Transcript of the reference audio
Success if reference loaded and HuBERT semantic codes extracted
Few-shot mode extracts semantic tokens from the reference audio using HuBERT, which provides better voice cloning quality than zero-shot mode.
VoiceCloner::set_reference_with_precomputed_codes
Set reference using pre-computed prompt semantic codes.
pub fn set_reference_with_precomputed_codes(
&mut self,
audio_path: impl AsRef<Path>,
text: &str,
codes_path: impl AsRef<Path>,
) -> Result<(), Error>
Path to reference audio file (for mel spectrogram)
Transcript of the reference audio
Path to binary file containing i32 codes (little-endian) or .npy file
Success if reference and codes loaded
Use this when the Rust HuBERT produces poor results. You can extract prompt semantic codes using Python and load them here.
VoiceCloner::synthesize
Synthesize speech from text.
pub fn synthesize(&mut self, text: &str) -> Result<AudioOutput, Error>
Text to synthesize (up to 10,000 characters)
return
Result<AudioOutput, Error>
Generated audio with samples, sample rate, duration, and token count
Text is automatically split at punctuation marks and each segment is processed separately for better quality. Long segments are further chunked to prevent T2S attention degradation.
VoiceCloner::synthesize_with_options
Synthesize speech with timeout and cancellation support.
pub fn synthesize_with_options(
&mut self,
text: &str,
options: SynthesisOptions,
) -> Result<AudioOutput, Error>
Synthesis options (timeout, cancellation token, speed override)
return
Result<AudioOutput, Error>
Generated audio or error if cancelled/timed out
use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use std::time::Duration;
let cancel = Arc::new(AtomicBool::new(false));
let options = SynthesisOptions {
timeout: Some(Duration::from_secs(30)),
cancel_token: Some(cancel.clone()),
..Default::default()
};
// Call cancel.store(true, Ordering::Relaxed) to cancel
let audio = cloner.synthesize_with_options("Hello", options)?;
VoiceCloner::synthesize_from_tokens
Synthesize audio from external semantic tokens.
pub fn synthesize_from_tokens(
&mut self,
text: &str,
tokens: &[i32],
) -> Result<AudioOutput, Error>
Pre-computed semantic tokens
return
Result<AudioOutput, Error>
Generated audio
This bypasses token generation and directly vocodes the provided tokens. Useful for comparing Rust VITS with Python’s semantic tokens.
VoiceCloner::few_shot_available
Check if few-shot mode is available.
pub fn few_shot_available(&self) -> bool
True if HuBERT model is loaded
VoiceCloner::is_few_shot_mode
Check if currently in few-shot mode.
pub fn is_few_shot_mode(&self) -> bool
True if prompt semantic codes and reference text are set
Types
VoiceClonerConfig
Configuration for voice cloner.
pub struct VoiceClonerConfig {
pub t2s_weights: String,
pub bert_weights: String,
pub bert_tokenizer: String,
pub vits_weights: String,
pub vits_pretrained_base: Option<String>,
pub hubert_weights: String,
pub sample_rate: u32, // 32000
pub top_k: i32, // 15
pub top_p: f32, // 1.0
pub temperature: f32, // 1.0
pub repetition_penalty: f32, // 1.2
pub noise_scale: f32, // 0.5
pub speed: f32, // 1.0
pub vits_onnx_path: Option<String>,
pub use_mlx_vits: bool, // false
pub use_gpu_mel: bool, // true
}
Default model directory: Uses $GPT_SOVITS_MODEL_DIR if set, otherwise ~/.OminiX/models/gpt-sovits-mlx.
AudioOutput
Generated audio output.
pub struct AudioOutput {
pub samples: Vec<f32>, // Raw audio samples (-1.0 to 1.0)
pub sample_rate: u32, // Sample rate (32000)
pub duration: f32, // Duration in seconds
pub num_tokens: usize, // Number of semantic tokens
}
AudioOutput::duration_secs
Get duration in seconds.
pub fn duration_secs(&self) -> f32
Duration calculated from samples.len() / sample_rate
AudioOutput::to_i16_samples
Convert to i16 samples for WAV output.
pub fn to_i16_samples(&self) -> Vec<i16>
Samples converted to 16-bit PCM, clamped to [-1, 1]
AudioOutput::apply_fade_in
Apply fade-in to reduce initial noise artifacts.
pub fn apply_fade_in(&mut self, fade_ms: f32)
Fade-in duration in milliseconds (default: 50ms)
AudioOutput::trim_start
Trim audio from the start to remove initial artifacts.
pub fn trim_start(&mut self, trim_ms: f32)
Duration to trim in milliseconds
SynthesisOptions
Options for synthesis with timeout and cancellation support.
pub struct SynthesisOptions {
pub timeout: Option<Duration>,
pub cancel_token: Option<Arc<AtomicBool>>,
pub max_tokens_per_chunk: Option<usize>,
pub speed_override: Option<f32>,
}
SynthesisOptions::with_timeout
Create options with a timeout.
pub fn with_timeout(timeout: Duration) -> Self
Maximum time allowed for synthesis
SynthesisOptions::with_cancel_token
Create options with a cancellation token.
pub fn with_cancel_token(token: Arc<AtomicBool>) -> Self
Cancellation token - set to true to cancel synthesis
Options with cancel token set
Text preprocessing
preprocess_text
Preprocess text to phonemes.
pub fn preprocess_text(text: &str) -> (Array, Vec<String>, Vec<i32>, String)
Input text in Chinese or English
return
(Array, Vec<String>, Vec<i32>, String)
Tuple of (phoneme_ids, phonemes, word2ph, normalized_text)
Language
Supported languages.
pub enum Language {
Chinese,
English,
}
Model files
Required files in model directory:
- T2S weights:
doubao_mixed_gpt_new.safetensors
- BERT weights:
bert.safetensors
- BERT tokenizer:
chinese-roberta-tokenizer/tokenizer.json
- VITS weights:
doubao_mixed_sovits_new.safetensors
- HuBERT weights:
hubert.safetensors (for few-shot mode)
- ONNX VITS (optional):
vits.onnx (recommended for best quality)
Environment variables
GPT_SOVITS_MODEL_DIR - Override default model directory
Modes
Zero-shot mode
Uses only reference audio mel spectrogram for voice style:
cloner.set_reference_audio("/path/to/reference.wav")?;
let audio = cloner.synthesize("你好,世界!")?;
Few-shot mode
Uses reference audio + transcript for stronger conditioning via HuBERT:
cloner.set_reference_audio_with_text(
"/path/to/reference.wav",
"这是参考音频的文本内容"
)?;
let audio = cloner.synthesize("你好,世界!")?;
Few-shot mode provides better quality but requires the reference audio transcript.