Overview
ChatterboxVC enables high-quality voice conversion, transforming the voice characteristics of input audio while preserving the linguistic content. This allows you to change the speaker identity of existing recordings.
Class Signature
Parameters
The S3Gen vocoder model instance for audio conversion
Device to run inference on (“cuda”, “cpu”, or “mps”)
Optional pre-computed reference voice embeddings dictionary
Class Methods
from_pretrained()
Load the pre-trained ChatterboxVC model from Hugging Face.Parameters
Device to load the model on (“cuda”, “cpu”, or “mps”). Automatically falls back to “cpu” if MPS is not available
Returns
Initialized ChatterboxVC model with pre-trained weights from
ResembleAI/chatterboxExample
from_local()
Load the model from a local checkpoint directory.Parameters
Path to the directory containing model checkpoint files
Device to load the model on (“cuda”, “cpu”, or “mps”)
Returns
Initialized ChatterboxVC model with weights loaded from local directory
Instance Methods
set_target_voice()
Set the target voice for conversion from an audio file.Parameters
Path to the audio file containing the target voice to convert to
Example
generate()
Convert the voice in the input audio to the target voice.Parameters
Path to the audio file to convert
Optional path to target voice audio file. If provided, will override the existing target voice
Returns
Converted audio waveform as a PyTorch tensor with shape
[1, samples]. Sample rate is 44100 Hz (accessible via vc_model.sr). Audio includes perceptual watermarkingExample
Attributes
Sample rate of generated audio (44100 Hz)
Device the model is running on
Current target voice embeddings used for conversion
Notes
- Voice conversion preserves the linguistic content and prosody while changing voice characteristics
- The model internally tokenizes the source audio at 16kHz before conversion
- Generated audio is automatically watermarked using the Perth implicit watermarker
- Both source and target audio are automatically resampled to the correct sample rates
- You must either call
set_target_voice()first or providetarget_voice_pathtogenerate()