Skip to main content

Overview

ChatterboxVC enables high-quality voice conversion, transforming the voice characteristics of input audio while preserving the linguistic content. This allows you to change the speaker identity of existing recordings.

Class Signature

class ChatterboxVC:
    def __init__(
        self,
        s3gen: S3Gen,
        device: str,
        ref_dict: dict = None,
    )

Parameters

s3gen
S3Gen
required
The S3Gen vocoder model instance for audio conversion
device
str
required
Device to run inference on (“cuda”, “cpu”, or “mps”)
ref_dict
dict
Optional pre-computed reference voice embeddings dictionary

Class Methods

from_pretrained()

Load the pre-trained ChatterboxVC model from Hugging Face.
@classmethod
def from_pretrained(cls, device: str) -> 'ChatterboxVC'

Parameters

device
str
required
Device to load the model on (“cuda”, “cpu”, or “mps”). Automatically falls back to “cpu” if MPS is not available

Returns

model
ChatterboxVC
Initialized ChatterboxVC model with pre-trained weights from ResembleAI/chatterbox

Example

from chatterbox import ChatterboxVC
import torch

# Load on GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
vc_model = ChatterboxVC.from_pretrained(device)

from_local()

Load the model from a local checkpoint directory.
@classmethod
def from_local(cls, ckpt_dir: str, device: str) -> 'ChatterboxVC'

Parameters

ckpt_dir
str
required
Path to the directory containing model checkpoint files
device
str
required
Device to load the model on (“cuda”, “cpu”, or “mps”)

Returns

model
ChatterboxVC
Initialized ChatterboxVC model with weights loaded from local directory

Instance Methods

set_target_voice()

Set the target voice for conversion from an audio file.
def set_target_voice(self, wav_fpath: str)

Parameters

wav_fpath
str
required
Path to the audio file containing the target voice to convert to

Example

# Set the target voice
vc_model.set_target_voice("target_speaker.wav")

generate()

Convert the voice in the input audio to the target voice.
def generate(
    self,
    audio: str,
    target_voice_path: str = None,
) -> torch.Tensor

Parameters

audio
str
required
Path to the audio file to convert
target_voice_path
str
Optional path to target voice audio file. If provided, will override the existing target voice

Returns

audio
torch.Tensor
Converted audio waveform as a PyTorch tensor with shape [1, samples]. Sample rate is 44100 Hz (accessible via vc_model.sr). Audio includes perceptual watermarking

Example

import torchaudio
from chatterbox import ChatterboxVC

device = "cuda"
vc_model = ChatterboxVC.from_pretrained(device)

# Method 1: Set target voice, then convert
vc_model.set_target_voice("target_speaker.wav")
converted_audio = vc_model.generate("source_audio.wav")
torchaudio.save("converted_output.wav", converted_audio, vc_model.sr)

# Method 2: Convert with target voice in one call
converted_audio = vc_model.generate(
    audio="source_audio.wav",
    target_voice_path="target_speaker.wav"
)
torchaudio.save("converted_output.wav", converted_audio, vc_model.sr)

Attributes

sr
int
Sample rate of generated audio (44100 Hz)
device
str
Device the model is running on
ref_dict
dict
Current target voice embeddings used for conversion

Notes

  • Voice conversion preserves the linguistic content and prosody while changing voice characteristics
  • The model internally tokenizes the source audio at 16kHz before conversion
  • Generated audio is automatically watermarked using the Perth implicit watermarker
  • Both source and target audio are automatically resampled to the correct sample rates
  • You must either call set_target_voice() first or provide target_voice_path to generate()

Build docs developers (and LLMs) love