Skip to main content

Overview

ChatterboxVC provides voice conversion capabilities, allowing you to transform the voice characteristics of any audio file to match a target speaker while preserving the original speech content and prosody. Unlike text-to-speech, voice conversion works directly with audio input.

Voice Transformation

Convert any speaker’s voice to match your target voice while keeping the original content.

Prosody Preservation

Maintains the original timing, rhythm, and intonation of the source audio.

Zero-Shot

No training required - just provide a target voice reference.

High Quality

24kHz output with natural voice transformation.

Voice Conversion vs TTS

Understand the key differences between voice conversion and text-to-speech:
AspectVoice Conversion (VC)Text-to-Speech (TTS)
InputAudio fileText string
OutputTransformed audioGenerated speech
ContentPreserves originalCreates new content
ProsodyKeeps original timingGenerates new prosody
Use CaseVoice transformationSpeech synthesis
Voice conversion is ideal when you want to change who is speaking while keeping the exact timing, emotion, and delivery of the original performance.

Model Specifications

  • Input: Audio file (automatically resampled to 16kHz)
  • Output Sample Rate: 24,000 Hz
  • Architecture: S3Gen decoder with voice conditioning
  • Repository: ResembleAI/chatterbox

Hardware Requirements

Minimum (CPU)

  • 4GB RAM
  • CPU inference supported
  • Slower conversion times

Recommended (GPU)

  • NVIDIA GPU with 4GB+ VRAM
  • CUDA support
  • Real-time conversion possible
The model also supports Apple Silicon (MPS) for Mac users with M1/M2/M3 chips.

Usage

Basic Voice Conversion

import torch
import torchaudio as ta
from chatterbox.vc import ChatterboxVC

# Load the voice conversion model
model = ChatterboxVC.from_pretrained(device="cuda")

# Convert voice
source_audio = "original_speaker.wav"
target_voice = "desired_voice.wav"

wav = model.generate(
    audio=source_audio,
    target_voice_path=target_voice
)

ta.save("converted.wav", wav, model.sr)

Auto-detect Device

import torch
import torchaudio as ta
from chatterbox.vc import ChatterboxVC

# Automatically detect the best available device
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

print(f"Using device: {device}")

model = ChatterboxVC.from_pretrained(device=device)

source_audio = "source.wav"
target_voice = "target.wav"

wav = model.generate(
    audio=source_audio,
    target_voice_path=target_voice
)

ta.save("converted.wav", wav, model.sr)

Batch Processing

import torch
import torchaudio as ta
from chatterbox.vc import ChatterboxVC
from pathlib import Path

model = ChatterboxVC.from_pretrained(device="cuda")

# Set target voice once
target_voice = "celebrity_voice.wav"
model.set_target_voice(target_voice)

# Process multiple files with the same target voice
source_files = [
    "speaker1.wav",
    "speaker2.wav",
    "speaker3.wav"
]

for i, source_file in enumerate(source_files):
    wav = model.generate(audio=source_file)
    ta.save(f"converted_{i}.wav", wav, model.sr)
When processing multiple files with the same target voice, use set_target_voice() once and then call generate() without the target_voice_path parameter for better performance.

Pre-setting Target Voice

import torchaudio as ta
from chatterbox.vc import ChatterboxVC

model = ChatterboxVC.from_pretrained(device="cuda")

# Pre-set the target voice
model.set_target_voice("my_target_voice.wav")

# Generate without specifying target_voice_path each time
wav = model.generate(audio="source1.wav")
ta.save("converted1.wav", wav, model.sr)

wav = model.generate(audio="source2.wav")
ta.save("converted2.wav", wav, model.sr)

How It Works

1

Audio Tokenization

The source audio is converted to 16kHz and tokenized using the S3 tokenizer, which extracts semantic speech features.
2

Voice Embedding

The target voice reference (first 10 seconds) is embedded to capture the speaker’s voice characteristics.
3

Voice Transformation

The S3Gen decoder transforms the source audio tokens using the target voice embedding while preserving the original content and prosody.
4

Audio Synthesis

The transformed tokens are decoded to 24kHz audio with the target voice characteristics.

Generation Parameters

ParameterTypeDescription
audiostrPath to source audio file to convert
target_voice_pathstr | NonePath to target voice reference (optional if pre-set)
Unlike TTS models, voice conversion has minimal parameters since it preserves the prosody and timing of the original audio.

Best Practices

Reference Audio Quality

Target Voice

  • Use clean, noise-free audio
  • 5-10 seconds of speech
  • Clear, natural speaking
  • Representative of desired voice

Source Audio

  • Any length supported
  • Automatically resampled
  • Speech-only recommended
  • Minimize background noise

Optimal Results

  1. Clean Audio: Both source and target should be free of background noise
  2. Similar Speaking Styles: Better results when source and target have similar speaking rates
  3. Quality References: Use high-quality recordings for the target voice
  4. Speech Content: Works best with speech-only audio (no music or sound effects)
Voice conversion quality depends heavily on the quality of both the source audio and target voice reference. Poor quality inputs will result in degraded outputs.

Technical Details

Audio Processing Pipeline

# Internal processing flow (for reference)
# 1. Load source audio and resample to 16kHz
audio_16k, _ = librosa.load(source_audio, sr=16000)

# 2. Tokenize source audio
speech_tokens, _ = tokenizer(audio_16k)

# 3. Load target voice and extract embeddings
target_audio, _ = librosa.load(target_voice, sr=24000)
ref_dict = s3gen.embed_ref(target_audio[:240000])  # First 10 seconds

# 4. Generate with voice transformation
output_wav = s3gen.inference(speech_tokens, ref_dict)

Conditioning Length

  • Source Audio: Full length is processed (no limit)
  • Target Voice: First 10 seconds (240,000 samples at 24kHz) are used for voice conditioning

Built-in Watermarking

Every audio file generated by ChatterboxVC includes Resemble AI’s Perth (Perceptual Threshold) watermark - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations.
import perth
import librosa

# Load the converted audio
converted_audio, sr = librosa.load("converted.wav", sr=None)

# Initialize watermarker
watermarker = perth.PerthImplicitWatermarker()

# Extract watermark
watermark = watermarker.get_watermark(converted_audio, sample_rate=sr)
print(f"Extracted watermark: {watermark}")  # 0.0 or 1.0

Use Cases

  • Content Localization: Adapt voice actors for different regions while keeping performances
  • Voice Replacement: Replace placeholder voices in video production
  • Privacy Protection: Anonymize speakers while preserving speech content
  • Character Consistency: Maintain consistent character voices across recordings
  • Audio Restoration: Update old recordings with clearer voices
  • Voice Acting: Transform voice performances to match different characters

Performance Characteristics

Conversion Speed

Fast processing with efficient tokenization. Real-time or near-real-time on modern GPUs.

Audio Quality

High-fidelity 24kHz output that preserves original prosody while transforming voice characteristics.

Limitations

  • Voice conversion works best with clean, speech-only audio
  • Background music or noise may affect quality
  • Extreme voice transformations (e.g., male to female or vice versa) may sound less natural
  • The model cannot add emotions or change prosody - it only transforms voice timbre

Comparison with TTS

When should you use voice conversion vs TTS?

Use Voice Conversion When:

  • You have audio you want to transform
  • You need to preserve exact timing
  • You want to keep original performance
  • You’re replacing voices in existing content

Use TTS When:

  • You’re starting from text
  • You need to generate new speech
  • You want to control prosody
  • You’re creating original content

Next Steps

Installation

Install Chatterbox and get started

TTS Models

Explore text-to-speech models

Build docs developers (and LLMs) love