Voice Conversion - Chatterbox TTS

Overview

ChatterboxVC provides voice conversion capabilities, allowing you to transform the voice characteristics of any audio file to match a target speaker while preserving the original speech content and prosody. Unlike text-to-speech, voice conversion works directly with audio input.

Voice Transformation

Convert any speaker’s voice to match your target voice while keeping the original content.

Prosody Preservation

Maintains the original timing, rhythm, and intonation of the source audio.

Zero-Shot

No training required - just provide a target voice reference.

High Quality

24kHz output with natural voice transformation.

Voice Conversion vs TTS

Understand the key differences between voice conversion and text-to-speech:

Aspect	Voice Conversion (VC)	Text-to-Speech (TTS)
Input	Audio file	Text string
Output	Transformed audio	Generated speech
Content	Preserves original	Creates new content
Prosody	Keeps original timing	Generates new prosody
Use Case	Voice transformation	Speech synthesis

Voice conversion is ideal when you want to change who is speaking while keeping the exact timing, emotion, and delivery of the original performance.

Model Specifications

Input: Audio file (automatically resampled to 16kHz)
Output Sample Rate: 24,000 Hz
Architecture: S3Gen decoder with voice conditioning
Repository: ResembleAI/chatterbox

Hardware Requirements

Minimum (CPU)

4GB RAM
CPU inference supported
Slower conversion times

Recommended (GPU)

NVIDIA GPU with 4GB+ VRAM
CUDA support
Real-time conversion possible

The model also supports Apple Silicon (MPS) for Mac users with M1/M2/M3 chips.

Usage

Basic Voice Conversion

import torch
import torchaudio as ta
from chatterbox.vc import ChatterboxVC

# Load the voice conversion model
model = ChatterboxVC.from_pretrained(device="cuda")

# Convert voice
source_audio = "original_speaker.wav"
target_voice = "desired_voice.wav"

wav = model.generate(
    audio=source_audio,
    target_voice_path=target_voice
)

ta.save("converted.wav", wav, model.sr)

Auto-detect Device

import torch
import torchaudio as ta
from chatterbox.vc import ChatterboxVC

# Automatically detect the best available device
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

print(f"Using device: {device}")

model = ChatterboxVC.from_pretrained(device=device)

source_audio = "source.wav"
target_voice = "target.wav"

wav = model.generate(
    audio=source_audio,
    target_voice_path=target_voice
)

ta.save("converted.wav", wav, model.sr)

Batch Processing

import torch
import torchaudio as ta
from chatterbox.vc import ChatterboxVC
from pathlib import Path

model = ChatterboxVC.from_pretrained(device="cuda")

# Set target voice once
target_voice = "celebrity_voice.wav"
model.set_target_voice(target_voice)

# Process multiple files with the same target voice
source_files = [
    "speaker1.wav",
    "speaker2.wav",
    "speaker3.wav"
]

for i, source_file in enumerate(source_files):
    wav = model.generate(audio=source_file)
    ta.save(f"converted_{i}.wav", wav, model.sr)

When processing multiple files with the same target voice, use set_target_voice() once and then call generate() without the target_voice_path parameter for better performance.

Pre-setting Target Voice

import torchaudio as ta
from chatterbox.vc import ChatterboxVC

model = ChatterboxVC.from_pretrained(device="cuda")

# Pre-set the target voice
model.set_target_voice("my_target_voice.wav")

# Generate without specifying target_voice_path each time
wav = model.generate(audio="source1.wav")
ta.save("converted1.wav", wav, model.sr)

wav = model.generate(audio="source2.wav")
ta.save("converted2.wav", wav, model.sr)

How It Works

Audio Tokenization

The source audio is converted to 16kHz and tokenized using the S3 tokenizer, which extracts semantic speech features.

Voice Embedding

The target voice reference (first 10 seconds) is embedded to capture the speaker’s voice characteristics.

Voice Transformation

The S3Gen decoder transforms the source audio tokens using the target voice embedding while preserving the original content and prosody.

Audio Synthesis

The transformed tokens are decoded to 24kHz audio with the target voice characteristics.

Generation Parameters

Parameter	Type	Description
`audio`	str	Path to source audio file to convert
`target_voice_path`	str \| None	Path to target voice reference (optional if pre-set)

Unlike TTS models, voice conversion has minimal parameters since it preserves the prosody and timing of the original audio.

Best Practices

Reference Audio Quality

Target Voice

Use clean, noise-free audio
5-10 seconds of speech
Clear, natural speaking
Representative of desired voice

Source Audio

Any length supported
Automatically resampled
Speech-only recommended
Minimize background noise

Optimal Results

Clean Audio: Both source and target should be free of background noise
Similar Speaking Styles: Better results when source and target have similar speaking rates
Quality References: Use high-quality recordings for the target voice
Speech Content: Works best with speech-only audio (no music or sound effects)

Voice conversion quality depends heavily on the quality of both the source audio and target voice reference. Poor quality inputs will result in degraded outputs.

Technical Details

Audio Processing Pipeline

# Internal processing flow (for reference)
# 1. Load source audio and resample to 16kHz
audio_16k, _ = librosa.load(source_audio, sr=16000)

# 2. Tokenize source audio
speech_tokens, _ = tokenizer(audio_16k)

# 3. Load target voice and extract embeddings
target_audio, _ = librosa.load(target_voice, sr=24000)
ref_dict = s3gen.embed_ref(target_audio[:240000])  # First 10 seconds

# 4. Generate with voice transformation
output_wav = s3gen.inference(speech_tokens, ref_dict)

Conditioning Length

Source Audio: Full length is processed (no limit)
Target Voice: First 10 seconds (240,000 samples at 24kHz) are used for voice conditioning

Built-in Watermarking

Every audio file generated by ChatterboxVC includes Resemble AI’s Perth (Perceptual Threshold) watermark - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations.

import perth
import librosa

# Load the converted audio
converted_audio, sr = librosa.load("converted.wav", sr=None)

# Initialize watermarker
watermarker = perth.PerthImplicitWatermarker()

# Extract watermark
watermark = watermarker.get_watermark(converted_audio, sample_rate=sr)
print(f"Extracted watermark: {watermark}")  # 0.0 or 1.0

Use Cases

Content Localization: Adapt voice actors for different regions while keeping performances
Voice Replacement: Replace placeholder voices in video production
Privacy Protection: Anonymize speakers while preserving speech content
Character Consistency: Maintain consistent character voices across recordings
Audio Restoration: Update old recordings with clearer voices
Voice Acting: Transform voice performances to match different characters

Performance Characteristics

Conversion Speed

Fast processing with efficient tokenization. Real-time or near-real-time on modern GPUs.

Audio Quality

High-fidelity 24kHz output that preserves original prosody while transforming voice characteristics.

Limitations

Voice conversion works best with clean, speech-only audio
Background music or noise may affect quality
Extreme voice transformations (e.g., male to female or vice versa) may sound less natural
The model cannot add emotions or change prosody - it only transforms voice timbre

Comparison with TTS

When should you use voice conversion vs TTS?

Use Voice Conversion When:

You have audio you want to transform
You need to preserve exact timing
You want to keep original performance
You’re replacing voices in existing content

Use TTS When:

You’re starting from text
You need to generate new speech
You want to control prosody
You’re creating original content

Next Steps

Installation

Install Chatterbox and get started

TTS Models

Explore text-to-speech models

Get Started

Models

Guides

​Overview

Voice Transformation

Prosody Preservation

Zero-Shot

High Quality

​Voice Conversion vs TTS

​Model Specifications

​Hardware Requirements

Minimum (CPU)

Recommended (GPU)

​Usage

​Basic Voice Conversion

​Auto-detect Device

​Batch Processing

​Pre-setting Target Voice

​How It Works

​Generation Parameters

​Best Practices

​Reference Audio Quality

Target Voice

Source Audio

​Optimal Results

​Technical Details

​Audio Processing Pipeline

​Conditioning Length

​Built-in Watermarking

​Use Cases

​Performance Characteristics

Conversion Speed

Audio Quality

​Limitations

​Comparison with TTS

Use Voice Conversion When:

Use TTS When:

​Next Steps

Installation

TTS Models

Build docs developers (and LLMs) love

Overview

Voice Conversion vs TTS

Model Specifications

Hardware Requirements

Usage

Basic Voice Conversion

Auto-detect Device

Batch Processing

Pre-setting Target Voice

How It Works

Generation Parameters

Best Practices

Reference Audio Quality

Optimal Results

Technical Details

Audio Processing Pipeline

Conditioning Length

Built-in Watermarking

Use Cases

Performance Characteristics

Limitations

Comparison with TTS

Next Steps