Overview
ChatterboxVC provides voice conversion capabilities, allowing you to transform the voice characteristics of any audio file to match a target speaker while preserving the original speech content and prosody. Unlike text-to-speech, voice conversion works directly with audio input.Voice Transformation
Convert any speaker’s voice to match your target voice while keeping the original content.
Prosody Preservation
Maintains the original timing, rhythm, and intonation of the source audio.
Zero-Shot
No training required - just provide a target voice reference.
High Quality
24kHz output with natural voice transformation.
Voice Conversion vs TTS
Understand the key differences between voice conversion and text-to-speech:| Aspect | Voice Conversion (VC) | Text-to-Speech (TTS) |
|---|---|---|
| Input | Audio file | Text string |
| Output | Transformed audio | Generated speech |
| Content | Preserves original | Creates new content |
| Prosody | Keeps original timing | Generates new prosody |
| Use Case | Voice transformation | Speech synthesis |
Voice conversion is ideal when you want to change who is speaking while keeping the exact timing, emotion, and delivery of the original performance.
Model Specifications
- Input: Audio file (automatically resampled to 16kHz)
- Output Sample Rate: 24,000 Hz
- Architecture: S3Gen decoder with voice conditioning
- Repository:
ResembleAI/chatterbox
Hardware Requirements
Minimum (CPU)
- 4GB RAM
- CPU inference supported
- Slower conversion times
Recommended (GPU)
- NVIDIA GPU with 4GB+ VRAM
- CUDA support
- Real-time conversion possible
The model also supports Apple Silicon (MPS) for Mac users with M1/M2/M3 chips.
Usage
Basic Voice Conversion
Auto-detect Device
Batch Processing
Pre-setting Target Voice
How It Works
Audio Tokenization
The source audio is converted to 16kHz and tokenized using the S3 tokenizer, which extracts semantic speech features.
Voice Embedding
The target voice reference (first 10 seconds) is embedded to capture the speaker’s voice characteristics.
Voice Transformation
The S3Gen decoder transforms the source audio tokens using the target voice embedding while preserving the original content and prosody.
Generation Parameters
| Parameter | Type | Description |
|---|---|---|
audio | str | Path to source audio file to convert |
target_voice_path | str | None | Path to target voice reference (optional if pre-set) |
Unlike TTS models, voice conversion has minimal parameters since it preserves the prosody and timing of the original audio.
Best Practices
Reference Audio Quality
Target Voice
- Use clean, noise-free audio
- 5-10 seconds of speech
- Clear, natural speaking
- Representative of desired voice
Source Audio
- Any length supported
- Automatically resampled
- Speech-only recommended
- Minimize background noise
Optimal Results
- Clean Audio: Both source and target should be free of background noise
- Similar Speaking Styles: Better results when source and target have similar speaking rates
- Quality References: Use high-quality recordings for the target voice
- Speech Content: Works best with speech-only audio (no music or sound effects)
Technical Details
Audio Processing Pipeline
Conditioning Length
- Source Audio: Full length is processed (no limit)
- Target Voice: First 10 seconds (240,000 samples at 24kHz) are used for voice conditioning
Built-in Watermarking
Every audio file generated by ChatterboxVC includes Resemble AI’s Perth (Perceptual Threshold) watermark - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations.Use Cases
- Content Localization: Adapt voice actors for different regions while keeping performances
- Voice Replacement: Replace placeholder voices in video production
- Privacy Protection: Anonymize speakers while preserving speech content
- Character Consistency: Maintain consistent character voices across recordings
- Audio Restoration: Update old recordings with clearer voices
- Voice Acting: Transform voice performances to match different characters
Performance Characteristics
Conversion Speed
Fast processing with efficient tokenization. Real-time or near-real-time on modern GPUs.
Audio Quality
High-fidelity 24kHz output that preserves original prosody while transforming voice characteristics.
Limitations
Comparison with TTS
When should you use voice conversion vs TTS?Use Voice Conversion When:
- You have audio you want to transform
- You need to preserve exact timing
- You want to keep original performance
- You’re replacing voices in existing content
Use TTS When:
- You’re starting from text
- You need to generate new speech
- You want to control prosody
- You’re creating original content
Next Steps
Installation
Install Chatterbox and get started
TTS Models
Explore text-to-speech models