Overview
Chatterbox is the original model in the Chatterbox family, offering general-purpose zero-shot text-to-speech with advanced creative controls. With 500M parameters, it provides fine-grained control over speech generation through CFG (Classifier-Free Guidance) weighting and exaggeration parameters.CFG Control
Adjust classifier-free guidance weight to control adherence to the reference voice and speaking style.
Exaggeration Tuning
Control emotional intensity and expressiveness of generated speech.
Zero-Shot Cloning
Clone any voice from a short reference clip without fine-tuning.
Creative Flexibility
Fine-tune generation for dramatic, expressive, or neutral speech styles.
Model Specifications
- Model Size: 500M parameters
- Language: English only
- Sample Rate: 24,000 Hz
- Architecture: T3 transformer + S3Gen decoder
- Repository:
ResembleAI/chatterbox
Key Features
Classifier-Free Guidance (CFG)
CFG weight controls how closely the model follows the reference audio’s characteristics:- Higher values (0.7-1.0): Stronger adherence to reference voice and style
- Medium values (0.3-0.5): Balanced, works well for most use cases
- Lower values (0.0-0.3): More creative interpretation, useful for fast-speaking references
Exaggeration Control
The exaggeration parameter controls emotional intensity and expressiveness:- Default (0.5): Natural, balanced speech
- Lower (0.0-0.3): More neutral, measured delivery
- Higher (0.7-1.0): More dramatic, expressive speech
Hardware Requirements
Minimum (CPU)
- 6GB RAM
- CPU inference supported
- Slower generation times
Recommended (GPU)
- NVIDIA GPU with 6GB+ VRAM
- CUDA support
- Near real-time generation
The model also supports Apple Silicon (MPS) for Mac users with M1/M2/M3 chips.
Usage
Basic Generation
Voice Cloning
Clone any voice by providing a reference audio clip:Creative Control with CFG and Exaggeration
Generation Parameters
Control the generation process with these parameters:| Parameter | Default | Description |
|---|---|---|
temperature | 0.8 | Controls randomness in token selection |
top_p | 1.0 | Nucleus sampling threshold |
min_p | 0.05 | Minimum probability threshold |
repetition_penalty | 1.2 | Penalizes repeated tokens |
cfg_weight | 0.5 | Classifier-free guidance strength |
exaggeration | 0.5 | Emotional intensity level |
audio_prompt_path | None | Path to reference audio for voice cloning |
Tips and Tricks
General Use (TTS and Voice Agents)
Reference Clip Matching
Ensure the reference clip matches the specified language. Otherwise, outputs may inherit the accent of the reference clip’s language. To mitigate this, set
cfg_weight to 0.Default Settings
The default settings (
exaggeration=0.5, cfg_weight=0.5) work well for most prompts.Fast Speaking Style
If the reference speaker has a fast speaking style, lower
cfg_weight to around 0.3 to improve pacing.Expressive or Dramatic Speech
- Use lower
cfg_weightvalues (e.g., around 0.3) - Increase
exaggerationto around 0.7 or higher - Note that higher exaggeration tends to speed up speech
- Lower CFG weight helps compensate with slower, more deliberate pacing
Performance Characteristics
Generation Speed
10-step decoding process provides high quality at the cost of slightly slower generation compared to Turbo.
Audio Quality
Excellent 24kHz output with fine control over voice characteristics and emotional expression.
Built-in Watermarking
Every audio file generated by Chatterbox includes Resemble AI’s Perth (Perceptual Threshold) watermark - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations. You can detect the watermark using:Use Cases
- General TTS: Versatile text-to-speech for various applications
- Creative Projects: Fine control for audiobooks, podcasts, and video content
- Character Voices: Expressive speech for games and animations
- Voice Agents: Conversational AI with adjustable personality
- Audio Production: Professional narration with emotional nuance
Comparison with Other Models
| Feature | Chatterbox | Chatterbox-Turbo | Chatterbox-Multilingual |
|---|---|---|---|
| Parameters | 500M | 350M | 500M |
| Languages | English | English | 23+ |
| CFG Control | Yes | No | Yes |
| Exaggeration | Yes | No | Yes |
| Paralinguistic Tags | No | Yes | No |
| Best For | Creative control | Low latency | Multi-language |
Next Steps
Installation
Install Chatterbox and get started
API Reference
Explore all parameters and methods