Chatterbox

Overview

Chatterbox is the original model in the Chatterbox family, offering general-purpose zero-shot text-to-speech with advanced creative controls. With 500M parameters, it provides fine-grained control over speech generation through CFG (Classifier-Free Guidance) weighting and exaggeration parameters.

CFG Control

Adjust classifier-free guidance weight to control adherence to the reference voice and speaking style.

Exaggeration Tuning

Control emotional intensity and expressiveness of generated speech.

Zero-Shot Cloning

Clone any voice from a short reference clip without fine-tuning.

Creative Flexibility

Fine-tune generation for dramatic, expressive, or neutral speech styles.

Model Specifications

Model Size: 500M parameters
Language: English only
Sample Rate: 24,000 Hz
Architecture: T3 transformer + S3Gen decoder
Repository: ResembleAI/chatterbox

Key Features

Classifier-Free Guidance (CFG)

CFG weight controls how closely the model follows the reference audio’s characteristics:

Higher values (0.7-1.0): Stronger adherence to reference voice and style
Medium values (0.3-0.5): Balanced, works well for most use cases
Lower values (0.0-0.3): More creative interpretation, useful for fast-speaking references

Exaggeration Control

The exaggeration parameter controls emotional intensity and expressiveness:

Default (0.5): Natural, balanced speech
Lower (0.0-0.3): More neutral, measured delivery
Higher (0.7-1.0): More dramatic, expressive speech

For expressive or dramatic speech, try lower CFG weight (around 0.3) with higher exaggeration (0.7+). Higher exaggeration tends to speed up speech, while lower CFG helps maintain slower, more deliberate pacing.

Hardware Requirements

Minimum (CPU)

6GB RAM
CPU inference supported
Slower generation times

Recommended (GPU)

NVIDIA GPU with 6GB+ VRAM
CUDA support
Near real-time generation

The model also supports Apple Silicon (MPS) for Mac users with M1/M2/M3 chips.

Usage

Basic Generation

import torchaudio as ta
from chatterbox.tts import ChatterboxTTS

# Load the model
model = ChatterboxTTS.from_pretrained(device="cuda")

# Generate speech with default voice
text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill."
wav = model.generate(text)

ta.save("test-english.wav", wav, model.sr)

Voice Cloning

Clone any voice by providing a reference audio clip:

import torchaudio as ta
from chatterbox.tts import ChatterboxTTS

model = ChatterboxTTS.from_pretrained(device="cuda")

text = "Your custom text here"
wav = model.generate(text, audio_prompt_path="your_reference_clip.wav")

ta.save("cloned-voice.wav", wav, model.sr)

Creative Control with CFG and Exaggeration

import torchaudio as ta
from chatterbox.tts import ChatterboxTTS

model = ChatterboxTTS.from_pretrained(device="cuda")

# For dramatic, expressive speech
text = "This is absolutely incredible!"
wav = model.generate(
    text,
    audio_prompt_path="reference.wav",
    cfg_weight=0.3,      # Lower CFG for slower pacing
    exaggeration=0.7     # Higher exaggeration for drama
)

ta.save("expressive.wav", wav, model.sr)

Generation Parameters

Control the generation process with these parameters:

Parameter	Default	Description
`temperature`	0.8	Controls randomness in token selection
`top_p`	1.0	Nucleus sampling threshold
`min_p`	0.05	Minimum probability threshold
`repetition_penalty`	1.2	Penalizes repeated tokens
`cfg_weight`	0.5	Classifier-free guidance strength
`exaggeration`	0.5	Emotional intensity level
`audio_prompt_path`	None	Path to reference audio for voice cloning

Tips and Tricks

General Use (TTS and Voice Agents)

Reference Clip Matching

Ensure the reference clip matches the specified language. Otherwise, outputs may inherit the accent of the reference clip’s language. To mitigate this, set cfg_weight to 0.

Default Settings

The default settings (exaggeration=0.5, cfg_weight=0.5) work well for most prompts.

Fast Speaking Style

If the reference speaker has a fast speaking style, lower cfg_weight to around 0.3 to improve pacing.

Expressive or Dramatic Speech

Use lower cfg_weight values (e.g., around 0.3)
Increase exaggeration to around 0.7 or higher
Note that higher exaggeration tends to speed up speech
Lower CFG weight helps compensate with slower, more deliberate pacing

When using high exaggeration values, the speech may become faster. Balance this with lower CFG weight for better pacing control.

Performance Characteristics

Generation Speed

10-step decoding process provides high quality at the cost of slightly slower generation compared to Turbo.

Audio Quality

Excellent 24kHz output with fine control over voice characteristics and emotional expression.

Built-in Watermarking

Every audio file generated by Chatterbox includes Resemble AI’s Perth (Perceptual Threshold) watermark - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations. You can detect the watermark using:

import perth
import librosa

# Load the watermarked audio
watermarked_audio, sr = librosa.load("output.wav", sr=None)

# Initialize watermarker
watermarker = perth.PerthImplicitWatermarker()

# Extract watermark
watermark = watermarker.get_watermark(watermarked_audio, sample_rate=sr)
print(f"Extracted watermark: {watermark}")  # 0.0 or 1.0

Use Cases

General TTS: Versatile text-to-speech for various applications
Creative Projects: Fine control for audiobooks, podcasts, and video content
Character Voices: Expressive speech for games and animations
Voice Agents: Conversational AI with adjustable personality
Audio Production: Professional narration with emotional nuance

Comparison with Other Models

Feature	Chatterbox	Chatterbox-Turbo	Chatterbox-Multilingual
Parameters	500M	350M	500M
Languages	English	English	23+
CFG Control	Yes	No	Yes
Exaggeration	Yes	No	Yes
Paralinguistic Tags	No	Yes	No
Best For	Creative control	Low latency	Multi-language

Get Started

Models

Guides

Overview

CFG Control

Exaggeration Tuning

Zero-Shot Cloning

Creative Flexibility

Model Specifications

Key Features

Classifier-Free Guidance (CFG)

Exaggeration Control

Hardware Requirements

Minimum (CPU)

Recommended (GPU)

Usage

Basic Generation

Voice Cloning

Creative Control with CFG and Exaggeration

Generation Parameters

Tips and Tricks

General Use (TTS and Voice Agents)

Reference Clip Matching

Default Settings

Fast Speaking Style

Expressive or Dramatic Speech

Performance Characteristics

Generation Speed

Audio Quality

Built-in Watermarking

Use Cases

Comparison with Other Models

Next Steps

Installation

API Reference

Build docs developers (and LLMs) love

Get Started

Models

Guides

​Overview

CFG Control

Exaggeration Tuning

Zero-Shot Cloning

Creative Flexibility

​Model Specifications

​Key Features

​Classifier-Free Guidance (CFG)

​Exaggeration Control

​Hardware Requirements

Minimum (CPU)

Recommended (GPU)

​Usage

​Basic Generation

​Voice Cloning

​Creative Control with CFG and Exaggeration

​Generation Parameters

​Tips and Tricks

​General Use (TTS and Voice Agents)

Reference Clip Matching

Default Settings

Fast Speaking Style

​Expressive or Dramatic Speech

​Performance Characteristics

Generation Speed

Audio Quality

​Built-in Watermarking

​Use Cases

​Comparison with Other Models

​Next Steps

Installation

API Reference

Build docs developers (and LLMs) love

Overview

Model Specifications

Key Features

Classifier-Free Guidance (CFG)

Exaggeration Control

Hardware Requirements

Usage

Basic Generation

Voice Cloning

Creative Control with CFG and Exaggeration

Generation Parameters

Tips and Tricks

General Use (TTS and Voice Agents)

Expressive or Dramatic Speech

Performance Characteristics

Built-in Watermarking

Use Cases

Comparison with Other Models

Next Steps