Skip to main content

Overview

Chatterbox is the original model in the Chatterbox family, offering general-purpose zero-shot text-to-speech with advanced creative controls. With 500M parameters, it provides fine-grained control over speech generation through CFG (Classifier-Free Guidance) weighting and exaggeration parameters.

CFG Control

Adjust classifier-free guidance weight to control adherence to the reference voice and speaking style.

Exaggeration Tuning

Control emotional intensity and expressiveness of generated speech.

Zero-Shot Cloning

Clone any voice from a short reference clip without fine-tuning.

Creative Flexibility

Fine-tune generation for dramatic, expressive, or neutral speech styles.

Model Specifications

  • Model Size: 500M parameters
  • Language: English only
  • Sample Rate: 24,000 Hz
  • Architecture: T3 transformer + S3Gen decoder
  • Repository: ResembleAI/chatterbox

Key Features

Classifier-Free Guidance (CFG)

CFG weight controls how closely the model follows the reference audio’s characteristics:
  • Higher values (0.7-1.0): Stronger adherence to reference voice and style
  • Medium values (0.3-0.5): Balanced, works well for most use cases
  • Lower values (0.0-0.3): More creative interpretation, useful for fast-speaking references

Exaggeration Control

The exaggeration parameter controls emotional intensity and expressiveness:
  • Default (0.5): Natural, balanced speech
  • Lower (0.0-0.3): More neutral, measured delivery
  • Higher (0.7-1.0): More dramatic, expressive speech
For expressive or dramatic speech, try lower CFG weight (around 0.3) with higher exaggeration (0.7+). Higher exaggeration tends to speed up speech, while lower CFG helps maintain slower, more deliberate pacing.

Hardware Requirements

Minimum (CPU)

  • 6GB RAM
  • CPU inference supported
  • Slower generation times

Recommended (GPU)

  • NVIDIA GPU with 6GB+ VRAM
  • CUDA support
  • Near real-time generation
The model also supports Apple Silicon (MPS) for Mac users with M1/M2/M3 chips.

Usage

Basic Generation

import torchaudio as ta
from chatterbox.tts import ChatterboxTTS

# Load the model
model = ChatterboxTTS.from_pretrained(device="cuda")

# Generate speech with default voice
text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill."
wav = model.generate(text)

ta.save("test-english.wav", wav, model.sr)

Voice Cloning

Clone any voice by providing a reference audio clip:
import torchaudio as ta
from chatterbox.tts import ChatterboxTTS

model = ChatterboxTTS.from_pretrained(device="cuda")

text = "Your custom text here"
wav = model.generate(text, audio_prompt_path="your_reference_clip.wav")

ta.save("cloned-voice.wav", wav, model.sr)

Creative Control with CFG and Exaggeration

import torchaudio as ta
from chatterbox.tts import ChatterboxTTS

model = ChatterboxTTS.from_pretrained(device="cuda")

# For dramatic, expressive speech
text = "This is absolutely incredible!"
wav = model.generate(
    text,
    audio_prompt_path="reference.wav",
    cfg_weight=0.3,      # Lower CFG for slower pacing
    exaggeration=0.7     # Higher exaggeration for drama
)

ta.save("expressive.wav", wav, model.sr)

Generation Parameters

Control the generation process with these parameters:
ParameterDefaultDescription
temperature0.8Controls randomness in token selection
top_p1.0Nucleus sampling threshold
min_p0.05Minimum probability threshold
repetition_penalty1.2Penalizes repeated tokens
cfg_weight0.5Classifier-free guidance strength
exaggeration0.5Emotional intensity level
audio_prompt_pathNonePath to reference audio for voice cloning

Tips and Tricks

General Use (TTS and Voice Agents)

Reference Clip Matching

Ensure the reference clip matches the specified language. Otherwise, outputs may inherit the accent of the reference clip’s language. To mitigate this, set cfg_weight to 0.

Default Settings

The default settings (exaggeration=0.5, cfg_weight=0.5) work well for most prompts.

Fast Speaking Style

If the reference speaker has a fast speaking style, lower cfg_weight to around 0.3 to improve pacing.

Expressive or Dramatic Speech

  1. Use lower cfg_weight values (e.g., around 0.3)
  2. Increase exaggeration to around 0.7 or higher
  3. Note that higher exaggeration tends to speed up speech
  4. Lower CFG weight helps compensate with slower, more deliberate pacing
When using high exaggeration values, the speech may become faster. Balance this with lower CFG weight for better pacing control.

Performance Characteristics

Generation Speed

10-step decoding process provides high quality at the cost of slightly slower generation compared to Turbo.

Audio Quality

Excellent 24kHz output with fine control over voice characteristics and emotional expression.

Built-in Watermarking

Every audio file generated by Chatterbox includes Resemble AI’s Perth (Perceptual Threshold) watermark - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations. You can detect the watermark using:
import perth
import librosa

# Load the watermarked audio
watermarked_audio, sr = librosa.load("output.wav", sr=None)

# Initialize watermarker
watermarker = perth.PerthImplicitWatermarker()

# Extract watermark
watermark = watermarker.get_watermark(watermarked_audio, sample_rate=sr)
print(f"Extracted watermark: {watermark}")  # 0.0 or 1.0

Use Cases

  • General TTS: Versatile text-to-speech for various applications
  • Creative Projects: Fine control for audiobooks, podcasts, and video content
  • Character Voices: Expressive speech for games and animations
  • Voice Agents: Conversational AI with adjustable personality
  • Audio Production: Professional narration with emotional nuance

Comparison with Other Models

FeatureChatterboxChatterbox-TurboChatterbox-Multilingual
Parameters500M350M500M
LanguagesEnglishEnglish23+
CFG ControlYesNoYes
ExaggerationYesNoYes
Paralinguistic TagsNoYesNo
Best ForCreative controlLow latencyMulti-language

Next Steps

Installation

Install Chatterbox and get started

API Reference

Explore all parameters and methods

Build docs developers (and LLMs) love