Skip to main content
Chatterbox provides several configuration parameters to customize your speech generation. These settings control expressiveness, voice characteristics, sampling behavior, and performance.

Device Options

Specify the computing device when loading the model:
from chatterbox.tts_turbo import ChatterboxTurboTTS
from chatterbox.tts import ChatterboxTTS
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

# CUDA (NVIDIA GPU)
model = ChatterboxTurboTTS.from_pretrained(device="cuda")

# CPU
model = ChatterboxTTS.from_pretrained(device="cpu")

# MPS (Apple Silicon)
model = ChatterboxMultilingualTTS.from_pretrained(device="mps")

Device Selection Guide

DeviceBest ForPerformance
cudaNVIDIA GPUsFastest - recommended for production
mpsApple Silicon (M1/M2/M3)Fast - good for Mac users
cpuAny systemSlower - use when GPU unavailable
Auto-detection: The models automatically fall back to CPU if the requested device is unavailable. For Apple Silicon Macs without MPS support, the model will use CPU automatically.

Auto-Device Detection Example

import torch

# Automatically detect the best available device
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

print(f"Using device: {device}")
model = ChatterboxTTS.from_pretrained(device=device)
From: example_tts.py (lines 6-14)

Generation Parameters

All generation parameters are passed to the generate() method:
wav = model.generate(
    text="Your text here",
    audio_prompt_path="reference.wav",
    cfg_weight=0.5,
    exaggeration=0.5,
    temperature=0.8,
    repetition_penalty=1.2,
    min_p=0.05,
    top_p=0.95,
    top_k=1000,
    norm_loudness=True
)

cfg_weight Parameter

Range: 0.0 to 1.0 (typically) Default: 0.5 (standard models), 0.0 (Turbo - ignored) Controls how strongly the model follows the reference voice characteristics. Higher values make the output more similar to the reference audio.
# Light conditioning - more variation from reference
wav = model.generate(text, cfg_weight=0.3)

# Default - balanced
wav = model.generate(text, cfg_weight=0.5)

# Strong conditioning - closer to reference
wav = model.generate(text, cfg_weight=0.7)

When to Adjust cfg_weight

If your reference speaker talks very quickly, lower cfg_weight to improve pacing:
wav = model.generate(
    text,
    audio_prompt_path="fast_speaker.wav",
    cfg_weight=0.3  # Slows down pacing
)
From README: “If the reference speaker has a fast speaking style, lowering cfg_weight to around 0.3 can improve pacing.”
When using a voice from one language to speak another, set cfg_weight=0 to reduce accent transfer:
# English voice speaking French with minimal accent
wav = multilingual_model.generate(
    "Bonjour!",
    language_id="fr",
    audio_prompt_path="english_speaker.wav",
    cfg_weight=0.0  # Reduces English accent
)
From README: “To mitigate [accent transfer], set cfg_weight to 0.”
For more expressive output, combine lower cfg_weight with higher exaggeration:
wav = model.generate(
    text,
    cfg_weight=0.3,     # Lower for slower pacing
    exaggeration=0.7    # Higher for more expression
)
From README: “Try lower cfg_weight values (e.g. ~0.3) and increase exaggeration to around 0.7 or higher.”
Turbo Model: Chatterbox Turbo ignores cfg_weight during generation. The parameter only applies to standard Chatterbox and multilingual models.

exaggeration Parameter

Range: 0.0 to 1.0+ Default: 0.5 (standard models), 0.0 (Turbo) Controls the expressiveness and emotional intensity of the generated speech.
# Neutral, flat delivery
wav = model.generate(text, exaggeration=0.0)

# Moderate expressiveness (default)
wav = model.generate(text, exaggeration=0.5)

# High expressiveness
wav = model.generate(text, exaggeration=0.7)

# Very dramatic
wav = model.generate(text, exaggeration=1.0)

Effects of Exaggeration

  • Lower values (0.0-0.3): More neutral, professional tone
  • Medium values (0.4-0.6): Natural conversation, moderate emotion
  • Higher values (0.7-1.0): Dramatic, expressive, emotional delivery
Speed Impact: Higher exaggeration tends to speed up speech. Compensate by reducing cfg_weight for more deliberate pacing.

Exaggeration Tips from README

General Use:
“The default settings (exaggeration=0.5, cfg_weight=0.5) work well for most prompts across all languages.”
Expressive Speech:
“Try lower cfg_weight values (e.g. ~0.3) and increase exaggeration to around 0.7 or higher. Higher exaggeration tends to speed up speech; reducing cfg_weight helps compensate with slower, more deliberate pacing.”

Example Configurations

# Professional narration
wav = model.generate(
    text,
    cfg_weight=0.5,
    exaggeration=0.3  # Controlled, professional
)

# Natural conversation
wav = model.generate(
    text,
    cfg_weight=0.5,
    exaggeration=0.5  # Balanced
)

# Dramatic storytelling
wav = model.generate(
    text,
    cfg_weight=0.3,   # Slower pacing
    exaggeration=0.8  # Very expressive
)
Turbo Model: Chatterbox Turbo ignores exaggeration during generate(). It only uses exaggeration when you explicitly call prepare_conditionals().

Sampling Parameters

These parameters control the randomness and diversity of speech generation.

temperature

Range: 0.0 to 2.0+ Default: 0.8 Controls randomness in token selection. Higher values produce more variation.
# More consistent, predictable output
wav = model.generate(text, temperature=0.6)

# Default balance
wav = model.generate(text, temperature=0.8)

# More variation and creativity
wav = model.generate(text, temperature=1.0)

repetition_penalty

Range: 1.0 to 2.5+ Default: 1.2 (Turbo), 2.0 (Multilingual) Penalizes repeated tokens to reduce repetitive speech patterns.
# No penalty - may repeat more
wav = model.generate(text, repetition_penalty=1.0)

# Turbo default - light penalty
wav = model.generate(text, repetition_penalty=1.2)

# Multilingual default - stronger penalty  
wav = model.generate(text, repetition_penalty=2.0)

top_p (Nucleus Sampling)

Range: 0.0 to 1.0 Default: 0.95 (Turbo), 1.0 (others) Keeps only the most probable tokens whose cumulative probability exceeds top_p.
# More focused on likely tokens
wav = model.generate(text, top_p=0.9)

# Turbo default
wav = model.generate(text, top_p=0.95)

# No filtering
wav = model.generate(text, top_p=1.0)

top_k

Range: 1 to 10000+ Default: 1000 (Turbo only) Keeps only the top K most probable tokens. Only used by Turbo model.
# Turbo model only
wav = model.generate(text, top_k=500)   # More conservative
wav = model.generate(text, top_k=1000)  # Default
wav = model.generate(text, top_k=2000)  # More variety

min_p

Range: 0.0 to 1.0 Default: 0.0 (Turbo - ignored), 0.05 (others) Sets a minimum probability threshold for token selection.
# Standard/Multilingual models
wav = model.generate(text, min_p=0.05)  # Default
wav = model.generate(text, min_p=0.10)  # More conservative
Turbo Model: Chatterbox Turbo ignores min_p. It only applies to standard and multilingual models.

Audio Processing Options

norm_loudness

Type: Boolean Default: True (Turbo only) Normalizes the loudness of the reference audio before processing.
# Normalize reference audio (default for Turbo)
wav = model.generate(
    text,
    audio_prompt_path="reference.wav",
    norm_loudness=True
)

# Skip normalization
wav = model.generate(
    text,
    audio_prompt_path="reference.wav",
    norm_loudness=False
)
Loudness normalization uses LUFS (Loudness Units relative to Full Scale) with a target of -27 LUFS, ensuring consistent volume levels across different reference audio files. From tts_turbo.py (lines 204-215)

Model-Specific Parameter Support

ParameterTurboStandardMultilingual
cfg_weight❌ Ignored✅ Supported✅ Supported
exaggeration⚠️ Only in prepare_conditionals()✅ Supported✅ Supported
min_p❌ Ignored✅ Supported✅ Supported
top_k✅ Supported❌ Not used❌ Not used
norm_loudness✅ Supported❌ Not used❌ Not used
temperature✅ Supported✅ Supported✅ Supported
repetition_penalty✅ Supported✅ Supported✅ Supported
top_p✅ Supported✅ Supported✅ Supported
When you pass ignored parameters to a model, you’ll see a warning but generation will continue:
WARNING: CFG, min_p and exaggeration are not supported by Turbo version and will be ignored.

Complete Configuration Examples

Turbo Model - Voice Agent

from chatterbox.tts_turbo import ChatterboxTurboTTS
import torchaudio as ta

model = ChatterboxTurboTTS.from_pretrained(device="cuda")

text = "Hi there! [chuckle] How can I help you today?"
wav = model.generate(
    text,
    audio_prompt_path="agent_voice.wav",
    temperature=0.8,
    repetition_penalty=1.2,
    top_p=0.95,
    top_k=1000,
    norm_loudness=True
)
ta.save("agent_output.wav", wav, model.sr)

Standard Model - Professional Narration

from chatterbox.tts import ChatterboxTTS
import torchaudio as ta

model = ChatterboxTTS.from_pretrained(device="cuda")

text = "Welcome to this comprehensive guide on audio synthesis technology."
wav = model.generate(
    text,
    audio_prompt_path="narrator.wav",
    cfg_weight=0.5,
    exaggeration=0.3,  # Professional, controlled
    temperature=0.7,   # Consistent output
    repetition_penalty=1.2,
    min_p=0.05,
    top_p=1.0
)
ta.save("narration.wav", wav, model.sr)

Multilingual Model - Dramatic Speech

from chatterbox.mtl_tts import ChatterboxMultilingualTTS
import torchaudio as ta

model = ChatterboxMultilingualTTS.from_pretrained(device="cuda")

text = "¡Esto es absolutamente increíble!"
wav = model.generate(
    text,
    language_id="es",
    audio_prompt_path="expressive_speaker.wav",
    cfg_weight=0.3,    # Slower pacing
    exaggeration=0.8,  # Very expressive
    temperature=0.9,   # More variation
    repetition_penalty=2.0,
    min_p=0.05,
    top_p=1.0
)
ta.save("dramatic_spanish.wav", wav, model.sr)

Default Values Summary

Chatterbox Turbo

model.generate(
    text,
    audio_prompt_path=None,
    temperature=0.8,
    repetition_penalty=1.2,
    top_p=0.95,
    top_k=1000,
    min_p=0.0,          # Ignored
    cfg_weight=0.0,     # Ignored  
    exaggeration=0.0,   # Ignored
    norm_loudness=True
)

Standard Chatterbox

model.generate(
    text,
    audio_prompt_path=None,
    temperature=0.8,
    repetition_penalty=1.2,
    top_p=1.0,
    min_p=0.05,
    cfg_weight=0.5,
    exaggeration=0.5
)

Chatterbox Multilingual

model.generate(
    text,
    language_id="en",  # Required
    audio_prompt_path=None,
    temperature=0.8,
    repetition_penalty=2.0,  # Higher than others
    top_p=1.0,
    min_p=0.05,
    cfg_weight=0.5,
    exaggeration=0.5
)

Performance Optimization

For Maximum Speed

  1. Use CUDA device with NVIDIA GPU
  2. Use Chatterbox Turbo (350M params vs 500M)
  3. Keep reference audio at 10 seconds or less
  4. Reuse conditionals for the same voice
# Pre-compute conditionals once
model.prepare_conditionals("voice.wav")

# Generate multiple times without reprocessing
for text in text_list:
    wav = model.generate(text)  # No audio_prompt_path needed

For Best Quality

  1. Use Standard Chatterbox or Multilingual for more parameters
  2. Tune cfg_weight and exaggeration for your use case
  3. Use high-quality reference audio (22050Hz+)
  4. Adjust temperature for consistency vs. variation

Troubleshooting Configuration Issues

Output too fast

# Lower cfg_weight
wav = model.generate(text, cfg_weight=0.3)

# Reduce exaggeration
wav = model.generate(text, exaggeration=0.3)

Output too monotone

# Increase exaggeration
wav = model.generate(text, exaggeration=0.7)

# Increase temperature for variation
wav = model.generate(text, temperature=1.0)

Repetitive speech

# Increase repetition penalty
wav = model.generate(text, repetition_penalty=2.5)

Voice doesn’t match reference

# Increase cfg_weight
wav = model.generate(text, cfg_weight=0.7)

# Check reference audio quality and duration

Build docs developers (and LLMs) love