Overview
ChatterboxTTS is the flagship English text-to-speech model with advanced control over voice characteristics and generation quality. It supports classifier-free guidance and multiple sampling strategies for high-quality, expressive speech synthesis.
Class Signature
Parameters
The T3 text-to-speech tokens model instance
The S3Gen vocoder model instance for token-to-audio conversion
Voice encoder for extracting speaker embeddings from reference audio
English text tokenizer instance
Device to run inference on (“cuda”, “cpu”, or “mps”)
Optional pre-computed conditionals for voice and style. See Conditionals reference
Class Methods
from_pretrained()
Load the pre-trained ChatterboxTTS model from Hugging Face.Parameters
Device to load the model on (“cuda”, “cpu”, or “mps”). Automatically falls back to “cpu” if MPS is not available
Returns
Initialized ChatterboxTTS model with pre-trained weights from
ResembleAI/chatterboxExample
from_local()
Load the model from a local checkpoint directory.Parameters
Path to the directory containing model checkpoint files
Device to load the model on (“cuda”, “cpu”, or “mps”)
Returns
Initialized ChatterboxTTS model with weights loaded from local directory
Instance Methods
prepare_conditionals()
Prepare voice conditionals from an audio prompt for subsequent generation calls.Parameters
Path to the audio file to use as voice reference
Voice exaggeration level (0.0 to 1.0). Higher values produce more expressive speech
Example
generate()
Generate speech from text using the prepared voice conditionals.Parameters
The text to convert to speech
Penalty for repeating tokens (1.0 = no penalty, higher values discourage repetition)
Minimum probability threshold for sampling. Filters out tokens below this probability
Nucleus sampling threshold (0.0 to 1.0). Only tokens with cumulative probability up to top_p are considered
Optional path to audio file for voice cloning. If provided, will override existing conditionals
Voice exaggeration level (0.0 to 1.0). Higher values produce more expressive and animated speech
Classifier-free guidance weight (0.0 to 1.0+). Higher values increase adherence to conditioning
Sampling temperature (higher = more random, lower = more deterministic)
Returns
Generated audio waveform as a PyTorch tensor with shape
[1, samples]. Sample rate is 44100 Hz (accessible via model.sr). Audio includes perceptual watermarkingExample
Attributes
Sample rate of generated audio (44100 Hz)
Device the model is running on
Current voice conditionals used for generation
Notes
- This model supports classifier-free guidance (CFG) for improved quality control
- Generated audio is automatically watermarked using the Perth implicit watermarker
- Text is automatically normalized (capitalization, punctuation) before generation
- The exaggeration parameter can be updated on-the-fly without re-preparing conditionals