Overview
ChatterboxTurboTTS is the fastest text-to-speech model in the Chatterbox family, optimized for low-latency inference. It uses a streamlined architecture with 2 CFM timesteps for rapid audio generation while maintaining high quality.
Class Signature
Parameters
The T3 text-to-speech tokens model instance
The S3Gen vocoder model instance for token-to-audio conversion
Voice encoder for extracting speaker embeddings from reference audio
English text tokenizer instance
Device to run inference on (“cuda”, “cpu”, or “mps”)
Optional pre-computed conditionals for voice and style. See Conditionals reference
Class Methods
from_pretrained()
Load the pre-trained ChatterboxTurboTTS model from Hugging Face.Parameters
Device to load the model on (“cuda”, “cpu”, or “mps”). Automatically falls back to “cpu” if MPS is not available
Returns
Initialized ChatterboxTurboTTS model with pre-trained weights from
ResembleAI/chatterbox-turboExample
from_local()
Load the model from a local checkpoint directory.Parameters
Path to the directory containing model checkpoint files
Device to load the model on (“cuda”, “cpu”, or “mps”)
Returns
Initialized ChatterboxTurboTTS model with weights loaded from local directory
Instance Methods
prepare_conditionals()
Prepare voice conditionals from an audio prompt for subsequent generation calls.Parameters
Path to the audio file to use as voice reference. Must be at least 5 seconds long
Voice exaggeration level (0.0 to 1.0). Higher values produce more expressive speech
Whether to normalize the loudness of the reference audio to -27 LUFS
Example
generate()
Generate speech from text using the prepared voice conditionals.Parameters
The text to convert to speech
Penalty for repeating tokens (1.0 = no penalty, higher values discourage repetition)
Minimum probability threshold for sampling. Not supported in Turbo version and will be ignored
Nucleus sampling threshold (0.0 to 1.0). Only tokens with cumulative probability up to top_p are considered
Optional path to audio file for voice cloning. If provided, will override existing conditionals
Voice exaggeration level. Not supported in Turbo version and will be ignored
Classifier-free guidance weight. Not supported in Turbo version and will be ignored
Sampling temperature (higher = more random, lower = more deterministic)
Number of top tokens to consider during sampling
Whether to normalize the loudness of the audio prompt if provided
Returns
Generated audio waveform as a PyTorch tensor with shape
[1, samples]. Sample rate is 44100 Hz (accessible via model.sr). Audio includes perceptual watermarkingExample
Attributes
Sample rate of generated audio (44100 Hz)
Device the model is running on
Current voice conditionals used for generation
Notes
- The Turbo model does not support
cfg_weight,min_p, orexaggerationparameters - these will be ignored with a warning - Audio prompts must be at least 5 seconds long
- Generated audio is automatically watermarked using the Perth implicit watermarker
- Text is automatically normalized (capitalization, punctuation) before generation