Overview
ChatterboxMultilingualTTS extends Chatterbox’s capabilities to 23 languages with high-quality voice cloning and synthesis. It supports cross-lingual voice cloning, allowing you to clone a voice in one language and synthesize speech in another.
Class Signature
Parameters
The T3 text-to-speech tokens model instance configured for multilingual support
The S3Gen vocoder model instance for token-to-audio conversion
Voice encoder for extracting speaker embeddings from reference audio
Multilingual text tokenizer instance
Device to run inference on (“cuda”, “cpu”, or “mps”)
Optional pre-computed conditionals for voice and style. See Conditionals reference
Class Methods
from_pretrained()
Load the pre-trained ChatterboxMultilingualTTS model from Hugging Face.Parameters
Device to load the model on (“cuda”, “cpu”, or “mps”)
Returns
Initialized ChatterboxMultilingualTTS model with pre-trained weights from
ResembleAI/chatterboxExample
from_local()
Load the model from a local checkpoint directory.Parameters
Path to the directory containing model checkpoint files
Device to load the model on (“cuda”, “cpu”, or “mps”)
Returns
Initialized ChatterboxMultilingualTTS model with weights loaded from local directory
get_supported_languages()
Return a dictionary of all supported language codes and their names.Returns
Dictionary mapping language codes to language names. See Supported Languages for the full list
Example
Instance Methods
prepare_conditionals()
Prepare voice conditionals from an audio prompt for subsequent generation calls.Parameters
Path to the audio file to use as voice reference
Voice exaggeration level (0.0 to 1.0). Higher values produce more expressive speech
Example
generate()
Generate speech from text in the specified language using the prepared voice conditionals.Parameters
The text to convert to speech
Language code for the text (e.g., “en”, “es”, “fr”). See Supported Languages for valid codes
Optional path to audio file for voice cloning. If provided, will override existing conditionals
Voice exaggeration level (0.0 to 1.0). Higher values produce more expressive and animated speech
Classifier-free guidance weight (0.0 to 1.0+). Higher values increase adherence to conditioning
Sampling temperature (higher = more random, lower = more deterministic)
Penalty for repeating tokens (1.0 = no penalty, higher values discourage repetition)
Minimum probability threshold for sampling. Filters out tokens below this probability
Nucleus sampling threshold (0.0 to 1.0). Only tokens with cumulative probability up to top_p are considered
Returns
Generated audio waveform as a PyTorch tensor with shape
[1, samples]. Sample rate is 44100 Hz (accessible via model.sr). Audio includes perceptual watermarkingExample
Attributes
Sample rate of generated audio (44100 Hz)
Device the model is running on
Current voice conditionals used for generation
Notes
- Supports 23 languages with cross-lingual voice cloning capabilities
- Language code validation is performed automatically - invalid codes will raise a
ValueError - Generated audio is automatically watermarked using the Perth implicit watermarker
- Text is automatically normalized for the target language (capitalization, punctuation)
- The exaggeration parameter can be updated on-the-fly without re-preparing conditionals
- Higher repetition_penalty (default 2.0) helps prevent repetition in multilingual synthesis