Chatterbox-Multilingual

Overview

Chatterbox-Multilingual extends the capabilities of the base Chatterbox model to support 23+ languages, making it ideal for global applications and localization projects. With 500M parameters, it maintains high-quality voice cloning while providing multilingual support.

23+ Languages

Support for major world languages including European, Asian, and Middle Eastern languages.

Zero-Shot Cloning

Clone voices across languages without fine-tuning or training.

CFG Control

Same advanced controls as the base model for fine-tuning output.

Cross-Language

Transfer voices across different languages while preserving characteristics.

Model Specifications

Model Size: 500M parameters
Languages: 23+ supported languages
Sample Rate: 24,000 Hz
Architecture: T3 transformer (multilingual config) + S3Gen decoder
Repository: ResembleAI/chatterbox

Supported Languages

The multilingual model supports the following 23 languages:

European Languages

Danish (da)
German (de)
Greek (el)
English (en)
Spanish (es)
Finnish (fi)
French (fr)
Italian (it)
Dutch (nl)
Norwegian (no)
Polish (pl)
Portuguese (pt)
Russian (ru)
Swedish (sv)

Asian Languages

Chinese (zh)
Japanese (ja)
Korean (ko)
Hindi (hi)
Malay (ms)

Middle Eastern & African Languages

Arabic (ar)
Hebrew (he)
Turkish (tr)
Swahili (sw)

Hardware Requirements

Minimum (CPU)

6GB RAM
CPU inference supported
Slower generation times

Recommended (GPU)

NVIDIA GPU with 6GB+ VRAM
CUDA support
Near real-time generation

The model also supports Apple Silicon (MPS) for Mac users with M1/M2/M3 chips.

Usage

Basic Generation

import torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

# Load the multilingual model
model = ChatterboxMultilingualTTS.from_pretrained(device="cuda")

# Generate French speech
french_text = "Bonjour, comment ça va? Ceci est le modèle de synthèse vocale multilingue Chatterbox, il prend en charge 23 langues."
wav = model.generate(french_text, language_id="fr")

ta.save("test-french.wav", wav, model.sr)

Multiple Languages

import torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

model = ChatterboxMultilingualTTS.from_pretrained(device="cuda")

# Spanish
spanish_text = "Hola, ¿cómo estás? Este es un ejemplo de síntesis de voz en español."
wav_spanish = model.generate(spanish_text, language_id="es")
ta.save("test-spanish.wav", wav_spanish, model.sr)

# Chinese
chinese_text = "你好，今天天气真不错，希望你有一个愉快的周末。"
wav_chinese = model.generate(chinese_text, language_id="zh")
ta.save("test-chinese.wav", wav_chinese, model.sr)

# German
german_text = "Guten Tag, wie geht es Ihnen? Dies ist ein Test der deutschen Sprachsynthese."
wav_german = model.generate(german_text, language_id="de")
ta.save("test-german.wav", wav_german, model.sr)

Voice Cloning Across Languages

import torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

model = ChatterboxMultilingualTTS.from_pretrained(device="cuda")

# Clone a voice and use it for different languages
reference_audio = "english_speaker.wav"

# Generate in French with English speaker's voice
french_text = "Bonjour, je parle français maintenant."
wav = model.generate(
    french_text,
    language_id="fr",
    audio_prompt_path=reference_audio
)

ta.save("french-with-english-voice.wav", wav, model.sr)

When using cross-language voice cloning, ensure the reference clip matches the target language, or set cfg_weight=0 to avoid accent transfer.

Getting Supported Languages

from chatterbox.mtl_tts import ChatterboxMultilingualTTS

# Get dictionary of all supported languages
languages = ChatterboxMultilingualTTS.get_supported_languages()

for code, name in languages.items():
    print(f"{code}: {name}")

Generation Parameters

Parameter	Default	Description
`language_id`	Required	Two-letter language code (e.g., “fr”, “es”, “zh”)
`temperature`	0.8	Controls randomness in token selection
`top_p`	1.0	Nucleus sampling threshold
`min_p`	0.05	Minimum probability threshold
`repetition_penalty`	2.0	Penalizes repeated tokens
`cfg_weight`	0.5	Classifier-free guidance strength
`exaggeration`	0.5	Emotional intensity level
`audio_prompt_path`	None	Path to reference audio for voice cloning

The language_id parameter is required for the multilingual model. Use the two-letter ISO 639-1 language codes.

Best Practices

Language-Specific Tips

Chinese (zh)

The model handles Chinese characters and punctuation. Use appropriate Chinese punctuation marks like 。？！for best results.

Japanese (ja)

Both Hiragana, Katakana, and Kanji are supported. The model automatically handles mixed scripts.

Arabic (ar)

Right-to-left text is properly handled. Ensure your text includes appropriate diacritical marks for best pronunciation.

European Languages

The model handles accented characters (é, ñ, ü, etc.) naturally. Include proper accents for accurate pronunciation.

Cross-Language Voice Cloning

Matching Languages: For best results, use a reference clip in the same language as your target text
Accent Transfer: If accent transfer occurs with cross-language cloning, set cfg_weight=0
Reference Quality: Use clear, noise-free reference audio for consistent results
Default Settings: The default parameters (exaggeration=0.5, cfg_weight=0.5) work well across all languages

For cross-language applications where you need to maintain voice consistency across multiple languages, start with a high-quality reference clip and test with cfg_weight values between 0.3 and 0.5.

Performance Characteristics

Generation Speed

Similar to base Chatterbox model with 10-step decoding. Speed varies slightly by language complexity.

Audio Quality

High-fidelity 24kHz output across all supported languages with natural prosody and intonation.

Built-in Watermarking

Every audio file generated by Chatterbox-Multilingual includes Resemble AI’s Perth (Perceptual Threshold) watermark - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations.

import perth
import librosa

# Load the watermarked audio
watermarked_audio, sr = librosa.load("output.wav", sr=None)

# Initialize watermarker
watermarker = perth.PerthImplicitWatermarker()

# Extract watermark
watermark = watermarker.get_watermark(watermarked_audio, sample_rate=sr)
print(f"Extracted watermark: {watermark}")  # 0.0 or 1.0

Use Cases

Global Applications: Build TTS systems for international audiences
Localization: Create localized content in multiple languages
Language Learning: Generate pronunciation examples in various languages
Multilingual Voice Agents: Conversational AI that speaks multiple languages
Content Translation: Convert written content to speech across languages
Accessibility: Text-to-speech for global accessibility features

Language Support Details

The model includes comprehensive language support with proper handling of:

Language-specific punctuation
Diacritical marks and accents
Script systems (Latin, Cyrillic, Arabic, Chinese, Japanese, Korean)
Language-appropriate prosody and intonation
Cultural speech patterns

Comparison with Other Models

Feature	Chatterbox	Chatterbox-Turbo	Chatterbox-Multilingual
Parameters	500M	350M	500M
Languages	English	English	23+
CFG Control	Yes	No	Yes
Exaggeration	Yes	No	Yes
Speed	Medium	Fast	Medium
Best For	Creative control	Low latency	Multi-language

Get Started

Models

Guides

Overview

23+ Languages

Zero-Shot Cloning

CFG Control

Cross-Language

Model Specifications

Supported Languages

Hardware Requirements

Minimum (CPU)

Recommended (GPU)

Usage

Basic Generation

Multiple Languages

Voice Cloning Across Languages

Getting Supported Languages

Generation Parameters

Best Practices

Language-Specific Tips

Cross-Language Voice Cloning

Performance Characteristics

Generation Speed

Audio Quality

Built-in Watermarking

Use Cases

Language Support Details

Comparison with Other Models

Next Steps

Installation

API Reference

Build docs developers (and LLMs) love

Get Started

Models

Guides

​Overview

23+ Languages

Zero-Shot Cloning

CFG Control

Cross-Language

​Model Specifications

​Supported Languages

​Hardware Requirements

Minimum (CPU)

Recommended (GPU)

​Usage

​Basic Generation

​Multiple Languages

​Voice Cloning Across Languages

​Getting Supported Languages

​Generation Parameters

​Best Practices

​Language-Specific Tips

​Cross-Language Voice Cloning

​Performance Characteristics

Generation Speed

Audio Quality

​Built-in Watermarking

​Use Cases

​Language Support Details

​Comparison with Other Models

​Next Steps

Installation

API Reference

Build docs developers (and LLMs) love

Overview

Model Specifications

Supported Languages

Hardware Requirements

Usage

Basic Generation

Multiple Languages

Voice Cloning Across Languages

Getting Supported Languages

Generation Parameters

Best Practices

Language-Specific Tips

Cross-Language Voice Cloning

Performance Characteristics

Built-in Watermarking

Use Cases

Language Support Details

Comparison with Other Models

Next Steps