Skip to main content

Overview

Chatterbox-Multilingual extends the capabilities of the base Chatterbox model to support 23+ languages, making it ideal for global applications and localization projects. With 500M parameters, it maintains high-quality voice cloning while providing multilingual support.

23+ Languages

Support for major world languages including European, Asian, and Middle Eastern languages.

Zero-Shot Cloning

Clone voices across languages without fine-tuning or training.

CFG Control

Same advanced controls as the base model for fine-tuning output.

Cross-Language

Transfer voices across different languages while preserving characteristics.

Model Specifications

  • Model Size: 500M parameters
  • Languages: 23+ supported languages
  • Sample Rate: 24,000 Hz
  • Architecture: T3 transformer (multilingual config) + S3Gen decoder
  • Repository: ResembleAI/chatterbox

Supported Languages

The multilingual model supports the following 23 languages:
  • Danish (da)
  • German (de)
  • Greek (el)
  • English (en)
  • Spanish (es)
  • Finnish (fi)
  • French (fr)
  • Italian (it)
  • Dutch (nl)
  • Norwegian (no)
  • Polish (pl)
  • Portuguese (pt)
  • Russian (ru)
  • Swedish (sv)
  • Chinese (zh)
  • Japanese (ja)
  • Korean (ko)
  • Hindi (hi)
  • Malay (ms)
  • Arabic (ar)
  • Hebrew (he)
  • Turkish (tr)
  • Swahili (sw)

Hardware Requirements

Minimum (CPU)

  • 6GB RAM
  • CPU inference supported
  • Slower generation times

Recommended (GPU)

  • NVIDIA GPU with 6GB+ VRAM
  • CUDA support
  • Near real-time generation
The model also supports Apple Silicon (MPS) for Mac users with M1/M2/M3 chips.

Usage

Basic Generation

import torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

# Load the multilingual model
model = ChatterboxMultilingualTTS.from_pretrained(device="cuda")

# Generate French speech
french_text = "Bonjour, comment ça va? Ceci est le modèle de synthèse vocale multilingue Chatterbox, il prend en charge 23 langues."
wav = model.generate(french_text, language_id="fr")

ta.save("test-french.wav", wav, model.sr)

Multiple Languages

import torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

model = ChatterboxMultilingualTTS.from_pretrained(device="cuda")

# Spanish
spanish_text = "Hola, ¿cómo estás? Este es un ejemplo de síntesis de voz en español."
wav_spanish = model.generate(spanish_text, language_id="es")
ta.save("test-spanish.wav", wav_spanish, model.sr)

# Chinese
chinese_text = "你好,今天天气真不错,希望你有一个愉快的周末。"
wav_chinese = model.generate(chinese_text, language_id="zh")
ta.save("test-chinese.wav", wav_chinese, model.sr)

# German
german_text = "Guten Tag, wie geht es Ihnen? Dies ist ein Test der deutschen Sprachsynthese."
wav_german = model.generate(german_text, language_id="de")
ta.save("test-german.wav", wav_german, model.sr)

Voice Cloning Across Languages

import torchaudio as ta
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

model = ChatterboxMultilingualTTS.from_pretrained(device="cuda")

# Clone a voice and use it for different languages
reference_audio = "english_speaker.wav"

# Generate in French with English speaker's voice
french_text = "Bonjour, je parle français maintenant."
wav = model.generate(
    french_text,
    language_id="fr",
    audio_prompt_path=reference_audio
)

ta.save("french-with-english-voice.wav", wav, model.sr)
When using cross-language voice cloning, ensure the reference clip matches the target language, or set cfg_weight=0 to avoid accent transfer.

Getting Supported Languages

from chatterbox.mtl_tts import ChatterboxMultilingualTTS

# Get dictionary of all supported languages
languages = ChatterboxMultilingualTTS.get_supported_languages()

for code, name in languages.items():
    print(f"{code}: {name}")

Generation Parameters

ParameterDefaultDescription
language_idRequiredTwo-letter language code (e.g., “fr”, “es”, “zh”)
temperature0.8Controls randomness in token selection
top_p1.0Nucleus sampling threshold
min_p0.05Minimum probability threshold
repetition_penalty2.0Penalizes repeated tokens
cfg_weight0.5Classifier-free guidance strength
exaggeration0.5Emotional intensity level
audio_prompt_pathNonePath to reference audio for voice cloning
The language_id parameter is required for the multilingual model. Use the two-letter ISO 639-1 language codes.

Best Practices

Language-Specific Tips

The model handles Chinese characters and punctuation. Use appropriate Chinese punctuation marks like 。?!for best results.
Both Hiragana, Katakana, and Kanji are supported. The model automatically handles mixed scripts.
Right-to-left text is properly handled. Ensure your text includes appropriate diacritical marks for best pronunciation.
The model handles accented characters (é, ñ, ü, etc.) naturally. Include proper accents for accurate pronunciation.

Cross-Language Voice Cloning

  1. Matching Languages: For best results, use a reference clip in the same language as your target text
  2. Accent Transfer: If accent transfer occurs with cross-language cloning, set cfg_weight=0
  3. Reference Quality: Use clear, noise-free reference audio for consistent results
  4. Default Settings: The default parameters (exaggeration=0.5, cfg_weight=0.5) work well across all languages
For cross-language applications where you need to maintain voice consistency across multiple languages, start with a high-quality reference clip and test with cfg_weight values between 0.3 and 0.5.

Performance Characteristics

Generation Speed

Similar to base Chatterbox model with 10-step decoding. Speed varies slightly by language complexity.

Audio Quality

High-fidelity 24kHz output across all supported languages with natural prosody and intonation.

Built-in Watermarking

Every audio file generated by Chatterbox-Multilingual includes Resemble AI’s Perth (Perceptual Threshold) watermark - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations.
import perth
import librosa

# Load the watermarked audio
watermarked_audio, sr = librosa.load("output.wav", sr=None)

# Initialize watermarker
watermarker = perth.PerthImplicitWatermarker()

# Extract watermark
watermark = watermarker.get_watermark(watermarked_audio, sample_rate=sr)
print(f"Extracted watermark: {watermark}")  # 0.0 or 1.0

Use Cases

  • Global Applications: Build TTS systems for international audiences
  • Localization: Create localized content in multiple languages
  • Language Learning: Generate pronunciation examples in various languages
  • Multilingual Voice Agents: Conversational AI that speaks multiple languages
  • Content Translation: Convert written content to speech across languages
  • Accessibility: Text-to-speech for global accessibility features

Language Support Details

The model includes comprehensive language support with proper handling of:
  • Language-specific punctuation
  • Diacritical marks and accents
  • Script systems (Latin, Cyrillic, Arabic, Chinese, Japanese, Korean)
  • Language-appropriate prosody and intonation
  • Cultural speech patterns

Comparison with Other Models

FeatureChatterboxChatterbox-TurboChatterbox-Multilingual
Parameters500M350M500M
LanguagesEnglishEnglish23+
CFG ControlYesNoYes
ExaggerationYesNoYes
SpeedMediumFastMedium
Best ForCreative controlLow latencyMulti-language

Next Steps

Installation

Install Chatterbox and get started

API Reference

Explore all parameters and methods

Build docs developers (and LLMs) love