Fine-tune speech generation with parameters for expressiveness, device selection, and quality control
Chatterbox provides several configuration parameters to customize your speech generation. These settings control expressiveness, voice characteristics, sampling behavior, and performance.
Auto-detection: The models automatically fall back to CPU if the requested device is unavailable. For Apple Silicon Macs without MPS support, the model will use CPU automatically.
Range: 0.0 to 1.0 (typically)
Default: 0.5 (standard models), 0.0 (Turbo - ignored)Controls how strongly the model follows the reference voice characteristics. Higher values make the output more similar to the reference audio.
# Light conditioning - more variation from referencewav = model.generate(text, cfg_weight=0.3)# Default - balancedwav = model.generate(text, cfg_weight=0.5)# Strong conditioning - closer to referencewav = model.generate(text, cfg_weight=0.7)
If your reference speaker talks very quickly, lower cfg_weight to improve pacing:
wav = model.generate( text, audio_prompt_path="fast_speaker.wav", cfg_weight=0.3 # Slows down pacing)
From README: “If the reference speaker has a fast speaking style, lowering cfg_weight to around 0.3 can improve pacing.”
Cross-language synthesis
When using a voice from one language to speak another, set cfg_weight=0 to reduce accent transfer:
# English voice speaking French with minimal accentwav = multilingual_model.generate( "Bonjour!", language_id="fr", audio_prompt_path="english_speaker.wav", cfg_weight=0.0 # Reduces English accent)
From README: “To mitigate [accent transfer], set cfg_weight to 0.”
Expressive or dramatic speech
For more expressive output, combine lower cfg_weight with higher exaggeration:
wav = model.generate( text, cfg_weight=0.3, # Lower for slower pacing exaggeration=0.7 # Higher for more expression)
From README: “Try lower cfg_weight values (e.g. ~0.3) and increase exaggeration to around 0.7 or higher.”
Turbo Model: Chatterbox Turbo ignores cfg_weight during generation. The parameter only applies to standard Chatterbox and multilingual models.
“The default settings (exaggeration=0.5, cfg_weight=0.5) work well for most prompts across all languages.”
Expressive Speech:
“Try lower cfg_weight values (e.g. ~0.3) and increase exaggeration to around 0.7 or higher. Higher exaggeration tends to speed up speech; reducing cfg_weight helps compensate with slower, more deliberate pacing.”
Loudness normalization uses LUFS (Loudness Units relative to Full Scale) with a target of -27 LUFS, ensuring consistent volume levels across different reference audio files.From tts_turbo.py (lines 204-215)
from chatterbox.tts_turbo import ChatterboxTurboTTSimport torchaudio as tamodel = ChatterboxTurboTTS.from_pretrained(device="cuda")text = "Hi there! [chuckle] How can I help you today?"wav = model.generate( text, audio_prompt_path="agent_voice.wav", temperature=0.8, repetition_penalty=1.2, top_p=0.95, top_k=1000, norm_loudness=True)ta.save("agent_output.wav", wav, model.sr)
# Pre-compute conditionals oncemodel.prepare_conditionals("voice.wav")# Generate multiple times without reprocessingfor text in text_list: wav = model.generate(text) # No audio_prompt_path needed