Overview
Chatterbox TTS provides four key parameters for controlling speech generation quality, consistency, and expressiveness. Understanding these parameters helps you achieve optimal results for different use cases.Parameters Reference
Temperature
Controls the randomness and creativity of the generation.Range: 0.0 to 2.0Effect:
- 0.0 - 0.5: More consistent and predictable output. Use for formal content like news, announcements, or technical documentation.
- 0.6 - 1.0: Balanced naturalness with some variation. Ideal for most conversational content.
- 1.0 - 2.0: More expressive and varied output. Good for storytelling, character voices, or creative content.
Top P (Nucleus Sampling)
Controls diversity by considering only the top probability mass.Range: 0.0 to 1.0Effect:
- 0.5 - 0.7: Very focused sampling. Consistent but may sound repetitive.
- 0.8 - 0.95: Balanced diversity. Good for most use cases.
- 0.95 - 1.0: Maximum diversity. More natural but less predictable.
topP.How It Works
Nucleus sampling dynamically adjusts the token pool based on probability distribution:Top K
Limits the sampling pool to the K most likely tokens.Range: 1 to 10,000Effect:
- 1 - 50: Very restricted vocabulary. Extremely consistent but unnatural.
- 100 - 500: Moderate restriction. Good for technical or formal content.
- 500 - 2000: Standard range. Balances naturalness and control.
- 2000+: Minimal restriction. Maximum vocabulary diversity.
Top-K vs Top-P
When to use Top-K vs Top-P
When to use Top-K vs Top-P
Use Top-K when:
- You want a fixed vocabulary size regardless of confidence
- You need very predictable output (low K values)
- You’re generating technical or domain-specific content
- You want adaptive sampling based on model confidence
- You prefer natural-sounding variation
- You’re generating conversational or narrative content
- Top-P provides adaptive diversity
- Top-K sets a hard upper limit on vocabulary
- This combination works well for most use cases
Repetition Penalty
Penalizes tokens that have already been generated.Range: 1.0 to 2.0Effect:
- 1.0: No penalty. May lead to repetitive phrases or words.
- 1.1 - 1.3: Subtle penalty. Reduces repetition while maintaining naturalness.
- 1.4 - 2.0: Strong penalty. Actively avoids repetition, may sound forced.
Example Impact
Use Case Presets
News & Announcements
Conversational Content
Storytelling & Characters
Technical Documentation
Parameter Interactions
Temperature + Top-P
- Low Temp + Low Top-P
- High Temp + High Top-P
- Low Temp + High Top-P
- High Temp + Low Top-P
Config:
temperature: 0.3, topP: 0.7Result: Very consistent, conservative output. Good for formal content but may sound robotic.Tuning Workflow
Start with Defaults
Begin with the default values:
temperature: 0.8topP: 0.95topK: 1000repetitionPenalty: 1.2
Adjust Temperature
If output is too monotone → increase temperatureIf output is too random → decrease temperature
Fine-tune Repetition
If you notice repeated phrases → increase repetitionPenaltyIf speech feels forced → decrease repetitionPenalty
Common Issues
Robotic or Monotone Output
Symptoms: Speech sounds flat, lacks emotion or variation Solutions:- Increase
temperatureto 1.0-1.3 - Increase
topPto 0.95-0.98 - Check voice quality (some voices are more expressive than others)
Repetitive Phrases
Symptoms: Same words or patterns repeat frequently Solutions:- Increase
repetitionPenaltyto 1.3-1.5 - Increase
topKto allow more vocabulary diversity - Break long text into shorter segments
Inconsistent Delivery
Symptoms: Tone varies unpredictably between sentences Solutions:- Decrease
temperatureto 0.5-0.7 - Decrease
topPto 0.85-0.90 - Use consistent punctuation and formatting
Unnatural Phrasing
Symptoms: Speech sounds forced or overly formal Solutions:- Decrease
repetitionPenaltyto 1.0-1.1 - Increase
temperatureslightly (0.1-0.2 increments) - Ensure input text is naturally written
Best Practices
Make Small Adjustments
Make Small Adjustments
Change one parameter at a time by small increments (0.1-0.2) to understand its individual effect.
Test with Representative Content
Test with Representative Content
Use actual content samples from your use case, not generic test phrases.
Consider Voice Characteristics
Consider Voice Characteristics
Different voices respond differently to parameters. What works for a narrator may not work for a character voice.
Document Your Presets
Document Your Presets
Save successful parameter combinations for different content types in your application.
Balance Consistency and Quality
Balance Consistency and Quality
Perfect consistency isn’t always desirable. Some variation makes speech sound more human.
Technical Details
Sampling Algorithm
Chatterbox uses a combined sampling approach:- Temperature scaling: Divides logits by temperature before softmax
- Top-K filtering: Removes all but the K most probable tokens
- Top-P filtering: Further filters to nucleus based on cumulative probability
- Repetition penalty: Divides logits of previously generated tokens
- Sampling: Randomly selects from the filtered distribution
Performance Impact
Parameter changes have minimal performance impact:- Temperature: Negligible (simple division)
- Top-P/Top-K: ~1-2ms overhead for filtering
- Repetition Penalty: ~1ms per token for lookup