Overview
Chatterbox-Turbo is the most efficient model in the Chatterbox family, delivering high-quality speech with less compute and VRAM than previous models. Built on a streamlined 350M parameter architecture, Turbo excels at low-latency voice agents while maintaining excellent performance for narration and creative workflows.One-Step Decoding
Distilled speech-token-to-mel decoder reduces generation from 10 steps to just one, while retaining high-fidelity audio output.
Paralinguistic Tags
Native support for
[cough], [laugh], [chuckle] and more to add distinct realism to generated speech.Low Latency
Optimized for production use in voice agents with sub-200ms latency potential.
Zero-Shot Cloning
Clone any voice from a 5-10 second reference clip without fine-tuning.
Model Specifications
- Model Size: 350M parameters
- Language: English only
- Sample Rate: 24,000 Hz
- Architecture: T3 transformer + S3Gen with mean flow decoding
- Repository:
ResembleAI/chatterbox-turbo
Key Features
Paralinguistic Tags
Turbo natively supports paralinguistic tags that add natural non-speech vocalizations to your generated audio:[laugh]- Natural laughter[chuckle]- Light chuckling[cough]- Coughing sound
Optimized Performance
The Turbo model achieves significant performance improvements:- Reduced VRAM: Lower memory footprint compared to base Chatterbox
- Faster Generation: One-step decoding instead of 10-step process
- Smaller Model: 350M parameters vs 500M in base models
Hardware Requirements
Minimum (CPU)
- 4GB RAM
- CPU inference supported
- Slower generation times
Recommended (GPU)
- NVIDIA GPU with 4GB+ VRAM
- CUDA support
- Real-time generation possible
The model also supports Apple Silicon (MPS) for Mac users with M1/M2/M3 chips.
Usage
Basic Generation
Using Paralinguistic Tags
Voice Cloning
Clone any voice by providing a reference audio clip:Generation Parameters
Control the generation process with these parameters:| Parameter | Default | Description |
|---|---|---|
temperature | 0.8 | Controls randomness. Higher = more varied output |
top_p | 0.95 | Nucleus sampling threshold |
top_k | 1000 | Limits vocabulary to top k tokens |
repetition_penalty | 1.2 | Penalizes repeated tokens |
audio_prompt_path | None | Path to reference audio for voice cloning |
exaggeration | 0.0 | Emotion intensity (not used in Turbo) |
norm_loudness | True | Normalize loudness of reference audio |
Unlike the base Chatterbox model, Turbo does not support
cfg_weight, exaggeration, or min_p parameters during generation.Best Practices
For Voice Agents
- Use default parameters for most natural results
- Keep text prompts conversational and natural
- Reference audio should match the desired speaking style
- Include paralinguistic tags for more engaging conversations
For Narration
- Adjust
temperaturebetween 0.7-0.9 for consistency - Use longer reference clips (8-10 seconds) for better voice capture
- Test different
repetition_penaltyvalues for varied cadence
Performance Characteristics
Generation Speed
Significantly faster than base Chatterbox due to one-step decoding. Real-time generation possible on modern GPUs.
Audio Quality
High-fidelity 24kHz output comparable to 10-step decoding models while being much faster.
Built-in Watermarking
Every audio file generated by Chatterbox-Turbo includes Resemble AI’s Perth (Perceptual Threshold) watermark - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations. You can detect the watermark using:Use Cases
- Voice Agents: Production-ready TTS for conversational AI
- Interactive Applications: Low-latency speech for games and apps
- Audiobooks: Narration with consistent voice quality
- Content Creation: Quick audio generation for videos and podcasts
- Accessibility: Text-to-speech for screen readers and assistive tools
Next Steps
Installation
Install Chatterbox and get started
API Reference
Explore all parameters and methods