Chatterbox-Turbo

Overview

Chatterbox-Turbo is the most efficient model in the Chatterbox family, delivering high-quality speech with less compute and VRAM than previous models. Built on a streamlined 350M parameter architecture, Turbo excels at low-latency voice agents while maintaining excellent performance for narration and creative workflows.

One-Step Decoding

Distilled speech-token-to-mel decoder reduces generation from 10 steps to just one, while retaining high-fidelity audio output.

Paralinguistic Tags

Native support for [cough], [laugh], [chuckle] and more to add distinct realism to generated speech.

Low Latency

Optimized for production use in voice agents with sub-200ms latency potential.

Zero-Shot Cloning

Clone any voice from a 5-10 second reference clip without fine-tuning.

Model Specifications

Model Size: 350M parameters
Language: English only
Sample Rate: 24,000 Hz
Architecture: T3 transformer + S3Gen with mean flow decoding
Repository: ResembleAI/chatterbox-turbo

Key Features

Paralinguistic Tags

Turbo natively supports paralinguistic tags that add natural non-speech vocalizations to your generated audio:

[laugh] - Natural laughter
[chuckle] - Light chuckling
[cough] - Coughing sound

Simply include these tags in your text, and the model will generate the appropriate vocal effect.

Paralinguistic tags work best when placed naturally in the sentence flow, just as they would occur in real conversation.

Optimized Performance

The Turbo model achieves significant performance improvements:

Reduced VRAM: Lower memory footprint compared to base Chatterbox
Faster Generation: One-step decoding instead of 10-step process
Smaller Model: 350M parameters vs 500M in base models

Hardware Requirements

Minimum (CPU)

4GB RAM
CPU inference supported
Slower generation times

Recommended (GPU)

NVIDIA GPU with 4GB+ VRAM
CUDA support
Real-time generation possible

The model also supports Apple Silicon (MPS) for Mac users with M1/M2/M3 chips.

Usage

Basic Generation

import torchaudio as ta
from chatterbox.tts_turbo import ChatterboxTurboTTS

# Load the Turbo model
model = ChatterboxTurboTTS.from_pretrained(device="cuda")

# Generate speech
text = "Hello, welcome to Chatterbox Turbo!"
wav = model.generate(text)

ta.save("output.wav", wav, model.sr)

Using Paralinguistic Tags

import torchaudio as ta
from chatterbox.tts_turbo import ChatterboxTurboTTS

model = ChatterboxTurboTTS.from_pretrained(device="cuda")

# Generate with paralinguistic tags
text = "Oh, that's hilarious! [chuckle] Um anyway, we do have a new model in store. It's the SkyNet T-800 series and it's got basically everything. Including AI integration with ChatGPT and all that jazz. Would you like me to get some prices for you?"

wav = model.generate(text)
ta.save("test-turbo.wav", wav, model.sr)

Voice Cloning

Clone any voice by providing a reference audio clip:

import torchaudio as ta
from chatterbox.tts_turbo import ChatterboxTurboTTS

model = ChatterboxTurboTTS.from_pretrained(device="cuda")

# Generate with voice cloning
text = "Hi there, Sarah here from MochaFone calling you back [chuckle], have you got one minute to chat about the billing issue?"

# Provide a 5-10 second reference clip
wav = model.generate(text, audio_prompt_path="your_10s_ref_clip.wav")

ta.save("cloned-voice.wav", wav, model.sr)

Your reference audio clip must be longer than 5 seconds. The model will use the first 10 seconds for voice conditioning.

Generation Parameters

Control the generation process with these parameters:

Parameter	Default	Description
`temperature`	0.8	Controls randomness. Higher = more varied output
`top_p`	0.95	Nucleus sampling threshold
`top_k`	1000	Limits vocabulary to top k tokens
`repetition_penalty`	1.2	Penalizes repeated tokens
`audio_prompt_path`	None	Path to reference audio for voice cloning
`exaggeration`	0.0	Emotion intensity (not used in Turbo)
`norm_loudness`	True	Normalize loudness of reference audio

Unlike the base Chatterbox model, Turbo does not support cfg_weight, exaggeration, or min_p parameters during generation.

Best Practices

For Voice Agents

Use default parameters for most natural results
Keep text prompts conversational and natural
Reference audio should match the desired speaking style
Include paralinguistic tags for more engaging conversations

For Narration

Adjust temperature between 0.7-0.9 for consistency
Use longer reference clips (8-10 seconds) for better voice capture
Test different repetition_penalty values for varied cadence

Performance Characteristics

Generation Speed

Significantly faster than base Chatterbox due to one-step decoding. Real-time generation possible on modern GPUs.

Audio Quality

High-fidelity 24kHz output comparable to 10-step decoding models while being much faster.

Built-in Watermarking

Every audio file generated by Chatterbox-Turbo includes Resemble AI’s Perth (Perceptual Threshold) watermark - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations. You can detect the watermark using:

import perth
import librosa

# Load the watermarked audio
watermarked_audio, sr = librosa.load("output.wav", sr=None)

# Initialize watermarker
watermarker = perth.PerthImplicitWatermarker()

# Extract watermark
watermark = watermarker.get_watermark(watermarked_audio, sample_rate=sr)
print(f"Extracted watermark: {watermark}")  # 0.0 or 1.0

Use Cases

Voice Agents: Production-ready TTS for conversational AI
Interactive Applications: Low-latency speech for games and apps
Audiobooks: Narration with consistent voice quality
Content Creation: Quick audio generation for videos and podcasts
Accessibility: Text-to-speech for screen readers and assistive tools

Get Started

Models

Guides

Overview

One-Step Decoding

Paralinguistic Tags

Low Latency

Zero-Shot Cloning

Model Specifications

Key Features

Paralinguistic Tags

Optimized Performance

Hardware Requirements

Minimum (CPU)

Recommended (GPU)

Usage

Basic Generation

Using Paralinguistic Tags

Voice Cloning

Generation Parameters

Best Practices

For Voice Agents

For Narration

Performance Characteristics

Generation Speed

Audio Quality

Built-in Watermarking

Use Cases

Next Steps

Installation

API Reference

Build docs developers (and LLMs) love

Get Started

Models

Guides

​Overview

One-Step Decoding

Paralinguistic Tags

Low Latency

Zero-Shot Cloning

​Model Specifications

​Key Features

​Paralinguistic Tags

​Optimized Performance

​Hardware Requirements

Minimum (CPU)

Recommended (GPU)

​Usage

​Basic Generation

​Using Paralinguistic Tags

​Voice Cloning

​Generation Parameters

​Best Practices

​For Voice Agents

​For Narration

​Performance Characteristics

Generation Speed

Audio Quality

​Built-in Watermarking

​Use Cases

​Next Steps

Installation

API Reference

Build docs developers (and LLMs) love

Overview

Model Specifications

Key Features

Paralinguistic Tags

Optimized Performance

Hardware Requirements

Usage

Basic Generation

Using Paralinguistic Tags

Voice Cloning

Generation Parameters

Best Practices

For Voice Agents

For Narration

Performance Characteristics

Built-in Watermarking

Use Cases

Next Steps