Skip to main content

Overview

Chatterbox-Turbo is the most efficient model in the Chatterbox family, delivering high-quality speech with less compute and VRAM than previous models. Built on a streamlined 350M parameter architecture, Turbo excels at low-latency voice agents while maintaining excellent performance for narration and creative workflows.

One-Step Decoding

Distilled speech-token-to-mel decoder reduces generation from 10 steps to just one, while retaining high-fidelity audio output.

Paralinguistic Tags

Native support for [cough], [laugh], [chuckle] and more to add distinct realism to generated speech.

Low Latency

Optimized for production use in voice agents with sub-200ms latency potential.

Zero-Shot Cloning

Clone any voice from a 5-10 second reference clip without fine-tuning.

Model Specifications

  • Model Size: 350M parameters
  • Language: English only
  • Sample Rate: 24,000 Hz
  • Architecture: T3 transformer + S3Gen with mean flow decoding
  • Repository: ResembleAI/chatterbox-turbo

Key Features

Paralinguistic Tags

Turbo natively supports paralinguistic tags that add natural non-speech vocalizations to your generated audio:
  • [laugh] - Natural laughter
  • [chuckle] - Light chuckling
  • [cough] - Coughing sound
Simply include these tags in your text, and the model will generate the appropriate vocal effect.
Paralinguistic tags work best when placed naturally in the sentence flow, just as they would occur in real conversation.

Optimized Performance

The Turbo model achieves significant performance improvements:
  • Reduced VRAM: Lower memory footprint compared to base Chatterbox
  • Faster Generation: One-step decoding instead of 10-step process
  • Smaller Model: 350M parameters vs 500M in base models

Hardware Requirements

Minimum (CPU)

  • 4GB RAM
  • CPU inference supported
  • Slower generation times

Recommended (GPU)

  • NVIDIA GPU with 4GB+ VRAM
  • CUDA support
  • Real-time generation possible
The model also supports Apple Silicon (MPS) for Mac users with M1/M2/M3 chips.

Usage

Basic Generation

import torchaudio as ta
from chatterbox.tts_turbo import ChatterboxTurboTTS

# Load the Turbo model
model = ChatterboxTurboTTS.from_pretrained(device="cuda")

# Generate speech
text = "Hello, welcome to Chatterbox Turbo!"
wav = model.generate(text)

ta.save("output.wav", wav, model.sr)

Using Paralinguistic Tags

import torchaudio as ta
from chatterbox.tts_turbo import ChatterboxTurboTTS

model = ChatterboxTurboTTS.from_pretrained(device="cuda")

# Generate with paralinguistic tags
text = "Oh, that's hilarious! [chuckle] Um anyway, we do have a new model in store. It's the SkyNet T-800 series and it's got basically everything. Including AI integration with ChatGPT and all that jazz. Would you like me to get some prices for you?"

wav = model.generate(text)
ta.save("test-turbo.wav", wav, model.sr)

Voice Cloning

Clone any voice by providing a reference audio clip:
import torchaudio as ta
from chatterbox.tts_turbo import ChatterboxTurboTTS

model = ChatterboxTurboTTS.from_pretrained(device="cuda")

# Generate with voice cloning
text = "Hi there, Sarah here from MochaFone calling you back [chuckle], have you got one minute to chat about the billing issue?"

# Provide a 5-10 second reference clip
wav = model.generate(text, audio_prompt_path="your_10s_ref_clip.wav")

ta.save("cloned-voice.wav", wav, model.sr)
Your reference audio clip must be longer than 5 seconds. The model will use the first 10 seconds for voice conditioning.

Generation Parameters

Control the generation process with these parameters:
ParameterDefaultDescription
temperature0.8Controls randomness. Higher = more varied output
top_p0.95Nucleus sampling threshold
top_k1000Limits vocabulary to top k tokens
repetition_penalty1.2Penalizes repeated tokens
audio_prompt_pathNonePath to reference audio for voice cloning
exaggeration0.0Emotion intensity (not used in Turbo)
norm_loudnessTrueNormalize loudness of reference audio
Unlike the base Chatterbox model, Turbo does not support cfg_weight, exaggeration, or min_p parameters during generation.

Best Practices

For Voice Agents

  • Use default parameters for most natural results
  • Keep text prompts conversational and natural
  • Reference audio should match the desired speaking style
  • Include paralinguistic tags for more engaging conversations

For Narration

  • Adjust temperature between 0.7-0.9 for consistency
  • Use longer reference clips (8-10 seconds) for better voice capture
  • Test different repetition_penalty values for varied cadence

Performance Characteristics

Generation Speed

Significantly faster than base Chatterbox due to one-step decoding. Real-time generation possible on modern GPUs.

Audio Quality

High-fidelity 24kHz output comparable to 10-step decoding models while being much faster.

Built-in Watermarking

Every audio file generated by Chatterbox-Turbo includes Resemble AI’s Perth (Perceptual Threshold) watermark - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations. You can detect the watermark using:
import perth
import librosa

# Load the watermarked audio
watermarked_audio, sr = librosa.load("output.wav", sr=None)

# Initialize watermarker
watermarker = perth.PerthImplicitWatermarker()

# Extract watermark
watermark = watermarker.get_watermark(watermarked_audio, sample_rate=sr)
print(f"Extracted watermark: {watermark}")  # 0.0 or 1.0

Use Cases

  • Voice Agents: Production-ready TTS for conversational AI
  • Interactive Applications: Low-latency speech for games and apps
  • Audiobooks: Narration with consistent voice quality
  • Content Creation: Quick audio generation for videos and podcasts
  • Accessibility: Text-to-speech for screen readers and assistive tools

Next Steps

Installation

Install Chatterbox and get started

API Reference

Explore all parameters and methods

Build docs developers (and LLMs) love