Skip to main content

Overview

ChatterboxTurboTTS is the fastest text-to-speech model in the Chatterbox family, optimized for low-latency inference. It uses a streamlined architecture with 2 CFM timesteps for rapid audio generation while maintaining high quality.

Class Signature

class ChatterboxTurboTTS:
    def __init__(
        self,
        t3: T3,
        s3gen: S3Gen,
        ve: VoiceEncoder,
        tokenizer: EnTokenizer,
        device: str,
        conds: Conditionals = None,
    )

Parameters

t3
T3
required
The T3 text-to-speech tokens model instance
s3gen
S3Gen
required
The S3Gen vocoder model instance for token-to-audio conversion
ve
VoiceEncoder
required
Voice encoder for extracting speaker embeddings from reference audio
tokenizer
EnTokenizer
required
English text tokenizer instance
device
str
required
Device to run inference on (“cuda”, “cpu”, or “mps”)
conds
Conditionals
Optional pre-computed conditionals for voice and style. See Conditionals reference

Class Methods

from_pretrained()

Load the pre-trained ChatterboxTurboTTS model from Hugging Face.
@classmethod
def from_pretrained(cls, device: str) -> 'ChatterboxTurboTTS'

Parameters

device
str
required
Device to load the model on (“cuda”, “cpu”, or “mps”). Automatically falls back to “cpu” if MPS is not available

Returns

model
ChatterboxTurboTTS
Initialized ChatterboxTurboTTS model with pre-trained weights from ResembleAI/chatterbox-turbo

Example

from chatterbox import ChatterboxTurboTTS
import torch

# Load on GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
model = ChatterboxTurboTTS.from_pretrained(device)

from_local()

Load the model from a local checkpoint directory.
@classmethod
def from_local(cls, ckpt_dir: str, device: str) -> 'ChatterboxTurboTTS'

Parameters

ckpt_dir
str
required
Path to the directory containing model checkpoint files
device
str
required
Device to load the model on (“cuda”, “cpu”, or “mps”)

Returns

model
ChatterboxTurboTTS
Initialized ChatterboxTurboTTS model with weights loaded from local directory

Instance Methods

prepare_conditionals()

Prepare voice conditionals from an audio prompt for subsequent generation calls.
def prepare_conditionals(
    self,
    wav_fpath: str,
    exaggeration: float = 0.5,
    norm_loudness: bool = True
)

Parameters

wav_fpath
str
required
Path to the audio file to use as voice reference. Must be at least 5 seconds long
exaggeration
float
default:"0.5"
Voice exaggeration level (0.0 to 1.0). Higher values produce more expressive speech
norm_loudness
bool
default:"True"
Whether to normalize the loudness of the reference audio to -27 LUFS

Example

# Prepare voice from reference audio
model.prepare_conditionals(
    wav_fpath="voice_sample.wav",
    exaggeration=0.5,
    norm_loudness=True
)

generate()

Generate speech from text using the prepared voice conditionals.
def generate(
    self,
    text: str,
    repetition_penalty: float = 1.2,
    min_p: float = 0.00,
    top_p: float = 0.95,
    audio_prompt_path: str = None,
    exaggeration: float = 0.0,
    cfg_weight: float = 0.0,
    temperature: float = 0.8,
    top_k: int = 1000,
    norm_loudness: bool = True,
) -> torch.Tensor

Parameters

text
str
required
The text to convert to speech
repetition_penalty
float
default:"1.2"
Penalty for repeating tokens (1.0 = no penalty, higher values discourage repetition)
min_p
float
default:"0.00"
Minimum probability threshold for sampling. Not supported in Turbo version and will be ignored
top_p
float
default:"0.95"
Nucleus sampling threshold (0.0 to 1.0). Only tokens with cumulative probability up to top_p are considered
audio_prompt_path
str
Optional path to audio file for voice cloning. If provided, will override existing conditionals
exaggeration
float
default:"0.0"
Voice exaggeration level. Not supported in Turbo version and will be ignored
cfg_weight
float
default:"0.0"
Classifier-free guidance weight. Not supported in Turbo version and will be ignored
temperature
float
default:"0.8"
Sampling temperature (higher = more random, lower = more deterministic)
top_k
int
default:"1000"
Number of top tokens to consider during sampling
norm_loudness
bool
default:"True"
Whether to normalize the loudness of the audio prompt if provided

Returns

audio
torch.Tensor
Generated audio waveform as a PyTorch tensor with shape [1, samples]. Sample rate is 44100 Hz (accessible via model.sr). Audio includes perceptual watermarking

Example

import torchaudio

# Generate speech with prepared conditionals
audio = model.generate(
    text="Hello, this is a test of the turbo model.",
    temperature=0.8,
    top_k=1000,
    top_p=0.95,
    repetition_penalty=1.2
)

# Save to file
torchaudio.save("output.wav", audio, model.sr)

# Or generate with a new voice in one call
audio = model.generate(
    text="Hello world!",
    audio_prompt_path="new_voice.wav"
)

Attributes

sr
int
Sample rate of generated audio (44100 Hz)
device
str
Device the model is running on
conds
Conditionals
Current voice conditionals used for generation

Notes

  • The Turbo model does not support cfg_weight, min_p, or exaggeration parameters - these will be ignored with a warning
  • Audio prompts must be at least 5 seconds long
  • Generated audio is automatically watermarked using the Perth implicit watermarker
  • Text is automatically normalized (capitalization, punctuation) before generation

Build docs developers (and LLMs) love