Overview
The Chatterbox TTS API is a FastAPI application running on Modal with GPU acceleration. It provides real-time text-to-speech generation using the Chatterbox Turbo model.This is the backend TTS engine that powers Resonance. The tRPC
generations.create endpoint calls this API internally.Architecture
- Platform: Modal serverless GPU infrastructure
- GPU: A10G (24GB VRAM)
- Model: Chatterbox Turbo TTS v0.1.6
- Scaledown: 5-minute idle timeout
- Concurrency: Up to 10 concurrent requests
- Storage: Cloudflare R2 bucket mount (read-only)
Infrastructure
chatterbox_tts.py
Authentication
All requests require an API key passed via thex-api-key header.
API key for authenticating requests. Configure via Modal secret
CHATTERBOX_API_KEY.Authentication Error
403 Forbidden
Generate Speech
Generate speech audio from text using a voice clone.Request Body
Text to convert to speech. Supports special tokens like
[chuckle], [laugh], etc.Constraints:- Minimum length: 1 character
- Maximum length: 5,000 characters
R2 object key for the voice audio file. Must be accessible in the mounted R2 bucket.Examples:
voices/system/narrator.wavvoices/custom/org-123/voice-456.wav
- Minimum length: 1 character
- Maximum length: 300 characters
Controls randomness in generation.Range: 0.0 to 2.0
- Lower (0.0-0.5): More consistent and predictable
- Default (0.8): Balanced naturalness
- Higher (1.0-2.0): More varied and expressive
Nucleus sampling threshold for controlling diversity.Range: 0.0 to 1.0
- Lower values: More focused, conservative sampling
- Higher values: More diverse token selection
Number of top tokens to consider during sampling.Range: 1 to 10,000
- Lower values: More focused vocabulary
- Higher values: Broader token selection
Penalty for repeating tokens.Range: 1.0 to 2.0
- 1.0: No penalty (may repeat)
- 1.2: Balanced (default)
- 2.0: Strong penalty (avoids repetition)
Whether to normalize audio loudness for consistent volume across generations.
Response
Content-Type:audio/wav
Body: Binary WAV audio stream
Implementation
chatterbox_tts.py
Error Responses
400 Bad Request
Voice file not found in R2 bucket.403 Forbidden
Invalid or missing API key.422 Unprocessable Entity
Invalid request parameters.500 Internal Server Error
Generation failure.Special Tokens
Chatterbox supports inline emotion and sound tokens:| Token | Effect |
|---|---|
[chuckle] | Light laughter |
[laugh] | Full laughter |
[sigh] | Sighing sound |
[gasp] | Gasping sound |
[cough] | Coughing sound |
Local Testing
Test the API locally using Modal’s CLI:Deployment
Deploy to Modal:Required Secrets
Configure these secrets in Modal:Performance
- Cold start: ~15-30 seconds (model loading)
- Warm inference: ~1-3 seconds per generation
- Concurrency: Up to 10 parallel requests
- Auto-scaling: Scales to zero after 5 minutes of inactivity