Why Modal?
- Serverless GPUs - NVIDIA A10G on-demand, pay only when active
- Auto-scaling - 0 to 10+ concurrent requests automatically
- Fast Cold Starts - Container provisioning in ~30 seconds
- R2 Integration - Direct bucket mounting for voice references
- No DevOps - No servers, containers, or Kubernetes to manage
Prerequisites
- Modal account (sign up free)
- Modal CLI installed:
pip install modal - Cloudflare R2 bucket configured (see R2 Setup)
- Hugging Face account for model weights
Quick Setup
Create Modal secrets
Configure three secrets in Modal Dashboard → Secrets:
cloudflare-r2
cloudflare-r2
R2 credentials for bucket mounting:
Use the same R2 credentials from your
.env.local file.chatterbox-api-key
chatterbox-api-key
API key to protect the Chatterbox endpoint:Generate a secure key:
hf-token
hf-token
Hugging Face token for downloading model weights:Get your token from Hugging Face Settings → Access Tokens.
Update chatterbox_tts.py
Edit Defined in
chatterbox_tts.py in your project root with your R2 credentials:chatterbox_tts.py
chatterbox_tts.py:23.Deploy to Modal
.env.local.Architecture
Modal Application
Thechatterbox_tts.py file defines a complete Modal application:
chatterbox_tts.py
chatterbox_tts.py:34.
R2 Bucket Mount
Modal mounts your R2 bucket read-only for direct file access:chatterbox_tts.py
chatterbox_tts.py:26 and chatterbox_tts.py:83.
Benefits:
- No file uploads to Modal
- Voice references accessible immediately after upload to R2
- Single source of truth for audio files
GPU Configuration
GPU Type
GPU Type
NVIDIA A10G - 24 GB VRAM, optimized for inferenceCost: ~$0.60/hour (pay per second)Alternative GPUs:
a100- More powerful, $2.50/hourt4- Cheaper, slower, $0.20/hourany- Let Modal choose available GPU
Scaledown Window
Scaledown Window
5 minutes - How long to keep GPU warm after last requestTrade-offs:
- Shorter window: Lower costs, more cold starts
- Longer window: Higher costs, fewer cold starts
5 minutes balances cost and user experience for typical usage patterns.
Concurrency
Concurrency
10 concurrent requests - Maximum parallel generationsModal automatically scales to handle concurrent load:
- 1-10 requests: Single GPU instance
- 11+ requests: Additional instances spin up
chatterbox_tts.py:93.API Endpoints
POST /generate
Generate TTS audio from text and voice reference. Request:chatterbox_tts.py:71.
Response:
- Content-Type:
audio/wav - Body: WAV audio file (24kHz, 16-bit PCM)
GET /docs
Interactive API documentation (Swagger UI). Visit:https://your-workspace--chatterbox-tts-serve.modal.run/docs
Model Loading
Chatterbox TTS model is loaded once per GPU instance:chatterbox_tts.py
chatterbox_tts.py:95.
Lifecycle:
- First request triggers container start (~30s cold start)
load_model()downloads weights from Hugging Face (~10s)- Model is cached for subsequent requests
- After 5 minutes idle, container shuts down
Model weights (~2 GB) are cached in Modal’s container image layer cache, reducing cold start times on subsequent runs.
Voice Key Format
Voice keys follow the R2 bucket structure: System voices:"voices/system/clxyz123abc"
Custom voices:
"voices/custom/clabc789def"
Modal resolves these to absolute paths:
chatterbox_tts.py:117.
Authentication
The API is protected by API key authentication:chatterbox_tts.py
chatterbox_tts.py:59.
Usage from Next.js:
Testing Locally
Modal provides alocal_entrypoint for testing:
chatterbox_tts.py:173.
This:
- Spins up a Modal container with GPU
- Mounts R2 bucket
- Generates audio
- Saves to local file
- Shuts down container
Use this to verify R2 mounting and voice key resolution before deploying.
Monitoring
Modal Dashboard
View real-time metrics at modal.com/apps:- Active containers - Currently running GPU instances
- Request volume - Requests per second/minute/hour
- Cold start rate - Percentage of requests triggering cold starts
- Error rate - Failed requests
- GPU utilization - Time GPU was active vs idle
Logs
View logs in real-time:Costs
Track GPU usage and costs:- Go to Billing in Modal Dashboard
- View breakdown by app and GPU type
- Export usage data for accounting
- 100 generations/day ≈ 1 hour GPU time = 18/month
- 1,000 generations/day ≈ 10 hours GPU time = 180/month
With 5-minute scaledown, actual GPU time is much less than total app uptime.
Optimizations
Reduce Cold Starts
Increase scaledown window
Increase scaledown window
Keep GPU warm longer:Trade-off: Higher idle costs, fewer cold starts
Use keep-warm policy
Use keep-warm policy
Maintain minimum active containers:Cost: ~$0.60/hour continuously
Preload model in image
Preload model in image
Bake model weights into container image:Benefit: Reduces cold start by ~10 seconds
Improve Throughput
Increase concurrency
Increase concurrency
Handle more parallel requests per GPU:Note: A10G can typically handle 5-10 concurrent TTS generations before VRAM becomes a bottleneck.
Use batch generation
Use batch generation
Process multiple prompts in a single request:
Troubleshooting
Deployment fails
- Check Python version matches:
python_version="3.10" - Verify package versions are valid
- Try deploying with
--force-buildflag:
Voice not found error
- Verify voice exists in R2:
- Check Modal secret
cloudflare-r2has correct credentials - Verify
R2_ACCOUNT_IDandR2_BUCKET_NAMEinchatterbox_tts.py - Test mounting:
API key rejected
- Verify Modal secret
chatterbox-api-keyis set: - Ensure
CHATTERBOX_API_KEYmatches in.env.localand Modal secret - Check header name is
X-Api-Key(case-sensitive) - Redeploy after changing secret:
Slow cold starts
Expected cold start time: 30-40 seconds If longer:- Check container build time in logs
- Consider preloading model weights in image (see Optimizations)
- Use
keep_warm=1for production
GPU out of memory
- Reduce
max_inputsconcurrency - Upgrade to A100 GPU:
- Process shorter prompts (split long text)
Related Documentation
Cloudflare R2
Configure bucket mounting
Environment Variables
Required Modal environment variables
Modal Documentation
Official Modal documentation
Chatterbox TTS
Chatterbox model repository