Modal Deployment

Resonance deploys the Chatterbox TTS model to Modal for serverless GPU inference. Modal provides pay-per-second GPU billing with automatic scaling and zero infrastructure management.

Serverless GPUs - NVIDIA A10G on-demand, pay only when active
Auto-scaling - 0 to 10+ concurrent requests automatically
Fast Cold Starts - Container provisioning in ~30 seconds
R2 Integration - Direct bucket mounting for voice references
No DevOps - No servers, containers, or Kubernetes to manage

Prerequisites

Modal account (sign up free)
Modal CLI installed: pip install modal
Cloudflare R2 bucket configured (see R2 Setup)
Hugging Face account for model weights

Quick Setup

Install Modal CLI

pip install modal

Or use pipx for isolated installation:

pipx install modal

Authenticate Modal

modal token set

This opens your browser to authenticate and saves credentials locally.

Create Modal secrets

Configure three secrets in Modal Dashboard → Secrets:

cloudflare-r2

R2 credentials for bucket mounting:

modal secret create cloudflare-r2 \
  AWS_ACCESS_KEY_ID=<your-r2-access-key-id> \
  AWS_SECRET_ACCESS_KEY=<your-r2-secret-access-key>

Use the same R2 credentials from your .env.local file.

chatterbox-api-key

API key to protect the Chatterbox endpoint:

modal secret create chatterbox-api-key \
  CHATTERBOX_API_KEY=<your-api-key>

Generate a secure key:

openssl rand -base64 32

This key must match CHATTERBOX_API_KEY in your Next.js .env.local.

hf-token

Hugging Face token for downloading model weights:

modal secret create hf-token \
  HF_TOKEN=<your-huggingface-token>

Get your token from Hugging Face Settings → Access Tokens.

Update chatterbox_tts.py

Edit chatterbox_tts.py in your project root with your R2 credentials:

chatterbox_tts.py

# R2 cloud bucket mount (read-only, replaces Modal Volume)
R2_BUCKET_NAME = "resonance-app"  # Your bucket name
R2_ACCOUNT_ID = "your-cloudflare-account-id"  # Your account ID
R2_MOUNT_PATH = "/r2"

Defined in chatterbox_tts.py:23.

Deploy to Modal

modal deploy chatterbox_tts.py

This builds the container image, uploads code, and deploys the API.Output:

✓ Initialized. View run at https://modal.com/apps/...
✓ Created function "Chatterbox.serve".

🎉 Deployed app "chatterbox-tts"!

Endpoint: https://your-workspace--chatterbox-tts-serve.modal.run

Copy the endpoint URL for your .env.local.

Configure Next.js

Add Modal endpoint to .env.local:

.env.local

CHATTERBOX_API_URL="https://your-workspace--chatterbox-tts-serve.modal.run"
CHATTERBOX_API_KEY="your-api-key-from-step-3"

Generate API types

Sync OpenAPI spec from deployed Modal app:

npm run sync-api

This fetches the OpenAPI spec from Modal and generates TypeScript types in src/types/.

Architecture

The chatterbox_tts.py file defines a complete Modal application:

chatterbox_tts.py

import modal

# Define container image with dependencies
image = modal.Image.debian_slim(python_version="3.10").uv_pip_install(
    "chatterbox-tts==0.1.6",
    "fastapi[standard]==0.124.4",
    "peft==0.18.0",
)

app = modal.App("chatterbox-tts", image=image)

Defined in chatterbox_tts.py:34.

R2 Bucket Mount

Modal mounts your R2 bucket read-only for direct file access:

chatterbox_tts.py

r2_bucket = modal.CloudBucketMount(
    R2_BUCKET_NAME,
    bucket_endpoint_url=f"https://{R2_ACCOUNT_ID}.r2.cloudflarestorage.com",
    secret=modal.Secret.from_name("cloudflare-r2"),
    read_only=True,
)

@app.cls(
    gpu="a10g",
    scaledown_window=60 * 5,  # 5 minutes idle before shutdown
    secrets=[
        modal.Secret.from_name("hf-token"),
        modal.Secret.from_name("chatterbox-api-key"),
        modal.Secret.from_name("cloudflare-r2"),
    ],
    volumes={R2_MOUNT_PATH: r2_bucket},  # Mount at /r2
)

Defined in chatterbox_tts.py:26 and chatterbox_tts.py:83. Benefits:

No file uploads to Modal
Voice references accessible immediately after upload to R2
Single source of truth for audio files

GPU Configuration

GPU Type

NVIDIA A10G - 24 GB VRAM, optimized for inference

@app.cls(gpu="a10g")

Cost: ~$0.60/hour (pay per second)Alternative GPUs:

a100 - More powerful, $2.50/hour
t4 - Cheaper, slower, $0.20/hour
any - Let Modal choose available GPU

Scaledown Window

5 minutes - How long to keep GPU warm after last request

@app.cls(scaledown_window=60 * 5)

Trade-offs:

Shorter window: Lower costs, more cold starts
Longer window: Higher costs, fewer cold starts

5 minutes balances cost and user experience for typical usage patterns.

Concurrency

10 concurrent requests - Maximum parallel generations

@modal.concurrent(max_inputs=10)

Modal automatically scales to handle concurrent load:

1-10 requests: Single GPU instance
11+ requests: Additional instances spin up

Defined in chatterbox_tts.py:93.

API Endpoints

POST /generate

Generate TTS audio from text and voice reference. Request:

interface TTSRequest {
  prompt: string;              // Text to synthesize (1-5000 chars)
  voice_key: string;           // R2 path (e.g., "voices/system/clxyz123")
  temperature?: number;        // Creativity (0.0-2.0, default: 0.8)
  top_p?: number;             // Nucleus sampling (0.0-1.0, default: 0.95)
  top_k?: number;             // Top-K sampling (1-10000, default: 1000)
  repetition_penalty?: number; // Repetition penalty (1.0-2.0, default: 1.2)
  norm_loudness?: boolean;    // Normalize loudness (default: true)
}

Defined in chatterbox_tts.py:71. Response:

Content-Type: audio/wav
Body: WAV audio file (24kHz, 16-bit PCM)

Example:

curl -X POST "https://your-workspace--chatterbox-tts-serve.modal.run/generate" \
  -H "Content-Type: application/json" \
  -H "X-Api-Key: your-api-key" \
  -d '{
    "prompt": "Hello from Resonance [chuckle].",
    "voice_key": "voices/system/clxyz123"
  }' \
  --output output.wav

GET /docs

Interactive API documentation (Swagger UI). Visit: https://your-workspace--chatterbox-tts-serve.modal.run/docs

Model Loading

Chatterbox TTS model is loaded once per GPU instance:

chatterbox_tts.py

@app.cls(...)
class Chatterbox:
    @modal.enter()
    def load_model(self):
        self.model = ChatterboxTurboTTS.from_pretrained(device="cuda")

Defined in chatterbox_tts.py:95. Lifecycle:

First request triggers container start (~30s cold start)
load_model() downloads weights from Hugging Face (~10s)
Model is cached for subsequent requests
After 5 minutes idle, container shuts down

Model weights (~2 GB) are cached in Modal’s container image layer cache, reducing cold start times on subsequent runs.

Voice Key Format

Voice keys follow the R2 bucket structure: System voices:

voice_key = "voices/system/{voiceId}"

Example: "voices/system/clxyz123abc" Custom voices:

voice_key = "voices/custom/{voiceId}"

Example: "voices/custom/clabc789def" Modal resolves these to absolute paths:

voice_path = Path(R2_MOUNT_PATH) / voice_key
# /r2/voices/system/clxyz123abc

Defined in chatterbox_tts.py:117.

Authentication

The API is protected by API key authentication:

chatterbox_tts.py

from fastapi.security import APIKeyHeader

api_key_scheme = APIKeyHeader(
    name="x-api-key",
    scheme_name="ApiKeyAuth",
    auto_error=False,
)

def verify_api_key(x_api_key: str | None = Security(api_key_scheme)):
    expected = os.environ.get("CHATTERBOX_API_KEY", "")
    if not expected or x_api_key != expected:
        raise HTTPException(status_code=403, detail="Invalid API key")
    return x_api_key

Defined in chatterbox_tts.py:59. Usage from Next.js:

const response = await fetch(`${env.CHATTERBOX_API_URL}/generate`, {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "X-Api-Key": env.CHATTERBOX_API_KEY,
  },
  body: JSON.stringify({ prompt, voice_key }),
});

Testing Locally

Modal provides a local_entrypoint for testing:

modal run chatterbox_tts.py \
  --prompt "Hello from Chatterbox [chuckle]." \
  --voice-key "voices/system/clxyz123" \
  --output-path "/tmp/output.wav"

Defined in chatterbox_tts.py:173. This:

Spins up a Modal container with GPU
Mounts R2 bucket
Generates audio
Saves to local file
Shuts down container

Use this to verify R2 mounting and voice key resolution before deploying.

Monitoring

View real-time metrics at modal.com/apps:

Active containers - Currently running GPU instances
Request volume - Requests per second/minute/hour
Cold start rate - Percentage of requests triggering cold starts
Error rate - Failed requests
GPU utilization - Time GPU was active vs idle

Logs

View logs in real-time:

modal app logs chatterbox-tts

Or in the dashboard under Apps → chatterbox-tts → Logs.

Costs

Track GPU usage and costs:

Go to Billing in Modal Dashboard
View breakdown by app and GPU type
Export usage data for accounting

Typical costs:

100 generations/day ≈ 1 hour GPU time = $0.60/day =$ 18/month
1,000 generations/day ≈ 10 hours GPU time = $6/day =$ 180/month

With 5-minute scaledown, actual GPU time is much less than total app uptime.

Optimizations

Reduce Cold Starts

Increase scaledown window

Keep GPU warm longer:

@app.cls(scaledown_window=60 * 15)  # 15 minutes

Trade-off: Higher idle costs, fewer cold starts

Use keep-warm policy

Maintain minimum active containers:

@app.cls(keep_warm=1)  # Always keep 1 container warm

Cost: ~$0.60/hour continuously

Preload model in image

Bake model weights into container image:

image = (
    modal.Image.debian_slim(python_version="3.10")
    .uv_pip_install("chatterbox-tts==0.1.6", ...)
    .run_commands(
        "python -c 'from chatterbox.tts_turbo import ChatterboxTurboTTS; ChatterboxTurboTTS.from_pretrained()'"
    )
)

Benefit: Reduces cold start by ~10 seconds

Improve Throughput

Increase concurrency

Handle more parallel requests per GPU:

@modal.concurrent(max_inputs=20)

Note: A10G can typically handle 5-10 concurrent TTS generations before VRAM becomes a bottleneck.

Use batch generation

Process multiple prompts in a single request:

@modal.method()
def generate_batch(self, prompts: list[str], audio_prompt_path: str):
    return [self.model.generate(p, audio_prompt_path) for p in prompts]

Troubleshooting

Deployment fails

Error: Failed to build image

Solution:

Check Python version matches: python_version="3.10"
Verify package versions are valid

Try deploying with --force-build flag:

modal deploy chatterbox_tts.py --force-build

Voice not found error

Voice not found at 'voices/system/clxyz123'

Checklist:

Verify voice exists in R2:

aws s3 ls s3://resonance-app/voices/system/ \
  --endpoint-url https://{account_id}.r2.cloudflarestorage.com

Check Modal secret cloudflare-r2 has correct credentials
Verify R2_ACCOUNT_ID and R2_BUCKET_NAME in chatterbox_tts.py

Test mounting:

modal run chatterbox_tts.py --voice-key "voices/system/clxyz123"

API key rejected

HTTPException: Invalid API key

Solution:

Verify Modal secret chatterbox-api-key is set:
```
modal secret list
```
Ensure CHATTERBOX_API_KEY matches in .env.local and Modal secret
Check header name is X-Api-Key (case-sensitive)
Redeploy after changing secret:
```
modal deploy chatterbox_tts.py
```

Slow cold starts

Expected cold start time: 30-40 seconds If longer:

Check container build time in logs
Consider preloading model weights in image (see Optimizations)
Use keep_warm=1 for production

GPU out of memory

RuntimeError: CUDA out of memory

Solutions:

Reduce max_inputs concurrency
Upgrade to A100 GPU:
```
@app.cls(gpu="a100")
```
Process shorter prompts (split long text)

Cloudflare R2

Configure bucket mounting

Environment Variables

Required Modal environment variables

Modal Documentation

Official Modal documentation

Chatterbox TTS

Chatterbox model repository

Get Started

Core Features

Configuration

Deployment

Prerequisites

Quick Setup

Architecture

R2 Bucket Mount

GPU Configuration

API Endpoints

POST /generate

GET /docs

Model Loading

Voice Key Format

Authentication

Testing Locally

Monitoring

Logs

Costs

Optimizations

Reduce Cold Starts

Improve Throughput

Troubleshooting

Deployment fails

Voice not found error

API key rejected

Slow cold starts

GPU out of memory

Cloudflare R2

Environment Variables

Modal Documentation

Chatterbox TTS

Build docs developers (and LLMs) love

Get Started

Core Features

Configuration

Deployment

​Why Modal?

​Prerequisites

​Quick Setup

​Architecture

​Modal Application

​R2 Bucket Mount

​GPU Configuration

​API Endpoints

​POST /generate

​GET /docs

​Model Loading

​Voice Key Format

​Authentication

​Testing Locally

​Monitoring

​Modal Dashboard

​Logs

​Costs

​Optimizations

​Reduce Cold Starts

​Improve Throughput

​Troubleshooting

​Deployment fails

​Voice not found error

​API key rejected

​Slow cold starts

​GPU out of memory

​Related Documentation

Cloudflare R2

Environment Variables

Modal Documentation

Chatterbox TTS

Build docs developers (and LLMs) love

Why Modal?

Prerequisites

Quick Setup

Architecture

Modal Application

R2 Bucket Mount

GPU Configuration

API Endpoints

POST /generate

GET /docs

Model Loading

Voice Key Format

Authentication

Testing Locally

Monitoring

Modal Dashboard

Logs

Costs

Optimizations

Reduce Cold Starts

Improve Throughput

Troubleshooting

Deployment fails

Voice not found error

API key rejected

Slow cold starts

GPU out of memory

Related Documentation