Skip to main content
POST
/
tts
Text-to-Speech
curl --request POST \
  --url https://api.example.com/tts \
  --header 'Content-Type: application/json' \
  --data '
{
  "text": "<string>"
}
'

Overview

This is a standalone text-to-speech endpoint that converts any text into MP3 audio using ElevenLabs. It’s independent of the conversation flow and can be used to generate audio for any text. This endpoint returns raw audio data (not JSON), making it suitable for direct audio playback or download. Required Environment Variable:
  • ELEVENLABS_API_KEY - Your ElevenLabs API key

Request

Accepts either JSON or form data.
text
string
required
The text to convert to speech. Maximum 1500 characters (automatically truncated if longer).This field is required and cannot be empty.

Response

Content-Type: audio/mpeg Headers:
Content-Disposition: inline; filename=speech.mp3
The response body is raw MP3 audio data (binary). You can:
  • Play it directly in an audio player
  • Save it to a file with .mp3 extension
  • Stream it to users
  • Embed it in HTML audio elements

Error Responses

400 Bad Request

No text provided:
{
  "error": "No text provided"
}

500 Internal Server Error

ElevenLabs API key not configured:
{
  "error": "ELEVENLABS_API_KEY not set in .env"
}
ElevenLabs API error:
{
  "error": "<error details from ElevenLabs or Python exception>"
}

Examples

# JSON request
curl -X POST http://localhost:5000/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, welcome to our pizza restaurant!"}' \
  --output speech.mp3

# Form data request
curl -X POST http://localhost:5000/tts \
  -F "text=Hello, welcome to our pizza restaurant!" \
  --output speech.mp3

Implementation Details

ElevenLabs Configuration

  • Voice ID: JBFqnCBsd6RMkjVDRZzb
  • Model: eleven_turbo_v2_5
  • Format: MP3 audio
  • Text Length: Maximum 1500 characters (automatically truncated)

Text Truncation

If your text exceeds 1500 characters, only the first 1500 characters will be converted to speech. This is done to:
  • Manage API costs
  • Ensure reasonable response times
  • Stay within ElevenLabs rate limits
Example:
text = "Very long text..." * 1000  # 10,000+ characters
# Only first 1500 chars will be converted

Audio Stream Handling

The endpoint uses a custom collect_audio_bytes() function to handle different audio stream formats from the ElevenLabs client:
  • Byte arrays (most common)
  • Iterable streams (chunks)
  • String data (rare)
This ensures compatibility across different versions of the ElevenLabs SDK.

Content-Type Handling

The endpoint accepts both:
  1. JSON: Content-Type: application/json with {"text": "..."}
  2. Form data: Content-Type: application/x-www-form-urlencoded with text=...
This makes it flexible for different client types (browsers, API clients, etc.).

Response Headers

The response includes:
Content-Type: audio/mpeg
Content-Disposition: inline; filename=speech.mp3
  • Content-Type: audio/mpeg tells the browser it’s MP3 audio
  • Content-Disposition: inline suggests playing in-browser rather than downloading
  • filename=speech.mp3 provides a default filename if the user saves it

Use Cases

  1. Preview audio generation - Test TTS before integrating into calls
  2. Generate IVR prompts - Create audio files for your phone system
  3. Accessibility features - Convert text content to audio for users
  4. Testing voice quality - Compare different text inputs and voice settings
  5. Standalone audio API - Use independently of the conversation features

Performance Notes

  • Response time depends on text length and ElevenLabs API performance
  • Typical response time: 1-3 seconds for short texts
  • Consider caching frequently used audio to reduce API calls
  • The endpoint has a 30-second timeout for the ElevenLabs API call

Build docs developers (and LLMs) love