Skip to main content

POST /api/speech

Generate high-quality audio from text using OpenAI’s text-to-speech models. This endpoint returns audio as a base64-encoded data URL ready for playback.

Request Body

text
string
required
The text to convert to speech. Maximum length varies by model but typically supports several paragraphs.
voice
string
default:"alloy"
The voice to use for speech generation. OpenAI provides six natural-sounding voices:
  • alloy (default) - Neutral and balanced
  • echo - Clear and expressive
  • fable - Warm and engaging
  • onyx - Deep and authoritative
  • nova - Energetic and bright
  • shimmer - Soft and calm

Response

audio
string
Base64-encoded audio data URL in the format data:audio/[format];base64,[data]. Ready to use with HTML5 <audio> elements or Web Audio API.

Example Request

const response = await fetch('http://localhost:3001/api/speech', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    text: 'Hello! This is a test of the text to speech API.',
    voice: 'nova'
  })
});

const result = await response.json();
console.log('Audio data URL:', result.audio);

// Play the audio
const audio = new Audio(result.audio);
audio.play();

Example Response

{
  "audio": "data:audio/mp3;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjYwLjE2LjEwMAAAAAAAAAAAAAAA..."
}

Voice Characteristics

Choose the voice that best fits your use case:
VoiceCharacteristicsBest For
alloyNeutral, balanced, versatileGeneral purpose, professional content
echoClear, expressive, articulatePresentations, tutorials, instructions
fableWarm, engaging, friendlyStorytelling, casual content, greetings
onyxDeep, authoritative, confidentFormal announcements, important messages
novaEnergetic, bright, upbeatNotifications, positive messages, alerts
shimmerSoft, calm, soothingRelaxing content, gentle reminders

Audio Format

The API returns audio in MP3 format by default, encoded as a base64 data URL. This format:
  • Works directly in browser <audio> elements
  • Compatible with Web Audio API
  • Small file size for quick transmission
  • High quality at 24kHz sample rate

Use Cases

Voice Responses

Generate voice responses from AI in the Voice Agent

Notifications

Speak important notifications or alerts

Accessibility

Read text content aloud for visually impaired users

Multilingual Support

Generate speech in multiple languages with natural pronunciation

Error Handling

400 Bad Request

{
  "error": "No text provided"
}

500 Server Error

{
  "error": "Speech generation failed"
}

Performance Considerations

  • Average response time: 1-3 seconds depending on text length
  • Text is processed in chunks for longer inputs
  • Audio is streamed and encoded efficiently
  • Use shorter text segments for faster response times

Integration Example

Complete example with error handling and playback controls:
async function speakText(text: string, voice: string = 'alloy') {
  try {
    const response = await fetch('http://localhost:3001/api/speech', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ text, voice })
    });
    
    if (!response.ok) {
      const error = await response.json();
      throw new Error(error.error || 'Speech generation failed');
    }
    
    const { audio } = await response.json();
    
    // Create and play audio
    const audioElement = new Audio(audio);
    
    return new Promise((resolve, reject) => {
      audioElement.onended = resolve;
      audioElement.onerror = reject;
      audioElement.play();
    });
  } catch (error) {
    console.error('Speech error:', error);
    throw error;
  }
}

// Usage
await speakText('Welcome to Tabby AI Keyboard', 'nova');
Source: nextjs-backend/src/app/api/speech/route.ts
  • Transcribe - Convert speech to text (STT)
  • Voice Agent - Real-time voice conversations
  • Chat - Generate text responses that can be spoken

Build docs developers (and LLMs) love