Skip to main content

Overview

LiteLLM provides unified interfaces for audio transcription (speech-to-text) and text-to-speech synthesis across multiple providers.

Transcription API

transcription()

Transcribe audio to text using OpenAI Whisper or other speech-to-text providers.

Function Signature

def transcription(
    file: FileTypes,
    model: str,
    language: Optional[str] = None,
    prompt: Optional[str] = None,
    response_format: Optional[str] = None,
    temperature: Optional[float] = None,
    timestamp_granularities: Optional[List[str]] = None,
    api_base: Optional[str] = None,
    api_key: Optional[str] = None,
    api_version: Optional[str] = None,
    timeout: Optional[float] = 600,
    **kwargs
) -> TranscriptionResponse

Parameters

file
FileTypes
required
The audio file to transcribe. Can be:
  • File path (string)
  • File object
  • File-like object with read() method
Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webmMaximum file size: 25 MB
model
string
required
Model to use for transcription.Examples:
  • whisper-1 (OpenAI)
  • azure/whisper (Azure OpenAI)
  • whisper-large-v3 (Groq)
language
string
Language of the input audio in ISO-639-1 format.Examples: "en", "es", "fr", "de"Improves accuracy and latency.
prompt
string
Optional text to guide the model’s style or continue a previous segment.Should match the audio language.
response_format
string
default:"json"
Format of the transcript output.Options:
  • "json": JSON with text field
  • "text": Plain text
  • "srt": SubRip subtitle format
  • "verbose_json": JSON with timestamps
  • "vtt": Web Video Text Tracks format
temperature
float
default:"0"
Sampling temperature between 0 and 1. Higher values make output more random.
timestamp_granularities
List[str]
Timestamp granularity for segments.Options: ["word"], ["segment"], or bothRequires response_format="verbose_json"

Response

text
string
The transcribed text.
task
string
The task performed (“transcribe”).
language
string
Detected language of the audio.
duration
float
Duration of the audio in seconds.
segments
List[dict]
Segments with timestamps (if verbose_json format).
words
List[dict]
Word-level timestamps (if timestamp_granularities includes “word”).

Examples

Basic Transcription
import litellm

with open("audio.mp3", "rb") as audio_file:
    response = litellm.transcription(
        model="whisper-1",
        file=audio_file
    )

print(response.text)
With Language Hint
import litellm

response = litellm.transcription(
    model="whisper-1",
    file=open("spanish_audio.mp3", "rb"),
    language="es"
)

print(response.text)
Verbose Output with Timestamps
import litellm

response = litellm.transcription(
    model="whisper-1",
    file=open("audio.mp3", "rb"),
    response_format="verbose_json",
    timestamp_granularities=["word", "segment"]
)

print(f"Text: {response.text}")
print(f"Language: {response.language}")
print(f"Duration: {response.duration}s")

# Word-level timestamps
for word in response.words:
    print(f"{word['word']} [{word['start']:.2f}s - {word['end']:.2f}s]")
SRT Subtitle Format
import litellm

response = litellm.transcription(
    model="whisper-1",
    file=open("video.mp4", "rb"),
    response_format="srt"
)

# Save as SRT file
with open("subtitles.srt", "w") as f:
    f.write(response.text)
Async Transcription
import litellm
import asyncio

async def main():
    with open("audio.mp3", "rb") as audio_file:
        response = await litellm.atranscription(
            model="whisper-1",
            file=audio_file
        )
    print(response.text)

asyncio.run(main())

Speech API (Text-to-Speech)

speech()

Generate spoken audio from text using text-to-speech models.

Function Signature

def speech(
    model: str,
    input: str,
    voice: str,
    response_format: Optional[str] = None,
    speed: Optional[float] = None,
    api_base: Optional[str] = None,
    api_key: Optional[str] = None,
    api_version: Optional[str] = None,
    timeout: Optional[float] = 600,
    **kwargs
) -> HttpxBinaryResponseContent

Parameters

model
string
required
TTS model to use.Examples:
  • tts-1 (OpenAI, faster)
  • tts-1-hd (OpenAI, higher quality)
  • azure/tts-1 (Azure OpenAI)
input
string
required
Text to convert to speech. Maximum length 4096 characters.
voice
string
required
Voice to use for generation.OpenAI voices:
  • alloy
  • echo
  • fable
  • onyx
  • nova
  • shimmer
response_format
string
default:"mp3"
Audio format.Options:
  • "mp3"
  • "opus"
  • "aac"
  • "flac"
  • "wav"
  • "pcm"
speed
float
default:"1.0"
Playback speed of the audio.Range: 0.25 to 4.0

Response

Returns binary audio data that can be saved to a file or streamed.

Examples

Basic Text-to-Speech
import litellm

response = litellm.speech(
    model="tts-1",
    voice="alloy",
    input="Hello! Welcome to LiteLLM text-to-speech."
)

# Save to file
with open("speech.mp3", "wb") as f:
    f.write(response.content)
High Quality Voice
import litellm

response = litellm.speech(
    model="tts-1-hd",
    voice="nova",
    input="This is high quality audio output.",
    response_format="wav"
)

with open("speech_hd.wav", "wb") as f:
    f.write(response.content)
Adjust Speed
import litellm

# Faster speech
response = litellm.speech(
    model="tts-1",
    voice="echo",
    input="This will be spoken quickly.",
    speed=1.5
)

with open("fast_speech.mp3", "wb") as f:
    f.write(response.content)
Different Voices
import litellm

text = "The quick brown fox jumps over the lazy dog."

# Try different voices
voices = ["alloy", "echo", "fable", "onyx", "nova", "shimmer"]

for voice in voices:
    response = litellm.speech(
        model="tts-1",
        voice=voice,
        input=text
    )
    
    with open(f"speech_{voice}.mp3", "wb") as f:
        f.write(response.content)
Async Speech Generation
import litellm
import asyncio

async def main():
    response = await litellm.aspeech(
        model="tts-1",
        voice="alloy",
        input="Async text-to-speech example."
    )
    
    with open("async_speech.mp3", "wb") as f:
        f.write(response.content)

asyncio.run(main())
Azure OpenAI TTS
import litellm

response = litellm.speech(
    model="azure/tts-1",
    voice="alloy",
    input="Using Azure OpenAI for text-to-speech.",
    api_key="your-azure-key",
    api_base="https://your-endpoint.openai.azure.com/",
    api_version="2024-02-01"
)

with open("azure_speech.mp3", "wb") as f:
    f.write(response.content)

Provider Support

Transcription Providers

  • OpenAI: Whisper models
  • Azure OpenAI: Whisper models
  • Groq: Ultra-fast Whisper models

Text-to-Speech Providers

  • OpenAI: tts-1, tts-1-hd
  • Azure OpenAI: tts-1, tts-1-hd
  • Vertex AI: Text-to-speech models

Error Handling

import litellm
from litellm import AuthenticationError, BadRequestError

try:
    # Transcription
    response = litellm.transcription(
        model="whisper-1",
        file=open("audio.mp3", "rb")
    )
    
    # Text-to-speech
    response = litellm.speech(
        model="tts-1",
        voice="alloy",
        input="Test speech"
    )
except AuthenticationError as e:
    print(f"Authentication failed: {e}")
except BadRequestError as e:
    print(f"Bad request: {e}")
except Exception as e:
    print(f"An error occurred: {e}")

Best Practices

Transcription

  1. Provide language hint when known for better accuracy
  2. Use appropriate audio quality - higher quality = better transcription
  3. Keep files under 25MB - split larger files if needed
  4. Use prompt to maintain context across segments

Text-to-Speech

  1. Choose appropriate voice for your use case
  2. Use tts-1 for real-time applications (faster)
  3. Use tts-1-hd when quality is priority
  4. Break long text into smaller chunks for better control
  5. Test different speeds to find optimal playback rate

Build docs developers (and LLMs) love