Audio APIs

Overview

LiteLLM provides unified interfaces for audio transcription (speech-to-text) and text-to-speech synthesis across multiple providers.

Transcription API

transcription()

Transcribe audio to text using OpenAI Whisper or other speech-to-text providers.

Function Signature

def transcription(
    file: FileTypes,
    model: str,
    language: Optional[str] = None,
    prompt: Optional[str] = None,
    response_format: Optional[str] = None,
    temperature: Optional[float] = None,
    timestamp_granularities: Optional[List[str]] = None,
    api_base: Optional[str] = None,
    api_key: Optional[str] = None,
    api_version: Optional[str] = None,
    timeout: Optional[float] = 600,
    **kwargs
) -> TranscriptionResponse

Parameters

file

FileTypes

required

The audio file to transcribe. Can be:

File path (string)
File object
File-like object with read() method

Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webmMaximum file size: 25 MB

model

string

required

Model to use for transcription.Examples:

whisper-1 (OpenAI)
azure/whisper (Azure OpenAI)
whisper-large-v3 (Groq)

language

string

Language of the input audio in ISO-639-1 format.Examples: "en", "es", "fr", "de"Improves accuracy and latency.

prompt

string

Optional text to guide the model’s style or continue a previous segment.Should match the audio language.

response_format

string

default:"json"

Format of the transcript output.Options:

"json": JSON with text field
"text": Plain text
"srt": SubRip subtitle format
"verbose_json": JSON with timestamps
"vtt": Web Video Text Tracks format

temperature

float

default:"0"

Sampling temperature between 0 and 1. Higher values make output more random.

timestamp_granularities

List[str]

Timestamp granularity for segments.Options: ["word"], ["segment"], or bothRequires response_format="verbose_json"

Response

text

string

The transcribed text.

task

string

The task performed (“transcribe”).

language

string

Detected language of the audio.

duration

float

Duration of the audio in seconds.

segments

List[dict]

Segments with timestamps (if verbose_json format).

words

List[dict]

Word-level timestamps (if timestamp_granularities includes “word”).

Examples

Basic Transcription

import litellm

with open("audio.mp3", "rb") as audio_file:
    response = litellm.transcription(
        model="whisper-1",
        file=audio_file
    )

print(response.text)

With Language Hint

import litellm

response = litellm.transcription(
    model="whisper-1",
    file=open("spanish_audio.mp3", "rb"),
    language="es"
)

print(response.text)

Verbose Output with Timestamps

import litellm

response = litellm.transcription(
    model="whisper-1",
    file=open("audio.mp3", "rb"),
    response_format="verbose_json",
    timestamp_granularities=["word", "segment"]
)

print(f"Text: {response.text}")
print(f"Language: {response.language}")
print(f"Duration: {response.duration}s")

# Word-level timestamps
for word in response.words:
    print(f"{word['word']} [{word['start']:.2f}s - {word['end']:.2f}s]")

SRT Subtitle Format

import litellm

response = litellm.transcription(
    model="whisper-1",
    file=open("video.mp4", "rb"),
    response_format="srt"
)

# Save as SRT file
with open("subtitles.srt", "w") as f:
    f.write(response.text)

Async Transcription

import litellm
import asyncio

async def main():
    with open("audio.mp3", "rb") as audio_file:
        response = await litellm.atranscription(
            model="whisper-1",
            file=audio_file
        )
    print(response.text)

asyncio.run(main())

Speech API (Text-to-Speech)

speech()

Generate spoken audio from text using text-to-speech models.

Function Signature

def speech(
    model: str,
    input: str,
    voice: str,
    response_format: Optional[str] = None,
    speed: Optional[float] = None,
    api_base: Optional[str] = None,
    api_key: Optional[str] = None,
    api_version: Optional[str] = None,
    timeout: Optional[float] = 600,
    **kwargs
) -> HttpxBinaryResponseContent

Parameters

model

string

required

TTS model to use.Examples:

tts-1 (OpenAI, faster)
tts-1-hd (OpenAI, higher quality)
azure/tts-1 (Azure OpenAI)

input

string

required

Text to convert to speech. Maximum length 4096 characters.

voice

string

required

Voice to use for generation.OpenAI voices:

alloy
echo
fable
onyx
nova
shimmer

response_format

string

default:"mp3"

Audio format.Options:

"mp3"
"opus"
"aac"
"flac"
"wav"
"pcm"

speed

float

default:"1.0"

Playback speed of the audio.Range: 0.25 to 4.0

Response

Returns binary audio data that can be saved to a file or streamed.

Examples

Basic Text-to-Speech

import litellm

response = litellm.speech(
    model="tts-1",
    voice="alloy",
    input="Hello! Welcome to LiteLLM text-to-speech."
)

# Save to file
with open("speech.mp3", "wb") as f:
    f.write(response.content)

High Quality Voice

import litellm

response = litellm.speech(
    model="tts-1-hd",
    voice="nova",
    input="This is high quality audio output.",
    response_format="wav"
)

with open("speech_hd.wav", "wb") as f:
    f.write(response.content)

Adjust Speed

import litellm

# Faster speech
response = litellm.speech(
    model="tts-1",
    voice="echo",
    input="This will be spoken quickly.",
    speed=1.5
)

with open("fast_speech.mp3", "wb") as f:
    f.write(response.content)

Different Voices

import litellm

text = "The quick brown fox jumps over the lazy dog."

# Try different voices
voices = ["alloy", "echo", "fable", "onyx", "nova", "shimmer"]

for voice in voices:
    response = litellm.speech(
        model="tts-1",
        voice=voice,
        input=text
    )
    
    with open(f"speech_{voice}.mp3", "wb") as f:
        f.write(response.content)

Async Speech Generation

import litellm
import asyncio

async def main():
    response = await litellm.aspeech(
        model="tts-1",
        voice="alloy",
        input="Async text-to-speech example."
    )
    
    with open("async_speech.mp3", "wb") as f:
        f.write(response.content)

asyncio.run(main())

Azure OpenAI TTS

import litellm

response = litellm.speech(
    model="azure/tts-1",
    voice="alloy",
    input="Using Azure OpenAI for text-to-speech.",
    api_key="your-azure-key",
    api_base="https://your-endpoint.openai.azure.com/",
    api_version="2024-02-01"
)

with open("azure_speech.mp3", "wb") as f:
    f.write(response.content)

Provider Support

Transcription Providers

OpenAI: Whisper models
Azure OpenAI: Whisper models
Groq: Ultra-fast Whisper models

Text-to-Speech Providers

OpenAI: tts-1, tts-1-hd
Azure OpenAI: tts-1, tts-1-hd
Vertex AI: Text-to-speech models

Error Handling

import litellm
from litellm import AuthenticationError, BadRequestError

try:
    # Transcription
    response = litellm.transcription(
        model="whisper-1",
        file=open("audio.mp3", "rb")
    )
    
    # Text-to-speech
    response = litellm.speech(
        model="tts-1",
        voice="alloy",
        input="Test speech"
    )
except AuthenticationError as e:
    print(f"Authentication failed: {e}")
except BadRequestError as e:
    print(f"Bad request: {e}")
except Exception as e:
    print(f"An error occurred: {e}")

Best Practices

Transcription

Provide language hint when known for better accuracy
Use appropriate audio quality - higher quality = better transcription
Keep files under 25MB - split larger files if needed
Use prompt to maintain context across segments

Text-to-Speech

Choose appropriate voice for your use case
Use tts-1 for real-time applications (faster)
Use tts-1-hd when quality is priority
Break long text into smaller chunks for better control
Test different speeds to find optimal playback rate

SDK Reference

Proxy Endpoints

Configuration

Overview

Transcription API

transcription()

Function Signature

Parameters

Response

Examples

Basic Transcription

With Language Hint

Verbose Output with Timestamps

SRT Subtitle Format

Async Transcription

Speech API (Text-to-Speech)

speech()

Function Signature

Parameters

Response

Examples

Basic Text-to-Speech

High Quality Voice

Adjust Speed

Different Voices

Async Speech Generation

Azure OpenAI TTS

Provider Support

Transcription Providers

Text-to-Speech Providers

Error Handling

Best Practices

Transcription

Text-to-Speech

Build docs developers (and LLMs) love

SDK Reference

Proxy Endpoints

Configuration

​Overview

​Transcription API

​transcription()

​Function Signature

​Parameters

​Response

​Examples

Basic Transcription

With Language Hint

Verbose Output with Timestamps

SRT Subtitle Format

Async Transcription

​Speech API (Text-to-Speech)

​speech()

​Function Signature

​Parameters

​Response

​Examples

Basic Text-to-Speech

High Quality Voice

Adjust Speed

Different Voices

Async Speech Generation

Azure OpenAI TTS

​Provider Support

​Transcription Providers

​Text-to-Speech Providers

​Error Handling

​Best Practices

​Transcription

​Text-to-Speech

​Related

Build docs developers (and LLMs) love

Overview

Transcription API

transcription()

Function Signature

Parameters

Response

Examples

Speech API (Text-to-Speech)

speech()

Function Signature

Parameters

Response

Examples

Provider Support

Transcription Providers

Text-to-Speech Providers

Error Handling

Best Practices

Transcription

Text-to-Speech

Related