Audio Transcription and TTS

Overview

LiteLLM provides unified interfaces for audio transcription (speech-to-text) and text-to-speech (TTS) across multiple providers including OpenAI Whisper, Azure, and more.

Audio Transcription

Basic Usage

from litellm import transcription

with open("audio.mp3", "rb") as audio_file:
    response = transcription(
        model="whisper-1",
        file=audio_file
    )

print(response.text)

Function Signature

def transcription(
    model: str,
    file: FileTypes,
    # Optional parameters
    language: Optional[str] = None,
    prompt: Optional[str] = None,
    response_format: Optional[str] = None,
    timestamp_granularities: Optional[List[str]] = None,
    temperature: Optional[int] = None,
    # API configuration
    timeout: float = 600,
    api_key: Optional[str] = None,
    api_base: Optional[str] = None,
    api_version: Optional[str] = None,
    max_retries: Optional[int] = None,
    custom_llm_provider: Optional[str] = None,
    **kwargs
) -> TranscriptionResponse

Transcription Parameters

model

string

required

The transcription model to use. Example: whisper-1

file

FileTypes

required

The audio file to transcribe. Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm

with open("audio.mp3", "rb") as f:
    file=f

language

string

Language of the audio in ISO-639-1 format (e.g., “en”, “fr”, “es”)

prompt

string

Optional text to guide the model’s style or continue a previous segment

response_format

string

Format of the transcript output. Options: "json", "text", "srt", "verbose_json", "vtt"

timestamp_granularities

List[str]

Timestamp granularities. Options: ["word"], ["segment"], or both

temperature

int

Sampling temperature (0 to 1). Higher values make output more random.

Transcription Response

class TranscriptionResponse:
    text: str  # The transcribed text

Transcription Examples

Basic Transcription

from litellm import transcription

with open("interview.mp3", "rb") as audio_file:
    response = transcription(
        model="whisper-1",
        file=audio_file
    )

print(response.text)

Transcription with Language

from litellm import transcription

with open("french_audio.mp3", "rb") as audio_file:
    response = transcription(
        model="whisper-1",
        file=audio_file,
        language="fr"  # French
    )

print(response.text)

Different Response Formats

from litellm import transcription

with open("audio.mp3", "rb") as audio_file:
    # SRT format (with timestamps)
    srt_response = transcription(
        model="whisper-1",
        file=audio_file,
        response_format="srt"
    )
    print(srt_response.text)
    
    # VTT format (web video text tracks)
    vtt_response = transcription(
        model="whisper-1",
        file=audio_file,
        response_format="vtt"
    )
    print(vtt_response.text)

Async Transcription

import asyncio
from litellm import atranscription

async def transcribe_audio():
    with open("audio.mp3", "rb") as audio_file:
        response = await atranscription(
            model="whisper-1",
            file=audio_file
        )
    return response.text

text = asyncio.run(transcribe_audio())
print(text)

Multiple Files Concurrently

import asyncio
from litellm import atranscription

async def transcribe_multiple(files: list):
    tasks = []
    for file_path in files:
        with open(file_path, "rb") as f:
            task = atranscription(model="whisper-1", file=f)
            tasks.append(task)
    
    responses = await asyncio.gather(*tasks)
    return [r.text for r in responses]

files = ["audio1.mp3", "audio2.mp3", "audio3.mp3"]
transcripts = asyncio.run(transcribe_multiple(files))

for i, text in enumerate(transcripts):
    print(f"File {i + 1}: {text}")

Text-to-Speech

Basic Usage

from litellm import speech
from pathlib import Path

response = speech(
    model="tts-1",
    input="Hello, this is a test of text to speech.",
    voice="alloy"
)

# Save audio file
response.stream_to_file("output.mp3")

Function Signature

def speech(
    model: str,
    input: str,
    voice: Optional[Union[str, dict]] = None,
    # Optional parameters
    response_format: Optional[str] = None,
    speed: Optional[int] = None,
    instructions: Optional[str] = None,
    # API configuration
    api_key: Optional[str] = None,
    api_base: Optional[str] = None,
    api_version: Optional[str] = None,
    organization: Optional[str] = None,
    project: Optional[str] = None,
    max_retries: Optional[int] = None,
    metadata: Optional[dict] = None,
    timeout: Optional[Union[float, httpx.Timeout]] = None,
    client: Optional[Any] = None,
    headers: Optional[dict] = None,
    custom_llm_provider: Optional[str] = None,
    **kwargs
) -> HttpxBinaryResponseContent

TTS Parameters

model

string

required

The TTS model to use. Examples: tts-1, tts-1-hd

input

string

required

The text to convert to speech (max 4096 characters)

voice

string

required

The voice to use. Options: alloy, echo, fable, onyx, nova, shimmer

response_format

string

Audio format. Options: mp3 (default), opus, aac, flac, wav, pcm

speed

float

Speed of the speech (0.25 to 4.0). Default: 1.0

TTS Response

Returns HttpxBinaryResponseContent with audio data:

response.stream_to_file("output.mp3")  # Save to file
response.content  # Raw bytes

TTS Examples

Basic Text-to-Speech

from litellm import speech

response = speech(
    model="tts-1",
    input="Welcome to the future of voice synthesis!",
    voice="nova"
)

response.stream_to_file("welcome.mp3")

Different Voices

from litellm import speech

text = "This is a test of different voices."
voices = ["alloy", "echo", "fable", "onyx", "nova", "shimmer"]

for voice in voices:
    response = speech(
        model="tts-1",
        input=text,
        voice=voice
    )
    response.stream_to_file(f"voice_{voice}.mp3")

High Quality Audio

from litellm import speech

response = speech(
    model="tts-1-hd",  # High quality model
    input="This is high quality audio synthesis.",
    voice="onyx",
    response_format="flac"  # Lossless format
)

response.stream_to_file("hq_audio.flac")

Adjust Speech Speed

from litellm import speech

text = "The quick brown fox jumps over the lazy dog."

# Slow speech
slow = speech(model="tts-1", input=text, voice="alloy", speed=0.5)
slow.stream_to_file("slow.mp3")

# Normal speech
normal = speech(model="tts-1", input=text, voice="alloy", speed=1.0)
normal.stream_to_file("normal.mp3")

# Fast speech
fast = speech(model="tts-1", input=text, voice="alloy", speed=2.0)
fast.stream_to_file("fast.mp3")

Different Audio Formats

from litellm import speech

text = "Converting to different audio formats."

formats = ["mp3", "opus", "aac", "flac", "wav"]
for fmt in formats:
    response = speech(
        model="tts-1",
        input=text,
        voice="echo",
        response_format=fmt
    )
    response.stream_to_file(f"output.{fmt}")

Async Text-to-Speech

import asyncio
from litellm import aspeech

async def generate_speech():
    response = await aspeech(
        model="tts-1",
        input="Hello from async!",
        voice="nova"
    )
    response.stream_to_file("async_output.mp3")

asyncio.run(generate_speech())

Generate Multiple Audio Files

import asyncio
from litellm import aspeech

async def generate_multiple():
    texts = [
        "Welcome to section one.",
        "This is section two.",
        "And this is section three."
    ]
    
    tasks = [
        aspeech(model="tts-1", input=text, voice="alloy")
        for text in texts
    ]
    
    responses = await asyncio.gather(*tasks)
    
    for i, response in enumerate(responses):
        response.stream_to_file(f"section_{i + 1}.mp3")

asyncio.run(generate_multiple())

Combined Use Cases

Voice Message Transcription

from litellm import transcription, completion

# Transcribe voice message
with open("voice_message.mp3", "rb") as audio:
    transcript_response = transcription(
        model="whisper-1",
        file=audio
    )

transcript = transcript_response.text
print(f"Transcription: {transcript}")

# Analyze or respond using LLM
llm_response = completion(
    model="gpt-4",
    messages=[{
        "role": "user",
        "content": f"Summarize this voice message: {transcript}"
    }]
)

print(f"Summary: {llm_response.choices[0].message.content}")

Audio Book Generation

from litellm import speech
import time

chapters = [
    "Chapter 1: The Beginning. It was a dark and stormy night...",
    "Chapter 2: The Journey. The hero set off on an adventure...",
    "Chapter 3: The End. And they lived happily ever after."
]

for i, chapter in enumerate(chapters):
    print(f"Generating chapter {i + 1}...")
    
    response = speech(
        model="tts-1-hd",
        input=chapter,
        voice="onyx",
        speed=0.9  # Slightly slower for audiobook
    )
    
    response.stream_to_file(f"audiobook_chapter_{i + 1}.mp3")
    time.sleep(1)  # Rate limiting

print("Audiobook generated!")

Language Learning Assistant

from litellm import speech, completion

# Get translation
response = completion(
    model="gpt-4",
    messages=[{
        "role": "user",
        "content": "Translate to Spanish: 'Good morning, how are you?'"
    }]
)

translation = response.choices[0].message.content
print(f"Translation: {translation}")

# Generate speech in target language
audio = speech(
    model="tts-1",
    input=translation,
    voice="nova"
)

audio.stream_to_file("spanish_phrase.mp3")

Meeting Transcription and Summary

from litellm import transcription, completion

# Transcribe meeting recording
with open("meeting.mp3", "rb") as audio:
    transcript = transcription(
        model="whisper-1",
        file=audio,
        response_format="text"
    )

print(f"Transcript length: {len(transcript.text)} characters")

# Generate summary
summary_response = completion(
    model="gpt-4",
    messages=[{
        "role": "user",
        "content": f"Summarize this meeting transcript and extract action items:\n\n{transcript.text}"
    }]
)

print(f"\nSummary:\n{summary_response.choices[0].message.content}")

Error Handling

from litellm import transcription, speech
from litellm.exceptions import (
    BadRequestError,
    AuthenticationError,
    Timeout
)

# Transcription error handling
try:
    with open("audio.mp3", "rb") as audio:
        response = transcription(model="whisper-1", file=audio)
except FileNotFoundError:
    print("Audio file not found")
except BadRequestError as e:
    print(f"Invalid request: {e}")
except AuthenticationError:
    print("Invalid API key")
except Timeout:
    print("Request timed out")

# TTS error handling
try:
    response = speech(
        model="tts-1",
        input="Hello world",
        voice="alloy"
    )
    response.stream_to_file("output.mp3")
except BadRequestError as e:
    print(f"Invalid request: {e}")
except Exception as e:
    print(f"Error: {e}")

Best Practices

Transcription:

Specify language: Improves accuracy if you know the language
Use prompts: Provide context or special terminology
Choose format: Use SRT/VTT for subtitles, JSON for programmatic use
File size: Keep files under 25MB (OpenAI limit)
Audio quality: Better quality audio = better transcription

Text-to-Speech:

Choose right model: Use tts-1-hd for higher quality
Select appropriate voice: Test different voices for your use case
Break up long text: Split into chunks for better processing
Control speed: Adjust for different use cases (audiobooks vs announcements)
Format selection: Use MP3 for web, WAV/FLAC for high quality

Get Started

Python SDK

AI Gateway (Proxy)

Core Features

Advanced

​Overview