Overview
LiteLLM provides unified interfaces for audio transcription (speech-to-text) and text-to-speech (TTS) across multiple providers including OpenAI Whisper, Azure, and more.
Audio Transcription
Basic Usage
from litellm import transcription
with open("audio.mp3", "rb") as audio_file:
response = transcription(
model="whisper-1",
file=audio_file
)
print(response.text)
Function Signature
def transcription(
model: str,
file: FileTypes,
# Optional parameters
language: Optional[str] = None,
prompt: Optional[str] = None,
response_format: Optional[str] = None,
timestamp_granularities: Optional[List[str]] = None,
temperature: Optional[int] = None,
# API configuration
timeout: float = 600,
api_key: Optional[str] = None,
api_base: Optional[str] = None,
api_version: Optional[str] = None,
max_retries: Optional[int] = None,
custom_llm_provider: Optional[str] = None,
**kwargs
) -> TranscriptionResponse
Transcription Parameters
The transcription model to use. Example: whisper-1
The audio file to transcribe. Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webmwith open("audio.mp3", "rb") as f:
file=f
Language of the audio in ISO-639-1 format (e.g., “en”, “fr”, “es”)
Optional text to guide the model’s style or continue a previous segment
Format of the transcript output. Options: "json", "text", "srt", "verbose_json", "vtt"
Timestamp granularities. Options: ["word"], ["segment"], or both
Sampling temperature (0 to 1). Higher values make output more random.
Transcription Response
class TranscriptionResponse:
text: str # The transcribed text
Transcription Examples
Basic Transcription
from litellm import transcription
with open("interview.mp3", "rb") as audio_file:
response = transcription(
model="whisper-1",
file=audio_file
)
print(response.text)
Transcription with Language
from litellm import transcription
with open("french_audio.mp3", "rb") as audio_file:
response = transcription(
model="whisper-1",
file=audio_file,
language="fr" # French
)
print(response.text)
from litellm import transcription
with open("audio.mp3", "rb") as audio_file:
# SRT format (with timestamps)
srt_response = transcription(
model="whisper-1",
file=audio_file,
response_format="srt"
)
print(srt_response.text)
# VTT format (web video text tracks)
vtt_response = transcription(
model="whisper-1",
file=audio_file,
response_format="vtt"
)
print(vtt_response.text)
Async Transcription
import asyncio
from litellm import atranscription
async def transcribe_audio():
with open("audio.mp3", "rb") as audio_file:
response = await atranscription(
model="whisper-1",
file=audio_file
)
return response.text
text = asyncio.run(transcribe_audio())
print(text)
Multiple Files Concurrently
import asyncio
from litellm import atranscription
async def transcribe_multiple(files: list):
tasks = []
for file_path in files:
with open(file_path, "rb") as f:
task = atranscription(model="whisper-1", file=f)
tasks.append(task)
responses = await asyncio.gather(*tasks)
return [r.text for r in responses]
files = ["audio1.mp3", "audio2.mp3", "audio3.mp3"]
transcripts = asyncio.run(transcribe_multiple(files))
for i, text in enumerate(transcripts):
print(f"File {i + 1}: {text}")
Text-to-Speech
Basic Usage
from litellm import speech
from pathlib import Path
response = speech(
model="tts-1",
input="Hello, this is a test of text to speech.",
voice="alloy"
)
# Save audio file
response.stream_to_file("output.mp3")
Function Signature
def speech(
model: str,
input: str,
voice: Optional[Union[str, dict]] = None,
# Optional parameters
response_format: Optional[str] = None,
speed: Optional[int] = None,
instructions: Optional[str] = None,
# API configuration
api_key: Optional[str] = None,
api_base: Optional[str] = None,
api_version: Optional[str] = None,
organization: Optional[str] = None,
project: Optional[str] = None,
max_retries: Optional[int] = None,
metadata: Optional[dict] = None,
timeout: Optional[Union[float, httpx.Timeout]] = None,
client: Optional[Any] = None,
headers: Optional[dict] = None,
custom_llm_provider: Optional[str] = None,
**kwargs
) -> HttpxBinaryResponseContent
TTS Parameters
The TTS model to use. Examples: tts-1, tts-1-hd
The text to convert to speech (max 4096 characters)
The voice to use. Options: alloy, echo, fable, onyx, nova, shimmer
Audio format. Options: mp3 (default), opus, aac, flac, wav, pcm
Speed of the speech (0.25 to 4.0). Default: 1.0
TTS Response
Returns HttpxBinaryResponseContent with audio data:
response.stream_to_file("output.mp3") # Save to file
response.content # Raw bytes
TTS Examples
Basic Text-to-Speech
from litellm import speech
response = speech(
model="tts-1",
input="Welcome to the future of voice synthesis!",
voice="nova"
)
response.stream_to_file("welcome.mp3")
Different Voices
from litellm import speech
text = "This is a test of different voices."
voices = ["alloy", "echo", "fable", "onyx", "nova", "shimmer"]
for voice in voices:
response = speech(
model="tts-1",
input=text,
voice=voice
)
response.stream_to_file(f"voice_{voice}.mp3")
High Quality Audio
from litellm import speech
response = speech(
model="tts-1-hd", # High quality model
input="This is high quality audio synthesis.",
voice="onyx",
response_format="flac" # Lossless format
)
response.stream_to_file("hq_audio.flac")
Adjust Speech Speed
from litellm import speech
text = "The quick brown fox jumps over the lazy dog."
# Slow speech
slow = speech(model="tts-1", input=text, voice="alloy", speed=0.5)
slow.stream_to_file("slow.mp3")
# Normal speech
normal = speech(model="tts-1", input=text, voice="alloy", speed=1.0)
normal.stream_to_file("normal.mp3")
# Fast speech
fast = speech(model="tts-1", input=text, voice="alloy", speed=2.0)
fast.stream_to_file("fast.mp3")
from litellm import speech
text = "Converting to different audio formats."
formats = ["mp3", "opus", "aac", "flac", "wav"]
for fmt in formats:
response = speech(
model="tts-1",
input=text,
voice="echo",
response_format=fmt
)
response.stream_to_file(f"output.{fmt}")
Async Text-to-Speech
import asyncio
from litellm import aspeech
async def generate_speech():
response = await aspeech(
model="tts-1",
input="Hello from async!",
voice="nova"
)
response.stream_to_file("async_output.mp3")
asyncio.run(generate_speech())
Generate Multiple Audio Files
import asyncio
from litellm import aspeech
async def generate_multiple():
texts = [
"Welcome to section one.",
"This is section two.",
"And this is section three."
]
tasks = [
aspeech(model="tts-1", input=text, voice="alloy")
for text in texts
]
responses = await asyncio.gather(*tasks)
for i, response in enumerate(responses):
response.stream_to_file(f"section_{i + 1}.mp3")
asyncio.run(generate_multiple())
Combined Use Cases
Voice Message Transcription
from litellm import transcription, completion
# Transcribe voice message
with open("voice_message.mp3", "rb") as audio:
transcript_response = transcription(
model="whisper-1",
file=audio
)
transcript = transcript_response.text
print(f"Transcription: {transcript}")
# Analyze or respond using LLM
llm_response = completion(
model="gpt-4",
messages=[{
"role": "user",
"content": f"Summarize this voice message: {transcript}"
}]
)
print(f"Summary: {llm_response.choices[0].message.content}")
Audio Book Generation
from litellm import speech
import time
chapters = [
"Chapter 1: The Beginning. It was a dark and stormy night...",
"Chapter 2: The Journey. The hero set off on an adventure...",
"Chapter 3: The End. And they lived happily ever after."
]
for i, chapter in enumerate(chapters):
print(f"Generating chapter {i + 1}...")
response = speech(
model="tts-1-hd",
input=chapter,
voice="onyx",
speed=0.9 # Slightly slower for audiobook
)
response.stream_to_file(f"audiobook_chapter_{i + 1}.mp3")
time.sleep(1) # Rate limiting
print("Audiobook generated!")
Language Learning Assistant
from litellm import speech, completion
# Get translation
response = completion(
model="gpt-4",
messages=[{
"role": "user",
"content": "Translate to Spanish: 'Good morning, how are you?'"
}]
)
translation = response.choices[0].message.content
print(f"Translation: {translation}")
# Generate speech in target language
audio = speech(
model="tts-1",
input=translation,
voice="nova"
)
audio.stream_to_file("spanish_phrase.mp3")
Meeting Transcription and Summary
from litellm import transcription, completion
# Transcribe meeting recording
with open("meeting.mp3", "rb") as audio:
transcript = transcription(
model="whisper-1",
file=audio,
response_format="text"
)
print(f"Transcript length: {len(transcript.text)} characters")
# Generate summary
summary_response = completion(
model="gpt-4",
messages=[{
"role": "user",
"content": f"Summarize this meeting transcript and extract action items:\n\n{transcript.text}"
}]
)
print(f"\nSummary:\n{summary_response.choices[0].message.content}")
Error Handling
from litellm import transcription, speech
from litellm.exceptions import (
BadRequestError,
AuthenticationError,
Timeout
)
# Transcription error handling
try:
with open("audio.mp3", "rb") as audio:
response = transcription(model="whisper-1", file=audio)
except FileNotFoundError:
print("Audio file not found")
except BadRequestError as e:
print(f"Invalid request: {e}")
except AuthenticationError:
print("Invalid API key")
except Timeout:
print("Request timed out")
# TTS error handling
try:
response = speech(
model="tts-1",
input="Hello world",
voice="alloy"
)
response.stream_to_file("output.mp3")
except BadRequestError as e:
print(f"Invalid request: {e}")
except Exception as e:
print(f"Error: {e}")
Best Practices
Transcription:
- Specify language: Improves accuracy if you know the language
- Use prompts: Provide context or special terminology
- Choose format: Use SRT/VTT for subtitles, JSON for programmatic use
- File size: Keep files under 25MB (OpenAI limit)
- Audio quality: Better quality audio = better transcription
Text-to-Speech:
- Choose right model: Use
tts-1-hd for higher quality
- Select appropriate voice: Test different voices for your use case
- Break up long text: Split into chunks for better processing
- Control speed: Adjust for different use cases (audiobooks vs announcements)
- Format selection: Use MP3 for web, WAV/FLAC for high quality
Troubleshooting
Transcription Issues
- Low accuracy: Specify language, improve audio quality
- File too large: Split audio into smaller chunks
- Timeout: Increase timeout parameter for long files
TTS Issues
- Text too long: Split into chunks of less than 4096 characters
- Pronunciation: Use phonetic spelling in input
- Quality: Use
tts-1-hd model for better quality