Speech to Text

Overview

The Speech to Text API (also known as Scribe) transcribes audio and video files with advanced features including speaker diarization, entity detection, multi-channel support, and webhook integration.

Methods

convert()

Transcribe an audio or video file with full control over transcription parameters.

from elevenlabs import ElevenLabs

client = ElevenLabs(api_key="YOUR_API_KEY")

response = client.speech_to_text.convert(
    file=open("audio.mp3", "rb"),
    model_id="scribe_v1",
    language_code="en",
    diarize=True,
    tag_audio_events=True
)

print(response.transcript)
print(f"Detected {len(response.speakers)} speakers")

model_id

str

required

The ID of the model to use for transcription. Available models:

scribe_v1 - General purpose transcription model
scribe_v2 - Latest transcription model with improved accuracy

file

core.File

The audio or video file to transcribe. Supports common formats including MP3, WAV, MP4, and more. Either file or cloud_storage_url must be provided.

cloud_storage_url

str

The HTTPS URL of the file to transcribe. The file must be accessible via HTTPS and less than 2GB. Supports URLs from cloud storage providers (AWS S3, Google Cloud Storage, Cloudflare R2, etc.), CDNs, or any HTTPS source. URLs can include authentication tokens in query parameters.

enable_logging

bool

When set to False, zero retention mode will be used. Log and transcript storage features will be unavailable. Zero retention mode may only be used by enterprise customers.

language_code

str

An ISO-639-1 or ISO-639-3 language code (e.g., “en”, “es”, “fr”, “de”). Can improve transcription performance if known beforehand. If not provided, the language is automatically detected.

tag_audio_events

bool

Whether to tag audio events like (laughter), (footsteps), (applause), etc. in the transcription.

num_speakers

int

The maximum number of speakers talking in the file. Helps with speaker prediction. Maximum of 32 speakers. If not provided, defaults to the maximum the model supports.

timestamps_granularity

str

The granularity of timestamps in the transcription:

word - Provides word-level timestamps
character - Provides character-level timestamps per word

diarize

bool

Whether to annotate which speaker is talking at each point in the file. Enables speaker identification with labels like “Speaker 1”, “Speaker 2”, etc.

diarization_threshold

float

Diarization threshold for speaker detection. Higher values mean fewer predicted speakers (less chance of splitting one speaker into two, but higher chance of merging two speakers into one). Lower values mean more predicted speakers. Can only be set when diarize=True and num_speakers=None. Default is model-specific (usually 0.22).

additional_formats

List[str]

Additional formats to export the transcript to. Options include:

srt - SubRip subtitle format
vtt - WebVTT subtitle format
txt - Plain text
json - Detailed JSON format

file_format

str

The format of input audio:

pcm_s16le_16 - 16-bit PCM at 16kHz, mono, little-endian (lower latency)
other - Any other encoded format (default)

webhook

bool

Whether to send the transcription result to configured webhooks. If set to True, the request returns early without the transcription, which is delivered later via webhook.

webhook_id

str

Optional specific webhook ID to send results to. Only valid when webhook=True. If not provided, results are sent to all configured speech-to-text webhooks.

webhook_metadata

dict

Optional metadata to include in webhook responses. Should be a JSON-serializable object with maximum depth of 2 levels and maximum size of 16KB. Useful for tracking internal IDs, job references, or contextual information.

temperature

float

Controls randomness of transcription output. Accepts values between 0.0 and 2.0. Higher values produce more diverse, less deterministic results. Default is model-specific (usually 0).

seed

int

Random seed for deterministic transcription. Must be an integer between 0 and 2147483647. Repeated requests with the same seed and parameters should return similar results, though determinism is not guaranteed.

use_multi_channel

bool

Whether the audio file contains multiple channels where each channel has a single speaker. When enabled, each channel is transcribed independently and results are combined. Each word includes a channel_index field. Maximum of 5 channels supported.

entity_detection

str | List[str]

Detect entities in the transcript. Options:

all - Detect all entities
Single entity type or category string
List of entity types/categories

Categories include: pii, phi, pci, other, offensive_languageDetected entities are returned in the entities field with text, type, and character positions. Usage incurs additional costs.

keyterms

List[str]

A list of keyterms to bias the transcription towards. Keyterms are words or phrases you want the model to recognize more accurately.Constraints:

Maximum 100 keyterms
Each keyterm must be less than 50 characters
Each keyterm can contain at most 5 words (after normalization)

Example: ["ElevenLabs", "API key", "neural network"]Usage incurs additional costs.

request_options

RequestOptions

Request-specific configuration.

return

SpeechToTextConvertResponse

The transcription result containing:

transcript (str) - The full transcript text
speakers (List) - List of detected speakers (if diarize=True)
words (List) - Word-level details with timestamps
entities (List) - Detected entities (if entity_detection enabled)
language (str) - Detected language code
Additional format exports if requested

Realtime Transcription

Access realtime speech-to-text via WebSocket connection:

from elevenlabs import ElevenLabs, RealtimeEvents, AudioFormat

client = ElevenLabs(api_key="YOUR_API_KEY")

# URL-based streaming
connection = await client.speech_to_text.realtime.connect({
    "url": "https://stream.example.com/audio.mp3"
})

connection.on(RealtimeEvents.PARTIAL_TRANSCRIPT, lambda data: print(data))
connection.on(RealtimeEvents.FINAL_TRANSCRIPT, lambda data: print(data))

# Manual audio chunks
connection = await client.speech_to_text.realtime.connect({
    "audio_format": AudioFormat.PCM_16000,
    "sample_rate": 16000
})

# Send audio chunks
await connection.send_audio(audio_chunk)

Async Methods

All methods have async equivalents:

import asyncio
from elevenlabs import AsyncElevenLabs

client = AsyncElevenLabs(api_key="YOUR_API_KEY")

async def transcribe():
    response = await client.speech_to_text.convert(
        file=open("audio.mp3", "rb"),
        model_id="scribe_v1",
        diarize=True
    )
    print(response.transcript)

asyncio.run(transcribe())

Use Cases

Meeting transcription: Transcribe meetings with speaker identification
Content accessibility: Generate subtitles and captions for videos
Content analysis: Extract entities and keywords from audio content
Multi-language support: Transcribe content in multiple languages
Compliance: Detect and redact PII, PHI, or PCI information

Client

Text to Speech

Voices

Conversational AI

Audio Processing

History & Models

Account & Usage

Overview

Methods

convert()

Realtime Transcription

Async Methods

Use Cases

Build docs developers (and LLMs) love

Client

Text to Speech

Voices

Conversational AI

Audio Processing

History & Models

Account & Usage

​Overview

​Methods

​convert()

​Realtime Transcription

​Async Methods

​Use Cases

Build docs developers (and LLMs) love

Overview

Methods

convert()

Realtime Transcription

Async Methods

Use Cases