Overview
The Speech to Text API (also known as Scribe) transcribes audio and video files with advanced features including speaker diarization, entity detection, multi-channel support, and webhook integration.Methods
convert()
Transcribe an audio or video file with full control over transcription parameters.The ID of the model to use for transcription. Available models:
scribe_v1- General purpose transcription modelscribe_v2- Latest transcription model with improved accuracy
The audio or video file to transcribe. Supports common formats including MP3, WAV, MP4, and more. Either
file or cloud_storage_url must be provided.The HTTPS URL of the file to transcribe. The file must be accessible via HTTPS and less than 2GB. Supports URLs from cloud storage providers (AWS S3, Google Cloud Storage, Cloudflare R2, etc.), CDNs, or any HTTPS source. URLs can include authentication tokens in query parameters.
When set to
False, zero retention mode will be used. Log and transcript storage features will be unavailable. Zero retention mode may only be used by enterprise customers.An ISO-639-1 or ISO-639-3 language code (e.g., “en”, “es”, “fr”, “de”). Can improve transcription performance if known beforehand. If not provided, the language is automatically detected.
Whether to tag audio events like
(laughter), (footsteps), (applause), etc. in the transcription.The maximum number of speakers talking in the file. Helps with speaker prediction. Maximum of 32 speakers. If not provided, defaults to the maximum the model supports.
The granularity of timestamps in the transcription:
word- Provides word-level timestampscharacter- Provides character-level timestamps per word
Whether to annotate which speaker is talking at each point in the file. Enables speaker identification with labels like “Speaker 1”, “Speaker 2”, etc.
Diarization threshold for speaker detection. Higher values mean fewer predicted speakers (less chance of splitting one speaker into two, but higher chance of merging two speakers into one). Lower values mean more predicted speakers. Can only be set when
diarize=True and num_speakers=None. Default is model-specific (usually 0.22).Additional formats to export the transcript to. Options include:
srt- SubRip subtitle formatvtt- WebVTT subtitle formattxt- Plain textjson- Detailed JSON format
The format of input audio:
pcm_s16le_16- 16-bit PCM at 16kHz, mono, little-endian (lower latency)other- Any other encoded format (default)
Whether to send the transcription result to configured webhooks. If set to
True, the request returns early without the transcription, which is delivered later via webhook.Optional specific webhook ID to send results to. Only valid when
webhook=True. If not provided, results are sent to all configured speech-to-text webhooks.Optional metadata to include in webhook responses. Should be a JSON-serializable object with maximum depth of 2 levels and maximum size of 16KB. Useful for tracking internal IDs, job references, or contextual information.
Controls randomness of transcription output. Accepts values between 0.0 and 2.0. Higher values produce more diverse, less deterministic results. Default is model-specific (usually 0).
Random seed for deterministic transcription. Must be an integer between 0 and 2147483647. Repeated requests with the same seed and parameters should return similar results, though determinism is not guaranteed.
Whether the audio file contains multiple channels where each channel has a single speaker. When enabled, each channel is transcribed independently and results are combined. Each word includes a
channel_index field. Maximum of 5 channels supported.Detect entities in the transcript. Options:
all- Detect all entities- Single entity type or category string
- List of entity types/categories
pii, phi, pci, other, offensive_languageDetected entities are returned in the entities field with text, type, and character positions. Usage incurs additional costs.A list of keyterms to bias the transcription towards. Keyterms are words or phrases you want the model to recognize more accurately.Constraints:
- Maximum 100 keyterms
- Each keyterm must be less than 50 characters
- Each keyterm can contain at most 5 words (after normalization)
["ElevenLabs", "API key", "neural network"]Usage incurs additional costs.Request-specific configuration.
The transcription result containing:
transcript(str) - The full transcript textspeakers(List) - List of detected speakers (if diarize=True)words(List) - Word-level details with timestampsentities(List) - Detected entities (if entity_detection enabled)language(str) - Detected language code- Additional format exports if requested
Realtime Transcription
Access realtime speech-to-text via WebSocket connection:Async Methods
All methods have async equivalents:Use Cases
- Meeting transcription: Transcribe meetings with speaker identification
- Content accessibility: Generate subtitles and captions for videos
- Content analysis: Extract entities and keywords from audio content
- Multi-language support: Transcribe content in multiple languages
- Compliance: Detect and redact PII, PHI, or PCI information