Overview
TheRealtimeTextToSpeechClient extends the standard TextToSpeechClient with WebSocket-based real-time text-to-speech capabilities. This allows you to stream text input and receive audio output in real-time, making it ideal for interactive applications like chatbots, voice assistants, and live narration.
Method Signature
Parameters
Voice ID to be used. You can use
https://api.elevenlabs.io/v1/voices to list all available voices.An iterator of text chunks that will get converted into speech in real-time. The text is automatically chunked at natural breakpoints (punctuation, spaces) for optimal speech generation.
Identifier of the model that will be used. You can query available models using
GET /v1/models. The model needs to have support for text to speech, which you can check using the can_do_text_to_speech property.Output format of the generated audio. Formatted as
codec_sample_rate_bitrate. For example, an mp3 with 22.05kHz sample rate at 32kbps is represented as mp3_22050_32.Voice settings overriding stored settings for the given voice. They are applied only on the given request.Properties:
stability(float): Stability setting (0.0 to 1.0)similarity_boost(float): Similarity boost setting (0.0 to 1.0)style(float): Style setting (0.0 to 1.0)use_speaker_boost(bool): Enable speaker boost
Request-specific configuration, such as custom headers.
Returns
Iterator[bytes] - Real-time streaming audio data as base64-decoded bytes.
Example: Basic Usage
Example: With Voice Settings
Example: Real-time Interactive Application
Text Chunking
Theconvert_realtime() method automatically chunks your input text at natural breakpoints using the internal text_chunker() function. This function splits text at:
- Sentence endings:
.,?,! - Pauses:
,,;,: - Dashes and parentheses:
—,-,(,),[,],} - Spaces
WebSocket Connection
Under the hood,convert_realtime() establishes a WebSocket connection to:
- Model ID and output format in query parameters
- Authentication via headers
- JSON message protocol for text chunks and audio responses
Use Cases
- Chatbots and voice assistants: Stream AI-generated responses as they’re created
- Live narration: Convert real-time text (e.g., from live captions) to speech
- Interactive storytelling: Generate speech for dynamic, user-driven narratives
- Accessibility tools: Provide real-time audio feedback for text input
Notes
- The realtime client requires a WebSocket connection, which is automatically managed
- Audio chunks are returned as they become available, enabling very low latency
- The connection will automatically close when all text has been processed
- Error handling is built-in via
ApiErrorexceptions