Skip to main content

Overview

The RealtimeTextToSpeechClient extends the standard TextToSpeechClient with WebSocket-based real-time text-to-speech capabilities. This allows you to stream text input and receive audio output in real-time, making it ideal for interactive applications like chatbots, voice assistants, and live narration.

Method Signature

client.text_to_speech.convert_realtime(
    voice_id: str,
    text: Iterator[str],
    model_id: Optional[str] = None,
    output_format: Optional[str] = "mp3_44100_128",
    voice_settings: Optional[VoiceSettings] = None,
    request_options: Optional[RequestOptions] = None,
) -> Iterator[bytes]

Parameters

voice_id
str
required
Voice ID to be used. You can use https://api.elevenlabs.io/v1/voices to list all available voices.
text
Iterator[str]
required
An iterator of text chunks that will get converted into speech in real-time. The text is automatically chunked at natural breakpoints (punctuation, spaces) for optimal speech generation.
model_id
str
default:"None"
Identifier of the model that will be used. You can query available models using GET /v1/models. The model needs to have support for text to speech, which you can check using the can_do_text_to_speech property.
output_format
str
default:"mp3_44100_128"
Output format of the generated audio. Formatted as codec_sample_rate_bitrate. For example, an mp3 with 22.05kHz sample rate at 32kbps is represented as mp3_22050_32.
voice_settings
VoiceSettings
default:"None"
Voice settings overriding stored settings for the given voice. They are applied only on the given request.Properties:
  • stability (float): Stability setting (0.0 to 1.0)
  • similarity_boost (float): Similarity boost setting (0.0 to 1.0)
  • style (float): Style setting (0.0 to 1.0)
  • use_speaker_boost (bool): Enable speaker boost
request_options
RequestOptions
default:"None"
Request-specific configuration, such as custom headers.

Returns

Iterator[bytes] - Real-time streaming audio data as base64-decoded bytes.

Example: Basic Usage

from elevenlabs import ElevenLabs
import typing

client = ElevenLabs(
    api_key="YOUR_API_KEY",
)

def get_text() -> typing.Iterator[str]:
    yield "Hello, how are you?"
    yield "I am fine, thank you."
    yield "This is real-time text to speech."

audio_stream = client.text_to_speech.convert_realtime(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    text=get_text(),
    model_id="eleven_multilingual_v2",
)

# Save the audio to a file
with open("realtime_output.mp3", "wb") as f:
    for chunk in audio_stream:
        f.write(chunk)

Example: With Voice Settings

from elevenlabs import ElevenLabs, VoiceSettings
import typing

client = ElevenLabs(
    api_key="YOUR_API_KEY",
)

def get_text() -> typing.Iterator[str]:
    yield "This speech has custom voice settings."
    yield "Notice the stability and style parameters."

audio_stream = client.text_to_speech.convert_realtime(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    text=get_text(),
    model_id="eleven_multilingual_v2",
    voice_settings=VoiceSettings(
        stability=0.5,
        similarity_boost=0.8,
        style=0.6,
        use_speaker_boost=True,
    ),
)

with open("output.mp3", "wb") as f:
    for chunk in audio_stream:
        f.write(chunk)

Example: Real-time Interactive Application

from elevenlabs import ElevenLabs
import typing
import pyaudio

client = ElevenLabs(
    api_key="YOUR_API_KEY",
)

def stream_user_input() -> typing.Iterator[str]:
    """Simulate streaming text from user input or an AI model"""
    sentences = [
        "Welcome to the interactive voice assistant.",
        "I can convert text to speech in real-time.",
        "This allows for natural, flowing conversations.",
    ]
    for sentence in sentences:
        yield sentence

# Set up audio playback
p = pyaudio.PyAudio()
audio_stream = p.open(
    format=pyaudio.paInt16,
    channels=1,
    rate=44100,
    output=True
)

# Stream real-time TTS
tts_stream = client.text_to_speech.convert_realtime(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    text=stream_user_input(),
    model_id="eleven_multilingual_v2",
    output_format="mp3_44100_128",
)

for audio_chunk in tts_stream:
    audio_stream.write(audio_chunk)

audio_stream.stop_stream()
audio_stream.close()
p.terminate()

Text Chunking

The convert_realtime() method automatically chunks your input text at natural breakpoints using the internal text_chunker() function. This function splits text at:
  • Sentence endings: ., ?, !
  • Pauses: ,, ;, :
  • Dashes and parentheses: , -, (, ), [, ], }
  • Spaces
This ensures smooth, natural-sounding speech generation even when streaming text.

WebSocket Connection

Under the hood, convert_realtime() establishes a WebSocket connection to:
wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input
The connection includes:
  • Model ID and output format in query parameters
  • Authentication via headers
  • JSON message protocol for text chunks and audio responses

Use Cases

  • Chatbots and voice assistants: Stream AI-generated responses as they’re created
  • Live narration: Convert real-time text (e.g., from live captions) to speech
  • Interactive storytelling: Generate speech for dynamic, user-driven narratives
  • Accessibility tools: Provide real-time audio feedback for text input

Notes

  • The realtime client requires a WebSocket connection, which is automatically managed
  • Audio chunks are returned as they become available, enabling very low latency
  • The connection will automatically close when all text has been processed
  • Error handling is built-in via ApiError exceptions

Build docs developers (and LLMs) love