Speech to Speech

Overview

The Speech to Speech API allows you to transform audio from one voice to another, maintaining full control over emotion, timing, and delivery. This is also known as voice conversion or voice changing.

Methods

convert()

Transform audio from one voice to another with full control over emotion, timing, and delivery.

from elevenlabs import ElevenLabs

client = ElevenLabs(api_key="YOUR_API_KEY")

audio_iterator = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=open("input_audio.mp3", "rb"),
    model_id="eleven_multilingual_sts_v2",
    output_format="mp3_44100_128"
)

# Save the audio
with open("output.mp3", "wb") as f:
    for chunk in audio_iterator:
        f.write(chunk)

voice_id

str

required

ID of the voice to be used. Use the Get voices endpoint to list all available voices.

audio

core.File

required

The audio file to convert. Can be a file path, file object, or bytes.

model_id

str

Identifier of the model that will be used. The model needs to have support for speech to speech (can_do_voice_conversion property). Default models include:

eleven_multilingual_sts_v2 - Multilingual speech-to-speech model
eleven_english_sts_v2 - English speech-to-speech model

output_format

str

Output format of the generated audio. Formatted as codec_sample_rate_bitrate. Examples:

mp3_44100_128 - MP3 at 44.1kHz, 128kbps
mp3_22050_32 - MP3 at 22.05kHz, 32kbps
pcm_16000 - PCM at 16kHz
ulaw_8000 - μ-law at 8kHz (commonly used for Twilio)

Note: Higher quality formats may require subscription to Creator tier or above.

enable_logging

bool

When set to False, zero retention mode will be used for the request. History features will be unavailable. Zero retention mode may only be used by enterprise customers.

optimize_streaming_latency

int

Latency optimization level. Higher values reduce latency at some cost to quality:

0 - Default mode (no latency optimizations)
1 - Normal latency optimizations (~50% improvement)
2 - Strong latency optimizations (~75% improvement)
3 - Maximum latency optimizations
4 - Maximum latency optimizations with text normalizer disabled (best latency, may mispronounce numbers/dates)

voice_settings

str

Voice settings as a JSON-encoded string. These override stored settings for the given voice and apply only to this request. Example:

{"stability": 0.5, "similarity_boost": 0.75}

seed

int

Random seed for deterministic generation. Must be an integer between 0 and 4294967295. Repeated requests with the same seed and parameters should return similar results, though determinism is not guaranteed.

remove_background_noise

bool

If set, will remove background noise from your audio input using the audio isolation model. Only applies to Voice Changer.

file_format

str

The format of input audio. Options:

pcm_s16le_16 - 16-bit PCM at 16kHz sample rate, mono, little-endian (lower latency)
other - Any other encoded audio format (default)

request_options

RequestOptions

Request-specific configuration including chunk_size and other customizations.

return

Iterator[bytes]

An iterator yielding audio data chunks. Iterate over this to get the complete audio file.

stream()

Stream audio conversion from one voice to another in real-time.

from elevenlabs import ElevenLabs

client = ElevenLabs(api_key="YOUR_API_KEY")

audio_stream = client.speech_to_speech.stream(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=open("input_audio.mp3", "rb"),
    model_id="eleven_multilingual_sts_v2",
    output_format="mp3_44100_128"
)

# Process streaming audio
for chunk in audio_stream:
    # Play or process audio chunk
    process_audio_chunk(chunk)

voice_id

str

required

ID of the voice to be used.

audio

core.File

required

The audio file to convert.

model_id

str

Identifier of the speech-to-speech model to use.

output_format

str

Output format of the streamed audio. Same format options as convert().

enable_logging

bool

Enable or disable request logging.

optimize_streaming_latency

int

Latency optimization level (0-4). Recommended for streaming use cases.

voice_settings

str

JSON-encoded voice settings override.

seed

int

Random seed for deterministic generation (0-4294967295).

remove_background_noise

bool

Remove background noise from input audio.

file_format

str

Format of input audio (pcm_s16le_16 or other).

request_options

RequestOptions

Request-specific configuration.

return

Iterator[bytes]

An iterator yielding streaming audio data chunks.

Async Methods

All methods have async equivalents accessible via AsyncElevenLabs:

import asyncio
from elevenlabs import AsyncElevenLabs

client = AsyncElevenLabs(api_key="YOUR_API_KEY")

async def convert_audio():
    audio_iterator = await client.speech_to_speech.convert(
        voice_id="JBFqnCBsd6RMkjVDRZzb",
        audio=open("input_audio.mp3", "rb"),
        model_id="eleven_multilingual_sts_v2"
    )
    
    async for chunk in audio_iterator:
        # Process audio chunk
        pass

asyncio.run(convert_audio())

Use Cases

Voice changing: Transform your voice into a different voice while preserving emotion and timing
Podcast editing: Replace speaker voices while maintaining natural delivery
Content localization: Adapt voice characteristics for different audiences
Audio restoration: Improve audio quality while preserving the original timing and emotion

Client

Text to Speech

Voices

Conversational AI

Audio Processing

History & Models

Account & Usage

Overview

Methods

convert()

stream()

Async Methods

Use Cases

Build docs developers (and LLMs) love

Client

Text to Speech

Voices

Conversational AI

Audio Processing

History & Models

Account & Usage

​Overview

​Methods

​convert()

​stream()

​Async Methods

​Use Cases

Build docs developers (and LLMs) love

Overview

Methods

convert()

stream()

Async Methods

Use Cases