Speech to Speech

Overview

Speech-to-speech conversion allows you to transform audio from one voice to another while maintaining full control over emotion, timing, and delivery. This is perfect for voice changing, dubbing, and voice conversion applications.

Basic Conversion

Convert audio to a different voice:

from elevenlabs.client import ElevenLabs

client = ElevenLabs(api_key="YOUR_API_KEY")

audio = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=open("input.mp3", "rb"),
    model_id="eleven_multilingual_sts_v2",
    output_format="mp3_44100_128"
)

# Save the converted audio
with open("output.mp3", "wb") as f:
    for chunk in audio:
        f.write(chunk)

Streaming Conversion

Stream the converted audio in real-time:

from elevenlabs import stream
from elevenlabs.client import ElevenLabs

client = ElevenLabs(api_key="YOUR_API_KEY")

audio_stream = client.speech_to_speech.stream(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=open("input.mp3", "rb"),
    model_id="eleven_multilingual_sts_v2",
    output_format="mp3_44100_128"
)

# Stream directly to speakers
stream(audio_stream)

Parameters

voice_id

string

required

ID of the voice to be used. Use the Get voices endpoint to list all available voices.

audio

File

required

The input audio file to convert.

model_id

string

Identifier of the model that will be used. The model needs to have support for speech-to-speech (check the can_do_voice_conversion property).

output_format

string

Output format of the generated audio. Formatted as codec_sample_rate_bitrate (e.g., mp3_44100_128).

optimize_streaming_latency

integer

Latency optimization level (0-4):

0: Default mode (no optimizations)
1: Normal optimizations (~50% improvement)
2: Strong optimizations (~75% improvement)
3: Max optimizations
4: Max optimizations with text normalizer off

voice_settings

string

JSON-encoded string of voice settings to override stored settings.

seed

integer

Seed for deterministic generation (0-4294967295).

remove_background_noise

boolean

Remove background noise from input audio using the audio isolation model.

file_format

string

Format of input audio. Options: pcm_s16le_16 or other. PCM format offers lower latency.

enable_logging

boolean

When set to false, zero retention mode is used (enterprise feature).

With Background Noise Removal

Clean up noisy input audio:

audio = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=open("noisy_input.mp3", "rb"),
    model_id="eleven_multilingual_sts_v2",
    remove_background_noise=True,
    output_format="mp3_44100_128"
)

Low Latency Mode

Optimize for minimal latency:

audio = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=open("input.mp3", "rb"),
    model_id="eleven_multilingual_sts_v2",
    optimize_streaming_latency=3,  # Max latency optimizations
    output_format="mp3_44100_128"
)

Custom Voice Settings

Override voice settings for the conversion:

import json

voice_settings = json.dumps({
    "stability": 0.5,
    "similarity_boost": 0.75,
    "style": 0.3
})

audio = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=open("input.mp3", "rb"),
    model_id="eleven_multilingual_sts_v2",
    voice_settings=voice_settings,
    output_format="mp3_44100_128"
)

Deterministic Generation

Use a seed for reproducible results:

audio = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=open("input.mp3", "rb"),
    model_id="eleven_multilingual_sts_v2",
    seed=42,  # Same seed = same output
    output_format="mp3_44100_128"
)

PCM Input for Lower Latency

Use PCM format for the lowest latency:

# Input must be 16-bit PCM at 16kHz, mono, little-endian
audio = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=open("input.pcm", "rb"),
    model_id="eleven_multilingual_sts_v2",
    file_format="pcm_s16le_16",
    output_format="mp3_44100_128"
)

Async Conversion

Convert audio asynchronously:

import asyncio
from elevenlabs.client import AsyncElevenLabs

async def convert_voice():
    client = AsyncElevenLabs(api_key="YOUR_API_KEY")
    
    audio = await client.speech_to_speech.convert(
        voice_id="JBFqnCBsd6RMkjVDRZzb",
        audio=open("input.mp3", "rb"),
        model_id="eleven_multilingual_sts_v2",
        output_format="mp3_44100_128"
    )
    
    with open("output.mp3", "wb") as f:
        async for chunk in audio:
            f.write(chunk)

asyncio.run(convert_voice())

Output Formats

Supported output formats:

mp3_44100_32 - MP3 at 44.1kHz, 32kbps
mp3_44100_64 - MP3 at 44.1kHz, 64kbps
mp3_44100_96 - MP3 at 44.1kHz, 96kbps
mp3_44100_128 - MP3 at 44.1kHz, 128kbps (recommended)
mp3_44100_192 - MP3 at 44.1kHz, 192kbps (Creator tier+)
pcm_16000 - PCM at 16kHz
pcm_22050 - PCM at 22.05kHz
pcm_24000 - PCM at 24kHz
pcm_44100 - PCM at 44.1kHz (Pro tier+)
ulaw_8000 - μ-law at 8kHz (Twilio compatible)

Use Cases

Voice Changing

Transform your voice in real-time or recordings

Content Localization

Maintain speaker identity across languages

Voice Preservation

Preserve vocal characteristics while changing content

Accessibility

Convert voices for better accessibility

Best Practices

Use high-quality input audio for best results
Enable background noise removal for noisy recordings
Use PCM input format for lowest latency applications
Test different latency optimization levels for your use case

The quality of voice conversion depends heavily on the input audio quality. Clean, clear audio produces better results.

Voice Cloning - Create custom voices
Audio Isolation - Remove background noise
Text to Speech - Generate speech from text

Getting Started

Core Features

Conversational AI

Advanced Features

Guides

Overview

Basic Conversion

Streaming Conversion

Parameters

With Background Noise Removal

Low Latency Mode

Custom Voice Settings

Deterministic Generation

PCM Input for Lower Latency

Async Conversion

Output Formats

Use Cases

Voice Changing

Content Localization

Voice Preservation

Accessibility

Best Practices

Build docs developers (and LLMs) love

Getting Started

Core Features

Conversational AI

Advanced Features

Guides

​Overview

​Basic Conversion

​Streaming Conversion

​Parameters

​With Background Noise Removal

​Low Latency Mode

​Custom Voice Settings

​Deterministic Generation

​PCM Input for Lower Latency

​Async Conversion

​Output Formats

​Use Cases

Voice Changing

Content Localization

Voice Preservation

Accessibility

​Best Practices

​Related Features

Build docs developers (and LLMs) love

Overview

Basic Conversion

Streaming Conversion

Parameters

With Background Noise Removal

Low Latency Mode

Custom Voice Settings

Deterministic Generation

PCM Input for Lower Latency

Async Conversion

Output Formats

Use Cases

Best Practices

Related Features