Skip to main content

Overview

Speech-to-speech conversion allows you to transform audio from one voice to another while maintaining full control over emotion, timing, and delivery. This is perfect for voice changing, dubbing, and voice conversion applications.

Basic Conversion

Convert audio to a different voice:
from elevenlabs.client import ElevenLabs

client = ElevenLabs(api_key="YOUR_API_KEY")

audio = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=open("input.mp3", "rb"),
    model_id="eleven_multilingual_sts_v2",
    output_format="mp3_44100_128"
)

# Save the converted audio
with open("output.mp3", "wb") as f:
    for chunk in audio:
        f.write(chunk)

Streaming Conversion

Stream the converted audio in real-time:
from elevenlabs import stream
from elevenlabs.client import ElevenLabs

client = ElevenLabs(api_key="YOUR_API_KEY")

audio_stream = client.speech_to_speech.stream(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=open("input.mp3", "rb"),
    model_id="eleven_multilingual_sts_v2",
    output_format="mp3_44100_128"
)

# Stream directly to speakers
stream(audio_stream)

Parameters

voice_id
string
required
ID of the voice to be used. Use the Get voices endpoint to list all available voices.
audio
File
required
The input audio file to convert.
model_id
string
Identifier of the model that will be used. The model needs to have support for speech-to-speech (check the can_do_voice_conversion property).
output_format
string
Output format of the generated audio. Formatted as codec_sample_rate_bitrate (e.g., mp3_44100_128).
optimize_streaming_latency
integer
Latency optimization level (0-4):
  • 0: Default mode (no optimizations)
  • 1: Normal optimizations (~50% improvement)
  • 2: Strong optimizations (~75% improvement)
  • 3: Max optimizations
  • 4: Max optimizations with text normalizer off
voice_settings
string
JSON-encoded string of voice settings to override stored settings.
seed
integer
Seed for deterministic generation (0-4294967295).
remove_background_noise
boolean
Remove background noise from input audio using the audio isolation model.
file_format
string
Format of input audio. Options: pcm_s16le_16 or other. PCM format offers lower latency.
enable_logging
boolean
When set to false, zero retention mode is used (enterprise feature).

With Background Noise Removal

Clean up noisy input audio:
audio = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=open("noisy_input.mp3", "rb"),
    model_id="eleven_multilingual_sts_v2",
    remove_background_noise=True,
    output_format="mp3_44100_128"
)

Low Latency Mode

Optimize for minimal latency:
audio = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=open("input.mp3", "rb"),
    model_id="eleven_multilingual_sts_v2",
    optimize_streaming_latency=3,  # Max latency optimizations
    output_format="mp3_44100_128"
)

Custom Voice Settings

Override voice settings for the conversion:
import json

voice_settings = json.dumps({
    "stability": 0.5,
    "similarity_boost": 0.75,
    "style": 0.3
})

audio = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=open("input.mp3", "rb"),
    model_id="eleven_multilingual_sts_v2",
    voice_settings=voice_settings,
    output_format="mp3_44100_128"
)

Deterministic Generation

Use a seed for reproducible results:
audio = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=open("input.mp3", "rb"),
    model_id="eleven_multilingual_sts_v2",
    seed=42,  # Same seed = same output
    output_format="mp3_44100_128"
)

PCM Input for Lower Latency

Use PCM format for the lowest latency:
# Input must be 16-bit PCM at 16kHz, mono, little-endian
audio = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=open("input.pcm", "rb"),
    model_id="eleven_multilingual_sts_v2",
    file_format="pcm_s16le_16",
    output_format="mp3_44100_128"
)

Async Conversion

Convert audio asynchronously:
import asyncio
from elevenlabs.client import AsyncElevenLabs

async def convert_voice():
    client = AsyncElevenLabs(api_key="YOUR_API_KEY")
    
    audio = await client.speech_to_speech.convert(
        voice_id="JBFqnCBsd6RMkjVDRZzb",
        audio=open("input.mp3", "rb"),
        model_id="eleven_multilingual_sts_v2",
        output_format="mp3_44100_128"
    )
    
    with open("output.mp3", "wb") as f:
        async for chunk in audio:
            f.write(chunk)

asyncio.run(convert_voice())

Output Formats

Supported output formats:
  • mp3_44100_32 - MP3 at 44.1kHz, 32kbps
  • mp3_44100_64 - MP3 at 44.1kHz, 64kbps
  • mp3_44100_96 - MP3 at 44.1kHz, 96kbps
  • mp3_44100_128 - MP3 at 44.1kHz, 128kbps (recommended)
  • mp3_44100_192 - MP3 at 44.1kHz, 192kbps (Creator tier+)
  • pcm_16000 - PCM at 16kHz
  • pcm_22050 - PCM at 22.05kHz
  • pcm_24000 - PCM at 24kHz
  • pcm_44100 - PCM at 44.1kHz (Pro tier+)
  • ulaw_8000 - μ-law at 8kHz (Twilio compatible)

Use Cases

Voice Changing

Transform your voice in real-time or recordings

Content Localization

Maintain speaker identity across languages

Voice Preservation

Preserve vocal characteristics while changing content

Accessibility

Convert voices for better accessibility

Best Practices

  • Use high-quality input audio for best results
  • Enable background noise removal for noisy recordings
  • Use PCM input format for lowest latency applications
  • Test different latency optimization levels for your use case
The quality of voice conversion depends heavily on the input audio quality. Clean, clear audio produces better results.

Build docs developers (and LLMs) love