Overview
The Speech to Speech API allows you to transform audio from one voice to another, maintaining full control over emotion, timing, and delivery. This is also known as voice conversion or voice changing.Methods
convert()
Transform audio from one voice to another with full control over emotion, timing, and delivery.ID of the voice to be used. Use the Get voices endpoint to list all available voices.
The audio file to convert. Can be a file path, file object, or bytes.
Identifier of the model that will be used. The model needs to have support for speech to speech (can_do_voice_conversion property). Default models include:
eleven_multilingual_sts_v2- Multilingual speech-to-speech modeleleven_english_sts_v2- English speech-to-speech model
Output format of the generated audio. Formatted as
codec_sample_rate_bitrate. Examples:mp3_44100_128- MP3 at 44.1kHz, 128kbpsmp3_22050_32- MP3 at 22.05kHz, 32kbpspcm_16000- PCM at 16kHzulaw_8000- μ-law at 8kHz (commonly used for Twilio)
When set to
False, zero retention mode will be used for the request. History features will be unavailable. Zero retention mode may only be used by enterprise customers.Latency optimization level. Higher values reduce latency at some cost to quality:
0- Default mode (no latency optimizations)1- Normal latency optimizations (~50% improvement)2- Strong latency optimizations (~75% improvement)3- Maximum latency optimizations4- Maximum latency optimizations with text normalizer disabled (best latency, may mispronounce numbers/dates)
Voice settings as a JSON-encoded string. These override stored settings for the given voice and apply only to this request. Example:
Random seed for deterministic generation. Must be an integer between 0 and 4294967295. Repeated requests with the same seed and parameters should return similar results, though determinism is not guaranteed.
If set, will remove background noise from your audio input using the audio isolation model. Only applies to Voice Changer.
The format of input audio. Options:
pcm_s16le_16- 16-bit PCM at 16kHz sample rate, mono, little-endian (lower latency)other- Any other encoded audio format (default)
Request-specific configuration including chunk_size and other customizations.
An iterator yielding audio data chunks. Iterate over this to get the complete audio file.
stream()
Stream audio conversion from one voice to another in real-time.ID of the voice to be used.
The audio file to convert.
Identifier of the speech-to-speech model to use.
Output format of the streamed audio. Same format options as
convert().Enable or disable request logging.
Latency optimization level (0-4). Recommended for streaming use cases.
JSON-encoded voice settings override.
Random seed for deterministic generation (0-4294967295).
Remove background noise from input audio.
Format of input audio (
pcm_s16le_16 or other).Request-specific configuration.
An iterator yielding streaming audio data chunks.
Async Methods
All methods have async equivalents accessible viaAsyncElevenLabs:
Use Cases
- Voice changing: Transform your voice into a different voice while preserving emotion and timing
- Podcast editing: Replace speaker voices while maintaining natural delivery
- Content localization: Adapt voice characteristics for different audiences
- Audio restoration: Improve audio quality while preserving the original timing and emotion