Speech (Text-to-Speech)
Generates audio from the input text.The text to generate audio for. Maximum length is 4096 characters.
One of the available TTS models:
tts-1 or tts-1-hd. The tts-1-hd model provides higher quality audio.The voice to use when generating the audio. Supported voices are:
alloy, echo, fable, onyx, nova, and shimmer.The format to audio in. Supported formats:
mp3, opus, aac, flac, wav, and pcm.The speed of the generated audio. Select a value from
0.25 to 4.0.Control the voice of your generated audio with additional instructions. Does not work with
tts-1.The format to stream the audio in. Supported formats are
sse and audio.Response
Returns aStringIO object containing the audio file content.
Examples
Generate speech
Different voice and format
Transcriptions (Speech-to-Text)
Transcribes audio into the input language.The audio file to transcribe. Supported formats:
flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.ID of the model to use. Options include
gpt-4o-transcribe, gpt-4o-mini-transcribe, whisper-1.The language of the input audio in ISO-639-1 format (e.g.,
en, es, fr). Supplying the input language improves accuracy and latency.An optional text to guide the model’s style or continue a previous audio segment. The prompt should match the audio language.
The format of the output:
json, text, srt, verbose_json, vtt, or diarized_json.The sampling temperature, between 0 and 1. Higher values make output more random.
The timestamp granularities to populate for this transcription. Options:
word, segment.Controls how the audio is cut into chunks. Can be
auto or a VAD configuration object.Optional list of speaker names that correspond to audio samples for speaker identification.
Response
Returns aTranscription, TranscriptionDiarized, or TranscriptionVerbose object depending on response_format.
The transcribed text.
The detected language (verbose_json only).
The duration of the audio in seconds (verbose_json only).
Word-level timestamps (when timestamp_granularities includes
word).Segment-level timestamps (when timestamp_granularities includes
segment).Examples
Basic transcription
With language specification
Streaming transcription
With timestamps
Translations (Audio to English)
Translates audio into English.The audio file to translate. Supported formats:
flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.ID of the model to use. Only
whisper-1 is currently supported.An optional text to guide the model’s style or continue a previous audio segment.
The format of the output:
json, text, srt, verbose_json, or vtt.The sampling temperature, between 0 and 1.
Response
Returns aTranslation or TranslationVerbose object.
The translated text in English.
Examples
Basic translation
With prompt guidance
Audio formats
Supported input formats for transcription and translation:- FLAC
- MP3
- MP4
- MPEG
- MPGA
- M4A
- OGG
- WAV
- WEBM
- MP3 (default)
- Opus
- AAC
- FLAC
- WAV
- PCM
Best practices
- Choose the right voice: Preview all available voices to find the best match for your use case
- Specify language: For transcription, providing the language code improves accuracy and speed
- Use prompts: Guide the model with prompts for better transcription of technical terms or specific styles
- Streaming: Use streaming methods for real-time applications
- File size: Keep audio files under 25 MB for best performance