Skip to main content
The Audio API provides three main capabilities: generating speech from text, transcribing audio to text, and translating audio to English.

Speech (Text-to-Speech)

Generates audio from the input text.
client.audio.speech.create(params)
input
String
required
The text to generate audio for. Maximum length is 4096 characters.
model
String
required
One of the available TTS models: tts-1 or tts-1-hd. The tts-1-hd model provides higher quality audio.
voice
String
required
The voice to use when generating the audio. Supported voices are: alloy, echo, fable, onyx, nova, and shimmer.
response_format
String
default:"mp3"
The format to audio in. Supported formats: mp3, opus, aac, flac, wav, and pcm.
speed
Float
default:"1.0"
The speed of the generated audio. Select a value from 0.25 to 4.0.
instructions
String
Control the voice of your generated audio with additional instructions. Does not work with tts-1.
stream_format
String
The format to stream the audio in. Supported formats are sse and audio.

Response

Returns a StringIO object containing the audio file content.

Examples

Generate speech

require "openai"

client = OpenAI::Client.new

audio = client.audio.speech.create(
  model: "tts-1",
  voice: "alloy",
  input: "Hello! Welcome to the OpenAI Ruby SDK."
)

File.binwrite("output.mp3", audio.read)

Different voice and format

audio = client.audio.speech.create(
  model: "tts-1-hd",
  voice: "nova",
  input: "This is a high quality audio sample.",
  response_format: "wav",
  speed: 1.2
)

File.binwrite("output.wav", audio.read)

Transcriptions (Speech-to-Text)

Transcribes audio into the input language.
client.audio.transcriptions.create(params)
file
File
required
The audio file to transcribe. Supported formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
model
String
required
ID of the model to use. Options include gpt-4o-transcribe, gpt-4o-mini-transcribe, whisper-1.
language
String
The language of the input audio in ISO-639-1 format (e.g., en, es, fr). Supplying the input language improves accuracy and latency.
prompt
String
An optional text to guide the model’s style or continue a previous audio segment. The prompt should match the audio language.
response_format
String
default:"json"
The format of the output: json, text, srt, verbose_json, vtt, or diarized_json.
temperature
Float
default:"0"
The sampling temperature, between 0 and 1. Higher values make output more random.
timestamp_granularities
Array<String>
The timestamp granularities to populate for this transcription. Options: word, segment.
chunking_strategy
String | Object
Controls how the audio is cut into chunks. Can be auto or a VAD configuration object.
known_speaker_names
Array<String>
Optional list of speaker names that correspond to audio samples for speaker identification.

Response

Returns a Transcription, TranscriptionDiarized, or TranscriptionVerbose object depending on response_format.
text
String
The transcribed text.
language
String
The detected language (verbose_json only).
duration
Float
The duration of the audio in seconds (verbose_json only).
words
Array
Word-level timestamps (when timestamp_granularities includes word).
segments
Array
Segment-level timestamps (when timestamp_granularities includes segment).

Examples

Basic transcription

require "openai"

client = OpenAI::Client.new

transcription = client.audio.transcriptions.create(
  file: File.open("audio.mp3"),
  model: "whisper-1"
)

puts transcription.text

With language specification

transcription = client.audio.transcriptions.create(
  file: File.open("spanish_audio.mp3"),
  model: "whisper-1",
  language: "es",
  response_format: "verbose_json"
)

puts "Detected language: #{transcription.language}"
puts "Duration: #{transcription.duration} seconds"
puts "Text: #{transcription.text}"

Streaming transcription

stream = client.audio.transcriptions.create_streaming(
  file: File.open("audio.mp3"),
  model: "whisper-1"
)

stream.each do |event|
  case event
  when OpenAI::Models::Audio::TranscriptionTextDeltaEvent
    print event.delta
  when OpenAI::Models::Audio::TranscriptionTextDoneEvent
    puts "\nTranscription complete!"
  end
end

With timestamps

transcription = client.audio.transcriptions.create(
  file: File.open("audio.mp3"),
  model: "whisper-1",
  response_format: "verbose_json",
  timestamp_granularities: [:word, :segment]
)

transcription.words.each do |word|
  puts "#{word[:word]} (#{word[:start]}s - #{word[:end]}s)"
end

Translations (Audio to English)

Translates audio into English.
client.audio.translations.create(params)
file
File
required
The audio file to translate. Supported formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
model
String
required
ID of the model to use. Only whisper-1 is currently supported.
prompt
String
An optional text to guide the model’s style or continue a previous audio segment.
response_format
String
default:"json"
The format of the output: json, text, srt, verbose_json, or vtt.
temperature
Float
default:"0"
The sampling temperature, between 0 and 1.

Response

Returns a Translation or TranslationVerbose object.
text
String
The translated text in English.

Examples

Basic translation

require "openai"

client = OpenAI::Client.new

translation = client.audio.translations.create(
  file: File.open("german_audio.mp3"),
  model: "whisper-1"
)

puts translation.text

With prompt guidance

translation = client.audio.translations.create(
  file: File.open("french_audio.mp3"),
  model: "whisper-1",
  prompt: "This is a technical discussion about machine learning.",
  response_format: "verbose_json"
)

puts "English translation: #{translation.text}"

Audio formats

Supported input formats for transcription and translation:
  • FLAC
  • MP3
  • MP4
  • MPEG
  • MPGA
  • M4A
  • OGG
  • WAV
  • WEBM
Supported output formats for speech generation:
  • MP3 (default)
  • Opus
  • AAC
  • FLAC
  • WAV
  • PCM

Best practices

  • Choose the right voice: Preview all available voices to find the best match for your use case
  • Specify language: For transcription, providing the language code improves accuracy and speed
  • Use prompts: Guide the model with prompts for better transcription of technical terms or specific styles
  • Streaming: Use streaming methods for real-time applications
  • File size: Keep audio files under 25 MB for best performance

Build docs developers (and LLMs) love