Prerequisites
Start h2oGPT with audio support enabled and models pre-loaded:Speech to text
POST /v1/audio/transcriptions
Transcribes an audio file to text using Whisper. The request is a multipart/form-data upload.
Request parameters
Audio file to transcribe. Accepted formats include WAV, MP3, and other formats supported by the underlying Whisper installation.
Pass
"whisper-1" for compatibility. The server uses its loaded Whisper model regardless of this value.Output format. Use
"text" to receive a plain string.When
true, partial transcription results are returned as server-sent events. The OpenAI Python client does not expose streaming for transcriptions natively; use httpx directly to receive a streamed response.Controls how audio is segmented for streaming. Options:
"silence" or "interval". Has no effect when stream=false.Response
Examples
Text to speech
POST /v1/audio/speech
Converts input text to audio using Coqui TTS or Microsoft TTS.
Request parameters
Pass
"tts-1" for compatibility. The server uses its loaded TTS model.The text to synthesize.
If set, overrides both
chatbot_role and speaker. For native OpenAI voices, h2oGPT translates them into defaults. Leave empty to rely on chatbot_role and speaker.Audio format of the response. Options:
"wav", "mp3", "opus", "aac", "flac", "pcm".When
true, audio is returned as a stream, one chunk (sentence) at a time. When false, the entire file is generated before returning.When
true and stream=true, WAV headers are stripped from all chunks after the first so the stream is a contiguous audio byte sequence. When false, each chunk is a valid standalone WAV file.TTS role for Coqui TTS.
Speaker for Microsoft TTS.
Response
Binary audio data in the requested format. TheContent-Type header is set to audio/<response_format>.
For streaming WAV, the server artificially inflates the header’s reported duration so players can stream through to end of audio.