STT and TTS models are preloaded at startup when enabled. If you do not need voice features, pass
--enable_tts=False --enable_stt=False to avoid consuming GPU memory.Speech-to-text (STT) with Whisper
h2oGPT uses OpenAI Whisper and its distilled variants for audio transcription. STT is used for:- Real-time voice input in the chat UI.
- Automatic transcription of uploaded audio and video files before indexing.
- ASR (Automatic Speech Recognition) for YouTube and other media ingested during document Q&A.
Enabling STT
Whisper model options
| Model | Notes |
|---|---|
openai/whisper-large-v3 | Highest accuracy. Recommended for production. |
distil-whisper/distil-large-v3 | ~10× faster than large-v3 with similar accuracy. Good default. |
openai/whisper-base | Smallest, fastest. Suitable for low-resource environments. |
Faster inference with faster_whisper
If the faster_whisper package is installed, h2oGPT automatically uses it for large-v2 and large-v3 models. This gives roughly 4× speedup and 2× lower memory usage compared to the standard implementation.
OpenAI-compatible STT API
h2oGPT exposes a/v1/audio/transcriptions endpoint compatible with the OpenAI Audio API:
Text-to-speech (TTS)
h2oGPT supports two TTS backends. Choose based on your licensing requirements and quality needs.- Microsoft SpeechT5 (MIT)
- Coqui XTTS (MPL2)
Microsoft SpeechT5 is the default TTS engine. It is MIT licensed, runs entirely on-device, and supports multiple voices.Available speaker styles include
SLT (female) and others from the CMU ARCTIC speaker set. Use the Speech Style dropdown in the left sidebar to switch voices at runtime.Streaming — SpeechT5 supports streaming audio so responses begin playing before generation is complete.Enabling TTS with a default voice
By default, no voice is selected and TTS is silent until you choose a style in the UI. To start speaking immediately:Full multi-GPU voice setup
For a 4-GPU system with high-quality STT and TTS alongside document Q&A:OpenAI-compatible TTS API
h2oGPT exposes a/v1/audio/speech endpoint and a Gradio API (/speak_text_api).
curl:
Voice cloning (Coqui XTTS only)
Voice cloning lets you add a custom speaker based on a recorded sample and use it across sessions.Open the Expert tab
Go to the Expert tab in the UI and scroll to the Speech Control and Voice Cloning section.
Provide a voice sample
Either upload a clean audio file (WAV, MP4A, etc.) using File for Clone, or record up to 30 seconds of audio using Mic for Clone. Check Use Mic for Cloning if you recorded from the microphone.Files are automatically trimmed to 30 seconds. Use a clean sample with only the target voice for the best results.
Name the speaker
Enter a name in the Speaker Style field. This name will appear in the Speech Style dropdown in the left sidebar.
AI Assistant Voice Control mode
Voice Control mode enables fully hands-free interaction with h2oGPT. Say an action word to start recording, ask your question, and the model responds in speech — no keyboard required.- Action phrase (
--tts_action_phrases) — Saying this word activates the microphone and begins recording your query. Use a distinctive phrase likeNimbus Cloudsto reduce false activations. - Stop phrase (
--tts_stop_phrases) — Saying this word cancels active recording or TTS playback.
Stopping speech playback
- Click Stop in the top-right corner to halt both text generation and speech output.
- Click Stop/Clear Speak to stop speech while leaving the generated text visible. This applies when you triggered speech manually with Speak Instruction or Speak Response.
CLI flags reference
| Flag | Description |
|---|---|
--enable_stt | Enable speech-to-text. Default: True. Set False to save GPU memory. |
--enable_tts | Enable text-to-speech. Default: True. Set False to save GPU memory. |
--tts_model | TTS model to load. Default: microsoft/speecht5_tts. |
--asr_model | Whisper model for STT/ASR. Default: openai/whisper-large-v3. |
--sst_model | Whisper model for real-time speech input. Defaults to --asr_model. |
--speaker | Default speaker for Microsoft SpeechT5 (e.g. SLT (female)). |
--chatbot_role | Default chatbot role for Coqui TTS (e.g. Female AI Assistant). |
--asr_gpu_id | GPU index for the ASR model. |
--tts_gpu_id | GPU index for the TTS model. |
--pre_load_image_audio_models | Preload all audio/image models at startup for faster first use. |
--tts_action_phrases | List of action words to activate Voice Control mode. |
--tts_stop_phrases | List of stop words to deactivate Voice Control mode. |