Text-to-Speech (TTS)
Voice Selection
The OpenAI TTS voice model to use for spoken responses.Available voices:
alloyashballadcedar(default)coralechofablemarinnovaonyxsageshimmerverse
Playback Speed
Playback speed multiplier for TTS audio.Valid range:
0.25 to 4.00.25= 4× slower (very slow)1.0= normal speed (default)2.0= 2× faster4.0= 4× faster (maximum)
Values outside the 0.25-4.0 range may produce distorted or unintelligible audio.
TTS Model
Klaus uses OpenAI’s gpt-4o-mini-tts model, which is hardcoded and cannot be changed via configuration. Cost: Approximately $0.015 per minute of generated audio.Voice Instructions
Klaus sends the following instructions to the TTS API to optimize voice output:“Speak at a natural conversational pace, not slow or deliberate. You are a sharp colleague giving a quick answer across a desk. Be direct and matter-of-fact, not performative. No vocal fry, no uptalk.”These instructions are hardcoded and ensure consistent, professional voice output across all voice models.
Speech-to-Text (STT)
Klaus uses Moonshine Voice, a local on-device STT model that runs entirely on your machine with no API calls or costs.Model Size
Moonshine model size. Larger models are more accurate but slower.Available models:
| Model | Size | Latency | Accuracy |
|---|---|---|---|
tiny | Small | ~100ms | Good |
small | Medium | ~200ms | Better |
medium | Large | ~300ms | Best (default) |
Language
Language code for Moonshine transcription.Default:
en (English)See the Moonshine documentation for a full list of supported language codes.
Input Mode
Default input mode on startup.Options:
voice_activation: Klaus automatically detects when you’re speaking (default)push_to_talk: Hold the push-to-talk key to record
F3 on Windows, § on macOS).Example Configuration
config.toml
Advanced: TTS Streaming
Klaus uses sentence-level streaming for low-latency responses:- Claude’s response is split into sentences as it streams
- Each sentence is sent to OpenAI TTS immediately (max 4000 chars per call)
- Audio playback starts on the first chunk
- Remaining chunks play seamlessly as they’re generated
Platform Optimizations
- macOS: Uses
highlatency mode to prevent CoreAudio crackling - All platforms: Reuses a single persistent audio output stream across all chunks to avoid device initialization delays
- VAD suspension: The microphone stream is suspended during TTS playback to free the audio device
Latency Breakdown
Typical end-to-end latency from question to first spoken word:| Stage | Latency |
|---|---|
| VAD detection + silence timeout | 0.5-1.5s |
| Moonshine STT (medium model) | ~300ms |
| Claude vision + reasoning (first chunk) | 1-2s |
| OpenAI TTS (first sentence) | 0.5-1s |
| Total | 2-4 seconds |
Subsequent sentences stream with minimal additional latency, creating a natural conversational flow.
Model Downloads
The Moonshine model is downloaded automatically on first use:- tiny: ~80 MB
- small: ~160 MB
- medium: ~245 MB
Troubleshooting
TTS Voice Not Working
- Check API key: Verify
OPENAI_API_KEYis set correctly - Check voice name: Ensure the voice name matches one of the available options (case-sensitive)
- Check logs: Look for OpenAI API errors in Klaus’s console output
STT Transcription Inaccurate
- Try a larger model: Switch from
tinytomediumfor better accuracy - Check microphone: Ensure your mic is selected correctly in Settings
- Reduce background noise: Moonshine works best in quiet environments
- Adjust VAD sensitivity: See Advanced Settings
TTS Playback Too Fast/Slow
- Adjust
tts_speedin config.toml - Valid range is 0.25 to 4.0
- Recommended range: 0.8 to 1.5 for natural-sounding speech