Overview
This page documents the complete data flow lifecycle of a single question in Klaus, from microphone input through speech recognition, vision processing, LLM reasoning, and text-to-speech output.High-Level Sequence
Stage 1: Audio Capture
Push-to-Talk Mode
Implementation:klaus/audio.py:33 (PushToTalkRecorder)
- User presses hotkey (default F2)
start_recording()openssd.InputStreamwith callback- Audio chunks accumulate in memory while key is held
- User releases key
stop_recording()concatenates chunks, encodes as WAV bytes- Returns WAV buffer to main application
Voice-Activated Mode
Implementation:klaus/audio.py:102 (VoiceActivatedRecorder)
- Continuous mic stream with 30ms frames (480 samples @ 16 kHz)
- WebRTC VAD classifies each frame as speech/silence
- Pre-buffer holds last 300ms of audio
- Speech detection:
- First voiced frame triggers
on_speech_start() - Frames accumulate while speaking
- After
silence_timeout(default 1.5s) of silence,_finalize()fires
- First voiced frame triggers
- Quality gates before accepting utterance:
- Minimum duration (0.5s)
- Minimum voiced ratio (0.28)
- Minimum voiced frames (8)
- Minimum RMS loudness (-45 dBFS)
- Minimum contiguous voiced run (6 frames)
- If all gates pass,
on_speech_end(wav_bytes)fires - Otherwise, utterance is discarded and
on_speech_discard(reason)fires
Stream Suspension for TTS
Before TTS playback, the VAD mic stream is suspended to free the CoreAudio device:resume_stream(). This prevents device contention on macOS.
Stage 2: Speech-to-Text
Implementation:klaus/stt.py:103
Moonshine Voice runs entirely on-device:
- WAV bytes arrive from audio recorder
- Moonshine model loaded from
~/.cache/moonshine/(downloaded on first use) - Model processes audio in ~300ms
- Returns transcript as plain text
config.toml):
stt_moonshine_model:tiny,small, ormedium(defaultmedium, 245M params)stt_moonshine_language: language code (defaulten)
Stage 3: Query Routing
Implementation:klaus/query_router.py:118 (QueryRouter.route())
Before sending the full context to Claude, Klaus classifies the question to optimize token usage and response style.
Local Heuristics (Fast Path)
Timing: ~1-2ms-
Pattern matching against regex signals:
definition: “define”, “what is”, “meaning”doc_ref: “page”, “paper”, “section”, “figure”deictic: “this”, “that”, “here”spatial: “left”, “right”, “top”, “bottom”concision: “concisely”, “briefly”, “quickly”general: “summarize”, “what is happening”
-
Weighted scoring for three route modes:
- Standalone definition: high definition score, low page score
- Page-grounded definition: high page + definition score
- General contextual: high general + page score
- If confidence ≥ threshold (default 0.75) and margin ≥ threshold (default 0.20), use local decision
LLM Router (Fallback)
Timing: ~150-350ms (strict timeout) If local confidence is low:- Short prompt sent to
claude-haiku-4-5 - Timeout:
router_timeout_ms(default 350ms) - Max tokens: 80
- Returns JSON:
{mode, confidence, reason} - If LLM confidence ≥ threshold (default 0.70), use LLM decision
- Otherwise, fall back to
standalone_definition
Route Policies
| Route | Image | History | Memory | Notes | Max Sentences | History Window |
|---|---|---|---|---|---|---|
| Standalone definition | ❌ | ❌ | ❌ | ❌ | 2 | 0 |
| Page-grounded definition | ✅ | ✅ | ❌ | ❌ | 2 | 2 turns |
| General contextual | ✅ | ✅ | ✅ | ✅ | None | Full |
klaus/query_router.py:35 for policy definitions.
Stage 4: Camera Capture
Implementation:klaus/camera.py:59
Runs in a daemon thread continuously:
- OpenCV
VideoCapturepolls camera in background loop - Frames stored in
self._framewith thread-safe lock - Auto-rotation: portrait frames (h > w) rotated 90° CW automatically (configurable)
- When question arrives,
capture_base64_jpeg()encodes most recent frame as JPEG (quality 85) and base64 - Base64 string sent to Claude as image content block
Stage 5: Claude Vision + Tool Use
Implementation:klaus/brain.py:80 (Brain.ask())
Context Assembly
Based on route decision:-
User content:
- Image block (if
route.use_image) - Text block with transcript
- Image block (if
-
System prompt:
- Base prompt from
config.SYSTEM_PROMPT - User background (if configured)
- Memory context (if
route.use_memory_context) - Notes context (if
route.use_notes_context) - Turn-specific instruction from route policy
- Sentence cap (if
route.max_sentencesset)
- Base prompt from
-
Message history:
- Full history (if
route.use_historyandhistory_turn_window == 0) - Last N turns (if
history_turn_window > 0) - Images stripped from all but most recent user message to save tokens
- Full history (if
Streaming Loop
- Create streaming message with
client.messages.stream() - For each
content_block_deltaevent with text:- Accumulate into
text_buf - Extract complete sentences via regex split
- Fire
on_sentence(sentence)callback for each complete sentence
- Accumulate into
- If
stop_reason == "tool_use":- Execute tool (web search, notes, etc.)
- Append assistant message + tool results
- Continue loop (max 5 rounds)
- Otherwise, emit final fragment and return
klaus/brain.py:328):
web_search: Tavily search viaklaus/search.py:50set_notes_file: Change active Obsidian notesave_note: Append content to current note
Sentence Cap Enforcement
If route policy specifiesmax_sentences:
- During streaming: stop emitting after N sentences
- After streaming: hard-truncate assistant text to N sentences via regex
Stage 6: Text-to-Speech Streaming
Implementation:klaus/tts.py:92 (TextToSpeech.speak_streaming())
Sentence Queueing
- Main thread creates
queue.Queue[str | None] Brain.ask()fireson_sentence(sentence)callback- Callback puts each sentence into queue
- After Claude finishes,
Nonesentinel added to queue
Parallel Synthesis + Playback
Two threads:-
Synthesis worker (
klaus/tts.py:163):- Reads sentences from queue
- Calls OpenAI TTS API for each sentence
- Puts WAV bytes into audio queue
- Adds
Nonesentinel when done
-
Playback thread (main TTS loop at
klaus/tts.py:134):- Reads WAV bytes from audio queue
- Decodes WAV header (sample rate, channels)
- Ensures persistent
sd.OutputStreamis open - Writes audio in small blocks (2048 frames) for responsive stop
- Closes stream after all chunks played
klaus/tts.py:52):
Chunk Splitting
API limit: 4000 chars per TTS call. If a sentence exceeds this, it’s batched with subsequent sentences up to the limit.Stage 7: Persistence
Implementation:klaus/memory.py:254
After Claude responds:
-
Exchangeobject created with:user_text: transcriptassistant_text: Claude responseimage_base64: page image (if used)searches: list of web searches performednotes_file_changed: whether notes were updated
-
Saved to SQLite:
sessionstable: session metadata (title, created_at)exchangestable: individual Q&A turns
-
UI updated:
- Chat widget appends message card
- Session panel updates current session
Performance Characteristics
| Stage | Latency | Notes |
|---|---|---|
| Audio capture (VAD) | 0-1.5s | Silence timeout configurable |
| Speech-to-text (Moonshine) | ~300ms | Runs on CPU, no API call |
| Query routing (local) | ~1-2ms | Regex pattern matching |
| Query routing (LLM fallback) | ~150-350ms | Strict timeout, used only if local uncertain |
| Camera capture | 50ms | Frame already in memory |
| Claude first token | ~800-1200ms | Vision + reasoning |
| TTS first audio | ~500-800ms | API call per sentence |
| End-to-end (question → first audio) | 2-4s | STT + routing + Claude + TTS |
Threading Model
| Thread | Purpose | Communication |
|---|---|---|
| PyQt6 main thread | UI event loop, signal handling | pyqtSignal |
| Camera thread | OpenCV capture loop | Thread-safe frame lock |
| Question processing thread | STT → routing → Claude → save | Qt signals to UI |
| TTS synthesis worker | OpenAI API calls | queue.Queue |
| TTS playback | Sounddevice output | queue.Queue |
| Audio callback (VAD) | WebRTC VAD frame processing | Callbacks to main app |
daemon=True exit cleanly when main thread terminates.
Next Steps
- Module Responsibilities — Detailed module breakdown
- Query Routing — Deep dive into routing logic