Overview
Klaus uses a hybrid local + LLM query router to classify each question before sending it to Claude. This classification determines what context (image, history, memory, notes) is sent and how the response is styled (sentence caps, instructions). Goals:- Reduce token usage: Don’t send page image for standalone definitions
- Optimize latency: Use fast local heuristics for 80%+ of questions
- Improve quality: Enforce concise definitions when appropriate
klaus/query_router.py:118 (QueryRouter.route())
Route Modes
Three routing categories:1. Standalone Definition
When: User asks for a concept definition without page context Examples:- “Define macroeconomics”
- “What is quantum entanglement?”
- “Explain photosynthesis very concisely”
- ❌ No image
- ❌ No history
- ❌ No memory context
- ❌ No notes context
- ✅ Max 2 sentences
- Turn instruction: “Return a direct standalone definition in at most two sentences. Do not reference the page unless explicitly asked.”
2. Page-Grounded Definition
When: User asks for a definition tied to a specific page location Examples:- “What does the term on the far right mean?”
- “Explain the definition in the top left”
- “What is complexity in this context?”
- ✅ Image included
- ✅ Short history (last 2 turns)
- ❌ No memory context
- ❌ No notes context
- ✅ Max 2 sentences
- History window: 2 turns
- Turn instruction: “Answer the definition request using the relevant page location. Keep the answer to at most two sentences.”
3. General Contextual
When: Open-ended questions about the page or complex reasoning Examples:- “What does this section mean?”
- “Summarize the main argument”
- “How does this relate to the previous chapter?”
- ✅ Image included
- ✅ Full conversation history
- ✅ Memory context from past sessions
- ✅ Notes context if Obsidian configured
- ✅ No sentence cap
- Turn instruction: None (default system prompt)
Decision Flow
Local Heuristics (Fast Path)
Implementation:klaus/query_router.py:191 (_route_local())
Timing: ~1-2ms
Pattern Matching
Six regex signals:Weighted Scoring
Each route mode gets a score based on matched signals:Confidence Calculation
Top score vs. second score:- Confidence ≥
ROUTER_LOCAL_CONFIDENCE_THRESHOLD(default 0.75) - Margin ≥
ROUTER_LOCAL_MARGIN_THRESHOLD(default 0.20)
LLM Router (Fallback)
Implementation:klaus/query_router.py:248 (_route_with_llm())
Timing: ~150-350ms (strict timeout)
When: Local confidence too low or margin too narrow
System Prompt
Example Prompt
Configuration
| Setting | Default | Purpose |
|---|---|---|
ROUTER_MODEL | claude-haiku-4-5 | Fast, cheap LLM for routing |
ROUTER_TIMEOUT_MS | 350 | Strict timeout to prevent latency spike |
ROUTER_MAX_TOKENS | 80 | Limit output size |
ROUTER_LLM_CONFIDENCE_THRESHOLD | 0.70 | Min confidence to use LLM decision |
Timeout Handling
If LLM call times out or fails:standalone_definition with confidence 0.5.
Fallback Policy
If both local and LLM routing fail or return low confidence: Mode:standalone_definition
Confidence: 0.5
Reason: "fallback:low_conf(local=X.XX,llm=Y.YY)"
Rationale: Standalone definition is the safest default. It won’t send unnecessary context and enforces brevity, which is better than a wrong classification.
Policy Table
| Route Mode | Image | History | Memory | Notes | Max Sentences | History Window | Turn Instruction |
|---|---|---|---|---|---|---|---|
| Standalone definition | ❌ | ❌ | ❌ | ❌ | 2 | 0 | ”Return a direct standalone definition in at most two sentences. Do not reference the page unless explicitly asked.” |
| Page-grounded definition | ✅ | ✅ | ❌ | ❌ | 2 | 2 | ”Answer the definition request using the relevant page location. Keep the answer to at most two sentences.” |
| General contextual | ✅ | ✅ | ✅ | ✅ | None | 0 (full) | None |
klaus/query_router.py:35 (_POLICY)
Sentence Cap Enforcement
For routes withmax_sentences set:
During Streaming
Implementation:klaus/brain.py:250
After Streaming
Implementation:klaus/brain.py:260
_SENTENCE_CHUNK = re.compile(r"[^.!?]+(?:[.!?]+|$)")
Example Classifications
Example 1: Standalone Definition
Question: “Define macroeconomics very concisely” Local Signals:definition: ✅ (“Define”)concision: ✅ (“very concisely”)doc_ref: ❌spatial: ❌
standalone_score: 0.93page_definition_score: 0.12contextual_score: 0.24
standalone_definition (confidence 0.92, margin 0.81)
Result:
- No image sent
- 2-sentence limit enforced
- Fast, concise response
Example 2: Page-Grounded Definition
Question: “Explain what complexity means in the definition on the far right” Local Signals:definition: ✅ (“definition”)explain: ✅ (“Explain”)doc_ref: ✅ (“definition”)spatial: ✅ (“far right”)on_page: ✅ (“in the definition”)
standalone_score: 0.35page_definition_score: 1.08contextual_score: 0.52
page_grounded_definition (confidence 0.88, margin 0.56)
Result:
- Image sent
- Last 2 turns of history sent
- 2-sentence limit enforced
- Claude knows to look at “far right” of page
Example 3: General Contextual
Question: “What does this section mean?” Local Signals:general: ✅ (“what does this section mean”)deictic: ✅ (“this”)doc_ref: ✅ (“section”)
standalone_score: -0.20page_definition_score: 0.30contextual_score: 0.94
general_contextual (confidence 0.81, margin 0.64)
Result:
- Image sent
- Full conversation history sent
- Memory context from past sessions sent
- Notes context sent (if Obsidian configured)
- No sentence cap
- Detailed, contextualized answer
Example 4: Ambiguous → LLM Fallback
Question: “Is that important?” Local Signals:deictic: ✅ (“that”)
standalone_score: -0.68page_definition_score: 0.22contextual_score: 0.40
- Returns:
{mode: "general_contextual", confidence: 0.82, reason: "Requires page context to evaluate importance"}
general_contextual (source: llm)
Result: Full context sent
Performance Impact
Token Savings
Standalone definition vs. General contextual for “Define macroeconomics”:| Context | Tokens | Notes |
|---|---|---|
| Image (base64 JPEG) | ~1700 | Omitted in standalone |
| Full history (10 turns) | ~2000 | Omitted in standalone |
| Memory context | ~300 | Omitted in standalone |
| Notes context | ~200 | Omitted in standalone |
| Total saved | ~4200 | 70-80% token reduction |
Latency Comparison
| Route Path | Latency | Notes |
|---|---|---|
| Local confident | ~1-2ms | 80%+ of questions |
| LLM fallback | ~150-350ms | 15-20% of questions |
| Timeout fallback | ~350ms | 5% of questions |
Configuration
All router settings in~/.klaus/config.toml:
Future Improvements
Current Gaps (fromCLAUDE.md:183):
- Router cost/latency tuning: Ambiguous turns may incur extra routing call. Tune thresholds if latency or fallback behavior needs adjustment.
- Pattern weights: Current weights are hand-tuned. Could train on labeled dataset for better accuracy.
- Caching: LLM router could cache decisions for repeated questions (not implemented).
Next Steps
- Architecture Overview — High-level system design
- Data Flow — End-to-end question lifecycle
- Module Responsibilities — Detailed module breakdown