System Architecture
ChatbotAI-Free is built as a modular, privacy-first desktop application using Python and PyQt6. This page explains how the components work together to deliver a seamless voice AI experience.Architecture Overview
The application follows a component-based architecture with clear separation of concerns:Core Components
AIManager (ai_manager.py)
The central orchestrator for all AI operations.
Responsibilities:
- Manages Whisper STT (faster-whisper)
- Interfaces with Ollama for LLM inference
- Coordinates TTS generation via TTSManager
- Maintains conversation history
- Handles language switching (English/Spanish)
- Tracks token usage for context window management
- Automatically uses multilingual Whisper models (removes
.ensuffix) - CUDA acceleration when available (falls back to CPU)
- Streaming LLM responses with
get_llm_response_streaming() - Supports
<think>...</think>blocks for reasoning models - VAD filtering to reduce hallucinations
/ai_manager.py:22-543
TTSManager (tts_manager.py)
Unified TTS engine that routes synthesis to the appropriate backend.
Routing Logic:
- Kokoro voices (no hyphens in name, e.g.,
af_bella,ef_dora) → Kokoro ONNX - Sherpa voices (contain hyphens, e.g.,
vits-piper-es_AR-daniela-high) → Sherpa-ONNX
- Lazy loading of Sherpa engines (cached per folder)
- Speed adjustment support (
speedparameter) - Language-aware synthesis (English:
en-us, Spanish:es) - Returns numpy float32 arrays at native sample rates (24kHz for Kokoro)
/tts_manager.py:26-151
AudioRecorder (audio_utils.py)
Handles microphone input with Voice Activity Detection (VAD).
Features:
- Real-time audio capture via
sounddevice - Automatic sample rate detection and resampling
- VAD-based silence detection (RMS threshold: 0.03)
- Configurable silence duration (default: 3 seconds)
- Pause/resume to prevent feedback loops
- Queue-based architecture for thread-safe audio buffering
silence_threshold: RMS energy threshold (0.03)silence_duration: Silence duration before stopping (3.0s)min_audio_duration: Minimum clip length to process (1.0s)
/audio_utils.py:13-180
AudioPlayer (audio_utils.py)
Plays TTS output via PipeWire (with sounddevice fallback).
Why PipeWire?
Using paplay allows the app to mix audio with other apps (YouTube, music players) without ALSA device locking conflicts.
Process:
- Converts float32 audio to int16 WAV format
- Writes temporary
.wavfile - Spawns
paplaysubprocess - Cleans up temp file after playback
/audio_utils.py:182-298
Chat History (chat_history.py)
Persistence layer for conversation management.
Storage Format:
Conversations are saved as Markdown files in the chats/ directory:
- Automatic title generation using lightest Ollama model
- Fast listing (reads only first 3 lines for metadata)
- Full message parsing for chat restoration
- Rename and delete operations
/chat_history.py:1-233
Threading Model
ChatbotAI-Free uses multiple thread types to maintain UI responsiveness:ManualRecorderThread
Walkie-talkie style recording thread. Records audio until
stop_recording() is called, then emits the complete audio data.Location: main.py:46-100WorkerThread
Pipeline thread for Classic Chat mode:
- Transcribe audio → Whisper
- Stream LLM response → Ollama
- Generate TTS per sentence → TTSManager
- Play audio chunks → AudioPlayer
main.py:122-349LiveWorkerThread
Continuous conversation thread for Live Mode with barge-in detection:
- VAD-based listening
- Real-time user interruption monitoring
- Automatic playback stopping when user speaks
main.py:1020-1296TitleGeneratorThread
Background thread that generates short chat titles using the lightest Ollama model (avoids blocking UI).Location:
main.py:102-120Data Flow: Microphone to Speakers
Audio Capture
User speaks →
AudioRecorder captures frames via sounddevice → Frames queued in audio_queueVAD Processing
ManualRecorderThread or LiveWorkerThread monitors RMS energy → Detects speech start/end → Concatenates audio chunksResampling
If microphone native rate ≠ 16kHz, audio is resampled using linear interpolation to match Whisper’s expected input
Transcription
AIManager.transcribe() → faster-whisper processes float32 audio → Returns text (filters hallucinations like “thank you”, “subscribe”)LLM Inference
AIManager.get_llm_response_streaming() → Ollama generates response with streaming → Text chunks emitted via on_chunk() callbackSentence Detection
Streaming text monitored for sentence delimiters (
., !, ?, \n) → Complete sentences sent to on_sentence() callbackTTS Generation
Complete sentence →
TTSManager.create() → Routed to Kokoro or Sherpa → Returns numpy float32 audioSteps 5-8 run in parallel threads to minimize latency. TTS generation starts as soon as the first sentence is ready, while the LLM continues generating the rest of the response.
Technology Stack
| Component | Technology | Purpose |
|---|---|---|
| UI Framework | PyQt6 | Desktop application interface, event handling |
| LLM Backend | Ollama | Local inference for Llama, Mistral, Gemma models |
| Speech Recognition | faster-whisper (CTranslate2) | Real-time STT with CUDA acceleration |
| Text-to-Speech | Kokoro ONNX v1.0 | High-quality neural TTS (54 voices, 2 languages) |
| Extra TTS Voices | Sherpa-ONNX (optional) | Piper-compatible voice packs (multi-language) |
| Audio I/O | sounddevice + paplay | Microphone capture and PipeWire playback |
| PDF Parsing | PyMuPDF (fitz) | Text extraction from PDF documents |
| Token Counting | tiktoken | Context window usage tracking |
| Markdown Rendering | Custom HTML converter | Rich text display in chat bubbles |
Voice Detection & Interruption
Classic Chat Mode
- Uses
silence_threshold = 0.03(RMS) - Records until 3 seconds of silence detected
- No interruption support (bot speaks until finished)
Live Mode
- Dual monitoring system:
- Main VAD loop for user speech start/end
- Separate
_monitor_for_barge_in()thread watching audio queue
- Barge-in detection:
- Uses higher threshold (
silence_threshold * 2.0) - Requires 4 consecutive speech frames to trigger
- Sets
user_speakingevent → Stops playback immediately - Clears audio queue and restarts listening
- Uses higher threshold (
main.py:1197-1237
Context Window Management
The app tracks token usage to prevent context overflow:- Token Counting: Ollama returns
prompt_eval_countandeval_countin streaming responses - Storage:
AIManager.last_token_usagedict stores{"prompt": N, "completion": M, "total": N+M} - Context Size Detection:
get_model_context_size()queries model metadata or respects user-definednum_ctx - UI Indicator:
ContextDonutwidget displays usage as colored arc (green < 50%, yellow < 80%, red ≥ 80%)
ai_manager.py:480-514, main.py:692-771
Configuration & Preferences
Settings are persisted inpreferences.json:
preferences.py (load_preferences(), save_preferences())
The
num_ctx parameter allows users to override the model’s default context window size. Setting it to 0 uses the model’s built-in default.Markdown Rendering Pipeline
Bot messages are rendered as HTML for rich formatting:- Text Processing:
MarkdownRenderer.to_html()converts markdown to styled HTML - Supported Features:
- Code blocks with syntax highlighting backgrounds
- Inline code with monospace styling
- Headers (H1-H4)
- Bold/italic formatting
- Tables with alternating row colors
- Horizontal rules
- Display:
QTextBrowserwidget renders HTML with custom CSS
main.py:351-461
Reasoning Panel (Thinking Mode)
For models that support reasoning:- Detection: Looks for
<think>...</think>tags or native Ollamathinkingfield - Routing: Thinking content goes to
on_thinking()callback, response text toon_chunk() - UI:
ThinkingWidgetdisplays collapsible panel with streaming thinking updates - Fallback: If model rejects
think: Trueparameter (400 error), retries without it
ai_manager.py:269-435, main.py:463-557