Overview
ChatbotAI-Free features a fully offline voice pipeline powered by state-of-the-art neural models:- Speech-to-Text (STT): faster-whisper with multilingual support
- Text-to-Speech (TTS): Kokoro neural TTS (English + Spanish) + optional Sherpa-ONNX voices
- Voice Activity Detection (VAD): Real-time silence/speech detection
- Audio Playback: PipeWire-native output via
paplay
Speech-to-Text (STT)
faster-whisper Engine
ChatbotAI-Free uses faster-whisper, a highly optimized implementation of OpenAI’s Whisper model. Key Features:- GPU acceleration (CUDA) for fast transcription
- Multilingual support (English, Spanish, and 90+ languages)
- VAD filtering built-in for accurate end-of-speech detection
- Hallucination filtering to remove common artifacts
Model Selection
Choose your STT quality/speed trade-off from Settings:| Model | Size | Speed | Accuracy | Best For |
|---|---|---|---|---|
| base | ~140 MB | Fastest | Good | Testing, low-end hardware |
| small | ~460 MB | Fast | Better | Balanced performance |
| medium | ~1.5 GB | Moderate | Great | High accuracy |
| large-v3 | ~2.9 GB | Slow | Best | Production, critical accuracy |
The model change takes effect after restarting the app. The app will offer to restart immediately after you change the model.
Language Support
The STT system automatically uses the language selected in Settings:- English → Whisper transcribes with
language="en" - Spanish → Whisper transcribes with
language="es"
Transcription Pipeline
When you speak:Audio capture
Microphone records at the device’s native sample rate (e.g., 48kHz) using
sounddevice.Resampling
Audio is resampled to 16kHz (Whisper’s required input format) using linear interpolation.
Whisper inference
faster-whisper transcribes with:
beam_size=5,best_of=5(high quality)vad_filter=True(remove non-speech)no_speech_threshold=0.6(filter silence)condition_on_previous_text=False(prevent hallucinations)
Text-to-Speech (TTS)
Kokoro TTS Engine
The primary TTS engine is Kokoro ONNX v1.0, a high-quality neural voice model. Features:- 54 voices (English + Spanish) in a single 300MB model
- GPU acceleration (ONNX Runtime)
- Real-time synthesis with low latency
- Adjustable speed (0.5× to 2.0×)
- Multilingual: English (
en-us) and Spanish (es) language codes
Voice Selection
Voices are prefixed by language and gender:- English voices:
af_bella,af_sarah,am_michael,bf_emma, etc. - Spanish voices:
ef_dora,em_carlos, etc.
Sherpa-ONNX Voice Packs (Optional)
You can add external voices in other languages using Sherpa-ONNX VITS models.Download a voice pack
Browse Hugging Face VITS models. Download:
- The
.onnxmodel file tokens.txt- The
espeak-ng-data/directory
Sherpa voices are identified by folder names containing hyphens (e.g.,
vits-piper-es_AR-daniela-high). Kokoro voices have short names without hyphens (e.g., af_bella).TTS Synthesis Pipeline
When the AI generates a response:Sentence segmentation
The streaming LLM response is split into sentences (delimited by
., !, ?, \n).Parallel TTS generation
Each sentence is synthesized in a background thread while the LLM continues streaming. Audio is queued for playback.
Voice Activity Detection (VAD)
How It Works
VAD monitors audio energy in real-time to detect speech and silence. Algorithm:- RMS calculation: Root Mean Square of each audio chunk
- Threshold comparison: If
rms > silence_threshold, speech is detected - Silence counter: Consecutive silent frames are counted
- End-of-speech: Stop recording after 1.5 seconds of silence
audio_utils.py:16):
silence_threshold = 0.03(RMS level)silence_duration = 3.0seconds (Classic Chat)silence_duration = 1.5seconds (Live Mode)min_audio_duration = 1.0seconds (minimum valid recording)
Barge-In Detection (Live Mode Only)
Live Mode runs a continuous monitoring thread even while the AI is speaking:- Audio playback stops immediately (
player.stop()) - TTS queue is cleared
- Mode loops back to listening
Audio Playback
PipeWire Integration
ChatbotAI-Free uses PipeWire for audio playback via thepaplay command (from PulseAudio compatibility layer).
Why PipeWire?
- Non-blocking: TTS never locks the audio device
- Mixing: Other apps (YouTube, Spotify) play simultaneously without conflict
- Low latency: Instant playback start
Playback Pipeline
If
paplay is not installed, the app falls back to sounddevice.play() (direct ALSA/JACK output).Configuration
Settings Panel (⚙️)
Adjust voice system parameters: STT Settings:- Whisper Model: base / small / medium / large-v3
- Language: English / Spanish
- Voice: Select from Kokoro + Sherpa voices
- Voice Speed: 0.5× to 2.0× (slider)
- Input Device: Select microphone
- Output Device: Select speaker (info only; paplay uses system default)
- Auto-send: Automatically send after silence detected
- Manual: Tap mic twice (record → send)
Changes to Whisper model require an app restart. Other settings take effect immediately.