Skip to main content

Overview

ChatbotAI-Free features a fully offline voice pipeline powered by state-of-the-art neural models:
  • Speech-to-Text (STT): faster-whisper with multilingual support
  • Text-to-Speech (TTS): Kokoro neural TTS (English + Spanish) + optional Sherpa-ONNX voices
  • Voice Activity Detection (VAD): Real-time silence/speech detection
  • Audio Playback: PipeWire-native output via paplay

Speech-to-Text (STT)

faster-whisper Engine

ChatbotAI-Free uses faster-whisper, a highly optimized implementation of OpenAI’s Whisper model. Key Features:
  • GPU acceleration (CUDA) for fast transcription
  • Multilingual support (English, Spanish, and 90+ languages)
  • VAD filtering built-in for accurate end-of-speech detection
  • Hallucination filtering to remove common artifacts

Model Selection

Choose your STT quality/speed trade-off from Settings:
ModelSizeSpeedAccuracyBest For
base~140 MBFastestGoodTesting, low-end hardware
small~460 MBFastBetterBalanced performance
medium~1.5 GBModerateGreatHigh accuracy
large-v3~2.9 GBSlowBestProduction, critical accuracy
The model change takes effect after restarting the app. The app will offer to restart immediately after you change the model.

Language Support

The STT system automatically uses the language selected in Settings:
  • English → Whisper transcribes with language="en"
  • Spanish → Whisper transcribes with language="es"
ChatbotAI-Free always loads the multilingual Whisper models (not .en variants). The .en models only understand English.

Transcription Pipeline

When you speak:
1

Audio capture

Microphone records at the device’s native sample rate (e.g., 48kHz) using sounddevice.
2

Resampling

Audio is resampled to 16kHz (Whisper’s required input format) using linear interpolation.
# From audio_utils.py:166-170
new_length = int(len(audio_data) * 16000 / record_rate)
old_idx = np.arange(len(audio_data))
new_idx = np.linspace(0, len(audio_data) - 1, new_length)
audio_data = np.interp(new_idx, old_idx, audio_data)
3

Whisper inference

faster-whisper transcribes with:
  • beam_size=5, best_of=5 (high quality)
  • vad_filter=True (remove non-speech)
  • no_speech_threshold=0.6 (filter silence)
  • condition_on_previous_text=False (prevent hallucinations)
4

Post-processing

Common hallucinations are filtered out:
  • English: “thank you”, “subscribe”, “bye”, etc.
  • Spanish: “gracias”, “suscríbete”, “adiós”, etc.
Text shorter than 2 characters is discarded.

Text-to-Speech (TTS)

Kokoro TTS Engine

The primary TTS engine is Kokoro ONNX v1.0, a high-quality neural voice model. Features:
  • 54 voices (English + Spanish) in a single 300MB model
  • GPU acceleration (ONNX Runtime)
  • Real-time synthesis with low latency
  • Adjustable speed (0.5× to 2.0×)
  • Multilingual: English (en-us) and Spanish (es) language codes

Voice Selection

Voices are prefixed by language and gender:
  • English voices: af_bella, af_sarah, am_michael, bf_emma, etc.
  • Spanish voices: ef_dora, em_carlos, etc.
Select voices from the top dropdown in the main window. The app automatically uses the correct language code based on your Settings.

Sherpa-ONNX Voice Packs (Optional)

You can add external voices in other languages using Sherpa-ONNX VITS models.
1

Install Sherpa-ONNX

pip install sherpa-onnx
2

Download a voice pack

Browse Hugging Face VITS models. Download:
  • The .onnx model file
  • tokens.txt
  • The espeak-ng-data/ directory
3

Place in voices/ folder

Drop the folder directly inside voices/ (not nested):
voices/
├── kokoro-v1.0/
└── vits-piper-es_AR-daniela-high/  ← new Sherpa voice
    ├── es_AR-daniela-high.onnx
    ├── tokens.txt
    └── espeak-ng-data/
4

Restart the app

The voice scanner detects the new folder and asks which language to assign it to. After confirmation, the voice appears in the dropdown.
Sherpa voices are identified by folder names containing hyphens (e.g., vits-piper-es_AR-daniela-high). Kokoro voices have short names without hyphens (e.g., af_bella).

TTS Synthesis Pipeline

When the AI generates a response:
1

Sentence segmentation

The streaming LLM response is split into sentences (delimited by ., !, ?, \n).
2

Markdown cleaning

Markdown symbols (*, **, #, etc.) and emojis are removed before TTS:
# From main.py:307-348
text = re.sub(r'\*\*(.+?)\*\*', r'\1', text)  # **bold**
text = re.sub(r'\*(.+?)\*', r'\1', text)      # *italic*
text = emoji_pattern.sub('', text)            # remove emojis
3

Parallel TTS generation

Each sentence is synthesized in a background thread while the LLM continues streaming. Audio is queued for playback.
4

Speed adjustment

The user-defined speed multiplier (from Settings) is applied during synthesis:
# From ai_manager.py:456
samples, sample_rate = self.tts_manager.create(text, speed=speed)

Voice Activity Detection (VAD)

How It Works

VAD monitors audio energy in real-time to detect speech and silence. Algorithm:
  1. RMS calculation: Root Mean Square of each audio chunk
    rms = np.sqrt(np.mean(audio_chunk ** 2))
    
  2. Threshold comparison: If rms > silence_threshold, speech is detected
  3. Silence counter: Consecutive silent frames are counted
  4. End-of-speech: Stop recording after 1.5 seconds of silence
Parameters (from audio_utils.py:16):
  • silence_threshold = 0.03 (RMS level)
  • silence_duration = 3.0 seconds (Classic Chat)
  • silence_duration = 1.5 seconds (Live Mode)
  • min_audio_duration = 1.0 seconds (minimum valid recording)

Barge-In Detection (Live Mode Only)

Live Mode runs a continuous monitoring thread even while the AI is speaking:
# From main.py:1198-1237
speech_threshold = self.recorder.silence_threshold * 2.0
consecutive_speech_frames = 0
frames_needed = 4  # 4 frames = ~120ms

while self.is_running:
    if rms > speech_threshold:
        consecutive_speech_frames += 1
        if consecutive_speech_frames >= frames_needed:
            self.user_speaking.set()  # Signal interruption
When triggered:
  • Audio playback stops immediately (player.stop())
  • TTS queue is cleared
  • Mode loops back to listening

Audio Playback

PipeWire Integration

ChatbotAI-Free uses PipeWire for audio playback via the paplay command (from PulseAudio compatibility layer). Why PipeWire?
  • Non-blocking: TTS never locks the audio device
  • Mixing: Other apps (YouTube, Spotify) play simultaneously without conflict
  • Low latency: Instant playback start

Playback Pipeline

1

TTS output

Neural TTS generates float32 audio samples (24kHz for Kokoro, varies for Sherpa).
2

Normalization

Samples are normalized to [-1.0, 1.0] range, then converted to int16 for WAV format.
# From audio_utils.py:218-225
max_val = np.abs(audio_data).max()
if max_val > 1.0:
    audio_data = audio_data / max_val
audio_int16 = (audio_data * 32767.0).astype(np.int16)
3

Temporary WAV file

Audio is written to a temporary .wav file using Python’s wave module.
4

paplay spawning

A subprocess runs paplay temp.wav. PipeWire handles mixing and routing.
# From audio_utils.py:241-245
subprocess.Popen(['paplay', tmp_path],
                 stdout=subprocess.DEVNULL,
                 stderr=subprocess.PIPE)
5

Cleanup

The temporary file is deleted after playback completes or is interrupted.
If paplay is not installed, the app falls back to sounddevice.play() (direct ALSA/JACK output).

Configuration

Settings Panel (⚙️)

Adjust voice system parameters: STT Settings:
  • Whisper Model: base / small / medium / large-v3
  • Language: English / Spanish
TTS Settings:
  • Voice: Select from Kokoro + Sherpa voices
  • Voice Speed: 0.5× to 2.0× (slider)
Audio Devices:
  • Input Device: Select microphone
  • Output Device: Select speaker (info only; paplay uses system default)
Recording Mode:
  • Auto-send: Automatically send after silence detected
  • Manual: Tap mic twice (record → send)
Changes to Whisper model require an app restart. Other settings take effect immediately.

Build docs developers (and LLMs) love