Selecting Ollama Models
Choose which LLM model to use for conversation from the top dropdown menu.The selected model is stored in
user_preferences.json. If set to null, the app uses the first available model.Supported Model Types
ChatbotAI-Free works with any Ollama-compatible model:- Llama (llama3.1, llama3.2, llama3.3)
- Mistral (mistral, mixtral)
- Gemma (gemma, gemma2)
- Qwen (qwen2, qwen2.5)
- Deepseek (deepseek-r1, deepseek-v3)
- Any custom or fine-tuned model
Whisper Model Sizes
Choose your speech-to-text quality and speed tradeoff from the Settings panel.Available Models
| Model | Size | Speed | Accuracy | Use Case |
|---|---|---|---|---|
| base | ~140 MB | Fastest | Good | Real-time conversation (default) |
| small | ~460 MB | Fast | Better | Balanced performance |
| medium | ~1.5 GB | Moderate | High | Clear transcriptions |
| large-v3 | ~2.9 GB | Slower | Best | Maximum accuracy |
ChatbotAI-Free always uses multilingual Whisper models (not
.en variants) to support both English and Spanish. The .en suffix is automatically stripped in ai_manager.py:63.Model Loading
Whisper models are loaded via faster-whisper with CUDA acceleration if available:Quality vs. Speed Tradeoffs
Choosing the right Whisper model depends on your hardware and use case:base (Recommended for most users)
base (Recommended for most users)
Best for: Real-time conversation, Live Mode
- Fast transcription (< 1 second on modern CPUs)
- Good accuracy for clear speech
- Low VRAM usage (~500 MB on GPU)
- Default selection
small
small
Best for: Better accuracy without major slowdown
- Still fast enough for conversation
- Improved accuracy for accents and noisy environments
- Moderate VRAM usage (~1 GB on GPU)
medium
medium
Best for: High-quality transcription
- Noticeable processing time (2-3 seconds)
- Excellent accuracy
- Higher VRAM usage (~2 GB on GPU)
- Not ideal for Live Mode
large-v3
large-v3
Best for: Maximum accuracy, offline use only
- Slower transcription (5-10 seconds)
- Best possible accuracy
- Significant VRAM usage (~4 GB on GPU)
- Only use for Reading Practice mode or when accuracy is critical
Transcription Parameters
The app uses these optimized settings for all Whisper models (fromai_manager.py:157-172):
Context Window Settings
Control how much conversation history the LLM can see.Default Behavior
Whencontext_size is set to 0 (default), the app uses each model’s built-in context window:
- Llama 3.1: 128,000 tokens
- Mistral: 32,768 tokens
- Gemma 2: 8,192 tokens
Custom Context Size
Override the model’s default by setting a custom value in Settings:Context Window Indicator
The bottom bar shows a donut chart indicating current context usage. Click it to see detailed token statistics:- Prompt tokens: Input text (user + history)
- Completion tokens: AI response
- Total tokens: Sum of both
ai_manager.py:344-351).
Model Restart Requirements
Some model changes require restarting the application:| Setting | Restart Required? |
|---|---|
| Switch Ollama model | No - takes effect immediately |
| Change context size | No - applies to next conversation |
| Change Whisper model | Yes - app restart required |
| Change voice speed | No - takes effect immediately |
When you change the Whisper model in Settings, the app offers to restart immediately. This is necessary because faster-whisper loads the model into memory at initialization.
GPU Acceleration
Both Ollama and faster-whisper automatically use CUDA if available:Whisper STT
- GPU: Uses
float16precision for speed - CPU: Uses
int8quantization for efficiency