- Streaming STT — Real-time transcription during live mic input (always active)
- Offline STT — High-accuracy batch transcription for audio files (user-switchable)
Quick Comparison
| Model | Category | Backend | Size | Accuracy | Languages | Features |
|---|---|---|---|---|---|---|
| Zipformer ⭐ | Streaming | k2-fsa | 50 MB | Good | English | Real-time streaming for live mic |
| Whisper base.en ⭐ | Offline | OpenAI Whisper | 140 MB | ~5% WER | English | Fast batch transcription |
| Parakeet TDT 0.6B 🏆 | Offline | NVIDIA NeMo | 640 MB | ~1.9% WER | 25 languages | Best accuracy + auto-punctuation |
⭐ = Default (ships with
🏆 = Recommended upgrade for best accuracy
rcli setup)🏆 = Recommended upgrade for best accuracy
Streaming STT: Zipformer
Always active for live mic input. Provides real-time transcription during voice conversations.
Specifications
- Provider: k2-fsa/sherpa-onnx
- Architecture: Zipformer transducer
- Size: ~50 MB (encoder + decoder + joiner)
- Languages: English only
- Accuracy: Good (optimized for streaming latency)
- Latency: ~43.7 ms average on M3 Max
- Real-time Factor: 0.022x (44x faster than real-time)
- License: Apache 2.0
Key Features
- Streaming architecture — Processes audio chunks in real-time as you speak
- Low latency — Sub-50ms transcription for immediate LLM processing
- VAD integration — Works with Silero VAD for endpoint detection
- Lock-free ring buffer — Zero-copy audio transfer to LLM thread
- Always active — Cannot be disabled (required for live mode)
Model Files
Stored in~/Library/RCLI/models/zipformer/:
Benchmarks
| Metric | Value |
|---|---|
| Average latency | 43.7 ms |
| Real-time factor | 0.022x |
| Throughput | 44x faster than real-time |
| CPU usage | ~2-4% per core |
| Memory | ~120 MB |
Offline STT Models
Offline models process recorded audio files or complete voice segments with higher accuracy.Whisper base.en (Default)
Recommended for most users. Fast, accurate, small footprint.
Specifications
- Provider: OpenAI Whisper
- Architecture: Transformer encoder-decoder
- Size: ~140 MB (encoder + decoder + tokens)
- Languages: English only (
.enmodel) - Accuracy: ~5% WER (Word Error Rate)
- Latency: ~43.7 ms on M3 Max
- License: MIT
- Download:
rcli setup(default)
Model Files
Stored in~/Library/RCLI/models/whisper-base.en/:
When to Use
- General voice commands and conversations
- Fast batch transcription
- Limited disk space (only 140 MB)
- English-only use cases
Parakeet TDT 0.6B v3 (Recommended)
Best accuracy available. 1.9% WER with auto-punctuation and 25-language support.
Specifications
- Provider: NVIDIA NeMo
- Architecture: TDT (Token-and-Duration Transducer)
- Size: ~640 MB (encoder + decoder + joiner + tokens)
- Languages: 25 languages (English, Spanish, French, German, Chinese, Japanese, etc.)
- Accuracy: ~1.9% WER on LibriSpeech test-clean
- Auto-punctuation: Yes (commas, periods, question marks)
- License: CC-BY-4.0
- Download:
rcli upgrade-stt
Model Files
Stored in~/Library/RCLI/models/parakeet-tdt/:
When to Use
- Best possible transcription quality
- Multilingual voice input
- Automatic punctuation required
- Professional use cases (dictation, meeting notes)
- Disk space not a constraint (640 MB)
Supported Languages
English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Russian, Ukrainian, Czech, Slovak, Romanian, Hungarian, Bulgarian, Arabic, Hebrew, Turkish, Persian, Urdu, Hindi, Bengali, Chinese, Japanese, KoreanAccuracy Comparison
Word Error Rate (WER)
Lower is better. Measured on LibriSpeech test-clean dataset:| Model | WER | Relative Improvement |
|---|---|---|
| Parakeet TDT 0.6B | 1.9% | Baseline (best) |
| Whisper base.en | ~5% | 2.6x higher error rate |
| Zipformer (streaming) | ~8-10% | Optimized for latency, not accuracy |
Punctuation
| Model | Auto-Punctuation | Example Output |
|---|---|---|
| Parakeet TDT | Yes | ”Hello, how are you today?” |
| Whisper base.en | No | ”hello how are you today” |
| Zipformer | No | ”hello how are you today” |
Switching STT Models
RCLI automatically uses the highest-priority offline STT model installed.Check Active STT
Upgrade to Parakeet
Manual Model Selection
Use the interactive model browser:Inference Backends
sherpa-onnx
All STT models run through sherpa-onnx, which provides:- ONNX Runtime with Metal GPU acceleration
- Streaming and offline decoding
- CTC, RNN-T, and transducer support
- Optimized for Apple Silicon
Model Format
All models use INT8 quantization for optimal performance:- 8-bit integer weights
- ~75% size reduction vs. FP32
- Minimal accuracy loss (<1% WER increase)
- Fast inference on Metal GPU
Benchmarks
Run comprehensive STT benchmarks:Storage Requirements
| Setup | Total Size | Models Installed |
|---|---|---|
| Default | ~190 MB | Zipformer + Whisper base.en |
| Upgraded | ~690 MB | Zipformer + Parakeet TDT |
| Both offline | ~780 MB | Zipformer + Whisper + Parakeet |
You can have multiple offline STT models installed. RCLI automatically selects the highest-priority model.
VAD Integration
All STT models work with Silero VAD for voice activity detection:- Size: 0.6 MB (ultra-lightweight)
- Latency: <10 ms per chunk
- Accuracy: 99%+ speech detection
- Function: Filters silence, detects speech endpoints
- Always active during live mode
Next Steps
Upgrade STT
Upgrade to Parakeet TDT for best accuracy
Model Browser
Interactive STT model management
Benchmarks
Measure STT performance on your Mac
Voice Modes
Learn about push-to-talk and continuous listening