Skip to main content
RCLI uses two types of STT models working together:
  • Streaming STT — Real-time transcription during live mic input (always active)
  • Offline STT — High-accuracy batch transcription for audio files (user-switchable)

Quick Comparison

ModelCategoryBackendSizeAccuracyLanguagesFeatures
ZipformerStreamingk2-fsa50 MBGoodEnglishReal-time streaming for live mic
Whisper base.enOfflineOpenAI Whisper140 MB~5% WEREnglishFast batch transcription
Parakeet TDT 0.6B 🏆OfflineNVIDIA NeMo640 MB~1.9% WER25 languagesBest accuracy + auto-punctuation
⭐ = Default (ships with rcli setup)
🏆 = Recommended upgrade for best accuracy

Streaming STT: Zipformer

Always active for live mic input. Provides real-time transcription during voice conversations.

Specifications

  • Provider: k2-fsa/sherpa-onnx
  • Architecture: Zipformer transducer
  • Size: ~50 MB (encoder + decoder + joiner)
  • Languages: English only
  • Accuracy: Good (optimized for streaming latency)
  • Latency: ~43.7 ms average on M3 Max
  • Real-time Factor: 0.022x (44x faster than real-time)
  • License: Apache 2.0

Key Features

  • Streaming architecture — Processes audio chunks in real-time as you speak
  • Low latency — Sub-50ms transcription for immediate LLM processing
  • VAD integration — Works with Silero VAD for endpoint detection
  • Lock-free ring buffer — Zero-copy audio transfer to LLM thread
  • Always active — Cannot be disabled (required for live mode)

Model Files

Stored in ~/Library/RCLI/models/zipformer/:
zipformer/
├── encoder-epoch-99-avg-1.int8.onnx
├── decoder-epoch-99-avg-1.int8.onnx
├── joiner-epoch-99-avg-1.int8.onnx
└── tokens.txt

Benchmarks

rcli bench --suite stt  # Benchmark active STT
Measured on Apple M3 Max:
MetricValue
Average latency43.7 ms
Real-time factor0.022x
Throughput44x faster than real-time
CPU usage~2-4% per core
Memory~120 MB

Offline STT Models

Offline models process recorded audio files or complete voice segments with higher accuracy.

Whisper base.en (Default)

Recommended for most users. Fast, accurate, small footprint.

Specifications

  • Provider: OpenAI Whisper
  • Architecture: Transformer encoder-decoder
  • Size: ~140 MB (encoder + decoder + tokens)
  • Languages: English only (.en model)
  • Accuracy: ~5% WER (Word Error Rate)
  • Latency: ~43.7 ms on M3 Max
  • License: MIT
  • Download: rcli setup (default)

Model Files

Stored in ~/Library/RCLI/models/whisper-base.en/:
whisper-base.en/
├── base.en-encoder.int8.onnx
├── base.en-decoder.int8.onnx
└── base.en-tokens.txt

When to Use

  • General voice commands and conversations
  • Fast batch transcription
  • Limited disk space (only 140 MB)
  • English-only use cases
Best accuracy available. 1.9% WER with auto-punctuation and 25-language support.

Specifications

  • Provider: NVIDIA NeMo
  • Architecture: TDT (Token-and-Duration Transducer)
  • Size: ~640 MB (encoder + decoder + joiner + tokens)
  • Languages: 25 languages (English, Spanish, French, German, Chinese, Japanese, etc.)
  • Accuracy: ~1.9% WER on LibriSpeech test-clean
  • Auto-punctuation: Yes (commas, periods, question marks)
  • License: CC-BY-4.0
  • Download: rcli upgrade-stt

Model Files

Stored in ~/Library/RCLI/models/parakeet-tdt/:
parakeet-tdt/
├── encoder.int8.onnx
├── decoder.int8.onnx
├── joiner.int8.onnx
└── tokens.txt

When to Use

  • Best possible transcription quality
  • Multilingual voice input
  • Automatic punctuation required
  • Professional use cases (dictation, meeting notes)
  • Disk space not a constraint (640 MB)

Supported Languages

English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Russian, Ukrainian, Czech, Slovak, Romanian, Hungarian, Bulgarian, Arabic, Hebrew, Turkish, Persian, Urdu, Hindi, Bengali, Chinese, Japanese, Korean

Accuracy Comparison

Word Error Rate (WER)

Lower is better. Measured on LibriSpeech test-clean dataset:
ModelWERRelative Improvement
Parakeet TDT 0.6B1.9%Baseline (best)
Whisper base.en~5%2.6x higher error rate
Zipformer (streaming)~8-10%Optimized for latency, not accuracy

Punctuation

ModelAuto-PunctuationExample Output
Parakeet TDTYes”Hello, how are you today?”
Whisper base.enNo”hello how are you today”
ZipformerNo”hello how are you today”

Switching STT Models

RCLI automatically uses the highest-priority offline STT model installed.

Check Active STT

rcli info  # Shows active offline STT model

Upgrade to Parakeet

rcli upgrade-stt  # Downloads Parakeet TDT (~640 MB)
After download completes, Parakeet becomes active immediately (no restart required).

Manual Model Selection

Use the interactive model browser:
rcli models  # Navigate to STT section
Or edit config manually:
# ~/.Library/RCLI/config
stt_model=parakeet-tdt  # or whisper-base

Inference Backends

sherpa-onnx

All STT models run through sherpa-onnx, which provides:
  • ONNX Runtime with Metal GPU acceleration
  • Streaming and offline decoding
  • CTC, RNN-T, and transducer support
  • Optimized for Apple Silicon

Model Format

All models use INT8 quantization for optimal performance:
  • 8-bit integer weights
  • ~75% size reduction vs. FP32
  • Minimal accuracy loss (<1% WER increase)
  • Fast inference on Metal GPU

Benchmarks

Run comprehensive STT benchmarks:
rcli bench --suite stt  # Benchmark active STT setup
Example output (M3 Max):
=== STT Benchmark ===
Streaming STT (Zipformer):
  Average latency: 43.7 ms
  Real-time factor: 0.022x
  
Offline STT (Parakeet TDT):
  Average latency: 52.3 ms
  Real-time factor: 0.026x
  Accuracy (WER): 1.9%

Storage Requirements

SetupTotal SizeModels Installed
Default~190 MBZipformer + Whisper base.en
Upgraded~690 MBZipformer + Parakeet TDT
Both offline~780 MBZipformer + Whisper + Parakeet
You can have multiple offline STT models installed. RCLI automatically selects the highest-priority model.

VAD Integration

All STT models work with Silero VAD for voice activity detection:
  • Size: 0.6 MB (ultra-lightweight)
  • Latency: <10 ms per chunk
  • Accuracy: 99%+ speech detection
  • Function: Filters silence, detects speech endpoints
  • Always active during live mode

Next Steps

Upgrade STT

Upgrade to Parakeet TDT for best accuracy

Model Browser

Interactive STT model management

Benchmarks

Measure STT performance on your Mac

Voice Modes

Learn about push-to-talk and continuous listening

Build docs developers (and LLMs) love