STT Models - RCLI

RCLI uses two types of STT models working together:

Streaming STT — Real-time transcription during live mic input (always active)
Offline STT — High-accuracy batch transcription for audio files (user-switchable)

Quick Comparison

Model	Category	Backend	Size	Accuracy	Languages	Features
Zipformer ⭐	Streaming	k2-fsa	50 MB	Good	English	Real-time streaming for live mic
Whisper base.en ⭐	Offline	OpenAI Whisper	140 MB	~5% WER	English	Fast batch transcription
Parakeet TDT 0.6B 🏆	Offline	NVIDIA NeMo	640 MB	~1.9% WER	25 languages	Best accuracy + auto-punctuation

⭐ = Default (ships with rcli setup)
🏆 = Recommended upgrade for best accuracy

Streaming STT: Zipformer

Always active for live mic input. Provides real-time transcription during voice conversations.

Specifications

Provider: k2-fsa/sherpa-onnx
Architecture: Zipformer transducer
Size: ~50 MB (encoder + decoder + joiner)
Languages: English only
Accuracy: Good (optimized for streaming latency)
Latency: ~43.7 ms average on M3 Max
Real-time Factor: 0.022x (44x faster than real-time)
License: Apache 2.0

Key Features

Streaming architecture — Processes audio chunks in real-time as you speak
Low latency — Sub-50ms transcription for immediate LLM processing
VAD integration — Works with Silero VAD for endpoint detection
Lock-free ring buffer — Zero-copy audio transfer to LLM thread
Always active — Cannot be disabled (required for live mode)

Model Files

Stored in ~/Library/RCLI/models/zipformer/:

zipformer/
├── encoder-epoch-99-avg-1.int8.onnx
├── decoder-epoch-99-avg-1.int8.onnx
├── joiner-epoch-99-avg-1.int8.onnx
└── tokens.txt

Benchmarks

rcli bench --suite stt  # Benchmark active STT

Measured on Apple M3 Max:

Metric	Value
Average latency	43.7 ms
Real-time factor	0.022x
Throughput	44x faster than real-time
CPU usage	~2-4% per core
Memory	~120 MB

Offline STT Models

Offline models process recorded audio files or complete voice segments with higher accuracy.

Whisper base.en (Default)

Recommended for most users. Fast, accurate, small footprint.

Specifications

Provider: OpenAI Whisper
Architecture: Transformer encoder-decoder
Size: ~140 MB (encoder + decoder + tokens)
Languages: English only (.en model)
Accuracy: ~5% WER (Word Error Rate)
Latency: ~43.7 ms on M3 Max
License: MIT
Download: rcli setup (default)

Model Files

Stored in ~/Library/RCLI/models/whisper-base.en/:

whisper-base.en/
├── base.en-encoder.int8.onnx
├── base.en-decoder.int8.onnx
└── base.en-tokens.txt

When to Use

General voice commands and conversations
Fast batch transcription
Limited disk space (only 140 MB)
English-only use cases

Parakeet TDT 0.6B v3 (Recommended)

Best accuracy available. 1.9% WER with auto-punctuation and 25-language support.

Specifications

Provider: NVIDIA NeMo
Architecture: TDT (Token-and-Duration Transducer)
Size: ~640 MB (encoder + decoder + joiner + tokens)
Languages: 25 languages (English, Spanish, French, German, Chinese, Japanese, etc.)
Accuracy: ~1.9% WER on LibriSpeech test-clean
Auto-punctuation: Yes (commas, periods, question marks)
License: CC-BY-4.0
Download: rcli upgrade-stt

Model Files

Stored in ~/Library/RCLI/models/parakeet-tdt/:

parakeet-tdt/
├── encoder.int8.onnx
├── decoder.int8.onnx
├── joiner.int8.onnx
└── tokens.txt

When to Use

Best possible transcription quality
Multilingual voice input
Automatic punctuation required
Professional use cases (dictation, meeting notes)
Disk space not a constraint (640 MB)

Supported Languages

English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Russian, Ukrainian, Czech, Slovak, Romanian, Hungarian, Bulgarian, Arabic, Hebrew, Turkish, Persian, Urdu, Hindi, Bengali, Chinese, Japanese, Korean

Accuracy Comparison

Word Error Rate (WER)

Lower is better. Measured on LibriSpeech test-clean dataset:

Model	WER	Relative Improvement
Parakeet TDT 0.6B	1.9%	Baseline (best)
Whisper base.en	~5%	2.6x higher error rate
Zipformer (streaming)	~8-10%	Optimized for latency, not accuracy

Punctuation

Model	Auto-Punctuation	Example Output
Parakeet TDT	Yes	”Hello, how are you today?”
Whisper base.en	No	”hello how are you today”
Zipformer	No	”hello how are you today”

Switching STT Models

RCLI automatically uses the highest-priority offline STT model installed.

Check Active STT

rcli info  # Shows active offline STT model

Upgrade to Parakeet

rcli upgrade-stt  # Downloads Parakeet TDT (~640 MB)

After download completes, Parakeet becomes active immediately (no restart required).

Manual Model Selection

Use the interactive model browser:

rcli models  # Navigate to STT section

Or edit config manually:

# ~/.Library/RCLI/config
stt_model=parakeet-tdt  # or whisper-base

Inference Backends

sherpa-onnx

All STT models run through sherpa-onnx, which provides:

ONNX Runtime with Metal GPU acceleration
Streaming and offline decoding
CTC, RNN-T, and transducer support
Optimized for Apple Silicon

Model Format

All models use INT8 quantization for optimal performance:

8-bit integer weights
~75% size reduction vs. FP32
Minimal accuracy loss (<1% WER increase)
Fast inference on Metal GPU

Benchmarks

Run comprehensive STT benchmarks:

rcli bench --suite stt  # Benchmark active STT setup

Example output (M3 Max):

=== STT Benchmark ===
Streaming STT (Zipformer):
  Average latency: 43.7 ms
  Real-time factor: 0.022x
  
Offline STT (Parakeet TDT):
  Average latency: 52.3 ms
  Real-time factor: 0.026x
  Accuracy (WER): 1.9%

Storage Requirements

Setup	Total Size	Models Installed
Default	~190 MB	Zipformer + Whisper base.en
Upgraded	~690 MB	Zipformer + Parakeet TDT
Both offline	~780 MB	Zipformer + Whisper + Parakeet

You can have multiple offline STT models installed. RCLI automatically selects the highest-priority model.

VAD Integration

All STT models work with Silero VAD for voice activity detection:

Size: 0.6 MB (ultra-lightweight)
Latency: <10 ms per chunk
Accuracy: 99%+ speech detection
Function: Filters silence, detects speech endpoints
Always active during live mode

Next Steps

Upgrade STT

Upgrade to Parakeet TDT for best accuracy

Model Browser

Interactive STT model management

Benchmarks

Measure STT performance on your Mac

Voice Modes

Learn about push-to-talk and continuous listening

Get Started

Core Features

Commands

Models

Actions

Advanced

Development

​Quick Comparison

​Streaming STT: Zipformer

​Specifications

​Key Features

​Model Files

​Benchmarks

​Offline STT Models

​Whisper base.en (Default)

​Specifications

​Model Files

​When to Use

​Parakeet TDT 0.6B v3 (Recommended)

​Specifications

​Model Files

​When to Use

​Supported Languages

​Accuracy Comparison

​Word Error Rate (WER)

​Punctuation

​Switching STT Models

​Check Active STT

​Upgrade to Parakeet

​Manual Model Selection

​Inference Backends

​sherpa-onnx

​Model Format

​Benchmarks

​Storage Requirements

​VAD Integration

​Next Steps

Upgrade STT

Model Browser

Benchmarks

Voice Modes

Build docs developers (and LLMs) love

Quick Comparison

Streaming STT: Zipformer

Specifications

Key Features

Model Files

Benchmarks

Offline STT Models

Whisper base.en (Default)

Specifications

Model Files

When to Use

Parakeet TDT 0.6B v3 (Recommended)

Specifications

Model Files

When to Use

Supported Languages

Accuracy Comparison

Word Error Rate (WER)

Punctuation

Switching STT Models

Check Active STT

Upgrade to Parakeet

Manual Model Selection

Inference Backends

sherpa-onnx

Model Format

Benchmarks

Storage Requirements

VAD Integration

Next Steps