Overview
Moonshine Voice offers a family of speech-to-text models trained from scratch, optimized for different accuracy/performance trade-offs. All models support flexible input windows and are designed for edge deployment.Model Families
Fromcore/moonshine-c-api.h:97-103:
- Non-Streaming: Process complete audio segments
- Streaming: Cache encoder/decoder state for incremental processing
Available Models
From README.md:566-580, current English models:| Architecture | Parameters | WER | Use Case |
|---|---|---|---|
| Tiny | 26M | 12.66% | Ultra-constrained devices |
| Tiny Streaming | 34M | 12.00% | IoT, wearables, real-time |
| Base | 58M | 10.07% | Offline transcription |
| Base Streaming | ~60M | ~10% | Balanced streaming |
| Small Streaming | 123M | 7.84% | Desktop apps, real-time |
| Medium Streaming | 245M | 6.65% | High accuracy, real-time |
Medium Streaming achieves 6.65% WER - better than Whisper Large v3’s 7.44% - with 6x fewer parameters (245M vs 1.5B).
Non-English Models
From README.md:566-580:| Language | Architecture | Parameters | WER/CER | Notes |
|---|---|---|---|---|
| Arabic | Base | 58M | 5.63% | Character Error Rate |
| Japanese | Base | 58M | 13.62% | Character Error Rate |
| Korean | Tiny | 26M | 6.46% | Character Error Rate |
| Mandarin | Base | 58M | 25.76% | Character Error Rate |
| Spanish | Base | 58M | 4.33% | Word Error Rate |
| Ukrainian | Base | 58M | 14.55% | Word Error Rate |
| Vietnamese | Base | 58M | 8.82% | Word Error Rate |
Model Architecture Details
Components
All Moonshine models consist of three files:- encoder_model.ort: Audio → Latent representation
- decoder_model_merged.ort: Latent → Token sequence
- tokenizer.bin: Tokens → UTF-8 text
core/moonshine-c-api.h:227-243:
Encoder Architecture
From README.md:591-593: Frontend (Preprocessing):- Learned convolution layers generate features
- Similar to MEL spectrograms but trained end-to-end
- Operates on 16-bit raw audio input
- Preserved at BFloat16 precision for accuracy
- Transformer architecture
- Variable-length input support
- No zero-padding required (unlike Whisper)
- Outputs latent representation
Decoder Architecture
Autoregressive decoder:- Takes encoder output + previous tokens
- Generates next token predictions
- Streaming models cache decoder state
- Beam search or greedy decoding
- Temperature-based sampling
- Length normalization
Tokenizer
Fromcore/moonshine-c-api.h:241-243:
The tokenizer.bin contains the token to character mapping for the model, in a compact binary format.Language-specific tokenizers:
- English: ~500 tokens (subword units)
- Non-Latin: Larger vocabulary for character coverage
Performance Characteristics
Latency Comparison
From README.md:101-108: MacBook Pro M1:| Model | Latency | Speed vs Whisper |
|---|---|---|
| Tiny Streaming | 34ms | 8x faster |
| Small Streaming | 73ms | 26x faster |
| Medium Streaming | 107ms | 105x faster |
| Whisper Tiny | 277ms | - |
| Whisper Small | 1940ms | - |
| Whisper Large v3 | 11286ms | - |
| Model | Latency | Usable? |
|---|---|---|
| Tiny Streaming | 237ms | ✅ Yes |
| Small Streaming | 527ms | ✅ Yes |
| Medium Streaming | 802ms | ⚠️ Marginal |
| Whisper Tiny | 5863ms | ❌ No |
| Whisper Small | 10397ms | ❌ No |
Memory Usage
Model file sizes (quantized):- Tiny: ~30 MB total
- Base: ~70 MB total
- Small Streaming: ~134 MB total
- Medium Streaming: ~270 MB total
- Input audio buffer: ~2-5 MB
- Encoder cache (streaming): ~10-50 MB
- Decoder state (streaming): ~5-20 MB
- Total runtime: 50-300 MB depending on model
Model Selection Guide
By Platform
By Use Case
Real-time voice assistants:Quantization
From README.md:589-594:We typically quantize our models to eight-bit weights across the board, and eight-bit calculations for heavy operations like MatMul.
Quantization Strategy
Weights: 8-bit integer quantization Activations: 8-bit for MatMul operationsFrontend: BFloat16 precision (higher accuracy needed) File format: ONNX models converted to
.ort flatbuffer format for:
- Memory-mapped loading
- Zero-copy inference
- Reduced startup time
Quantization Impact
From README.md:590-594:The only anomaly is the treatment of the frontend, which uses convolution layers to generate features. The inputs correspond to 16-bit signed integers from raw audio, so we’ve found it necessary to leave convolution operations in at least BFloat16 precision.Quality retention:
- WER impact: Less than 0.5% absolute
- Latency improvement: 2-3x faster
- Memory reduction: 4x smaller
Training and Research
Research Papers
From README.md:554-560:-
Moonshine (2024): First generation architecture
- Flexible input windows
- Improved on Whisper’s fixed 30s requirement
- arxiv.org/abs/2410.15608
-
Flavors of Moonshine (2025): Multilingual approach
- Monolingual models for higher accuracy
- Language-specific training benefits
- arxiv.org/abs/2509.02523
-
Moonshine v2 (2026): Streaming architecture
- Ergodic streaming encoder
- State caching for latency reduction
- arxiv.org/abs/2602.12241
Training Data
From README.md:126:After extensive data-gathering work, we were able to release the first generation of Moonshine models.
- Large proprietary audio dataset
- Multiple languages and accents
- Real-world noise conditions
- Trained from scratch (not fine-tuned from Whisper)
Model Customization
From README.md:585-587:Domain Customization: Moonshine AI offers full retraining using our internal dataset as a commercial service. Community project for fine-tuning: github.com/pierre-cheneau/finetune-moonshine-asr
Loading Models
From Files
Frompython/src/moonshine_voice/transcriber.py:74-126:
From Memory
Fromcore/moonshine-c-api.h:267-277:
- Mobile apps with bundled models
- Encrypted model storage
- Custom model distribution
Download Helper
Model Inference
ONNXRuntime Backend
From README.md:438:The only major dependency that the C++ core library has is the Onnx Runtime.Benefits:
- Cross-platform (Linux, macOS, Windows, iOS, Android)
- CPU optimization (SIMD, threading)
- Memory-mapped model loading
- Consistent performance across devices
Threading
Fromcore/silero-vad.h:71-72:
- Inter-threads: Parallelism across model layers
- Intra-threads: Parallelism within operations
Version Compatibility
Fromcore/moonshine-c-api.h:89-95:
Benchmarking Models
From README.md:473-493:- Absolute time taken
- Percentage of audio duration
- Average latency per phrase
- Compute load percentage
Model Accuracy
Evaluation
From README.md:580:The English evaluations were done using the HuggingFace OpenASR Leaderboard datasets and methodology.Test your model:
WER Interpretation
| WER | Quality | Use Case |
|---|---|---|
| Under 5% | Excellent | Professional transcription |
| 5-10% | Very good | Voice assistants, most applications |
| 10-15% | Good | Casual transcription, commands |
| 15-20% | Acceptable | Constrained devices, noisy environments |
| Over 20% | Poor | Limited use cases |
Future Models
From README.md:741-747 roadmap:- Binary size reduction for mobile
- More languages
- More streaming models
- Improved speaker identification
- Lightweight domain customization
Next Steps
Streaming Concepts
Learn how streaming models work
Transcription
Understand the transcription pipeline