Skip to main content
Moonshine models are optimized for real-time voice applications with low latency requirements. These benchmarks compare Moonshine to Whisper models across different platforms.

Quick Comparison: Moonshine vs Whisper

TL;DR - Choose Moonshine when working with live speech.
ModelWERParametersMacBook ProLinux x86Raspberry Pi 5
Moonshine Medium Streaming6.65%245 million107ms269ms802ms
Whisper Large v37.44%1.5 billion11,286ms16,919msN/A
Moonshine Small Streaming7.84%123 million73ms165ms527ms
Whisper Small8.59%244 million1,940ms3,425ms10,397ms
Moonshine Tiny Streaming12.00%34 million34ms69ms237ms
Whisper Tiny12.81%39 million277ms1,141ms5,863ms
Key Takeaways:
  • Moonshine Medium Streaming achieves better accuracy than Whisper Large v3 with 6x fewer parameters
  • Moonshine models are 10-100x faster than equivalent Whisper models for real-time speech
  • Moonshine Tiny Streaming runs in 34ms on a MacBook Pro, enabling sub-200ms total latency for voice interfaces
  • All Moonshine models run efficiently on Raspberry Pi 5, while Whisper Large v3 cannot run on this platform

Understanding the Metrics

Word Error Rate (WER)

Measures transcription accuracy. Lower is better. A WER of 6.65% means that on average, 6.65% of words are incorrectly transcribed.

Latency (ms)

The average time between when the library determines the user has stopped talking and the delivery of the final transcript. This is where streaming models excel:
  • Streaming models do most work while the user is talking
  • Non-streaming models must process the entire segment after speech ends
  • For responsive voice interfaces, target latency below 200ms

Compute Percentage

The percentage of CPU time required to process audio in real-time. For example:
  • 20% means the model uses 1/5 of CPU time, leaving 80% for your application
  • 100% means the model uses all available CPU just to keep up with real-time audio
  • Values over 100% mean the model cannot process audio in real-time

Benchmark Methodology

Test Setup

The core/benchmark tool simulates processing live audio by:
  1. Loading a .wav audio file
  2. Feeding it in chunks to the model (simulating real-time streaming)
  3. Measuring absolute processing time
  4. Calculating percentage of audio duration
  5. Computing average response latency

Running Benchmarks

cd core
mkdir build
cd build
cmake ..
cmake --build . --config Release
./benchmark
By default, the benchmark uses the embedded Tiny English model. You can specify a different model:
./benchmark --model-path /path/to/model --model-arch 5

Adjusting Update Frequency

Control how often the transcript is updated (default 0.5 seconds):
./benchmark --transcription-interval 0.3
Longer intervals reduce compute requirements slightly but slow down updates to your application.

Python Benchmark Script

For platforms supporting Python, use scripts/run-benchmarks.py which:
  • Automatically downloads models
  • Evaluates both Moonshine and Whisper models
  • Provides detailed latency and compute cost comparisons

Whisper Comparison Methodology

Our Whisper benchmarks are designed for real-time voice application scenarios, not bulk offline processing: Requirements:
  • Speech must be responded to quickly once a user completes a phrase
  • Phrases range from 1-10 seconds in duration
  • Latency on individual segments matters more than overall throughput
Setup:
  • Test file: two_cities.wav (mix of short and long phrases)
  • Moonshine models: Tiny, Base, Tiny/Small/Medium Streaming
  • Whisper models: Tiny, Base, Small, Large v3
  • Comparison: Moonshine Medium Streaming vs Whisper Large v3 (both achieve sub-8% WER)
  • VAD: Moonshine VAD segmenter splits audio into phrases
  • Platform: CPU only (using faster-whisper for best cross-platform performance)
Measurements:
  1. Response Latency: Time from phrase completion (VAD detection) to transcribed text
    • Whisper: Full transcription time for each segment
    • Moonshine Streaming: Minimal time (most work done during speech)
  2. Compute Cost: Total audio processing time as percentage of audio duration
    • Inverse of the Real-Time Factor (RTF) metric
    • Reflects actual CPU load for real-time applications
We use CPU-only benchmarks because most applications cannot rely on GPU/NPU acceleration being present across all target platforms. While GPU-accelerated Whisper implementations exist, they lack the portability required for edge deployment.

Why Not Whisper for Live Speech?

Whisper is excellent for bulk offline processing, but has limitations for real-time voice interfaces:

Fixed 30-Second Input Window

  • Voice interface phrases are typically 1-10 seconds
  • Remaining 20+ seconds are zero padding
  • Wasted computation encoding empty input
  • Increased latency even on high-end hardware

No Caching

  • Voice interfaces need to display feedback while the user talks
  • This requires repeated transcription calls as speech continues
  • Whisper starts from scratch each time, repeating work on unchanged audio
  • Moonshine caches encoder output and decoder state for dramatic speedup

Poor Multilingual Support

From OpenAI’s Whisper paper (Appendix D-2.4):
  • 82 languages listed
  • Only 33 languages achieve sub-20% WER (usable quality)
  • For Base model (common on edge devices): only 5 languages under 20% WER
  • Languages like Korean and Japanese have poor accuracy despite large markets
Moonshine’s language-specific models achieve much better accuracy for the same model size.

Fragmented Edge Support

  • Desktop-focused frameworks with mature ecosystems
  • Inconsistent interfaces and capabilities across iOS, Android, Raspberry Pi
  • Difficult to build applications that run on multiple platforms
Moonshine provides a unified API across all platforms.

Platform-Specific Notes

MacBook Pro

Latencies are measured on a recent MacBook Pro with M-series chip. Moonshine models benefit from optimized CPU inference.

Linux x86

Latencies measured on a standard x86_64 Linux server. 2-3x slower than MacBook Pro but still achieving sub-second latency for all Moonshine models.

Raspberry Pi 5

Moonshine models are specifically optimized for Raspberry Pi:
  • All models run efficiently on the device
  • Tiny Streaming achieves 237ms latency (suitable for most voice interfaces)
  • Whisper Tiny is 24x slower (5,863ms vs 237ms)
  • Whisper Large v3 cannot run on this platform

Custom Benchmarking

You can run benchmarks with your own audio files:
python scripts/run-benchmarks.py --wav_path /path/to/your/audio.wav
This helps evaluate performance for your specific use case and audio characteristics.

Build docs developers (and LLMs) love