Overview
Each engine has different strengths and trade-offs:| Engine | Languages | Architecture | Best For |
|---|---|---|---|
| Whisper (default) | 99 languages | Encoder-decoder (whisper.cpp) | General use, multilingual |
| Parakeet | English | FastConformer TDT (ONNX) | Fast English transcription |
| Moonshine | English | Encoder-decoder (ONNX) | Edge devices, low memory |
| SenseVoice | zh, en, ja, ko, yue | CTC encoder (ONNX) | Chinese, Japanese, Korean |
| Paraformer | zh+en, zh+yue+en | Non-autoregressive (ONNX) | Chinese-English bilingual |
| Dolphin | 40 languages + 22 Chinese dialects | CTC E-Branchformer (ONNX) | Eastern languages (no English) |
| Omnilingual | 1600+ languages | wav2vec2 CTC (ONNX) | Low-resource and rare languages |
Whisper (Default)
OpenAI’s Whisper is the default engine, offering excellent accuracy and broad language support.Why Use Whisper
- Multilingual: Supports 99 languages with a single model
- High accuracy: State-of-the-art transcription quality
- Well-tested: Most mature and stable engine in Voxtype
- GPU acceleration: CUDA, Vulkan, Metal, ROCm support
- Multiple backends: Local, remote server, CLI subprocess
Available Models
| Model | Size | English WER | Speed | Languages |
|---|---|---|---|---|
| tiny.en | 39 MB | ~10% | Fastest | English only |
| base.en | 142 MB | ~8% | Fast | English only |
| small.en | 466 MB | ~6% | Medium | English only |
| medium.en | 1.5 GB | ~5% | Slow | English only |
| large-v3 | 3 GB | ~4% | Slowest | 99 languages |
| large-v3-turbo | 1.6 GB | ~4% | Fast | 99 languages |
base.en for CPU, large-v3-turbo for GPU
Configuration
Multilingual Support
Use Case 1: Transcribe in spoken language (speak French → output French)Performance Examples
With GPU acceleration, even large models are fast:| Model | CPU (8-core) | Vulkan GPU (RX 6800) |
|---|---|---|
| base.en | ~7x realtime | ~35x realtime |
| large-v3 | ~1x realtime | ~5x realtime |
ONNX Engines
The ONNX engines use ONNX Runtime for inference, providing alternative architectures optimized for specific use cases.Switching to ONNX
ONNX engines require a separate binary compiled with ONNX features:voxtype-*-onnx-avx2- Most CPUsvoxtype-*-onnx-avx512- Modern CPUs with AVX-512voxtype-*-onnx-cuda- NVIDIA GPU accelerationvoxtype-*-onnx-rocm- AMD GPU acceleration
Parakeet
NVIDIA’s FastConformer-based model for fast, accurate English transcription.Why Use Parakeet
- Excellent CPU performance: ~30x realtime on AVX-512 CPUs
- Proper punctuation: Outputs capitalized text with punctuation
- Small model size: 600MB for good accuracy
- No GPU required: Optimized for CPU inference
Configuration
Downloading the Model
Performance
Tested on Ryzen 9 9900X3D (AVX-512):| Audio Length | Transcription Time | Real-time Factor |
|---|---|---|
| 1-2s | 0.06-0.09s | ~20x |
| 3-4s | 0.11-0.13s | ~30x |
| 5s | 0.15s | ~33x |
Limitations
- English only: No multilingual support
- Repetition hallucination: May repeat words more than spoken
- Proper nouns: Technical terms may be substituted with phonetically similar words
Moonshine
Moonshine AI’s lightweight encoder-decoder transformer optimized for edge devices.Why Use Moonshine
- Very fast on CPU: 0.09s for 4-second audio
- Small models: Base is 237MB, tiny is 100MB
- Multilingual options: Japanese, Mandarin, Korean, Arabic
- Low memory: Minimal resource usage
Available Models
| Model | Languages | Size | License |
|---|---|---|---|
| tiny | English | 100 MB | MIT |
| base | English | 237 MB | MIT |
| tiny-ja | Japanese | 100 MB | Community |
| tiny-zh | Mandarin | 100 MB | Community |
| tiny-ko | Korean | 100 MB | Community |
| tiny-ar | Arabic | 100 MB | Community |
| base-ja | Japanese | 237 MB | Community |
| base-zh | Mandarin | 237 MB | Community |
Configuration
Downloading Models
Performance
| Engine | Model | Time (4s audio) |
|---|---|---|
| Moonshine | base | 0.09s |
| Parakeet | TDT 0.6B | 0.3-0.5s |
| Whisper | large-v3-turbo | 17.7s |
Limitations
- No punctuation: Outputs lowercase without punctuation (use
spoken_punctuation) - Two sizes only: No medium or large variants
- No streaming: Batch mode only (sufficient for push-to-talk)
SenseVoice
CTC-based encoder optimized for Chinese, Japanese, and Korean.Why Use SenseVoice
- Native CJK support: Excellent for Chinese, Japanese, Korean
- Fast inference: CTC architecture is faster than encoder-decoder
- Cantonese support: Includes Yue (Cantonese) in addition to Mandarin
- Emotion recognition: Detects emotional tone in speech
Configuration
Supported Languages
- Chinese (Mandarin):
zh - English:
en - Japanese:
ja - Korean:
ko - Cantonese:
yue - Auto-detect:
auto
Paraformer
FunASR’s non-autoregressive model for Chinese-English bilingual transcription.Why Use Paraformer
- Bilingual: Handles Chinese and English in same recording
- Fast: Non-autoregressive architecture
- Code-switching: Good for mixed-language conversations
Configuration
Dolphin
CTC E-Branchformer optimized for 40 languages and 22 Chinese dialects.Why Use Dolphin
- Eastern languages: Covers many Asian and African languages
- Chinese dialects: 22 Chinese regional variants
- No English: Optimized for non-English languages
- Dictation-optimized: Tuned for spoken input
Configuration
Supported Languages
Includes: Arabic, Bengali, Burmese, Cantonese, Filipino, Gujarati, Hindi, Indonesian, Japanese, Javanese, Kannada, Khmer, Korean, Lao, Malayalam, Marathi, Mongolian, Nepali, Odia, Persian, Portuguese, Punjabi, Russian, Sinhala, Spanish, Swahili, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Wu, Xiang, and 22 Chinese dialects.Omnilingual
Wav2vec2-based model for 1600+ languages, including low-resource and rare languages.Why Use Omnilingual
- Maximum language coverage: 1600+ languages
- Low-resource languages: Supports languages not in Whisper
- Regional variants: Many dialects and regional variants
- Accessibility: Makes voice-to-text accessible to more speakers
Configuration
Trade-offs
- Lower accuracy: Broader coverage means less specialization
- No punctuation: CTC-based model without punctuation prediction
- Best for niche languages: Use Whisper or Parakeet if your language is well-supported
Choosing the Right Engine
For English Dictation
- Best accuracy: Whisper
large-v3-turbowith GPU - Best CPU speed: Parakeet
parakeet-tdt-0.6b-v3 - Lowest memory: Moonshine
tiny - Balanced: Whisper
base.en
For Multilingual Use
- Western languages: Whisper
large-v3 - Chinese/Japanese/Korean: SenseVoice
- Mixed Chinese-English: Paraformer
- Eastern languages: Dolphin
- Rare/low-resource: Omnilingual
For Hardware Constraints
- Limited memory: Moonshine
tiny(100MB) - No GPU: Parakeet or Moonshine
- Fast GPU: Whisper
large-v3-turbowith CUDA/Vulkan - Edge devices: Moonshine
base
Switching Engines
You can switch engines in your config:Model Downloads
Use the interactive model selection tool:Further Reading
- GPU Acceleration - Speed up transcription with GPU
- Model Selection Guide - Detailed model comparison
- Configuration guide - Full engine configuration options