Skip to main content
Voxtype supports 7 different transcription engines, giving you flexibility to optimize for speed, accuracy, language support, and hardware constraints.

Overview

Each engine has different strengths and trade-offs:
EngineLanguagesArchitectureBest For
Whisper (default)99 languagesEncoder-decoder (whisper.cpp)General use, multilingual
ParakeetEnglishFastConformer TDT (ONNX)Fast English transcription
MoonshineEnglishEncoder-decoder (ONNX)Edge devices, low memory
SenseVoicezh, en, ja, ko, yueCTC encoder (ONNX)Chinese, Japanese, Korean
Paraformerzh+en, zh+yue+enNon-autoregressive (ONNX)Chinese-English bilingual
Dolphin40 languages + 22 Chinese dialectsCTC E-Branchformer (ONNX)Eastern languages (no English)
Omnilingual1600+ languageswav2vec2 CTC (ONNX)Low-resource and rare languages

Whisper (Default)

OpenAI’s Whisper is the default engine, offering excellent accuracy and broad language support.

Why Use Whisper

  • Multilingual: Supports 99 languages with a single model
  • High accuracy: State-of-the-art transcription quality
  • Well-tested: Most mature and stable engine in Voxtype
  • GPU acceleration: CUDA, Vulkan, Metal, ROCm support
  • Multiple backends: Local, remote server, CLI subprocess

Available Models

ModelSizeEnglish WERSpeedLanguages
tiny.en39 MB~10%FastestEnglish only
base.en142 MB~8%FastEnglish only
small.en466 MB~6%MediumEnglish only
medium.en1.5 GB~5%SlowEnglish only
large-v33 GB~4%Slowest99 languages
large-v3-turbo1.6 GB~4%Fast99 languages
Recommendation: base.en for CPU, large-v3-turbo for GPU

Configuration

engine = "whisper"  # Default, can be omitted

[whisper]
model = "base.en"
language = "en"     # or "auto" for detection
translate = false   # Translate to English?

Multilingual Support

Use Case 1: Transcribe in spoken language (speak French → output French)
[whisper]
model = "large-v3"
language = "auto"   # Auto-detect and transcribe in that language
translate = false
Use Case 2: Translate to English (speak French → output English)
[whisper]
model = "large-v3"
language = "auto"   # Auto-detect the spoken language
translate = true    # Translate output to English
Use Case 3: Force a specific language (always transcribe as Spanish)
[whisper]
model = "large-v3"
language = "es"     # Force Spanish transcription
translate = false

Performance Examples

With GPU acceleration, even large models are fast:
ModelCPU (8-core)Vulkan GPU (RX 6800)
base.en~7x realtime~35x realtime
large-v3~1x realtime~5x realtime

ONNX Engines

The ONNX engines use ONNX Runtime for inference, providing alternative architectures optimized for specific use cases.

Switching to ONNX

ONNX engines require a separate binary compiled with ONNX features:
# Use the setup tool to switch binaries
voxtype setup onnx --enable   # Switch to ONNX binary
voxtype setup onnx --disable  # Switch back to Whisper
Or download ONNX binaries manually from the releases page:
  • voxtype-*-onnx-avx2 - Most CPUs
  • voxtype-*-onnx-avx512 - Modern CPUs with AVX-512
  • voxtype-*-onnx-cuda - NVIDIA GPU acceleration
  • voxtype-*-onnx-rocm - AMD GPU acceleration

Parakeet

NVIDIA’s FastConformer-based model for fast, accurate English transcription.

Why Use Parakeet

  • Excellent CPU performance: ~30x realtime on AVX-512 CPUs
  • Proper punctuation: Outputs capitalized text with punctuation
  • Small model size: 600MB for good accuracy
  • No GPU required: Optimized for CPU inference

Configuration

engine = "parakeet"

[parakeet]
model = "parakeet-tdt-0.6b-v3"
on_demand_loading = false

Downloading the Model

voxtype setup model  # Interactive download
Or manually:
mkdir -p ~/.local/share/voxtype/models/parakeet-tdt-0.6b-v3
cd ~/.local/share/voxtype/models/parakeet-tdt-0.6b-v3
# Download encoder-model.onnx, decoder_joint-model.onnx, vocab.txt, config.json
# from https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx

Performance

Tested on Ryzen 9 9900X3D (AVX-512):
Audio LengthTranscription TimeReal-time Factor
1-2s0.06-0.09s~20x
3-4s0.11-0.13s~30x
5s0.15s~33x

Limitations

  • English only: No multilingual support
  • Repetition hallucination: May repeat words more than spoken
  • Proper nouns: Technical terms may be substituted with phonetically similar words

Moonshine

Moonshine AI’s lightweight encoder-decoder transformer optimized for edge devices.

Why Use Moonshine

  • Very fast on CPU: 0.09s for 4-second audio
  • Small models: Base is 237MB, tiny is 100MB
  • Multilingual options: Japanese, Mandarin, Korean, Arabic
  • Low memory: Minimal resource usage

Available Models

ModelLanguagesSizeLicense
tinyEnglish100 MBMIT
baseEnglish237 MBMIT
tiny-jaJapanese100 MBCommunity
tiny-zhMandarin100 MBCommunity
tiny-koKorean100 MBCommunity
tiny-arArabic100 MBCommunity
base-jaJapanese237 MBCommunity
base-zhMandarin237 MBCommunity
Note: Non-English models require the Moonshine Community License (free for non-commercial use).

Configuration

engine = "moonshine"

[moonshine]
model = "base"      # or "tiny", "base-ja", etc.
quantized = true    # Use quantized models when available

Downloading Models

voxtype setup model  # Interactive download with license acceptance

Performance

EngineModelTime (4s audio)
Moonshinebase0.09s
ParakeetTDT 0.6B0.3-0.5s
Whisperlarge-v3-turbo17.7s

Limitations

  • No punctuation: Outputs lowercase without punctuation (use spoken_punctuation)
  • Two sizes only: No medium or large variants
  • No streaming: Batch mode only (sufficient for push-to-talk)

SenseVoice

CTC-based encoder optimized for Chinese, Japanese, and Korean.

Why Use SenseVoice

  • Native CJK support: Excellent for Chinese, Japanese, Korean
  • Fast inference: CTC architecture is faster than encoder-decoder
  • Cantonese support: Includes Yue (Cantonese) in addition to Mandarin
  • Emotion recognition: Detects emotional tone in speech

Configuration

engine = "sensevoice"

[sensevoice]
model = "sensevoice-small"  # or path to model directory
language = "auto"           # or "zh", "en", "ja", "ko", "yue"

Supported Languages

  • Chinese (Mandarin): zh
  • English: en
  • Japanese: ja
  • Korean: ko
  • Cantonese: yue
  • Auto-detect: auto

Paraformer

FunASR’s non-autoregressive model for Chinese-English bilingual transcription.

Why Use Paraformer

  • Bilingual: Handles Chinese and English in same recording
  • Fast: Non-autoregressive architecture
  • Code-switching: Good for mixed-language conversations

Configuration

engine = "paraformer"

[paraformer]
model = "paraformer-zh"  # or "paraformer-zh-yue-en"

Dolphin

CTC E-Branchformer optimized for 40 languages and 22 Chinese dialects.

Why Use Dolphin

  • Eastern languages: Covers many Asian and African languages
  • Chinese dialects: 22 Chinese regional variants
  • No English: Optimized for non-English languages
  • Dictation-optimized: Tuned for spoken input

Configuration

engine = "dolphin"

[dolphin]
model = "dolphin-base"

Supported Languages

Includes: Arabic, Bengali, Burmese, Cantonese, Filipino, Gujarati, Hindi, Indonesian, Japanese, Javanese, Kannada, Khmer, Korean, Lao, Malayalam, Marathi, Mongolian, Nepali, Odia, Persian, Portuguese, Punjabi, Russian, Sinhala, Spanish, Swahili, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Wu, Xiang, and 22 Chinese dialects.

Omnilingual

Wav2vec2-based model for 1600+ languages, including low-resource and rare languages.

Why Use Omnilingual

  • Maximum language coverage: 1600+ languages
  • Low-resource languages: Supports languages not in Whisper
  • Regional variants: Many dialects and regional variants
  • Accessibility: Makes voice-to-text accessible to more speakers

Configuration

engine = "omnilingual"

[omnilingual]
model = "omnilingual-base"

Trade-offs

  • Lower accuracy: Broader coverage means less specialization
  • No punctuation: CTC-based model without punctuation prediction
  • Best for niche languages: Use Whisper or Parakeet if your language is well-supported

Choosing the Right Engine

For English Dictation

  1. Best accuracy: Whisper large-v3-turbo with GPU
  2. Best CPU speed: Parakeet parakeet-tdt-0.6b-v3
  3. Lowest memory: Moonshine tiny
  4. Balanced: Whisper base.en

For Multilingual Use

  1. Western languages: Whisper large-v3
  2. Chinese/Japanese/Korean: SenseVoice
  3. Mixed Chinese-English: Paraformer
  4. Eastern languages: Dolphin
  5. Rare/low-resource: Omnilingual

For Hardware Constraints

  1. Limited memory: Moonshine tiny (100MB)
  2. No GPU: Parakeet or Moonshine
  3. Fast GPU: Whisper large-v3-turbo with CUDA/Vulkan
  4. Edge devices: Moonshine base

Switching Engines

You can switch engines in your config:
# Change this line:
engine = "parakeet"  # or "moonshine", "sensevoice", etc.

# Add corresponding engine config:
[parakeet]
model = "parakeet-tdt-0.6b-v3"
Or override per transcription:
voxtype --engine moonshine transcribe audio.wav
Restart the daemon after changing engines:
systemctl --user restart voxtype

Model Downloads

Use the interactive model selection tool:
voxtype setup model
This shows all available engines and models, handles downloads, and updates your config automatically.

Further Reading

Build docs developers (and LLMs) love