Overview
Performs offline speech recognition using Vosk. Vosk is a modern, offline speech recognition toolkit that provides high accuracy without requiring an internet connection or API keys.Method Signature
Parameters
The audio data to recognize. Must be an
AudioData instance.If
True, returns the full result dictionary from Vosk. If False, returns only the transcription text.Returns
The recognized text when
verbose=FalseWhen
verbose=True, returns the Vosk result dictionary containing:text: The transcribed text- Additional Vosk-specific metadata
Exceptions
Raised when:
- The Vosk model is not found
- The
voskmodule is not installed - Model files are corrupted or incomplete
Example Usage
Basic Offline Recognition
With Verbose Output
From Audio File
Continuous Recognition
Voice Assistant
Batch Processing
Installation and Setup
1. Install Vosk Library
2. Download Vosk Model
Vosk requires a language model to be downloaded. The library expects the model at:- Go to Vosk Models
- Download a model for your language (e.g.,
vosk-model-en-us-0.22) - Extract the model
- Place it in the correct directory:
3. Verify Installation
Available Models
English Models
| Model | Size | Description |
|---|---|---|
vosk-model-small-en-us-0.15 | 40 MB | Lightweight, fast |
vosk-model-en-us-0.22 | 1.8 GB | High accuracy |
vosk-model-en-us-0.42-gigaspeech | 2.3 GB | Best accuracy |
Other Languages
Vosk supports 20+ languages:- European: English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Ukrainian, Russian, Greek, Turkish
- Asian: Chinese, Japanese, Korean, Hindi, Arabic, Persian, Vietnamese
- Other: Catalan, Esperanto
Language Support
Vosk supports multiple languages through different models:Changing Language
To use a different language:- Download the appropriate language model
- Place it in the
speech_recognition/models/vosk/directory - The library will automatically use the installed model
Performance Characteristics
Advantages
- Fully Offline: No internet connection required
- High Accuracy: Modern deep learning models
- Fast: Optimized for real-time recognition
- Free: No API costs or limits
- Privacy: Audio never leaves your device
- Multiple Languages: 20+ languages supported
- Modern Architecture: State-of-the-art deep learning
Comparison with PocketSphinx
| Feature | Vosk | PocketSphinx |
|---|---|---|
| Accuracy | Higher | Lower |
| Speed | Fast | Fast |
| Model Size | Larger | Smaller |
| Setup | Requires model download | Built-in models |
| Languages | 20+ | Many more |
| Modern | Yes (DNN-based) | Older (HMM-based) |
Model Selection Guide
Small Models (40-100 MB)
Use for:- Resource-constrained devices (Raspberry Pi)
- Real-time applications
- Quick prototyping
Large Models (1-2 GB)
Use for:- High-accuracy applications
- Transcription services
- Production systems
Best Use Cases
- Offline Applications: No internet available
- Privacy-Critical: Healthcare, legal, financial
- Voice Commands: Home automation, assistants
- Transcription Services: Convert speech to text
- Embedded Systems: Raspberry Pi, IoT devices
- Real-time Recognition: Live captioning, subtitles
Troubleshooting
Model Not Found Error
Low Accuracy
Solutions:- Use a larger, more accurate model
- Ensure good audio quality (16 kHz, clear speech)
- Reduce background noise
- Adjust for ambient noise before recognition
Slow Performance
Solutions:- Use a smaller model
- Ensure audio is at 16 kHz (Vosk’s native rate)
- Use a faster CPU or GPU
Technical Details
- Audio Format: Automatically converted to 16 kHz, 16-bit mono
- Model Type: Kaldi-based DNN models
- Architecture: Deep Neural Networks with acoustic models
- License: Apache 2.0 (Vosk) + model-specific licenses
Notes
- Completely offline after model download
- No API keys or internet connection required
- Model must be downloaded separately
- Audio is automatically converted to 16 kHz, 16-bit samples
- High accuracy comparable to cloud services
- Free and open-source
- Good for both short commands and long-form transcription