Method Signature
Parameters
An
AudioData instance containing the audio to transcribe.If
True, returns the full response dict from Vosk including confidence scores and alternatives. If False, returns only the transcription text.Returns
- Default:
str- The transcribed text - With
verbose=True:dict- Full Vosk response with text and metadata
Installation
Download Language Model
Option 1: Using CLIThis downloads the default English model.Option 2: Manual Download
- Go to Vosk Models
- Download a model for your language
- Extract it to:
speech_recognition/models/vosk/
Vosk requires a language model to be downloaded. The model is typically 50-300 MB depending on language and quality level.
Model Setup
Vosk expects models in this location:Basic Example
Microphone Example
Verbose Output
Get detailed results including confidence:Available Models
Vosk provides various models with different sizes and accuracy levels:English Models
| Model | Size | Accuracy | Use Case |
|---|---|---|---|
| vosk-model-small-en-us | 40 MB | Good | Embedded, mobile |
| vosk-model-en-us | 1.8 GB | Very Good | Server applications |
| vosk-model-en-us-daanzu | 1.0 GB | Excellent | Dictation |
Other Languages
| Language | Model Name | Size |
|---|---|---|
| Chinese | vosk-model-small-cn | 42 MB |
| German | vosk-model-small-de | 45 MB |
| Spanish | vosk-model-small-es | 39 MB |
| French | vosk-model-small-fr | 41 MB |
| Russian | vosk-model-small-ru | 45 MB |
| Hindi | vosk-model-small-hi | 36 MB |
| Portuguese | vosk-model-small-pt | 31 MB |
| Italian | vosk-model-small-it | 48 MB |
| Turkish | vosk-model-small-tr | 35 MB |
| Vietnamese | vosk-model-small-vn | 32 MB |
| Japanese | vosk-model-small-ja | 48 MB |
| Korean | vosk-model-small-ko | 42 MB |
| Arabic | vosk-model-ar | 1.4 GB |
Changing Language/Model
The library automatically uses the model inspeech_recognition/models/vosk/. To use a different language:
- Download the model for your language
- Extract it to replace the current model in
speech_recognition/models/vosk/ - Use
recognize_vosk()normally - it will use the new model
Error Handling
Audio Requirements
- Sample Rate: 16 kHz (automatically converted)
- Sample Width: 16-bit (automatically converted)
- Channels: Mono (stereo is automatically converted)
- Format: Any format supported by the library
Real-Time Recognition
Performance Comparison
Advantages
- Fully Offline: No internet required
- Privacy: Audio never leaves your device
- Free: No API keys, no usage limits
- High Accuracy: Much better than Sphinx, comparable to some cloud services
- Fast: Real-time transcription on modern hardware
- Many Languages: 20+ languages supported
- Small Models: 30-50 MB for small models
- Active Development: Regular updates and improvements
Limitations
- Model Required: Must download language model first
- Lower Accuracy than Whisper: Not as accurate as Whisper large models
- No Language Detection: Must use specific language model
- Memory Usage: Larger models require more RAM (~1-2 GB)
- One Model at a Time: Can’t easily switch languages mid-session
Use Cases
- Privacy-sensitive applications: Medical, legal, personal
- Offline environments: No internet access
- Voice assistants: Smart home, IoT devices
- Real-time transcription: Meetings, lectures
- Mobile apps: On-device recognition
- Embedded systems: Raspberry Pi (use small models)
- Prototyping: Quick offline testing
Comparison: Vosk vs Other Offline Engines
| Feature | Vosk | Sphinx | Whisper (local) |
|---|---|---|---|
| Accuracy | High | Low-Medium | Very High |
| Speed | Fast | Very Fast | Medium-Slow |
| Memory | Low (50 MB - 2 GB) | Very Low (~50 MB) | High (1-10 GB) |
| Languages | 20+ | Limited | 99 |
| Setup | Easy | Complex | Easy |
| Model Size | 30 MB - 2 GB | Included | 100 MB - 3 GB |
| GPU Support | No | No | Yes |
| Real-time | Yes | Yes | Challenging |
When to Use Vosk
Use Vosk when:- ✅ You need good accuracy offline
- ✅ Privacy is important
- ✅ You need real-time transcription
- ✅ Your language is supported
- ✅ You want better than Sphinx accuracy
- ✅ You can’t use cloud services
- ❌ You need the absolute highest accuracy (use Whisper)
- ❌ You need 99 languages (use Whisper)
- ❌ You need keyword spotting specifically (use Sphinx)
- ❌ You have GPU and can wait longer (use Whisper)
- ❌ Cloud services are acceptable (use Google/Azure)
Best Practices
Model Storage:
Vosk models are loaded on first use and kept in memory. For long-running applications, this is fine. For short scripts, the model loading time (1-2 seconds) may be noticeable.