Method Signature
Parameters
An
AudioData instance containing the audio to transcribe.Recognition language as an RFC5646 language tag (e.g.,
"en-US", "en-GB").Can also be a 3-tuple of filesystem paths: (acoustic_model_dir, language_model_file, phoneme_dict_file) for custom models.Note: Only "en-US" is supported out of the box. Other languages require downloading additional data.List of keywords to search for, as tuples of
(keyword, sensitivity).Sensitivity is a float from 0.0 (insensitive, fewer false positives) to 1.0 (sensitive, more false positives).When specified, Sphinx only listens for these keywords instead of general transcription.Path to a JSGF or FSG grammar file. Constrains recognition to phrases defined in the grammar.Useful for command-and-control applications with a limited vocabulary.
If
True, returns the raw pocketsphinx.Decoder object. If False, returns only the transcription text.Returns
- Default:
str- The transcribed text - With
show_all=True:pocketsphinx.Decoder- Raw decoder object with detailed results
Installation
- Using pip
- From source (Linux/Mac)
PocketSphinx works entirely offline. Once installed, no internet connection is required.
Basic Example
Microphone Example
Keyword Spotting
Sphinx excels at keyword spotting - listening for specific phrases:Grammar-Based Recognition
Constraint recognition to specific phrases using grammars:JSGF Grammar Example
Create a filecommands.gram:
Multiple Languages
Out of the box, only English (US) is supported. For other languages:Installing Additional Languages
Download Language Pack
Download language models from the CMU Sphinx download page.
Extract Files
Extract the archive and locate:
- Acoustic model directory (usually
en-us,fr-fr, etc.) - Language model file (
.lm.binor.lm) - Phoneme dictionary (
.dicor.dict)
Audio Requirements
- Sample Rate: 16 kHz (automatically converted)
- Sample Width: 16-bit mono (automatically converted)
- Channels: Mono (stereo is automatically converted)
- Format: Any format supported by the library (WAV, FLAC, AIFF)
Error Handling
Improving Accuracy
Command-and-Control Example
Advantages
- Fully Offline: No internet required
- Privacy: Audio never leaves your device
- Free: No API keys, no usage limits
- Lightweight: Low resource usage
- Low Latency: Fast local processing
- Keyword Spotting: Excellent for hotword detection
- Grammars: Constrained vocabulary recognition
Limitations
- Lower Accuracy: Not as accurate as cloud services or modern offline models
- English Only (default): Other languages require manual setup
- Sensitive to Noise: Poor performance in noisy environments
- Limited Vocabulary: Best with constrained vocabulary
- Setup Complexity: Installing additional languages is complex
Use Cases
- Hotword detection: “Hey Computer”, “OK Google”, etc.
- Voice commands: Smart home controls with limited vocabulary
- Offline applications: No internet available
- Privacy-sensitive: Medical, legal, military applications
- Embedded systems: Raspberry Pi, IoT devices
- Prototyping: Quick offline testing
Comparison: Sphinx vs Other Offline Engines
| Feature | Sphinx | Vosk | Whisper |
|---|---|---|---|
| Accuracy | Low-Medium | Medium-High | Very High |
| Speed | Very Fast | Fast | Medium-Slow |
| Memory | Very Low (~50 MB) | Low (~100 MB) | Medium-High (1-10 GB) |
| Languages | Limited | 20+ | 99 |
| Setup | Complex | Easy | Easy |
| Keyword Spotting | Excellent | Good | No |
| Best For | Commands, hotwords | General offline | High accuracy offline |
When to Use Sphinx
Use Sphinx when:- ✅ You need keyword/hotword detection
- ✅ You have a limited, known vocabulary
- ✅ Privacy is critical (fully offline)
- ✅ Resources are very limited (Raspberry Pi, etc.)
- ✅ You need very low latency
- ❌ You need high accuracy for general transcription
- ❌ You need support for many languages
- ❌ Background noise is a concern
- ❌ You can use cloud services