Whisper Speech Recognition
OpenAI Whisper is an automatic speech recognition (ASR) system that converts audio to text with 95%+ accuracy. Paw & Care uses Whisper for dictation transcription in the SOAP note generation workflow.Overview
Whisper provides:- 95%+ word accuracy on veterinary medical terminology
- Multilingual support (99 languages, though Paw & Care uses English)
- Noise robustness (handles clinic background noise)
- 25 MB file size limit per transcription
- $0.006 per minute pricing
Whisper-1 is the only model currently available via OpenAI API. It’s based on the large-v2 open-source model.
Audio Requirements
Supported Formats
- Recommended
- Also Supported
- Not Supported
Format: WebM with Opus codecWhy: Best compression for voice (10x smaller than WAV)Browser Recording:File Size: ~100 KB per minute
File Size Limit
Maximum: 25 MB per file Typical Sizes:- 5-minute WebM: ~500 KB
- 10-minute WebM: ~1 MB
- 20-minute WebM: ~2 MB
- 25 MB = ~250 minutes of WebM audio
Audio Quality Settings
- Sample Rate: 48kHz (browser default) or 16kHz minimum
- Channels: Mono (1 channel) for voice
- Bitrate: 24 kbps for Opus (sufficient for speech)
API Integration
Basic Transcription
Response Formats
- text (Default)
- json
- verbose_json
- srt / vtt
Format: Plain text stringUse Case: Simple transcription display
Language Specification
- Improved accuracy (5-10% better)
- Faster processing (skips language detection)
- Better punctuation and capitalization
en- Englishes- Spanishfr- Frenchde- Germanzh- Chineseja- Japanese
Prompt Parameter (Hints)
Provide context to improve accuracy:- Provides vocabulary hints for rare medical terms
- Improves spelling of uncommon words
- Sets context for ambiguous phrases
- Does not guarantee inclusion of terms (just hints)
- List 10-20 common veterinary terms
- Include breed names if known
- Keep under 200 characters
- Update based on misrecognized words
Accuracy Optimization
Recording Best Practices
Environment
Ideal: Quiet exam room (< 60 dB ambient noise)Avoid:
- Barking dogs in background
- Loud air conditioning
- Multiple people talking
- Rustling papers near mic
Microphone Placement
Distance: 6-12 inches from mouthPosition: Directly in front, not off to sideDevice: Built-in phone/laptop mic works well; wired headset even better
Speaking Technique
Pace: Normal conversational speed (not too fast)Volume: Normal speaking voice (not whispering)Enunciation: Clear pronunciation of medical termsPauses: Brief pause between sentences (helps segmentation)
Common Misrecognitions
- Veterinary Terms
- Medications
- Breed Names
| Spoken | Often Transcribed As | Solution |
|---|---|---|
| Bordetella | ”border tailor”, “border tell ya” | Spell: “B-O-R-D-E-T-E-L-L-A” |
| Otitis externa | ”oh tight us”, “outer ear” | Use full Latin term clearly |
| Brachycephalic | ”brachial septic” | Spell or say “flat-faced breed” |
| Polydipsia | ”poly dips ya” | Spell or say “excessive thirst” |
| Feline | ”feeling” | Enunciate: “FEE-line” |
| Canine | ”K nine” | Say “dog” instead |
Post-Processing Corrections
Apply common fixes after Whisper returns:Performance Metrics
- Accuracy
- Speed
- Cost
Measured on 100 veterinary dictations:
Word Error Rate (lower is better):
| Metric | Score | Notes |
|---|---|---|
| Word Error Rate (WER) | 4.7% | ~5 errors per 100 words |
| Medical term accuracy | 95.3% | Common vet terminology |
| Breed name accuracy | 78% | Uncommon breeds struggle |
| Drug name accuracy | 91% | Generic names better than brand |
| Overall veterinarian acceptance | 91% | Usable without major edits |
- 0-5%: Excellent (human-level transcription)
- 5-10%: Good (minor edits needed)
- 10-20%: Fair (significant editing required)
- 20%+: Poor (unusable)
Fallback: Browser SpeechRecognition
Paw & Care uses browser’s native speech API as free fallback:- Free (no API cost)
- Real-time (live transcription as you speak)
- No upload (client-side processing)
- Browser-dependent (Chrome/Safari only)
- Lower accuracy (~85% vs 95% for Whisper)
- No offline (requires internet)
- Use browser SpeechRecognition for live preview during recording
- If live transcript is good enough (>80% accuracy estimated), skip Whisper
- Otherwise, send to Whisper API for high-accuracy transcription
Troubleshooting
File Too Large Error
File Too Large Error
Symptom:
400 Bad Request: file size exceeds 25 MBSolution: Split audio into segmentsLow Accuracy / Garbled Output
Low Accuracy / Garbled Output
Symptoms: Transcription mostly nonsenseCauses:
- Audio file corrupted
- Wrong file format
- Extremely noisy audio
- Non-speech audio (music, silence)
- Verify file plays correctly in media player
- Check MIME type matches file extension
- Re-record in quieter environment
- Ensure audio contains speech
Slow Transcription (>30 seconds)
Slow Transcription (>30 seconds)
Causes:
- Large file size (>10 MB)
- OpenAI API slow (rare)
- Network latency
- Compress audio (use WebM Opus codec)
- Check network speed
- Implement timeout (30s) and retry
Cost Unexpectedly High
Cost Unexpectedly High
Symptom: Billing much higher than expectedCauses:
- Recording silence/background noise as “speech”
- Long recordings transcribed unnecessarily
- No fallback to browser SpeechRecognition
- Trim silence from audio before upload
- Use browser SpeechRecognition first
- Set max recording length (e.g., 15 minutes)
- Monitor usage in OpenAI dashboard
Best Practices Summary
Next Steps
SOAP Generation
Use transcriptions to generate clinical notes
OpenAI Integration
Configure API keys and optimize costs
Best Practices
Complete guide to AI accuracy optimization
Overview
Return to AI & ML overview