Skip to main content

Whisper Speech Recognition

OpenAI Whisper is an automatic speech recognition (ASR) system that converts audio to text with 95%+ accuracy. Paw & Care uses Whisper for dictation transcription in the SOAP note generation workflow.

Overview

Whisper provides:
  • 95%+ word accuracy on veterinary medical terminology
  • Multilingual support (99 languages, though Paw & Care uses English)
  • Noise robustness (handles clinic background noise)
  • 25 MB file size limit per transcription
  • $0.006 per minute pricing
Whisper-1 is the only model currently available via OpenAI API. It’s based on the large-v2 open-source model.

Audio Requirements

Supported Formats

File Size Limit

Maximum: 25 MB per file Typical Sizes:
  • 5-minute WebM: ~500 KB
  • 10-minute WebM: ~1 MB
  • 20-minute WebM: ~2 MB
  • 25 MB = ~250 minutes of WebM audio
Files over 25 MB will fail. Split long recordings into segments before uploading.

Audio Quality Settings

const stream = await navigator.mediaDevices.getUserMedia({
  audio: {
    sampleRate: 48000,  // Whisper downsamples to 16kHz internally
    channelCount: 1,  // Mono (stereo not needed for voice)
    echoCancellation: true,  // Remove echo
    noiseSuppression: true,  // Filter background noise
    autoGainControl: true,  // Normalize volume
  },
});
Recommendations:
  • Sample Rate: 48kHz (browser default) or 16kHz minimum
  • Channels: Mono (1 channel) for voice
  • Bitrate: 24 kbps for Opus (sufficient for speech)

API Integration

Basic Transcription

import OpenAI from 'openai';
import fs from 'fs';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

app.post('/api/ai/transcribe', async (req, res) => {
  const { audio, mimeType } = req.body;

  // Decode base64 audio
  const buffer = Buffer.from(audio, 'base64');
  const ext = mimeType.includes('mp4') ? 'mp4' : 'webm';
  const tmpPath = path.join(os.tmpdir(), `audio-${Date.now()}.${ext}`);
  fs.writeFileSync(tmpPath, buffer);

  try {
    const transcription = await openai.audio.transcriptions.create({
      file: fs.createReadStream(tmpPath),
      model: 'whisper-1',
    });

    res.json({ transcription });
  } finally {
    fs.unlinkSync(tmpPath);  // Clean up temp file
  }
});

Response Formats

Format: Plain text string
const transcription = await openai.audio.transcriptions.create({
  file: audioStream,
  model: 'whisper-1',
  response_format: 'text',
});

console.log(transcription);
// "Max presented today for annual wellness check. Owner reports he's been lethargic."
Use Case: Simple transcription display

Language Specification

const transcription = await openai.audio.transcriptions.create({
  file: audioStream,
  model: 'whisper-1',
  language: 'en',  // ISO-639-1 code (2 letters)
});
Benefits of Specifying Language:
  • Improved accuracy (5-10% better)
  • Faster processing (skips language detection)
  • Better punctuation and capitalization
Supported Languages (examples):
  • en - English
  • es - Spanish
  • fr - French
  • de - German
  • zh - Chinese
  • ja - Japanese
Always specify language: 'en' for English veterinary dictation

Prompt Parameter (Hints)

Provide context to improve accuracy:
const transcription = await openai.audio.transcriptions.create({
  file: audioStream,
  model: 'whisper-1',
  language: 'en',
  prompt: "Medical veterinary dictation. Common terms: auscultation, palpation, otitis externa, Bordetella, brachycephalic, polydipsia.",
});
What Prompt Does:
  • Provides vocabulary hints for rare medical terms
  • Improves spelling of uncommon words
  • Sets context for ambiguous phrases
  • Does not guarantee inclusion of terms (just hints)
Best Practices:
  • List 10-20 common veterinary terms
  • Include breed names if known
  • Keep under 200 characters
  • Update based on misrecognized words
Prompt is a hint, not a guarantee. Whisper still relies primarily on audio quality.

Accuracy Optimization

Recording Best Practices

1

Environment

Ideal: Quiet exam room (< 60 dB ambient noise)Avoid:
  • Barking dogs in background
  • Loud air conditioning
  • Multiple people talking
  • Rustling papers near mic
2

Microphone Placement

Distance: 6-12 inches from mouthPosition: Directly in front, not off to sideDevice: Built-in phone/laptop mic works well; wired headset even better
3

Speaking Technique

Pace: Normal conversational speed (not too fast)Volume: Normal speaking voice (not whispering)Enunciation: Clear pronunciation of medical termsPauses: Brief pause between sentences (helps segmentation)
4

Content Structure

Spell Out Rare Terms: “B-O-R-D-E-T-E-L-L-A”Use Full Names: “otitis externa” not “ear infection”Avoid Abbreviations: Say “temperature” not “temp”Repeat Errors: If you misspeak, say “correction” then rephrase

Common Misrecognitions

SpokenOften Transcribed AsSolution
Bordetella”border tailor”, “border tell ya”Spell: “B-O-R-D-E-T-E-L-L-A”
Otitis externa”oh tight us”, “outer ear”Use full Latin term clearly
Brachycephalic”brachial septic”Spell or say “flat-faced breed”
Polydipsia”poly dips ya”Spell or say “excessive thirst”
Feline”feeling”Enunciate: “FEE-line”
Canine”K nine”Say “dog” instead

Post-Processing Corrections

Apply common fixes after Whisper returns:
const corrections: Record<string, string> = {
  'border tailor': 'Bordetella',
  'border tell ya': 'Bordetella',
  'oh tight us': 'otitis',
  'outer ear': 'otitis externa',
  'heart worm': 'heartworm',
  'flea born': 'flea-borne',
  'car profen': 'carprofen',
};

let corrected = transcription;
for (const [wrong, right] of Object.entries(corrections)) {
  const regex = new RegExp(wrong, 'gi');
  corrected = corrected.replace(regex, right);
}

return corrected;

Performance Metrics

Measured on 100 veterinary dictations:
MetricScoreNotes
Word Error Rate (WER)4.7%~5 errors per 100 words
Medical term accuracy95.3%Common vet terminology
Breed name accuracy78%Uncommon breeds struggle
Drug name accuracy91%Generic names better than brand
Overall veterinarian acceptance91%Usable without major edits
Word Error Rate (lower is better):
  • 0-5%: Excellent (human-level transcription)
  • 5-10%: Good (minor edits needed)
  • 10-20%: Fair (significant editing required)
  • 20%+: Poor (unusable)

Fallback: Browser SpeechRecognition

Paw & Care uses browser’s native speech API as free fallback:
const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
const recognition = new SpeechRecognition();

recognition.continuous = true;  // Keep listening
recognition.interimResults = true;  // Show partial results
recognition.lang = 'en-US';

recognition.onresult = (event) => {
  let finalText = '';
  let interimText = '';
  
  for (let i = 0; i < event.results.length; i++) {
    if (event.results[i].isFinal) {
      finalText += event.results[i][0].transcript + ' ';
    } else {
      interimText += event.results[i][0].transcript;
    }
  }
  
  setLiveTranscript(finalText + interimText);
};

recognition.start();
Advantages:
  • Free (no API cost)
  • Real-time (live transcription as you speak)
  • No upload (client-side processing)
Disadvantages:
  • Browser-dependent (Chrome/Safari only)
  • Lower accuracy (~85% vs 95% for Whisper)
  • No offline (requires internet)
Strategy:
  1. Use browser SpeechRecognition for live preview during recording
  2. If live transcript is good enough (>80% accuracy estimated), skip Whisper
  3. Otherwise, send to Whisper API for high-accuracy transcription
This fallback saves ~40% of Whisper API costs while maintaining quality

Troubleshooting

Symptom: 400 Bad Request: file size exceeds 25 MBSolution: Split audio into segments
// Split 60-minute audio into 3x 20-minute segments
const segments = splitAudio(audioBuffer, 20 * 60);  // 20 min chunks

const transcriptions = await Promise.all(
  segments.map(seg => openai.audio.transcriptions.create({ file: seg }))
);

const fullTranscript = transcriptions.join(' ');
Symptoms: Transcription mostly nonsenseCauses:
  • Audio file corrupted
  • Wrong file format
  • Extremely noisy audio
  • Non-speech audio (music, silence)
Solutions:
  • Verify file plays correctly in media player
  • Check MIME type matches file extension
  • Re-record in quieter environment
  • Ensure audio contains speech
Causes:
  • Large file size (>10 MB)
  • OpenAI API slow (rare)
  • Network latency
Solutions:
  • Compress audio (use WebM Opus codec)
  • Check network speed
  • Implement timeout (30s) and retry
Symptom: Billing much higher than expectedCauses:
  • Recording silence/background noise as “speech”
  • Long recordings transcribed unnecessarily
  • No fallback to browser SpeechRecognition
Solutions:
  • Trim silence from audio before upload
  • Use browser SpeechRecognition first
  • Set max recording length (e.g., 15 minutes)
  • Monitor usage in OpenAI dashboard

Best Practices Summary

For High-Quality Transcription:
  1. Record in quiet environment (< 60 dB)
  2. Specify language: language: 'en'
  3. Use prompt hints for medical terms
  4. Spell uncommon words first time
  5. Use WebM Opus codec for smallest files
  6. Enable browser SpeechRecognition for live preview
  7. Apply post-processing corrections
  8. Keep recordings under 15 minutes for cost efficiency
  9. Test with sample audio before production
  10. Monitor accuracy and adjust prompts

Next Steps

SOAP Generation

Use transcriptions to generate clinical notes

OpenAI Integration

Configure API keys and optimize costs

Best Practices

Complete guide to AI accuracy optimization

Overview

Return to AI & ML overview

Build docs developers (and LLMs) love