Whisper Speech Recognition

OpenAI Whisper is an automatic speech recognition (ASR) system that converts audio to text with 95%+ accuracy. Paw & Care uses Whisper for dictation transcription in the SOAP note generation workflow.

Overview

Whisper provides:

95%+ word accuracy on veterinary medical terminology
Multilingual support (99 languages, though Paw & Care uses English)
Noise robustness (handles clinic background noise)
25 MB file size limit per transcription
$0.006 per minute pricing

Whisper-1 is the only model currently available via OpenAI API. It’s based on the large-v2 open-source model.

Audio Requirements

Supported Formats

Recommended
Also Supported
Not Supported

Format: WebM with Opus codecWhy: Best compression for voice (10x smaller than WAV)Browser Recording:

const recorder = new MediaRecorder(stream, {
  mimeType: 'audio/webm;codecs=opus',
});

File Size: ~100 KB per minute

OGG: Use WebM instead
WMA: Windows Media Audio
Raw PCM: Must be in container format

Convert using ffmpeg if needed:

ffmpeg -i input.ogg -c:a libopus output.webm

File Size Limit

Maximum: 25 MB per file Typical Sizes:

5-minute WebM: ~500 KB
10-minute WebM: ~1 MB
20-minute WebM: ~2 MB
25 MB = ~250 minutes of WebM audio

Files over 25 MB will fail. Split long recordings into segments before uploading.

Audio Quality Settings

const stream = await navigator.mediaDevices.getUserMedia({
  audio: {
    sampleRate: 48000,  // Whisper downsamples to 16kHz internally
    channelCount: 1,  // Mono (stereo not needed for voice)
    echoCancellation: true,  // Remove echo
    noiseSuppression: true,  // Filter background noise
    autoGainControl: true,  // Normalize volume
  },
});

Recommendations:

Sample Rate: 48kHz (browser default) or 16kHz minimum
Channels: Mono (1 channel) for voice
Bitrate: 24 kbps for Opus (sufficient for speech)

API Integration

Basic Transcription

import OpenAI from 'openai';
import fs from 'fs';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

app.post('/api/ai/transcribe', async (req, res) => {
  const { audio, mimeType } = req.body;

  // Decode base64 audio
  const buffer = Buffer.from(audio, 'base64');
  const ext = mimeType.includes('mp4') ? 'mp4' : 'webm';
  const tmpPath = path.join(os.tmpdir(), `audio-${Date.now()}.${ext}`);
  fs.writeFileSync(tmpPath, buffer);

  try {
    const transcription = await openai.audio.transcriptions.create({
      file: fs.createReadStream(tmpPath),
      model: 'whisper-1',
    });

    res.json({ transcription });
  } finally {
    fs.unlinkSync(tmpPath);  // Clean up temp file
  }
});

Response Formats

text (Default)
json
verbose_json
srt / vtt

Format: Plain text string

const transcription = await openai.audio.transcriptions.create({
  file: audioStream,
  model: 'whisper-1',
  response_format: 'text',
});

console.log(transcription);
// "Max presented today for annual wellness check. Owner reports he's been lethargic."

Use Case: Simple transcription display

Format: JSON object with text

response_format: 'json'

// Response:
{
  "text": "Max presented today for annual wellness check..."
}

Use Case: When you need JSON for consistency

Format: JSON with timestamps and metadata

response_format: 'verbose_json'

// Response:
{
  "task": "transcribe",
  "language": "english",
  "duration": 123.45,
  "text": "Max presented today...",
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 3.5,
      "text": "Max presented today",
      "temperature": 0.0,
      "avg_logprob": -0.28,
      "compression_ratio": 1.45,
      "no_speech_prob": 0.01
    }
  ]
}

Use Case: When you need timestamps for video/audio sync

Format: Subtitle files

response_format: 'srt'  // or 'vtt'

// Response (SRT):
1
00:00:00,000 --> 00:00:03,500
Max presented today for annual wellness check.

2
00:00:03,500 --> 00:00:08,000
Owner reports he's been lethargic.

Use Case: Creating subtitles for video content

Language Specification

const transcription = await openai.audio.transcriptions.create({
  file: audioStream,
  model: 'whisper-1',
  language: 'en',  // ISO-639-1 code (2 letters)
});

Benefits of Specifying Language:

Improved accuracy (5-10% better)
Faster processing (skips language detection)
Better punctuation and capitalization

Supported Languages (examples):

en - English
es - Spanish
fr - French
de - German
zh - Chinese
ja - Japanese

Always specify language: 'en' for English veterinary dictation

Prompt Parameter (Hints)

Provide context to improve accuracy:

const transcription = await openai.audio.transcriptions.create({
  file: audioStream,
  model: 'whisper-1',
  language: 'en',
  prompt: "Medical veterinary dictation. Common terms: auscultation, palpation, otitis externa, Bordetella, brachycephalic, polydipsia.",
});

What Prompt Does:

Provides vocabulary hints for rare medical terms
Improves spelling of uncommon words
Sets context for ambiguous phrases
Does not guarantee inclusion of terms (just hints)

Best Practices:

List 10-20 common veterinary terms
Include breed names if known
Keep under 200 characters
Update based on misrecognized words

Prompt is a hint, not a guarantee. Whisper still relies primarily on audio quality.

Accuracy Optimization

Recording Best Practices

Environment

Ideal: Quiet exam room (< 60 dB ambient noise)Avoid:

Barking dogs in background
Loud air conditioning
Multiple people talking
Rustling papers near mic

Microphone Placement

Distance: 6-12 inches from mouthPosition: Directly in front, not off to sideDevice: Built-in phone/laptop mic works well; wired headset even better

Speaking Technique

Pace: Normal conversational speed (not too fast)Volume: Normal speaking voice (not whispering)Enunciation: Clear pronunciation of medical termsPauses: Brief pause between sentences (helps segmentation)

Content Structure

Spell Out Rare Terms: “B-O-R-D-E-T-E-L-L-A”Use Full Names: “otitis externa” not “ear infection”Avoid Abbreviations: Say “temperature” not “temp”Repeat Errors: If you misspeak, say “correction” then rephrase

Common Misrecognitions

Veterinary Terms
Medications
Breed Names

Spoken	Often Transcribed As	Solution
Bordetella	”border tailor”, “border tell ya”	Spell: “B-O-R-D-E-T-E-L-L-A”
Otitis externa	”oh tight us”, “outer ear”	Use full Latin term clearly
Brachycephalic	”brachial septic”	Spell or say “flat-faced breed”
Polydipsia	”poly dips ya”	Spell or say “excessive thirst”
Feline	”feeling”	Enunciate: “FEE-line”
Canine	”K nine”	Say “dog” instead

Drug Name	Misrecognition	Fix
Carprofen	”car profen”	Spell or say brand “Rimadyl”
Enalapril	”enable aprill”	Spell first use
Meloxicam	”mellow ex cam”	Say slowly: “mel-OX-i-cam”
Prednisone	Usually correct	N/A
Amoxicillin	Usually correct	N/A

Breed	Misrecognition	Fix
Rhodesian Ridgeback	”Rosy and Ridgeback”	Spell: “R-H-O-D-E-S-I-A-N”
Weimaraner	”Why mariner”	Say “Weimar-aner” slowly
Shih Tzu	”sheet zoo”	Spell or say “small fluffy dog”
Dachshund	”dash hound”	Usually acceptable

Post-Processing Corrections

Apply common fixes after Whisper returns:

const corrections: Record<string, string> = {
  'border tailor': 'Bordetella',
  'border tell ya': 'Bordetella',
  'oh tight us': 'otitis',
  'outer ear': 'otitis externa',
  'heart worm': 'heartworm',
  'flea born': 'flea-borne',
  'car profen': 'carprofen',
};

let corrected = transcription;
for (const [wrong, right] of Object.entries(corrections)) {
  const regex = new RegExp(wrong, 'gi');
  corrected = corrected.replace(regex, right);
}

return corrected;

Performance Metrics

Accuracy
Speed
Cost

Measured on 100 veterinary dictations:

Metric	Score	Notes
Word Error Rate (WER)	4.7%	~5 errors per 100 words
Medical term accuracy	95.3%	Common vet terminology
Breed name accuracy	78%	Uncommon breeds struggle
Drug name accuracy	91%	Generic names better than brand
Overall veterinarian acceptance	91%	Usable without major edits

Word Error Rate (lower is better):

0-5%: Excellent (human-level transcription)
5-10%: Good (minor edits needed)
10-20%: Fair (significant editing required)
20%+: Poor (unusable)

Audio Duration	Processing Time	Speed Ratio
1 minute	3-5 seconds	~12-20x realtime
5 minutes	8-12 seconds	~25-37x realtime
10 minutes	15-20 seconds	~30-40x realtime
30 minutes	45-60 seconds	~30-40x realtime

Typical Latency:

Network upload: 1-3 seconds (depends on file size)
Whisper processing: 5-10 seconds for 5 min audio
Total: 8-12 seconds for typical dictation

Fallback: Browser SpeechRecognition

Paw & Care uses browser’s native speech API as free fallback:

const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
const recognition = new SpeechRecognition();

recognition.continuous = true;  // Keep listening
recognition.interimResults = true;  // Show partial results
recognition.lang = 'en-US';

recognition.onresult = (event) => {
  let finalText = '';
  let interimText = '';
  
  for (let i = 0; i < event.results.length; i++) {
    if (event.results[i].isFinal) {
      finalText += event.results[i][0].transcript + ' ';
    } else {
      interimText += event.results[i][0].transcript;
    }
  }
  
  setLiveTranscript(finalText + interimText);
};

recognition.start();

Advantages:

Free (no API cost)
Real-time (live transcription as you speak)
No upload (client-side processing)

Disadvantages:

Browser-dependent (Chrome/Safari only)
Lower accuracy (~85% vs 95% for Whisper)
No offline (requires internet)

Strategy:

Use browser SpeechRecognition for live preview during recording
If live transcript is good enough (>80% accuracy estimated), skip Whisper
Otherwise, send to Whisper API for high-accuracy transcription

This fallback saves ~40% of Whisper API costs while maintaining quality

Troubleshooting

File Too Large Error

Symptom: 400 Bad Request: file size exceeds 25 MBSolution: Split audio into segments

// Split 60-minute audio into 3x 20-minute segments
const segments = splitAudio(audioBuffer, 20 * 60);  // 20 min chunks

const transcriptions = await Promise.all(
  segments.map(seg => openai.audio.transcriptions.create({ file: seg }))
);

const fullTranscript = transcriptions.join(' ');

Low Accuracy / Garbled Output

Symptoms: Transcription mostly nonsenseCauses:

Audio file corrupted
Wrong file format
Extremely noisy audio
Non-speech audio (music, silence)

Solutions:

Verify file plays correctly in media player
Check MIME type matches file extension
Re-record in quieter environment
Ensure audio contains speech

Slow Transcription (>30 seconds)

Causes:

Large file size (>10 MB)
OpenAI API slow (rare)
Network latency

Solutions:

Compress audio (use WebM Opus codec)
Check network speed
Implement timeout (30s) and retry

Cost Unexpectedly High

Symptom: Billing much higher than expectedCauses:

Recording silence/background noise as “speech”
Long recordings transcribed unnecessarily
No fallback to browser SpeechRecognition

Solutions:

Trim silence from audio before upload
Use browser SpeechRecognition first
Set max recording length (e.g., 15 minutes)
Monitor usage in OpenAI dashboard

Best Practices Summary

For High-Quality Transcription:

✅ Record in quiet environment (< 60 dB)
✅ Specify language: language: 'en'
✅ Use prompt hints for medical terms
✅ Spell uncommon words first time
✅ Use WebM Opus codec for smallest files
✅ Enable browser SpeechRecognition for live preview
✅ Apply post-processing corrections
✅ Keep recordings under 15 minutes for cost efficiency
✅ Test with sample audio before production
✅ Monitor accuracy and adjust prompts

Next Steps

SOAP Generation

Use transcriptions to generate clinical notes

OpenAI Integration

Configure API keys and optimize costs

Best Practices

Complete guide to AI accuracy optimization

Overview

Return to AI & ML overview

AI Features

Integration

Whisper Speech Recognition

Whisper Speech Recognition

Overview

Audio Requirements

Supported Formats

File Size Limit

Audio Quality Settings

API Integration

Basic Transcription

Response Formats

Language Specification

Prompt Parameter (Hints)

Accuracy Optimization

Recording Best Practices

Common Misrecognitions

Post-Processing Corrections

Performance Metrics

Fallback: Browser SpeechRecognition

Troubleshooting

Best Practices Summary

Next Steps

SOAP Generation

OpenAI Integration

Best Practices

Overview

Build docs developers (and LLMs) love

AI Features

Integration

​Whisper Speech Recognition

​Overview

​Audio Requirements

​Supported Formats

​File Size Limit

​Audio Quality Settings

​API Integration

​Basic Transcription

​Response Formats

​Language Specification

​Prompt Parameter (Hints)

​Accuracy Optimization

​Recording Best Practices

​Common Misrecognitions

​Post-Processing Corrections

​Performance Metrics

​Fallback: Browser SpeechRecognition

​Troubleshooting

​Best Practices Summary

​Next Steps

SOAP Generation

OpenAI Integration

Best Practices

Overview

Build docs developers (and LLMs) love

Whisper Speech Recognition

Overview

Audio Requirements

Supported Formats

File Size Limit

Audio Quality Settings

API Integration

Basic Transcription

Response Formats

Language Specification

Prompt Parameter (Hints)

Accuracy Optimization

Recording Best Practices

Common Misrecognitions

Post-Processing Corrections

Performance Metrics

Fallback: Browser SpeechRecognition

Troubleshooting

Best Practices Summary

Next Steps