IBM Watson Speech to Text is an enterprise-grade speech recognition service offering industry-specific models, speaker diarization, and support for 20+ languages. It’s ideal for businesses requiring customization and advanced features.
Method Signature
recognize_ibm(
audio_data: AudioData,
key: str,
language: str = "en-US",
show_all: bool = False
) -> str | tuple[str, float] | dict
Parameters
An AudioData instance containing the audio to transcribe.
IBM Watson Speech to Text API key. Required for authentication.See Getting an API Key for instructions.
Recognition language as an RFC5646 language tag with dialect (e.g., "en-US", "es-ES", "ja-JP").The library automatically selects the appropriate broadband model for the language.
If True, returns the full API response. If False, returns a tuple of (transcription, confidence).
Returns
- Default:
tuple[str, float] - Transcription text and confidence score (0.0 to 1.0)
- With
show_all=True: dict - Full API response with all alternatives, timestamps, and metadata
Getting an API Key
Create Speech to Text Service
- Log in to IBM Cloud
- Click “Create resource”
- Search for “Speech to Text”
- Select “Speech to Text” service
- Choose a plan:
- Lite: 500 minutes/month free
- Standard: Pay-as-you-go
- Choose a region (Dallas, Frankfurt, Sydney, Tokyo, Washington DC, London)
- Click “Create”
Get API Key
- Go to your Speech to Text service instance
- Click “Manage” in the left sidebar
- Click “Show Credentials”
- Copy the API Key
- Note the URL (though the library uses a default endpoint)
IBM Watson API keys are mixed-case alphanumeric strings. The service URL format is typically:
https://api.{region}.speech-to-text.watson.cloud.ibm.com
Basic Example
import speech_recognition as sr
IBM_API_KEY = "your_ibm_watson_api_key"
r = sr.Recognizer()
with sr.AudioFile("audio.wav") as source:
audio = r.record(source)
try:
text, confidence = r.recognize_ibm(audio, key=IBM_API_KEY)
print(f"Transcription: {text}")
print(f"Confidence: {confidence:.2%}")
except sr.UnknownValueError:
print("IBM Watson could not understand audio")
except sr.RequestError as e:
print(f"Could not request results; {e}")
Microphone Example
import speech_recognition as sr
IBM_API_KEY = "your_ibm_watson_api_key"
r = sr.Recognizer()
with sr.Microphone() as source:
print("Speak now...")
audio = r.listen(source)
print("Transcribing...")
try:
text, confidence = r.recognize_ibm(audio, key=IBM_API_KEY)
print(f"You said: {text}")
print(f"Confidence: {confidence:.2%}")
except sr.UnknownValueError:
print("Could not understand audio")
except sr.RequestError as e:
print(f"Error: {e}")
Language Support
IBM Watson Speech to Text supports 20+ languages with various models.
Major Languages
All Supported
import speech_recognition as sr
r = sr.Recognizer()
with sr.AudioFile("audio.wav") as source:
audio = r.record(source)
# English (US)
text, conf = r.recognize_ibm(audio, key=IBM_KEY, language="en-US")
# Spanish (Spain)
text, conf = r.recognize_ibm(audio, key=IBM_KEY, language="es-ES")
# French (France)
text, conf = r.recognize_ibm(audio, key=IBM_KEY, language="fr-FR")
# German
text, conf = r.recognize_ibm(audio, key=IBM_KEY, language="de-DE")
# Japanese
text, conf = r.recognize_ibm(audio, key=IBM_KEY, language="ja-JP")
# Korean
text, conf = r.recognize_ibm(audio, key=IBM_KEY, language="ko-KR")
# Chinese (Mandarin)
text, conf = r.recognize_ibm(audio, key=IBM_KEY, language="zh-CN")
Supported Languages:
- Arabic (
ar-MS)
- Chinese (Mandarin) (
zh-CN)
- Dutch (
nl-NL)
- English (Australian, UK, US) (
en-AU, en-GB, en-US)
- French (Canadian, France) (
fr-CA, fr-FR)
- German (
de-DE)
- Hindi (
hi-IN)
- Italian (
it-IT)
- Japanese (
ja-JP)
- Korean (
ko-KR)
- Portuguese (Brazilian) (
pt-BR)
- Spanish (Argentine, Castilian, Chilean, Colombian, Mexican, Peruvian)
(
es-AR, es-ES, es-CL, es-CO, es-MX, es-PE)
- Swedish (
sv-SE)
For the complete list with model details, see IBM’s language support documentation.
Full Response
import speech_recognition as sr
import json
IBM_API_KEY = "your_api_key"
r = sr.Recognizer()
with sr.AudioFile("audio.wav") as source:
audio = r.record(source)
# Get full response
response = r.recognize_ibm(
audio,
key=IBM_API_KEY,
show_all=True
)
print(json.dumps(response, indent=2))
# Access detailed results
for result in response.get("results", []):
for alternative in result.get("alternatives", []):
print(f"Transcript: {alternative['transcript']}")
print(f"Confidence: {alternative['confidence']:.2%}")
Using Environment Variables
import speech_recognition as sr
import os
IBM_API_KEY = os.environ.get("IBM_WATSON_API_KEY")
if not IBM_API_KEY:
raise ValueError("IBM_WATSON_API_KEY environment variable not set")
r = sr.Recognizer()
with sr.AudioFile("audio.wav") as source:
audio = r.record(source)
text, confidence = r.recognize_ibm(audio, key=IBM_API_KEY)
print(f"{text} ({confidence:.0%} confidence)")
Error Handling
import speech_recognition as sr
IBM_API_KEY = "your_api_key"
r = sr.Recognizer()
with sr.AudioFile("audio.wav") as source:
audio = r.record(source)
try:
text, confidence = r.recognize_ibm(
audio,
key=IBM_API_KEY,
language="en-US"
)
print(f"Transcription: {text}")
except sr.UnknownValueError:
# Speech was unintelligible
print("Could not understand the audio")
except sr.RequestError as e:
# API request failed
error_msg = str(e).lower()
if "unauthorized" in error_msg or "401" in error_msg:
print("Invalid API key")
elif "forbidden" in error_msg or "403" in error_msg:
print("Access forbidden - check your subscription")
elif "connection" in error_msg:
print("Network connection error")
else:
print(f"API error: {e}")
Audio Requirements
- Sample Rate: Minimum 16 kHz recommended (automatically converted if lower)
- Sample Width: Minimum 16-bit (automatically converted)
- Format: Converted to FLAC before sending to API
- Channels: Mono (stereo is automatically converted)
- Audio Length: Up to 100 MB or 60 minutes per request
Timeouts
import speech_recognition as sr
IBM_API_KEY = "your_api_key"
r = sr.Recognizer()
r.operation_timeout = 15 # Wait up to 15 seconds
with sr.AudioFile("audio.wav") as source:
audio = r.record(source)
try:
text, _ = r.recognize_ibm(audio, key=IBM_API_KEY)
print(text)
except sr.WaitTimeoutError:
print("Request timed out")
Advanced Features (SDK Required)
For advanced features, use the IBM Watson Python SDK directly:
Speaker Diarization
from ibm_watson import SpeechToTextV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
authenticator = IAMAuthenticator('your_api_key')
speech_to_text = SpeechToTextV1(authenticator=authenticator)
with open('audio.wav', 'rb') as audio_file:
response = speech_to_text.recognize(
audio=audio_file,
content_type='audio/wav',
speaker_labels=True # Enable speaker diarization
).get_result()
for result in response['results']:
for alternative in result['alternatives']:
print(alternative['transcript'])
for speaker in response['speaker_labels']:
print(f"Speaker {speaker['speaker']}: {speaker['from']}s - {speaker['to']}s")
Custom Language Models
from ibm_watson import SpeechToTextV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
authenticator = IAMAuthenticator('your_api_key')
speech_to_text = SpeechToTextV1(authenticator=authenticator)
with open('audio.wav', 'rb') as audio_file:
response = speech_to_text.recognize(
audio=audio_file,
content_type='audio/wav',
language_customization_id='your-custom-model-id'
).get_result()
from ibm_watson import SpeechToTextV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
authenticator = IAMAuthenticator('your_api_key')
speech_to_text = SpeechToTextV1(authenticator=authenticator)
with open('audio.wav', 'rb') as audio_file:
response = speech_to_text.recognize(
audio=audio_file,
content_type='audio/wav',
smart_formatting=True # Format dates, times, numbers, etc.
).get_result()
Industry-Specific Models
IBM offers specialized models for specific industries:
- Medical: Healthcare terminology
- Telephony: Optimized for phone audio
- Multimedia: Optimized for video/broadcast content
These require the IBM Watson SDK and custom model training.
Pricing
Pricing Tiers:
- Lite: 500 minutes/month free
- Standard: $0.02 per minute (first 1,000 minutes)
- Plus: Volume discounts available
Check IBM Watson pricing for current rates.
Best Practices
For production applications:
- Use environment variables for API keys
- Implement retry logic for transient failures
- Monitor usage in IBM Cloud dashboard
- Use custom models for domain-specific terminology
- Implement proper error handling
- Consider using the IBM Watson SDK for advanced features
Security:
- Never commit API keys to version control
- Rotate keys periodically
- Use IBM Cloud IAM for fine-grained access control
- Implement rate limiting
- Consider data residency requirements for sensitive data
Advantages
- Industry Models: Specialized models for healthcare, legal, etc.
- Speaker Diarization: Identify who said what
- Custom Models: Train models with your terminology
- Smart Formatting: Automatic formatting of dates, numbers, etc.
- Profanity Filtering: Built-in content filtering
- Free Tier: 500 minutes/month for testing
- Enterprise Support: 24/7 support available
Limitations
- Fewer Languages: ~20 languages vs 100+ for Google/Azure
- Setup Complexity: More complex than some alternatives
- Cost: Can be expensive for high volumes
- Regional Availability: Limited to certain IBM Cloud regions
Use Cases
- Call center transcription
- Medical dictation and transcription
- Legal deposition transcription
- Meeting transcription with speaker identification
- Voice commands for enterprise applications
- Compliance recording and analysis
- Customer service analytics
Comparison: IBM vs Other Services
| Feature | IBM Watson | Azure | Google |
|---|
| Accuracy | High | High | High |
| Languages | 20+ | 100+ | 100+ |
| Speaker Diarization | Yes | Yes (SDK) | Yes (SDK) |
| Custom Models | Yes | Yes | Yes |
| Industry Models | Yes | Limited | No |
| Free Tier | 500 min/month | 5 hours/month | Limited |
| Setup | Medium | Medium | Easy |