Overview
Performs speech recognition using IBM Watson Speech to Text API. Provides enterprise-grade speech recognition with support for multiple languages and custom models.Method Signature
Parameters
The audio data to recognize. Must be an
AudioData instance.IBM Watson Speech to Text API key.See setup instructions below for how to obtain an API key.
Recognition language as an RFC5646 language tag with dialect (e.g.,
"en-US", "es-ES", "zh-CN").The supported language values are listed in the API documentation as model names like en-US_BroadbandModel.If
True, returns the raw API response as a JSON dictionary. If False, returns a tuple of (transcript, confidence).Returns
When
show_all=False, returns (transcript, confidence) where:transcript: The recognized text (may contain multiple utterances separated by newlines)confidence: Confidence score between 0 and 1
When
show_all=True, returns the raw API response containing:results: List of recognition resultsalternatives: Multiple transcription alternatives with confidence scores
Exceptions
Raised when the speech is unintelligible
Raised when:
- The API request fails
- The API key is invalid
- There is no internet connection
Example Usage
Basic Recognition
With Different Languages
Getting Full API Response
From Audio File
Using Environment Variables
Mandarin Chinese Recognition
Setup Instructions
1. Create IBM Cloud Account
- Go to IBM Cloud
- Sign up for a free account (Lite tier available)
- Log in to the IBM Cloud Console
2. Create Speech to Text Service
- Go to IBM Cloud Catalog
- Search for “Speech to Text”
- Click on the service
- Select a region (e.g., Dallas, Washington DC)
- Choose the Lite plan (free tier) or a paid plan
- Give your service a name
- Click Create
3. Get API Key
- After creation, you’ll be taken to the service dashboard
- Click Manage in the left sidebar
- Under Credentials, you’ll see:
- API Key: Your authentication key
- URL: Service endpoint URL
- Copy the API Key
4. Use in Code
Language Support
IBM Watson Speech to Text supports many languages:Available Models
en-US- English (United States)en-GB- English (United Kingdom)es-ES- Spanish (Spain)es-LA- Spanish (Latin America)fr-FR- French (France)de-DE- German (Germany)it-IT- Italian (Italy)ja-JP- Japaneseko-KR- Koreanpt-BR- Portuguese (Brazil)zh-CN- Chinese (Mandarin, Simplified)ar-MS- Arabic (Modern Standard)nl-NL- Dutch (Netherlands)fr-CA- French (Canada)
Pricing
- Lite Plan: 500 minutes per month (free)
- Standard Plan: Pay-per-use after Lite tier
Features
- Multiple Languages: Support for 10+ languages
- Custom Models: Train custom acoustic and language models
- Speaker Labels: Identify different speakers
- Smart Formatting: Automatic formatting of dates, times, numbers
- Profanity Filtering: Optional filtering of profane words
- Word Timestamps: Get timing for each word
- Confidence Scores: Returns confidence for transcriptions
Notes
- Requires internet connection
- Audio must be at least 16 kHz sample rate
- Audio is automatically converted to 16-bit samples
- Returns both transcript and confidence score
- Free Lite tier includes 500 minutes per month
- Multiple utterances are separated by newlines in the transcript
- Uses FLAC audio format for transmission