Microsoft Azure Speech is an enterprise-grade cloud speech recognition service offering real-time transcription, custom models, and support for over 100 languages. It’s ideal for production applications requiring high reliability and advanced features.
Method Signature
recognize_azure(
audio_data: AudioData,
key: str,
language: str = "en-US",
profanity: str = "masked",
location: str = "westus",
show_all: bool = False
) -> str | tuple[str, float] | dict
Parameters
An AudioData instance containing the audio to transcribe.
Azure Speech API subscription key. Required for authentication.See Getting an API Key for instructions.
Recognition language as a BCP-47 language tag (e.g., "en-US", "fr-FR", "ja-JP").See supported languages.
Profanity filtering mode:
"masked": Replace profanity with asterisks
"removed": Remove profanity from results
"raw": No filtering
Azure region where your Speech resource is deployed.Common regions: "eastus", "westus", "westus2", "northeurope", "westeurope", "southeastasia"
If True, returns the full API response. If False, returns a tuple of (transcription, confidence).
Returns
- Default:
tuple[str, float] - Transcription text and confidence score (0.0 to 1.0)
- With
show_all=True: dict - Full API response with all recognition details
Getting an API Key
Create Speech Resource
- Go to the Azure Portal
- Click “Create a resource”
- Search for “Speech” or “Cognitive Services”
- Click “Create”
- Fill in the form:
- Name: Your resource name
- Subscription: Select your subscription
- Location: Choose a region near you
- Pricing tier: F0 (free) or S0 (paid)
- Click “Review + create” then “Create”
Get API Key
- Navigate to your Speech resource
- Click “Keys and Endpoint” in the left menu
- Copy Key 1 or Key 2 (either works)
- Note the Location/Region (you’ll need this too)
Azure Speech API keys are 32-character lowercase hexadecimal strings.
Basic Example
import speech_recognition as sr
# Your Azure Speech credentials
AZURE_KEY = "your_azure_speech_api_key"
AZURE_LOCATION = "westus" # Or your resource location
r = sr.Recognizer()
with sr.AudioFile("audio.wav") as source:
audio = r.record(source)
try:
text, confidence = r.recognize_azure(
audio,
key=AZURE_KEY,
location=AZURE_LOCATION
)
print(f"Transcription: {text}")
print(f"Confidence: {confidence:.2%}")
except sr.UnknownValueError:
print("Azure could not understand audio")
except sr.RequestError as e:
print(f"Could not request results; {e}")
Microphone Example
import speech_recognition as sr
AZURE_KEY = "your_azure_speech_api_key"
AZURE_LOCATION = "westus"
r = sr.Recognizer()
with sr.Microphone() as source:
print("Speak now...")
audio = r.listen(source)
print("Transcribing...")
text, confidence = r.recognize_azure(
audio,
key=AZURE_KEY,
location=AZURE_LOCATION
)
print(f"You said: {text}")
print(f"Confidence: {confidence:.2%}")
Language Support
Azure Speech supports over 100 languages and dialects.
Common Languages
Regional Variants
import speech_recognition as sr
r = sr.Recognizer()
with sr.AudioFile("audio.wav") as source:
audio = r.record(source)
# English (US)
text, conf = r.recognize_azure(audio, key=KEY, language="en-US")
# Spanish (Spain)
text, conf = r.recognize_azure(audio, key=KEY, language="es-ES")
# French (France)
text, conf = r.recognize_azure(audio, key=KEY, language="fr-FR")
# German (Germany)
text, conf = r.recognize_azure(audio, key=KEY, language="de-DE")
# Japanese
text, conf = r.recognize_azure(audio, key=KEY, language="ja-JP")
# Chinese (Mandarin, Simplified)
text, conf = r.recognize_azure(audio, key=KEY, language="zh-CN")
# English variants
r.recognize_azure(audio, key=KEY, language="en-US") # United States
r.recognize_azure(audio, key=KEY, language="en-GB") # United Kingdom
r.recognize_azure(audio, key=KEY, language="en-AU") # Australia
r.recognize_azure(audio, key=KEY, language="en-CA") # Canada
r.recognize_azure(audio, key=KEY, language="en-IN") # India
# Spanish variants
r.recognize_azure(audio, key=KEY, language="es-ES") # Spain
r.recognize_azure(audio, key=KEY, language="es-MX") # Mexico
r.recognize_azure(audio, key=KEY, language="es-AR") # Argentina
# Portuguese variants
r.recognize_azure(audio, key=KEY, language="pt-BR") # Brazil
r.recognize_azure(audio, key=KEY, language="pt-PT") # Portugal
For a complete list, see Azure’s language support documentation.
Profanity Filtering
import speech_recognition as sr
AZURE_KEY = "your_key"
r = sr.Recognizer()
with sr.AudioFile("audio.wav") as source:
audio = r.record(source)
# Masked (default) - replaces profanity with asterisks
text, _ = r.recognize_azure(audio, key=AZURE_KEY, profanity="masked")
print(text) # "What the ****"
# Removed - removes profanity entirely
text, _ = r.recognize_azure(audio, key=AZURE_KEY, profanity="removed")
print(text) # "What the"
# Raw - no filtering
text, _ = r.recognize_azure(audio, key=AZURE_KEY, profanity="raw")
print(text) # "What the hell"
Azure Regions
Choose a region close to your users for lower latency:
import speech_recognition as sr
r = sr.Recognizer()
with sr.AudioFile("audio.wav") as source:
audio = r.record(source)
# US regions
text, _ = r.recognize_azure(audio, key=KEY, location="eastus")
text, _ = r.recognize_azure(audio, key=KEY, location="westus")
text, _ = r.recognize_azure(audio, key=KEY, location="westus2")
# Europe regions
text, _ = r.recognize_azure(audio, key=KEY, location="northeurope")
text, _ = r.recognize_azure(audio, key=KEY, location="westeurope")
# Asia regions
text, _ = r.recognize_azure(audio, key=KEY, location="southeastasia")
text, _ = r.recognize_azure(audio, key=KEY, location="eastasia")
The location parameter must match the region where you created your Azure Speech resource.
Full Response
import speech_recognition as sr
import json
AZURE_KEY = "your_key"
r = sr.Recognizer()
with sr.AudioFile("audio.wav") as source:
audio = r.record(source)
# Get full response
response = r.recognize_azure(
audio,
key=AZURE_KEY,
location="westus",
show_all=True
)
print(json.dumps(response, indent=2))
# Access specific fields
for result in response.get("NBest", []):
print(f"Text: {result['Display']}")
print(f"Confidence: {result['Confidence']:.2%}")
Error Handling
import speech_recognition as sr
AZURE_KEY = "your_key"
AZURE_LOCATION = "westus"
r = sr.Recognizer()
with sr.AudioFile("audio.wav") as source:
audio = r.record(source)
try:
text, confidence = r.recognize_azure(
audio,
key=AZURE_KEY,
location=AZURE_LOCATION
)
print(f"Transcription: {text}")
except sr.UnknownValueError:
# Speech was unintelligible
print("Could not understand the audio")
except sr.RequestError as e:
# API request failed
if "invalid key" in str(e).lower():
print("Invalid API key")
elif "connection" in str(e).lower():
print("Network connection error")
else:
print(f"API error: {e}")
Audio Requirements
- Sample Rate: 16 kHz (automatically converted)
- Sample Width: 16-bit (automatically converted)
- Channels: Mono (stereo is automatically converted)
- Format: Converted to WAV with PCM encoding
Timeouts
import speech_recognition as sr
r = sr.Recognizer()
r.operation_timeout = 15 # Wait up to 15 seconds
with sr.AudioFile("audio.wav") as source:
audio = r.record(source)
try:
text, _ = r.recognize_azure(audio, key=AZURE_KEY)
print(text)
except sr.WaitTimeoutError:
print("Request timed out")
Using Environment Variables
import speech_recognition as sr
import os
# Store credentials in environment variables
AZURE_KEY = os.environ.get("AZURE_SPEECH_KEY")
AZURE_LOCATION = os.environ.get("AZURE_SPEECH_LOCATION", "westus")
if not AZURE_KEY:
raise ValueError("AZURE_SPEECH_KEY environment variable not set")
r = sr.Recognizer()
with sr.AudioFile("audio.wav") as source:
audio = r.record(source)
text, confidence = r.recognize_azure(
audio,
key=AZURE_KEY,
location=AZURE_LOCATION
)
print(text)
Pricing
Pricing Tiers:
- Free (F0): 5 audio hours per month
- Standard (S0): $1 per audio hour
Check Azure Speech pricing for current rates.
Advanced Features
For advanced features not available in recognize_azure(), consider using the Azure Speech SDK directly:
- Streaming recognition: Real-time transcription
- Speaker diarization: Identify who said what
- Custom models: Train models for domain-specific terminology
- Pronunciation assessment: Evaluate pronunciation for language learning
- Intent recognition: Combine speech recognition with LUIS
See the Azure Speech SDK documentation for details.
Best Practices
For production applications:
- Use environment variables for credentials (never hardcode keys)
- Implement retry logic for transient failures
- Monitor your API usage in the Azure Portal
- Use the region closest to your users
- Implement proper error handling
- Cache the OAuth token (done automatically by the library)
Security:
- Never commit API keys to version control
- Rotate keys periodically
- Use Azure Key Vault for production deployments
- Implement rate limiting to prevent abuse
Comparison: Azure vs Other Services
| Feature | Azure Speech | Google | Whisper (local) |
|---|
| Accuracy | High | High | Very High |
| Languages | 100+ | 100+ | 99 |
| Real-time | Yes (SDK) | Yes (SDK) | No |
| Custom models | Yes | Yes | No |
| Privacy | Cloud | Cloud | Local |
| Pricing | Pay-per-use | Free tier + paid | Free |
| Setup complexity | Medium | Low | Low |