Azure Speech
Azure Speech provides speech-to-text, text-to-speech, speech translation, and other speech processing capabilities. Build voice-enabled applications with high-accuracy speech recognition, natural-sounding speech synthesis, and real-time translation.Key Capabilities
Speech-to-Text
Convert spoken audio to text with high accuracy
Text-to-Speech
Generate natural-sounding speech from text
Speech Translation
Translate spoken language in real-time
Voice Live
Build conversational voice interfaces
Speaker Recognition
Identify and verify speakers by voice
Pronunciation Assessment
Evaluate and improve pronunciation
Speech-to-Text
Transcribe audio to text with industry-leading accuracy:Real-Time Transcription
Convert streaming audio to text in real-time:Batch Transcription
Process large audio files asynchronously:Fast Transcription
Quick transcription for pre-recorded audio:- Ultra-fast processing (faster than real-time)
- Optimized for recorded files
- Lower latency than batch transcription
- Ideal for captions and subtitles
Custom Speech
Improve accuracy for specific scenarios:- Acoustic models: Adapt to noise environments
- Language models: Add domain-specific vocabulary
- Pronunciation: Define custom pronunciations
- Train with your audio and transcripts
Text-to-Speech
Generate natural-sounding speech from text:Neural Voices
High-quality, natural voices powered by neural networks:SSML (Speech Synthesis Markup Language)
Fine-tune speech output:Voice Styles and Emotions
Express emotions and speaking styles:- Cheerful, sad, angry, fearful
- Customer service, newscast, assistant
- Chat, poetry reading, and more
Custom Neural Voice
Create unique voices for your brand:- Record voice samples (300-2000 utterances)
- Train custom neural voice model
- Unique brand identity
- Consistent voice across applications
- Requires Limited Access approval
Batch Synthesis
Generate audio for large texts asynchronously:Speech Translation
Translate speech between languages in real-time:Speech-to-Speech Translation
Translate and synthesize in target language:Voice Live (Preview)
Build conversational voice interfaces:- Natural, human-like conversations
- Fast response times (low latency)
- Integration with LLMs
- Real-time interaction
- Context-aware responses
Language Identification
Automatically detect spoken language:Pronunciation Assessment
Evaluate speech pronunciation for language learning:Text-to-Speech Avatar
Generate videos of photorealistic talking avatars:- Lifelike synthetic avatars
- Natural speech and lip-sync
- Multiple avatar styles
- Real-time or batch generation
- Suitable for training, presentations, customer service
Use Cases
Call Centers
Call Centers
- Transcribe customer calls
- Real-time agent assistance
- Sentiment analysis from audio
- Automated quality assurance
- Multi-language support
Accessibility
Accessibility
- Voice dictation for text input
- Screen reader integration
- Caption generation for videos
- Voice-controlled applications
- Text-to-speech for visually impaired
Content Creation
Content Creation
- Generate audiobooks from text
- Create podcast voiceovers
- Produce e-learning narration
- Synthesize multilingual content
- Avatar-based video production
Customer Service
Customer Service
- Voice-enabled chatbots
- IVR systems
- Virtual assistants
- Automated responses
- Multi-language support
Language Learning
Language Learning
- Pronunciation feedback
- Speaking practice
- Real-time transcription
- Reading assistance
- Fluency assessment
SDK Support
Python
C#
Java
Maven package for Speech SDK
JavaScript
C++
Native SDK for C++ applications
Swift/Objective-C
SDK for iOS and macOS apps
Input Requirements
Speech-to-Text
- Audio formats: WAV, MP3, OGG, FLAC, OPUS
- Sample rate: 8 kHz or 16 kHz (16 kHz recommended)
- Channels: Mono or stereo
- Bit depth: 16-bit PCM
Text-to-Speech
- Text length: Up to 10,000 characters per request
- SSML: Supported for fine-tuned control
- Output formats: Multiple audio formats available
Containers
Run Speech services on-premises:- Speech-to-text container
- Text-to-speech container
- Custom speech container
- Neural text-to-speech container
- Maintain data privacy
- Low-latency local processing
Pricing
Speech-to-Text
- Free Tier (F0): 5 hours per month
- Standard Tier (S0): Pay per hour of audio
- Custom models: Additional costs
Text-to-Speech
- Free Tier (F0): 0.5M characters per month
- Standard Tier (S0): Pay per million characters
- Neural voices: Higher cost than standard
- Custom voices: Additional training and hosting
Getting Started
Try Speech Studio
Test features with sample audio at speech.microsoft.com
Best Practices
- Use appropriate audio quality (16 kHz, 16-bit)
- Implement noise reduction for better accuracy
- Use custom models for domain-specific vocabulary
- Cache text-to-speech audio for repeated phrases
- Implement retry logic for network failures
- Monitor usage and costs
- Test with diverse accents and speaking styles