ISTTService interface. All providers deliver real-time streaming transcription optimized for conversational applications.
Supported providers
Deepgram
Industry-leading accuracy with ultra-low latency streaming
Azure Speech
Microsoft’s neural STT with 100+ language support
AssemblyAI
Advanced speech understanding with speaker diarization
ElevenLabs
Multilingual transcription from the TTS leader
Provider configurations
- Deepgram
- Azure Speech
- AssemblyAI
- ElevenLabs
Deepgram Speech to Text
Provider ID:DeepgramImplementation:
DeepgramSTTService.csIndustry-leading accuracy with sub-300ms latency and support for 30+ languages.Configuration fields
| Field | Type | Required | Description |
|---|---|---|---|
apiKey | password | Yes | Deepgram API key from console.deepgram.com |
model | select | Yes | Model: nova-2, enhanced, base |
language | text | No | Language code (e.g., en, es, fr) |
interimResults | boolean | No | Enable partial transcripts (default: true) |
punctuate | boolean | No | Add punctuation (default: true) |
profanityFilter | boolean | No | Filter profanity (default: false) |
diarize | boolean | No | Enable speaker diarization (default: false) |
smartFormat | boolean | No | Format numbers, dates, currency (default: true) |
numericFormat | boolean | No | Convert numbers to digits (default: true) |
encoding | select | No | Audio format: linear16, mulaw, opus |
sampleRate | number | No | Sample rate in Hz (8000, 16000, 48000) |
channels | number | No | Audio channels (default: 1) |
Recommended settings for voice calls
Model comparison
- nova-2 - Latest generation, best accuracy, lowest latency (recommended)
- nova - Previous generation, balanced performance
- enhanced - Optimized for noisy environments
- base - Cost-optimized, basic accuracy
Nova-2 supports automatic language detection across 100+ languages. Set
language: "multi" to enable auto-detection, though single-language mode provides better accuracy.Smart formatting examples
WithsmartFormat: true:- “twenty three dollars” → “$23”
- “march fifth two thousand twenty four” → “March 5, 2024”
- “one two three four five six seven eight nine zero” → “1234567890”
Language support
Deepgram supports 30+ languages including:- English -
en,en-US,en-GB,en-AU - Spanish -
es,es-419(Latin America) - French -
fr,fr-CA - German -
de - Arabic -
ar - Chinese -
zh,zh-CN,zh-TW - Hindi -
hi - Portuguese -
pt,pt-BR
Implementation details
Interface contract
Streaming architecture
All STT providers use WebSocket streaming for real-time transcription:- Connection establishment - WebSocket opened to provider
- Audio streaming - Raw audio chunks sent continuously
- Interim results - Partial transcripts emitted via
TranscriptionReceivedevent - Final results - Complete transcripts emitted via
FinalTranscriptionReceivedevent - Silence detection - Automatic utterance segmentation
Provider manager
TheSTTProviderManager (defined in IqraInfrastructure/Managers/STT/STTProviderManager.cs) handles:
- Provider registration - Auto-discovers implementations
- Model catalog - Maintains available models per provider
- Configuration validation - Ensures required fields
- Instance creation - Instantiates provider services
- Event marshaling - Routes transcription events to conversation engine
Audio format handling
The system automatically converts incoming audio:- Telephony input - μ-law 8kHz from SIP trunks
- WebRTC input - Opus 16/48kHz from browsers
- Format conversion - Converts to provider’s preferred format
- Resampling - Adjusts sample rate if needed
Provider selection guide
- Lowest latency
- Best accuracy
- Multi-language
- Compliance
Recommended:
- Deepgram Nova-2 - Sub-300ms latency
- Azure Speech - Consistent low latency
- ElevenLabs - Optimized for real-time
Configuration best practices
Telephony optimization
For phone integrations:WebRTC optimization
For browser/app integrations:Noise handling
For environments with background noise:- Deepgram: Use
enhancedmodel - Azure: Enable
enableDictation: falsefor conversational mode - AssemblyAI: Default model handles noise well
Language detection
For multi-language callers:- Deepgram: Set
language: "multi"for auto-detection - Azure: Use language identification service (separate API)
- Manual detection: Detect via LLM analyzing first utterance
Adding custom providers
To integrate a new STT provider:- Add enum value in
IqraCore/Entities/Interfaces/InterfaceSTTProviderEnum.cs - Implement interface in
IqraInfrastructure/Managers/STT/Providers/ - Handle WebSocket streaming to provider API
- Emit transcription events via
TranscriptionReceivedandFinalTranscriptionReceived - Restart application for auto-registration
DeepgramSTTService.cs:1-80 for reference implementation.
Next steps
Configure LLM
Add language intelligence for processing transcripts
Add voice output
Configure text-to-speech for responses
Multi-language agents
Configure parallel language contexts
Telephony integration
Deploy via phone providers