Skip to main content
Iqra AI supports four speech-to-text (STT) providers through the ISTTService interface. All providers deliver real-time streaming transcription optimized for conversational applications.

Supported providers

Deepgram

Industry-leading accuracy with ultra-low latency streaming

Azure Speech

Microsoft’s neural STT with 100+ language support

AssemblyAI

Advanced speech understanding with speaker diarization

ElevenLabs

Multilingual transcription from the TTS leader

Provider configurations

Deepgram Speech to Text

Provider ID: Deepgram
Implementation: DeepgramSTTService.cs
Industry-leading accuracy with sub-300ms latency and support for 30+ languages.

Configuration fields

FieldTypeRequiredDescription
apiKeypasswordYesDeepgram API key from console.deepgram.com
modelselectYesModel: nova-2, enhanced, base
languagetextNoLanguage code (e.g., en, es, fr)
interimResultsbooleanNoEnable partial transcripts (default: true)
punctuatebooleanNoAdd punctuation (default: true)
profanityFilterbooleanNoFilter profanity (default: false)
diarizebooleanNoEnable speaker diarization (default: false)
smartFormatbooleanNoFormat numbers, dates, currency (default: true)
numericFormatbooleanNoConvert numbers to digits (default: true)
encodingselectNoAudio format: linear16, mulaw, opus
sampleRatenumberNoSample rate in Hz (8000, 16000, 48000)
channelsnumberNoAudio channels (default: 1)
{
  "model": "nova-2",
  "language": "en",
  "interimResults": true,
  "punctuate": true,
  "smartFormat": true,
  "encoding": "mulaw",
  "sampleRate": 8000,
  "channels": 1
}

Model comparison

  • nova-2 - Latest generation, best accuracy, lowest latency (recommended)
  • nova - Previous generation, balanced performance
  • enhanced - Optimized for noisy environments
  • base - Cost-optimized, basic accuracy
Nova-2 supports automatic language detection across 100+ languages. Set language: "multi" to enable auto-detection, though single-language mode provides better accuracy.

Smart formatting examples

With smartFormat: true:
  • “twenty three dollars” → “$23”
  • “march fifth two thousand twenty four” → “March 5, 2024”
  • “one two three four five six seven eight nine zero” → “1234567890”

Language support

Deepgram supports 30+ languages including:
  • English - en, en-US, en-GB, en-AU
  • Spanish - es, es-419 (Latin America)
  • French - fr, fr-CA
  • German - de
  • Arabic - ar
  • Chinese - zh, zh-CN, zh-TW
  • Hindi - hi
  • Portuguese - pt, pt-BR
See Deepgram documentation for the full list.

Implementation details

Interface contract

public interface ISTTService
{
    Task<FunctionReturnResult> Initialize();
    Task<FunctionReturnResult> StartStreamingAsync(Stream audioStream, 
                                                    CancellationToken cancellationToken);
    event EventHandler<TranscriptionResult>? TranscriptionReceived;
    event EventHandler<TranscriptionResult>? FinalTranscriptionReceived;
}

Streaming architecture

All STT providers use WebSocket streaming for real-time transcription:
  1. Connection establishment - WebSocket opened to provider
  2. Audio streaming - Raw audio chunks sent continuously
  3. Interim results - Partial transcripts emitted via TranscriptionReceived event
  4. Final results - Complete transcripts emitted via FinalTranscriptionReceived event
  5. Silence detection - Automatic utterance segmentation
This architecture enables the agent to begin processing responses before the user finishes speaking, reducing perceived latency.

Provider manager

The STTProviderManager (defined in IqraInfrastructure/Managers/STT/STTProviderManager.cs) handles:
  • Provider registration - Auto-discovers implementations
  • Model catalog - Maintains available models per provider
  • Configuration validation - Ensures required fields
  • Instance creation - Instantiates provider services
  • Event marshaling - Routes transcription events to conversation engine

Audio format handling

The system automatically converts incoming audio:
  1. Telephony input - μ-law 8kHz from SIP trunks
  2. WebRTC input - Opus 16/48kHz from browsers
  3. Format conversion - Converts to provider’s preferred format
  4. Resampling - Adjusts sample rate if needed
No manual configuration required.

Provider selection guide

Recommended:
  1. Deepgram Nova-2 - Sub-300ms latency
  2. Azure Speech - Consistent low latency
  3. ElevenLabs - Optimized for real-time
Critical for voice conversations where response time matters.

Configuration best practices

Telephony optimization

For phone integrations:
{
  "encoding": "mulaw",
  "sampleRate": 8000,
  "channels": 1,
  "interimResults": true
}

WebRTC optimization

For browser/app integrations:
{
  "encoding": "opus",
  "sampleRate": 48000,
  "channels": 1,
  "interimResults": true
}

Noise handling

For environments with background noise:
  • Deepgram: Use enhanced model
  • Azure: Enable enableDictation: false for conversational mode
  • AssemblyAI: Default model handles noise well

Language detection

For multi-language callers:
  1. Deepgram: Set language: "multi" for auto-detection
  2. Azure: Use language identification service (separate API)
  3. Manual detection: Detect via LLM analyzing first utterance

Adding custom providers

To integrate a new STT provider:
  1. Add enum value in IqraCore/Entities/Interfaces/InterfaceSTTProviderEnum.cs
  2. Implement interface in IqraInfrastructure/Managers/STT/Providers/
  3. Handle WebSocket streaming to provider API
  4. Emit transcription events via TranscriptionReceived and FinalTranscriptionReceived
  5. Restart application for auto-registration
See DeepgramSTTService.cs:1-80 for reference implementation.

Next steps

Configure LLM

Add language intelligence for processing transcripts

Add voice output

Configure text-to-speech for responses

Multi-language agents

Configure parallel language contexts

Telephony integration

Deploy via phone providers

Build docs developers (and LLMs) love