TTS provider integrations

Iqra AI supports 18 text-to-speech (TTS) providers through the ITTSService interface. All providers deliver audio in real-time streaming formats optimized for telephony and WebRTC channels.

Supported providers

The platform includes native integrations for:

ElevenLabs

Industry-leading voice cloning and multilingual synthesis

Azure Speech

Microsoft’s neural TTS with 400+ voices

Deepgram

Ultra-low latency streaming TTS

Cartesia

Expressive conversational voices

Google TTS

WaveNet and Neural2 voices

FishAudio

High-quality voice synthesis

Minimax

Advanced Chinese language support

HumeAI

Emotionally intelligent speech

Inworld

Character voices for gaming

Speechify

Natural reading voices

MurfAI

Studio-quality voiceovers

Neuphonic

Neural voice generation

ResembleAI

Real-time voice cloning

Rime

Expressive speech synthesis

Sarvam

Indic language specialist

UpliftAI

Enterprise TTS platform

HamsaAI

Arabic language optimization

Zyphra Zonos

Fast multilingual synthesis

Popular provider configurations

ElevenLabs
Azure Speech
Deepgram
Cartesia
Google Cloud

ElevenLabs Text to Speech

Provider ID: ElevenLabsTextToSpeech
Implementation: ElevenLabsTTSService.csIndustry-leading voice cloning with support for 30+ languages and ultra-realistic prosody.

Configuration fields

Field	Type	Required	Description
`apiKey`	password	Yes	ElevenLabs API key from elevenlabs.io
`voiceId`	text	Yes	Voice identifier (e.g., `21m00Tcm4TlvDq8ikWAM`)
`modelId`	select	No	Model: `eleven_multilingual_v2`, `eleven_turbo_v2_5`
`stability`	number	No	Voice consistency (0.0-1.0, default: 0.5)
`similarityBoost`	number	No	Voice clarity (0.0-1.0, default: 0.75)
`style`	number	No	Exaggeration level (0.0-1.0)
`useSpeakerBoost`	boolean	No	Enhance clarity (recommended: true)
`speed`	number	No	Playback speed (0.5-2.0)
`pronunciationDictionaryIds`	array	No	Custom pronunciation dictionaries
`applyTextNormalization`	select	No	`auto`, `on`, or `off`

Recommended settings for voice calls

{
  "voiceId": "21m00Tcm4TlvDq8ikWAM",
  "modelId": "eleven_turbo_v2_5",
  "stability": 0.5,
  "similarityBoost": 0.8,
  "useSpeakerBoost": true,
  "speed": 1.0,
  "applyTextNormalization": "auto"
}

Use eleven_turbo_v2_5 for real-time conversations (lowest latency) and eleven_multilingual_v2 for maximum voice quality in non-English languages.

Finding voice IDs

Go to https://elevenlabs.io/voice-library
Select a voice or clone your own
Copy the voice ID from the URL or API settings

Pronunciation dictionaries

Create custom dictionaries in the ElevenLabs dashboard to handle:

Brand names and acronyms
Technical terminology
Non-standard pronunciations
Regional variations

Add dictionary IDs to the pronunciationDictionaryIds array.

Azure Speech Services

Provider ID: AzureSpeechServices
Implementation: AzureSpeechTTSService.csMicrosoft’s neural TTS with 400+ voices across 140+ languages and dialects.

Configuration fields

Field	Type	Required	Description
`subscriptionKey`	password	Yes	Azure Speech resource key
`region`	text	Yes	Azure region (e.g., `eastus`, `westeurope`)
`voiceName`	text	Yes	Neural voice name (e.g., `en-US-AriaNeural`)
`style`	select	No	Speaking style (voice-dependent)
`styleDegree`	number	No	Style intensity (0.01-2.0)
`rate`	number	No	Speech rate (-50% to +200%)
`pitch`	text	No	Pitch adjustment (e.g., `+5Hz`, `-10%`)
`volume`	number	No	Volume level (0-100)

Recommended settings

{
  "region": "eastus",
  "voiceName": "en-US-AriaNeural",
  "style": "friendly",
  "styleDegree": 1.0,
  "rate": 0,
  "volume": 100
}

Available speaking styles

Styles vary by voice. Common options include:

General: friendly, cheerful, empathetic, calm
Professional: customerservice, newscast, assistant
Expressive: excited, angry, sad, terrified

See Azure documentation for voice-specific styles.

Azure has region-specific voice availability. Ensure your selected voice is available in your deployment region.

Multi-language support

Azure excels at non-English languages:

Arabic: ar-SA-ZariyahNeural, ar-EG-SalmaNeural
Chinese: zh-CN-XiaoxiaoNeural, zh-TW-HsiaoChenNeural
Hindi: hi-IN-SwaraNeural
Spanish: es-MX-DaliaNeural, es-ES-ElviraNeural

Deepgram Text to Speech

Provider ID: DeepgramTextToSpeech
Implementation: DeepgramTextToSpeech.csUltra-low latency streaming TTS optimized for real-time conversations.

Configuration fields

Field	Type	Required	Description
`apiKey`	password	Yes	Deepgram API key from console.deepgram.com
`model`	select	Yes	Voice model (e.g., `aura-asteria-en`)
`encoding`	select	No	Audio format: `linear16`, `mulaw`, `alaw`
`sampleRate`	number	No	Sample rate in Hz (8000, 16000, 24000)
`bitrate`	number	No	Bitrate for compressed formats

Recommended settings for telephony

{
  "model": "aura-asteria-en",
  "encoding": "mulaw",
  "sampleRate": 8000
}

Available Aura voices

aura-asteria-en - Friendly female (US)
aura-luna-en - Calm female (US)
aura-stella-en - Professional female (US)
aura-athena-en - Confident female (UK)
aura-hera-en - Warm female (US)
aura-orion-en - Professional male (US)
aura-arcas-en - Friendly male (US)
aura-perseus-en - Confident male (US)
aura-angus-en - Authoritative male (Ireland)
aura-orpheus-en - Expressive male (US)

Deepgram typically delivers first audio chunk in under 250ms, making it ideal for voice applications where response time is critical.

Cartesia Text to Speech

Provider ID: CartesiaTextToSpeech
Implementation: CartesiaTTSService.csExpressive conversational voices designed for dialogue-heavy applications.

Configuration fields

Field	Type	Required	Description
`apiKey`	password	Yes	Cartesia API key from cartesia.ai
`voiceId`	text	Yes	Voice identifier
`model`	select	No	Model version
`language`	text	No	Language code (e.g., `en`, `es`)
`sampleRate`	number	No	Audio sample rate

Example configuration

{
  "voiceId": "default-voice-id",
  "model": "sonic",
  "language": "en",
  "sampleRate": 24000
}

Google Cloud Text to Speech

Provider ID: GoogleCloudTextToSpeech
Implementation: GoogleTTSService.csGoogle’s WaveNet and Neural2 voices with support for 220+ voices.

Configuration fields

Field	Type	Required	Description
`credentialsJson`	password	Yes	Service account JSON key
`voiceName`	text	Yes	Voice name (e.g., `en-US-Neural2-A`)
`languageCode`	text	Yes	Language (e.g., `en-US`)
`speakingRate`	number	No	Speed (0.25-4.0)
`pitch`	number	No	Pitch (-20.0 to 20.0)
`volumeGainDb`	number	No	Volume in decibels

Voice types

Standard - Basic synthesis
WaveNet - High-quality neural synthesis
Neural2 - Latest generation (recommended)
Studio - Premium studio quality
News - Optimized for news reading
Polyglot - Multilingual voices

Example configuration

{
  "voiceName": "en-US-Neural2-F",
  "languageCode": "en-US",
  "speakingRate": 1.0,
  "pitch": 0.0
}

Implementation details

Interface contract

public interface ITTSService
{
    Task<FunctionReturnResult> Initialize();
    Task<FunctionReturnResult<byte[]>> TextToSpeechAsync(string text, 
                                                          CancellationToken cancellationToken);
    Task<FunctionReturnResult<Stream>> TextToSpeechStreamAsync(string text, 
                                                                CancellationToken cancellationToken);
}

Audio format handling

Iqra AI automatically handles format conversion:

Provider native format - Each TTS service outputs in its preferred format
Format detection - System identifies optimal format (PCM, μ-law, Opus, etc.)
Automatic conversion - Converts to telephony format (8kHz μ-law) or WebRTC (16kHz Opus)
Streaming delivery - Chunks audio for minimal latency

See TTSProviderManager.cs:1-50 for implementation.

Caching system

The TTSAudioCacheManager optimizes repeated phrases:

Cache key generation - Hash of text + voice + config
S3 storage - Persistent cache in RustFS
TTL management - Configurable expiration
Cache invalidation - Automatic on config changes

This dramatically reduces latency and costs for common responses.

Provider selection guide

Lowest latency
Best quality
Multi-language
Cost-optimized

Recommended providers:

Deepgram - Sub-250ms first chunk
ElevenLabs Turbo - ~300ms latency
Cartesia - Optimized for streaming

Use μ-law encoding at 8kHz for telephony.

Adding custom providers

To integrate a new TTS provider:

Add enum value in IqraCore/Entities/Interfaces/InterfaceTTSProviderEnum.cs
Implement interface in IqraInfrastructure/Managers/TTS/Providers/
Handle audio formats using TTSProviderAvailableAudioFormat
Return streaming data via Stream or byte[]
Restart application for auto-registration

See ElevenLabsTTSService.cs:19-71 for reference implementation.

Next steps

Configure STT

Add speech-to-text for input processing

Multi-language agents

Configure parallel language contexts

Voice settings

Fine-tune voice parameters per agent

Telephony integration

Deploy via phone providers

Getting Started

Core Concepts

Building Agents

Integrations

Knowledge Base & RAG

Deployment

Channels

​Supported providers

ElevenLabs

Azure Speech

Deepgram

Cartesia

Google TTS

FishAudio

Minimax

HumeAI

Inworld

Speechify

MurfAI

Neuphonic

ResembleAI

Rime

Sarvam

UpliftAI

HamsaAI

Zyphra Zonos

​Popular provider configurations

​ElevenLabs Text to Speech

​Configuration fields

​Recommended settings for voice calls

​Finding voice IDs

​Pronunciation dictionaries

​Azure Speech Services

​Configuration fields

​Recommended settings

​Available speaking styles

​Multi-language support

​Deepgram Text to Speech

​Configuration fields

​Recommended settings for telephony

​Available Aura voices

​Cartesia Text to Speech

​Configuration fields

​Example configuration

​Google Cloud Text to Speech

​Configuration fields

​Voice types

​Example configuration

​Implementation details

​Interface contract

​Audio format handling

​Caching system

​Provider selection guide

​Adding custom providers

​Next steps

Configure STT

Multi-language agents

Voice settings

Telephony integration

Build docs developers (and LLMs) love

Supported providers

Popular provider configurations

ElevenLabs Text to Speech

Configuration fields

Recommended settings for voice calls

Finding voice IDs

Pronunciation dictionaries

Azure Speech Services

Configuration fields

Recommended settings

Available speaking styles

Multi-language support

Deepgram Text to Speech

Configuration fields

Recommended settings for telephony

Available Aura voices

Cartesia Text to Speech

Configuration fields

Example configuration

Google Cloud Text to Speech

Configuration fields

Voice types

Example configuration

Implementation details

Interface contract

Audio format handling

Caching system

Provider selection guide

Adding custom providers

Next steps