Overview
Voice synthesis in Airi uses theunspeech library to provide a unified interface across different TTS providers. Configuration happens at two levels:
- Provider Level: Choose and configure TTS service
- Character Level: Select voice and audio parameters per character
Available TTS Providers
ElevenLabs
Best for: Highest quality, natural-sounding voices with emotionConfiguration
Configuration
Supported Models
Supported Models
eleven_multilingual_v2: 29 languages, high qualityeleven_turbo_v2_5: Fastest, lowest latency (~300ms)eleven_turbo_v2: Fast with good qualityeleven_monolingual_v1: English only, highest qualityeleven_multilingual_v1: Legacy multilingual
Voice Settings Explained
Voice Settings Explained
Stability (0.0 - 1.0):
- Low (0.0-0.3): More expressive, emotional, varies between generations
- Medium (0.4-0.6): Balanced expression and consistency
- High (0.7-1.0): Very consistent, less expressive, robotic
- Low: More creative interpretation
- High: Closer to original voice sample
- Recommended: 0.75 for most use cases
- Controls speaking style exaggeration
- 0.0 = neutral, 1.0 = very expressive
- Enhances voice similarity
- Recommended: true for custom voices
Available Voices
Available Voices
ElevenLabs provides 100+ pre-made voices. Popular choices:
- Rachel: American female, warm and friendly
- Clyde: American male, conversational
- Domi: American female, energetic
- Bella: American female, soft and calming
- Antoni: American male, professional
Cost & Latency
Cost & Latency
Pricing (as of 2026):
- Free tier: 10,000 characters/month
- Starter: $5/month (30,000 chars)
- Creator: $22/month (100,000 chars)
- Pro: $99/month (500,000 chars)
eleven_turbo_v2_5: 300-500mseleven_turbo_v2: 400-600mseleven_multilingual_v2: 600-900mseleven_monolingual_v1: 700-1000ms
OpenAI TTS
Best for: Quick setup, good quality, OpenAI ecosystem integrationConfiguration
Configuration
Models & Voices
Models & Voices
Models:
tts-1: Standard quality, optimized for speedtts-1-hd: High definition audiogpt-4o-mini-tts: Latest model with 13 voicesgpt-4o-mini-tts-2025-12-15: Dated snapshot
- alloy: Neutral, versatile (all models)
- echo: Male, clear (all models)
- fable: British, expressive (all models)
- onyx: Deep male (all models)
- nova: Female, energetic (all models)
- shimmer: Soft female (all models)
- ballad: Male, warm (gpt-4o-mini-tts only)
- verse: Female, confident (gpt-4o-mini-tts only)
- marin: Female, professional (gpt-4o-mini-tts only)
- cedar: Male, authoritative (gpt-4o-mini-tts only)
Cost & Performance
Cost & Performance
Pricing:
tts-1: $15.00 per 1M characterstts-1-hd: $30.00 per 1M characters
- Use
tts-1for real-time conversations - Use
tts-1-hdfor pre-recorded content
Microsoft Azure Speech
Best for: Maximum voice variety, enterprise featuresConfiguration
Configuration
Features
Features
- 400+ neural voices
- 140+ languages and dialects
- Custom Neural Voice training
- SSML support for fine control
- Visemes for lip-sync
- Batch synthesis API
Popular Voices
Popular Voices
English (US):
en-US-JennyNeural: Female, friendlyen-US-GuyNeural: Male, casualen-US-AriaNeural: Female, professional
ja-JP-NanamiNeural: Female, standardja-JP-KeitaNeural: Male, standard
zh-CN-XiaoxiaoNeural: Female, standardzh-CN-YunxiNeural: Male, standard
Pricing
Pricing
- Standard: $4 per 1M characters
- Neural: $16 per 1M characters
- Free tier: 5M characters/month (first 12 months)
Deepgram Aura
Best for: Real-time conversational AI, lowest latencyConfiguration
Configuration
Models
Models
aura-2: Latest generation, improved naturalnessaura-1: First generationaura: Legacy model (deprecated)
Voices
Voices
Available voices include:
asteria-en: Female, US Englishluna-en: Female, US Englishstella-en: Female, US Englishathena-en: Female, UK Englishhera-en: Female, US Englishorion-en: Male, US Englisharcas-en: Male, US Englishperseus-en: Male, US Englishangus-en: Male, Irish Englishorpheus-en: Male, US Englishhelios-en: Male, UK English
Performance
Performance
Latency: 200-400ms (industry-leading)Pricing: $0.015 per 1,000 charactersBest Use: Real-time streaming conversations, voice assistants
Alibaba Cloud CosyVoice
Best for: Chinese language, cost-effectiveConfiguration
Configuration
Models
Models
cosyvoice-v1: Original versioncosyvoice-v2: Improved naturalness and emotion
Features
Features
- Optimized for Chinese languages
- Natural prosody and emotion
- Voice cloning support
- Multi-speaker synthesis
Volcengine TTS
Best for: Chinese language, ByteDance integration Configuration:Local & Open-Source Options
Browser Local TTS
Browser Local TTS
Provider ID: Features:
browser-local-audio-speechRun TTS models entirely in your browser using WebGPU.Requirements:- Modern browser with WebGPU support
- 8GB+ RAM recommended
- GPU acceleration
- ✅ No API costs
- ✅ Complete privacy (no data sent)
- ✅ Works offline
- ❌ Slower than cloud services
- ❌ Limited voice options
Desktop Local TTS
Desktop Local TTS
Provider ID:
app-local-audio-speechNative TTS using Hugging Face Candle (CUDA/Metal accelerated).Requirements:- Airi Desktop (Tamagotchi) app
- NVIDIA GPU (CUDA) or Apple Silicon (Metal)
- Hardware accelerated
- No internet required
- Lower latency than browser
- Larger model support
Index-TTS (Bilibili)
Index-TTS (Bilibili)
Provider ID: Configuration:Features:
index-tts-vllmOpen-source Chinese/English TTS by Bilibili.Setup:- Optimized for Chinese
- Multiple voices per language
- Free and open-source
- Self-hosted
Player2 Speech
Player2 Speech
Provider ID: Languages: English, Japanese, Chinese, Spanish, French, Hindi, Italian, Portuguese
player2-speechGame-focused TTS from Player2.game integration.Configuration:Audio Settings
Quality vs Latency Tradeoffs
| Provider | Model | Latency | Quality | Cost |
|---|---|---|---|---|
| Deepgram | aura-2 | 200-400ms | Good | Low |
| ElevenLabs | eleven_turbo_v2_5 | 300-500ms | Excellent | Medium |
| OpenAI | tts-1 | 800-1200ms | Good | Medium |
| OpenAI | tts-1-hd | 800-1200ms | Excellent | High |
| ElevenLabs | eleven_multilingual_v2 | 600-900ms | Excellent | High |
| Microsoft | Neural | 600-1000ms | Very Good | Medium |
| Local (Browser) | Various | 2-5s | Varies | Free |
| Local (Desktop) | Various | 1-3s | Varies | Free |
- Real-time conversations: Deepgram Aura, ElevenLabs Turbo
- Pre-recorded content: OpenAI TTS-HD, ElevenLabs Multilingual
- Cost-conscious: Local models, Microsoft Azure
- Privacy-focused: Local browser/desktop TTS
Audio Format Settings
TTS providers output audio in different formats:Voice Configuration in Character Cards
Voices are configured per character through the AIRI Card extension:Advanced Configuration
Custom Voice Cloning (ElevenLabs)
-
Prepare voice samples:
- 1-5 minutes of clear audio
- Single speaker
- Minimal background noise
- Variety of emotions/tones
-
Upload to ElevenLabs:
- Go to Voice Lab
- Create new voice
- Upload samples
- Generate voice
-
Get voice ID:
-
Configure in Airi:
SSML Support (Microsoft Azure)
SSML (Speech Synthesis Markup Language) provides fine-grained control:Streaming vs Batch Synthesis
Airi supports both streaming and batch synthesis: Streaming (real-time):- Lower perceived latency
- Better for conversations
- Audio plays as generated
- Used by default
- All audio generated before playback
- Better for complex SSML
- More consistent quality
- Configure in provider settings
Performance Optimization
Reducing Latency
-
Choose low-latency providers:
- Deepgram Aura: ~200ms
- ElevenLabs Turbo: ~300ms
-
Use streaming mode:
-
Pre-warm connections:
-
Optimize network:
- Use nearby regions
- Reduce network hops
- Consider CDN for audio
Caching Generated Audio
Airi can cache generated TTS audio:- Provider + model + voice
- Text content
- Voice settings (pitch, rate, etc.)
Cost Management
Provider Cost Comparison
Per 1M characters (approximate):- Local (Browser/Desktop): $0 (free)
- Deepgram Aura: $15
- OpenAI TTS-1: $15
- Microsoft Neural: $16
- ElevenLabs Turbo: ~$30 (based on tier)
- OpenAI TTS-HD: $30
Reducing Costs
- Use local TTS when possible
- Cache generated audio
- Choose cost-effective providers:
- Deepgram for real-time
- Microsoft for variety
- Local for development
- Monitor usage:
- Set provider quotas
- Track character count
- Alert on thresholds
Troubleshooting
Voice Not Available
Problem: Selected voice doesn’t work with model Solution: Check voice compatibility:Audio Cutting Out
Causes:- Network instability
- Provider rate limits
- Audio buffer underrun
- Increase buffer size in audio settings
- Check network connection
- Switch to lower-latency provider
- Use local TTS
Poor Voice Quality
Causes:- Wrong model selection
- Suboptimal voice settings
- Network packet loss
- Use HD models (OpenAI TTS-HD, ElevenLabs Multilingual)
- Adjust voice settings:
- Increase stability for consistency
- Adjust similarity boost
- Check network quality
- Try different voices
High Latency
Solutions:- Switch to faster provider (Deepgram, ElevenLabs Turbo)
- Use local TTS
- Enable audio caching
- Check network latency to provider
- Use regional endpoints
Rate Limit Exceeded
Error: 429 Too Many Requests Solutions:- Implement request throttling
- Upgrade provider tier
- Distribute load across providers
- Use local TTS fallback
- Cache more aggressively
Code Reference
Voice synthesis implementation:- Speech store:
packages/stage-ui/src/stores/modules/speech.ts - Speech runtime:
packages/stage-ui/src/stores/speech-runtime.ts - Provider configs:
packages/stage-ui/src/stores/providers.ts - Audio pipeline: Uses
@proj-airi/pipelines-audio
Related Resources
Providers
Configure TTS providers
Character Settings
Link voices to characters
