Skip to main content
Airi supports multiple TTS (text-to-speech) providers with different voice options, quality levels, and performance characteristics.

Overview

Voice synthesis in Airi uses the unspeech library to provide a unified interface across different TTS providers. Configuration happens at two levels:
  1. Provider Level: Choose and configure TTS service
  2. Character Level: Select voice and audio parameters per character

Available TTS Providers

ElevenLabs

Best for: Highest quality, natural-sounding voices with emotion
{
  "apiKey": "your-elevenlabs-api-key",
  "baseUrl": "https://unspeech.hyp3r.link/v1/",
  "voiceSettings": {
    "stability": 0.5,
    "similarityBoost": 0.75,
    "style": 0.0,
    "useSpeakerBoost": true
  }
}
  • eleven_multilingual_v2: 29 languages, high quality
  • eleven_turbo_v2_5: Fastest, lowest latency (~300ms)
  • eleven_turbo_v2: Fast with good quality
  • eleven_monolingual_v1: English only, highest quality
  • eleven_multilingual_v1: Legacy multilingual
Stability (0.0 - 1.0):
  • Low (0.0-0.3): More expressive, emotional, varies between generations
  • Medium (0.4-0.6): Balanced expression and consistency
  • High (0.7-1.0): Very consistent, less expressive, robotic
Similarity Boost (0.0 - 1.0):
  • Low: More creative interpretation
  • High: Closer to original voice sample
  • Recommended: 0.75 for most use cases
Style (0.0 - 1.0) (Turbo models only):
  • Controls speaking style exaggeration
  • 0.0 = neutral, 1.0 = very expressive
Speaker Boost (boolean):
  • Enhances voice similarity
  • Recommended: true for custom voices
ElevenLabs provides 100+ pre-made voices. Popular choices:
  • Rachel: American female, warm and friendly
  • Clyde: American male, conversational
  • Domi: American female, energetic
  • Bella: American female, soft and calming
  • Antoni: American male, professional
You can also clone custom voices with 1-5 minute samples.
Pricing (as of 2026):
  • Free tier: 10,000 characters/month
  • Starter: $5/month (30,000 chars)
  • Creator: $22/month (100,000 chars)
  • Pro: $99/month (500,000 chars)
Latency:
  • eleven_turbo_v2_5: 300-500ms
  • eleven_turbo_v2: 400-600ms
  • eleven_multilingual_v2: 600-900ms
  • eleven_monolingual_v1: 700-1000ms

OpenAI TTS

Best for: Quick setup, good quality, OpenAI ecosystem integration
{
  "apiKey": "sk-...",
  "baseUrl": "https://api.openai.com/v1/"
}
Models:
  • tts-1: Standard quality, optimized for speed
  • tts-1-hd: High definition audio
  • gpt-4o-mini-tts: Latest model with 13 voices
  • gpt-4o-mini-tts-2025-12-15: Dated snapshot
Voices:
  • alloy: Neutral, versatile (all models)
  • echo: Male, clear (all models)
  • fable: British, expressive (all models)
  • onyx: Deep male (all models)
  • nova: Female, energetic (all models)
  • shimmer: Soft female (all models)
  • ballad: Male, warm (gpt-4o-mini-tts only)
  • verse: Female, confident (gpt-4o-mini-tts only)
  • marin: Female, professional (gpt-4o-mini-tts only)
  • cedar: Male, authoritative (gpt-4o-mini-tts only)
Pricing:
  • tts-1: $15.00 per 1M characters
  • tts-1-hd: $30.00 per 1M characters
Latency: ~800-1200msQuality vs Speed:
  • Use tts-1 for real-time conversations
  • Use tts-1-hd for pre-recorded content

Microsoft Azure Speech

Best for: Maximum voice variety, enterprise features
{
  "apiKey": "your-azure-key",
  "region": "eastus",
  "baseUrl": "https://unspeech.hyp3r.link/v1/"
}
  • 400+ neural voices
  • 140+ languages and dialects
  • Custom Neural Voice training
  • SSML support for fine control
  • Visemes for lip-sync
  • Batch synthesis API
  • Standard: $4 per 1M characters
  • Neural: $16 per 1M characters
  • Free tier: 5M characters/month (first 12 months)

Deepgram Aura

Best for: Real-time conversational AI, lowest latency
{
  "apiKey": "your-deepgram-key",
  "baseUrl": "https://unspeech.hyp3r.link/v1/"
}
  • aura-2: Latest generation, improved naturalness
  • aura-1: First generation
  • aura: Legacy model (deprecated)
Available voices include:
  • asteria-en: Female, US English
  • luna-en: Female, US English
  • stella-en: Female, US English
  • athena-en: Female, UK English
  • hera-en: Female, US English
  • orion-en: Male, US English
  • arcas-en: Male, US English
  • perseus-en: Male, US English
  • angus-en: Male, Irish English
  • orpheus-en: Male, US English
  • helios-en: Male, UK English
Latency: 200-400ms (industry-leading)Pricing: $0.015 per 1,000 charactersBest Use: Real-time streaming conversations, voice assistants

Alibaba Cloud CosyVoice

Best for: Chinese language, cost-effective
{
  "apiKey": "your-alibaba-key",
  "baseUrl": "https://unspeech.hyp3r.link/v1/"
}
  • cosyvoice-v1: Original version
  • cosyvoice-v2: Improved naturalness and emotion
  • Optimized for Chinese languages
  • Natural prosody and emotion
  • Voice cloning support
  • Multi-speaker synthesis

Volcengine TTS

Best for: Chinese language, ByteDance integration Configuration:
{
  "apiKey": "your-volcengine-key",
  "baseUrl": "https://unspeech.hyp3r.link/v1/",
  "app": {
    "appId": "your-app-id"
  }
}

Local & Open-Source Options

Provider ID: browser-local-audio-speechRun TTS models entirely in your browser using WebGPU.Requirements:
  • Modern browser with WebGPU support
  • 8GB+ RAM recommended
  • GPU acceleration
Configuration:
{
  "baseUrl": "auto-configured"
}
Features:
  • ✅ No API costs
  • ✅ Complete privacy (no data sent)
  • ✅ Works offline
  • ❌ Slower than cloud services
  • ❌ Limited voice options
Supported Models: Varies based on browser capabilities
Provider ID: app-local-audio-speechNative TTS using Hugging Face Candle (CUDA/Metal accelerated).Requirements:
  • Airi Desktop (Tamagotchi) app
  • NVIDIA GPU (CUDA) or Apple Silicon (Metal)
Features:
  • Hardware accelerated
  • No internet required
  • Lower latency than browser
  • Larger model support
Provider ID: index-tts-vllmOpen-source Chinese/English TTS by Bilibili.Setup:
# Install from https://index-tts.github.io
git clone https://github.com/bilibili/index-tts
cd index-tts
pip install -r requirements.txt
python -m index_tts.server --port 11996
Configuration:
{
  "baseUrl": "http://localhost:11996/tts/"
}
Features:
  • Optimized for Chinese
  • Multiple voices per language
  • Free and open-source
  • Self-hosted
Provider ID: player2-speechGame-focused TTS from Player2.game integration.Configuration:
{
  "baseUrl": "http://localhost:4315/v1/"
}
Languages: English, Japanese, Chinese, Spanish, French, Hindi, Italian, Portuguese

Audio Settings

Quality vs Latency Tradeoffs

ProviderModelLatencyQualityCost
Deepgramaura-2200-400msGoodLow
ElevenLabseleven_turbo_v2_5300-500msExcellentMedium
OpenAItts-1800-1200msGoodMedium
OpenAItts-1-hd800-1200msExcellentHigh
ElevenLabseleven_multilingual_v2600-900msExcellentHigh
MicrosoftNeural600-1000msVery GoodMedium
Local (Browser)Various2-5sVariesFree
Local (Desktop)Various1-3sVariesFree
Recommendations:
  • Real-time conversations: Deepgram Aura, ElevenLabs Turbo
  • Pre-recorded content: OpenAI TTS-HD, ElevenLabs Multilingual
  • Cost-conscious: Local models, Microsoft Azure
  • Privacy-focused: Local browser/desktop TTS

Audio Format Settings

TTS providers output audio in different formats:
// Character speech module configuration
{
  provider: 'elevenlabs',
  model: 'eleven_turbo_v2_5',
  voice_id: 'rachel',
  
  // Optional audio settings
  pitch: 1.0,      // 0.5 - 2.0 (not all providers)
  rate: 1.0,       // 0.25 - 4.0 (not all providers)
  ssml: false,     // Enable SSML markup
  language: 'en'   // Language hint
}

Voice Configuration in Character Cards

Voices are configured per character through the AIRI Card extension:
{
  "name": "Your Character",
  "extensions": {
    "airi": {
      "modules": {
        "speech": {
          "provider": "elevenlabs",
          "model": "eleven_multilingual_v2",
          "voice_id": "rachel",
          "pitch": 1.0,
          "rate": 1.0,
          "language": "en"
        }
      }
    }
  }
}
This allows different characters to use different voices.

Advanced Configuration

Custom Voice Cloning (ElevenLabs)

  1. Prepare voice samples:
    • 1-5 minutes of clear audio
    • Single speaker
    • Minimal background noise
    • Variety of emotions/tones
  2. Upload to ElevenLabs:
    • Go to Voice Lab
    • Create new voice
    • Upload samples
    • Generate voice
  3. Get voice ID:
    curl https://api.elevenlabs.io/v1/voices \
      -H "xi-api-key: YOUR_API_KEY"
    
  4. Configure in Airi:
    {
      "voice_id": "your-cloned-voice-id"
    }
    

SSML Support (Microsoft Azure)

SSML (Speech Synthesis Markup Language) provides fine-grained control:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-US-JennyNeural">
    <prosody rate="fast" pitch="+2st">
      I'm excited to meet you!
    </prosody>
    <break time="500ms"/>
    <prosody rate="slow" pitch="-2st">
      But I can also be calm.
    </prosody>
  </voice>
</speak>
Enable SSML in character config:
{
  "ssml": true
}

Streaming vs Batch Synthesis

Airi supports both streaming and batch synthesis: Streaming (real-time):
  • Lower perceived latency
  • Better for conversations
  • Audio plays as generated
  • Used by default
Batch (pre-generate):
  • All audio generated before playback
  • Better for complex SSML
  • More consistent quality
  • Configure in provider settings

Performance Optimization

Reducing Latency

  1. Choose low-latency providers:
    • Deepgram Aura: ~200ms
    • ElevenLabs Turbo: ~300ms
  2. Use streaming mode:
    // Streaming enabled by default
    
  3. Pre-warm connections:
    // Provider connection kept alive
    
  4. Optimize network:
    • Use nearby regions
    • Reduce network hops
    • Consider CDN for audio

Caching Generated Audio

Airi can cache generated TTS audio:
// In speech runtime configuration
{
  enableCache: true,
  cacheMaxSize: 100, // MB
  cacheTTL: 86400     // seconds (1 day)
}
Cache keys are based on:
  • Provider + model + voice
  • Text content
  • Voice settings (pitch, rate, etc.)

Cost Management

Provider Cost Comparison

Per 1M characters (approximate):
  • Local (Browser/Desktop): $0 (free)
  • Deepgram Aura: $15
  • OpenAI TTS-1: $15
  • Microsoft Neural: $16
  • ElevenLabs Turbo: ~$30 (based on tier)
  • OpenAI TTS-HD: $30

Reducing Costs

  1. Use local TTS when possible
  2. Cache generated audio
  3. Choose cost-effective providers:
    • Deepgram for real-time
    • Microsoft for variety
    • Local for development
  4. Monitor usage:
    • Set provider quotas
    • Track character count
    • Alert on thresholds

Troubleshooting

Voice Not Available

Problem: Selected voice doesn’t work with model Solution: Check voice compatibility:
// OpenAI example
const voice = 'ballad' // Only works with gpt-4o-mini-tts
const model = 'gpt-4o-mini-tts' // ✅ Compatible
// const model = 'tts-1' // ❌ Incompatible

Audio Cutting Out

Causes:
  • Network instability
  • Provider rate limits
  • Audio buffer underrun
Solutions:
  1. Increase buffer size in audio settings
  2. Check network connection
  3. Switch to lower-latency provider
  4. Use local TTS

Poor Voice Quality

Causes:
  • Wrong model selection
  • Suboptimal voice settings
  • Network packet loss
Solutions:
  1. Use HD models (OpenAI TTS-HD, ElevenLabs Multilingual)
  2. Adjust voice settings:
    • Increase stability for consistency
    • Adjust similarity boost
  3. Check network quality
  4. Try different voices

High Latency

Solutions:
  1. Switch to faster provider (Deepgram, ElevenLabs Turbo)
  2. Use local TTS
  3. Enable audio caching
  4. Check network latency to provider
  5. Use regional endpoints

Rate Limit Exceeded

Error: 429 Too Many Requests Solutions:
  1. Implement request throttling
  2. Upgrade provider tier
  3. Distribute load across providers
  4. Use local TTS fallback
  5. Cache more aggressively

Code Reference

Voice synthesis implementation:
  • Speech store: packages/stage-ui/src/stores/modules/speech.ts
  • Speech runtime: packages/stage-ui/src/stores/speech-runtime.ts
  • Provider configs: packages/stage-ui/src/stores/providers.ts
  • Audio pipeline: Uses @proj-airi/pipelines-audio

Providers

Configure TTS providers

Character Settings

Link voices to characters

Build docs developers (and LLMs) love