Voice Synthesis Configuration

Airi supports multiple TTS (text-to-speech) providers with different voice options, quality levels, and performance characteristics.

Overview

Voice synthesis in Airi uses the unspeech library to provide a unified interface across different TTS providers. Configuration happens at two levels:

Provider Level: Choose and configure TTS service
Character Level: Select voice and audio parameters per character

Available TTS Providers

ElevenLabs

Best for: Highest quality, natural-sounding voices with emotion

Configuration

{
  "apiKey": "your-elevenlabs-api-key",
  "baseUrl": "https://unspeech.hyp3r.link/v1/",
  "voiceSettings": {
    "stability": 0.5,
    "similarityBoost": 0.75,
    "style": 0.0,
    "useSpeakerBoost": true
  }
}

Supported Models

eleven_multilingual_v2: 29 languages, high quality
eleven_turbo_v2_5: Fastest, lowest latency (~300ms)
eleven_turbo_v2: Fast with good quality
eleven_monolingual_v1: English only, highest quality
eleven_multilingual_v1: Legacy multilingual

Voice Settings Explained

Stability (0.0 - 1.0):

Low (0.0-0.3): More expressive, emotional, varies between generations
Medium (0.4-0.6): Balanced expression and consistency
High (0.7-1.0): Very consistent, less expressive, robotic

Similarity Boost (0.0 - 1.0):

Low: More creative interpretation
High: Closer to original voice sample
Recommended: 0.75 for most use cases

Style (0.0 - 1.0) (Turbo models only):

Controls speaking style exaggeration
0.0 = neutral, 1.0 = very expressive

Speaker Boost (boolean):

Enhances voice similarity
Recommended: true for custom voices

Available Voices

ElevenLabs provides 100+ pre-made voices. Popular choices:

Rachel: American female, warm and friendly
Clyde: American male, conversational
Domi: American female, energetic
Bella: American female, soft and calming
Antoni: American male, professional

You can also clone custom voices with 1-5 minute samples.

Cost & Latency

Pricing (as of 2026):

Free tier: 10,000 characters/month
Starter: $5/month (30,000 chars)
Creator: $22/month (100,000 chars)
Pro: $99/month (500,000 chars)

Latency:

eleven_turbo_v2_5: 300-500ms
eleven_turbo_v2: 400-600ms
eleven_multilingual_v2: 600-900ms
eleven_monolingual_v1: 700-1000ms

OpenAI TTS

Best for: Quick setup, good quality, OpenAI ecosystem integration

Configuration

{
  "apiKey": "sk-...",
  "baseUrl": "https://api.openai.com/v1/"
}

Models & Voices

Models:

tts-1: Standard quality, optimized for speed
tts-1-hd: High definition audio
gpt-4o-mini-tts: Latest model with 13 voices
gpt-4o-mini-tts-2025-12-15: Dated snapshot

Voices:

alloy: Neutral, versatile (all models)
echo: Male, clear (all models)
fable: British, expressive (all models)
onyx: Deep male (all models)
nova: Female, energetic (all models)
shimmer: Soft female (all models)
ballad: Male, warm (gpt-4o-mini-tts only)
verse: Female, confident (gpt-4o-mini-tts only)
marin: Female, professional (gpt-4o-mini-tts only)
cedar: Male, authoritative (gpt-4o-mini-tts only)

Cost & Performance

Pricing:

tts-1: $15.00 per 1M characters
tts-1-hd: $30.00 per 1M characters

Latency: ~800-1200msQuality vs Speed:

Use tts-1 for real-time conversations
Use tts-1-hd for pre-recorded content

Microsoft Azure Speech

Best for: Maximum voice variety, enterprise features

Configuration

{
  "apiKey": "your-azure-key",
  "region": "eastus",
  "baseUrl": "https://unspeech.hyp3r.link/v1/"
}

Features

400+ neural voices
140+ languages and dialects
Custom Neural Voice training
SSML support for fine control
Visemes for lip-sync
Batch synthesis API

Popular Voices

English (US):

en-US-JennyNeural: Female, friendly
en-US-GuyNeural: Male, casual
en-US-AriaNeural: Female, professional

Japanese:

ja-JP-NanamiNeural: Female, standard
ja-JP-KeitaNeural: Male, standard

Chinese:

zh-CN-XiaoxiaoNeural: Female, standard
zh-CN-YunxiNeural: Male, standard

Pricing

Standard: $4 per 1M characters
Neural: $16 per 1M characters
Free tier: 5M characters/month (first 12 months)

Deepgram Aura

Best for: Real-time conversational AI, lowest latency

Configuration

{
  "apiKey": "your-deepgram-key",
  "baseUrl": "https://unspeech.hyp3r.link/v1/"
}

Models

aura-2: Latest generation, improved naturalness
aura-1: First generation
aura: Legacy model (deprecated)

Voices

Available voices include:

asteria-en: Female, US English
luna-en: Female, US English
stella-en: Female, US English
athena-en: Female, UK English
hera-en: Female, US English
orion-en: Male, US English
arcas-en: Male, US English
perseus-en: Male, US English
angus-en: Male, Irish English
orpheus-en: Male, US English
helios-en: Male, UK English

Performance

Latency: 200-400ms (industry-leading)Pricing: $0.015 per 1,000 charactersBest Use: Real-time streaming conversations, voice assistants

Alibaba Cloud CosyVoice

Best for: Chinese language, cost-effective

Configuration

{
  "apiKey": "your-alibaba-key",
  "baseUrl": "https://unspeech.hyp3r.link/v1/"
}

Models

cosyvoice-v1: Original version
cosyvoice-v2: Improved naturalness and emotion

Features

Optimized for Chinese languages
Natural prosody and emotion
Voice cloning support
Multi-speaker synthesis

Volcengine TTS

Best for: Chinese language, ByteDance integration Configuration:

{
  "apiKey": "your-volcengine-key",
  "baseUrl": "https://unspeech.hyp3r.link/v1/",
  "app": {
    "appId": "your-app-id"
  }
}

Local & Open-Source Options

Browser Local TTS

Provider ID: browser-local-audio-speechRun TTS models entirely in your browser using WebGPU.Requirements:

Modern browser with WebGPU support
8GB+ RAM recommended
GPU acceleration

Configuration:

{
  "baseUrl": "auto-configured"
}

Features:

✅ No API costs
✅ Complete privacy (no data sent)
✅ Works offline
❌ Slower than cloud services
❌ Limited voice options

Supported Models: Varies based on browser capabilities

Desktop Local TTS

Provider ID: app-local-audio-speechNative TTS using Hugging Face Candle (CUDA/Metal accelerated).Requirements:

Airi Desktop (Tamagotchi) app
NVIDIA GPU (CUDA) or Apple Silicon (Metal)

Features:

Hardware accelerated
No internet required
Lower latency than browser
Larger model support

Index-TTS (Bilibili)

Provider ID: index-tts-vllmOpen-source Chinese/English TTS by Bilibili.Setup:

# Install from https://index-tts.github.io
git clone https://github.com/bilibili/index-tts
cd index-tts
pip install -r requirements.txt
python -m index_tts.server --port 11996

Configuration:

{
  "baseUrl": "http://localhost:11996/tts/"
}

Features:

Optimized for Chinese
Multiple voices per language
Free and open-source
Self-hosted

Player2 Speech

Provider ID: player2-speechGame-focused TTS from Player2.game integration.Configuration:

{
  "baseUrl": "http://localhost:4315/v1/"
}

Languages: English, Japanese, Chinese, Spanish, French, Hindi, Italian, Portuguese

Audio Settings

Quality vs Latency Tradeoffs

Provider	Model	Latency	Quality	Cost
Deepgram	aura-2	200-400ms	Good	Low
ElevenLabs	eleven_turbo_v2_5	300-500ms	Excellent	Medium
OpenAI	tts-1	800-1200ms	Good	Medium
OpenAI	tts-1-hd	800-1200ms	Excellent	High
ElevenLabs	eleven_multilingual_v2	600-900ms	Excellent	High
Microsoft	Neural	600-1000ms	Very Good	Medium
Local (Browser)	Various	2-5s	Varies	Free
Local (Desktop)	Various	1-3s	Varies	Free

Recommendations:

Real-time conversations: Deepgram Aura, ElevenLabs Turbo
Pre-recorded content: OpenAI TTS-HD, ElevenLabs Multilingual
Cost-conscious: Local models, Microsoft Azure
Privacy-focused: Local browser/desktop TTS

Audio Format Settings

TTS providers output audio in different formats:

// Character speech module configuration
{
  provider: 'elevenlabs',
  model: 'eleven_turbo_v2_5',
  voice_id: 'rachel',
  
  // Optional audio settings
  pitch: 1.0,      // 0.5 - 2.0 (not all providers)
  rate: 1.0,       // 0.25 - 4.0 (not all providers)
  ssml: false,     // Enable SSML markup
  language: 'en'   // Language hint
}

Voice Configuration in Character Cards

Voices are configured per character through the AIRI Card extension:

{
  "name": "Your Character",
  "extensions": {
    "airi": {
      "modules": {
        "speech": {
          "provider": "elevenlabs",
          "model": "eleven_multilingual_v2",
          "voice_id": "rachel",
          "pitch": 1.0,
          "rate": 1.0,
          "language": "en"
        }
      }
    }
  }
}

This allows different characters to use different voices.

Advanced Configuration

Custom Voice Cloning (ElevenLabs)

Prepare voice samples:
- 1-5 minutes of clear audio
- Single speaker
- Minimal background noise
- Variety of emotions/tones
Upload to ElevenLabs:
- Go to Voice Lab
- Create new voice
- Upload samples
- Generate voice

Get voice ID:

curl https://api.elevenlabs.io/v1/voices \
  -H "xi-api-key: YOUR_API_KEY"

Configure in Airi:

{
  "voice_id": "your-cloned-voice-id"
}

SSML Support (Microsoft Azure)

SSML (Speech Synthesis Markup Language) provides fine-grained control:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-US-JennyNeural">
    <prosody rate="fast" pitch="+2st">
      I'm excited to meet you!
    </prosody>
    <break time="500ms"/>
    <prosody rate="slow" pitch="-2st">
      But I can also be calm.
    </prosody>
  </voice>
</speak>

Enable SSML in character config:

{
  "ssml": true
}

Streaming vs Batch Synthesis

Airi supports both streaming and batch synthesis: Streaming (real-time):

Lower perceived latency
Better for conversations
Audio plays as generated
Used by default

Batch (pre-generate):

All audio generated before playback
Better for complex SSML
More consistent quality
Configure in provider settings

Performance Optimization

Reducing Latency

Choose low-latency providers:
- Deepgram Aura: ~200ms
- ElevenLabs Turbo: ~300ms
Use streaming mode:
```
// Streaming enabled by default
```
Pre-warm connections:
```
// Provider connection kept alive
```
Optimize network:
- Use nearby regions
- Reduce network hops
- Consider CDN for audio

Caching Generated Audio

Airi can cache generated TTS audio:

// In speech runtime configuration
{
  enableCache: true,
  cacheMaxSize: 100, // MB
  cacheTTL: 86400     // seconds (1 day)
}

Cache keys are based on:

Provider + model + voice
Text content
Voice settings (pitch, rate, etc.)

Cost Management

Provider Cost Comparison

Per 1M characters (approximate):

Local (Browser/Desktop): $0 (free)
Deepgram Aura: $15
OpenAI TTS-1: $15
Microsoft Neural: $16
ElevenLabs Turbo: ~$30 (based on tier)
OpenAI TTS-HD: $30

Reducing Costs

Use local TTS when possible
Cache generated audio
Choose cost-effective providers:
- Deepgram for real-time
- Microsoft for variety
- Local for development
Monitor usage:
- Set provider quotas
- Track character count
- Alert on thresholds

Troubleshooting

Voice Not Available

Problem: Selected voice doesn’t work with model Solution: Check voice compatibility:

// OpenAI example
const voice = 'ballad' // Only works with gpt-4o-mini-tts
const model = 'gpt-4o-mini-tts' // ✅ Compatible
// const model = 'tts-1' // ❌ Incompatible

Audio Cutting Out

Causes:

Network instability
Provider rate limits
Audio buffer underrun

Solutions:

Increase buffer size in audio settings
Check network connection
Switch to lower-latency provider
Use local TTS

Poor Voice Quality

Causes:

Wrong model selection
Suboptimal voice settings
Network packet loss

Solutions:

Use HD models (OpenAI TTS-HD, ElevenLabs Multilingual)
Adjust voice settings:
- Increase stability for consistency
- Adjust similarity boost
Check network quality
Try different voices

High Latency

Solutions:

Switch to faster provider (Deepgram, ElevenLabs Turbo)
Use local TTS
Enable audio caching
Check network latency to provider
Use regional endpoints

Rate Limit Exceeded

Error: 429 Too Many Requests Solutions:

Implement request throttling
Upgrade provider tier
Distribute load across providers
Use local TTS fallback
Cache more aggressively

Code Reference

Voice synthesis implementation:

Speech store: packages/stage-ui/src/stores/modules/speech.ts
Speech runtime: packages/stage-ui/src/stores/speech-runtime.ts
Provider configs: packages/stage-ui/src/stores/providers.ts
Audio pipeline: Uses @proj-airi/pipelines-audio

Providers

Configure TTS providers

Character Settings

Link voices to characters

Get Started

Core Features

Platforms

Integrations

Configuration

Development

Voice Synthesis Configuration

Overview

Available TTS Providers

ElevenLabs

OpenAI TTS

Microsoft Azure Speech

Deepgram Aura

Alibaba Cloud CosyVoice

Volcengine TTS

Local & Open-Source Options

Audio Settings

Quality vs Latency Tradeoffs

Audio Format Settings

Voice Configuration in Character Cards

Advanced Configuration

Custom Voice Cloning (ElevenLabs)

SSML Support (Microsoft Azure)

Streaming vs Batch Synthesis

Performance Optimization

Reducing Latency

Caching Generated Audio

Cost Management

Provider Cost Comparison

Reducing Costs

Troubleshooting

Voice Not Available

Audio Cutting Out

Poor Voice Quality

High Latency

Rate Limit Exceeded

Code Reference

Providers

Character Settings

Build docs developers (and LLMs) love

Get Started

Core Features

Platforms

Integrations

Configuration

Development

​Overview

​Available TTS Providers

​ElevenLabs

​OpenAI TTS

​Microsoft Azure Speech

​Deepgram Aura

​Alibaba Cloud CosyVoice

​Volcengine TTS

​Local & Open-Source Options

​Audio Settings

​Quality vs Latency Tradeoffs

​Audio Format Settings

​Voice Configuration in Character Cards

​Advanced Configuration

​Custom Voice Cloning (ElevenLabs)

​SSML Support (Microsoft Azure)

​Streaming vs Batch Synthesis

​Performance Optimization

​Reducing Latency

​Caching Generated Audio

​Cost Management

​Provider Cost Comparison

​Reducing Costs

​Troubleshooting

​Voice Not Available

​Audio Cutting Out

​Poor Voice Quality

​High Latency

​Rate Limit Exceeded

​Code Reference

​Related Resources

Providers

Character Settings

Build docs developers (and LLMs) love

Overview

Available TTS Providers

ElevenLabs

OpenAI TTS

Microsoft Azure Speech

Deepgram Aura

Alibaba Cloud CosyVoice

Volcengine TTS

Local & Open-Source Options

Audio Settings

Quality vs Latency Tradeoffs

Audio Format Settings

Voice Configuration in Character Cards

Advanced Configuration

Custom Voice Cloning (ElevenLabs)

SSML Support (Microsoft Azure)

Streaming vs Batch Synthesis

Performance Optimization

Reducing Latency

Caching Generated Audio

Cost Management

Provider Cost Comparison

Reducing Costs

Troubleshooting

Voice Not Available

Audio Cutting Out

Poor Voice Quality

High Latency

Rate Limit Exceeded

Code Reference

Related Resources