Overview
Airi’s voice chat system provides low-latency, real-time voice interaction through a sophisticated audio pipeline. The system handles audio input/output, voice activity detection (VAD), speech recognition, and text-to-speech synthesis in a unified pipeline.Architecture
The voice chat system consists of several integrated components:- Audio Context Management: High-quality audio processing with configurable sample rates
- Voice Activity Detection: Client-side speech detection using Silero VAD
- Audio Pipeline: Streaming audio processing with resampling and encoding
- Speech Pipeline: Orchestrates TTS generation, playback scheduling, and intent management
Audio Context
The audio context provides the foundation for all audio operations in Airi.Initialization
Creating Audio Nodes
Audio Context State Management
Voice Activity Detection (VAD)
VAD automatically detects when the user is speaking, enabling push-to-talk-free interaction.VAD Configuration
Using VAD
Vue Composable for VAD
Speech Pipeline
The speech pipeline manages TTS generation, playback scheduling, and intent prioritization.Creating a Speech Pipeline
Using Speech Intents
Pipeline Events
Configuration
Audio Quality Settings
VAD Sensitivity
Speech Pipeline Priority
Performance Considerations
- Sample Rate: Higher sample rates (48kHz) provide better quality but use more processing power
- Buffer Size: Smaller buffers reduce latency but may cause audio glitches on slower devices
- VAD Thresholds: Adjust based on microphone quality and ambient noise levels
- Worklet Processing: Audio worklets run on a separate thread for optimal performance
Best Practices
- Initialize Early: Set up the audio context before user interaction to avoid delays
- Cleanup Resources: Always disconnect and remove audio nodes when done
- Handle Errors: Audio context can fail on iOS without user gesture
- Monitor State: Subscribe to context state changes for debugging
- Test Across Devices: Audio behavior varies significantly across browsers and devices
