Overview
Airi provides advanced speech recognition capabilities with support for multiple transcription providers, client-side Voice Activity Detection (VAD), and both streaming and batch transcription modes. The system works entirely in the browser with optional server-side provider support.Architecture
The speech recognition system consists of:- Voice Activity Detection (VAD): Client-side speech detection using Silero VAD (Transformers.js)
- Audio Pipeline: Real-time audio capture, processing, and streaming
- Transcription Providers: Multiple provider support (Web Speech API, Aliyun, OpenAI-compatible)
- Streaming Support: Real-time transcription as the user speaks
- Session Management: Handle continuous recognition with idle timeouts
Voice Activity Detection
VAD automatically detects when speech starts and ends, enabling hands-free interaction.VAD Implementation
Airi uses the Silero VAD model via Hugging Face Transformers.js:VAD Events
Processing Audio with VAD
VAD Configuration
Transcription Providers
Airi supports multiple speech recognition providers:Web Speech API (Browser-Native)
OpenAI Whisper
Aliyun NLS (Streaming)
Streaming Transcription
Real-Time Speech Recognition
Web Speech API Streaming
Batch Transcription
Transcribe Audio File
Using Generate API
Audio Pipeline
Audio Stream Creation
Session Management
Continuous Recognition
Idle Timeout
Manual Stop
Language Support
Supported Languages
Auto Language Detection
Performance Optimization
Sample Rate
VAD Tuning
Model Selection
Error Handling
Common Errors
Retry Logic
Best Practices
- Check Support: Verify Web Speech API availability before using
- Request Permissions Early: Get microphone access before user needs it
- Use VAD: Implement VAD to avoid sending silence to APIs
- Handle Errors: Always catch and handle transcription errors
- Set Timeouts: Implement idle timeouts to save resources
- Choose Appropriate Provider: Use Web Speech API for free/offline, Whisper for accuracy
- Monitor Performance: Track latency and adjust buffer sizes
- Cleanup Sessions: Always stop and cleanup audio resources
