Overview
While Unmute uses WebSockets for message transport, it leverages Web Audio API and Opus encoding for efficient real-time audio processing. The system handles:- Microphone input capture and encoding
- Real-time audio streaming with low latency
- Audio decoding and playback
- Voice activity detection (VAD)
- Audio buffering and synchronization
Unmute does not use WebRTC peer-to-peer connections. Instead, it uses WebSocket for transport with Opus audio encoding, which provides similar efficiency for the client-server architecture.
Audio Pipeline Architecture
Frontend Audio Processing
Audio Processor Setup
TheuseAudioProcessor hook (frontend/src/app/useAudioProcessor.ts) manages the complete audio pipeline:
Microphone Input Processing
Configuration (useAudioProcessor.ts:83-104):
Enabled to prevent feedback from speakers
Disabled - handled by backend processing
Enabled for consistent audio levels
24kHz - balances quality and bandwidth
20ms frames for low latency
Opus Encoding
The frontend uses theopus-recorder library to encode microphone input:
Audio Output Processing
Decoder Setup (useAudioProcessor.ts:56-77):
audio-output-processor worklet handles the actual audio playback, buffering incoming frames and outputting them at the correct rate.
Audio Analysis for Visualization
Both input and output audio streams are analyzed for visualization:Backend Audio Processing
Opus Decoding
The backend uses thesphn library for Opus stream processing (unmute/main_websocket.py:415-477):
Opus Encoding
For outgoing audio, the backend encodes PCM audio to Opus (unmute/main_websocket.py:520-558):
Audio Buffering and Synchronization
Backend Buffering
The TTS system manages audio buffering to prevent stuttering (unmute/tts/text_to_speech.py:88-94):
Buffer size in seconds to prevent stuttering while maintaining low latency
Frontend Buffering
The audio output processor worklet handles buffering on the client side, ensuring smooth playback even with network jitter.Audio Format Specifications
Input Audio (Microphone)
Opus
24kHz
Mono (1 channel)
20ms
Adaptive (Opus encoder automatic)
Base64-encoded over WebSocket
Output Audio (TTS)
Opus
24kHz
Mono (1 channel)
Variable (based on TTS output)
Base64-encoded over WebSocket
Cloudflare TURN Configuration
Although Unmute doesn’t use WebRTC peer connections, it includes utilities for obtaining TURN server credentials from Cloudflare (unmute/webrtc_utils.py):
This utility is available for future use if Unmute moves to a peer-to-peer WebRTC architecture.
Performance Considerations
Latency Optimization
- Small Frame Sizes: 20ms frames minimize encoding latency
- Streaming Mode:
streamPages: truesends data immediately without waiting for complete pages - Low Complexity:
encoderComplexity: 0trades some quality for lower CPU usage and latency - Minimal Buffering:
AUDIO_BUFFER_SEC = FRAME_TIME_SEC * 4keeps buffer small
Bandwidth Optimization
- 24kHz Sample Rate: Lower than 48kHz but sufficient for voice
- Mono Audio: Single channel reduces bandwidth by 50%
- Opus Codec: Highly efficient compression for speech
- Adaptive Bitrate: Opus automatically adjusts based on audio characteristics
CPU Optimization
- Web Workers: Encoding/decoding runs in separate threads
- Audio Worklets: Audio processing runs on high-priority audio thread
- Async Processing: Backend uses
asyncio.to_threadfor CPU-intensive operations
Debugging Audio Issues
Enable Developer Mode
PressD in the frontend to enable developer mode, which shows:
- Debug dictionary with internal state
- Additional logging in the console
Check Audio Levels
The circular visualizers show audio activity:- User circle (right): Should pulse when speaking
- Assistant circle (left): Should pulse during TTS output
Common Issues
No audio input detected
No audio input detected
- Check microphone permissions
- Verify microphone is not muted in system settings
- Check browser DevTools console for errors
- Ensure
echoCancellationis properly configured
Choppy or stuttering audio output
Choppy or stuttering audio output
- Network issues may be causing packet loss
- Backend TTS may be slower than real-time
- Increase
AUDIO_BUFFER_SECvalue - Check CPU usage on backend
Audio/text desynchronization
Audio/text desynchronization
- This can occur when TTS is slower than real-time
- Buffering in the audio pipeline causes delayed playback
- Adjust
AUDIO_BUFFER_SECto balance latency vs stability
Echo or feedback
Echo or feedback
- Ensure
echoCancellation: trueis set - Use headphones to prevent speaker feedback
- Lower speaker volume
Related Documentation
- WebSocket Protocol - Message format and communication
- System Architecture - Overall system design