Overview
The Speech-to-Text (STT) component uses Kyutai’s streaming ASR (Automatic Speech Recognition) models to transcribe user speech in real-time with low latency and integrated Voice Activity Detection (VAD). Key Features:- Real-time streaming transcription
- ~2.5 second algorithmic delay (configurable)
- Integrated pause prediction (VAD)
- WebSocket-based binary protocol (MessagePack)
- Word-level timestamps
Architecture
STT Service
Technology: Rust (moshi-server) Location:services/moshi-server/
Model: Kyutai Streaming ASR
Deployment: Docker container with GPU access
Service Configuration
Docker Compose (docker-compose.yml:78):
- VRAM: ~2.5 GB
- Concurrent streams: Limited by capacity management
Python Client
File:unmute/stt/speech_to_text.py
SpeechToText Class
Connection Flow
Startup Sequence
File:speech_to_text.py:130
Sending Audio
File:speech_to_text.py:105
use_single_float=True for efficiency
Receiving Messages
File:speech_to_text.py:175
The STT client is an async iterator:
Message Types
Client → Server
Audio Message
Marker Message
Server → Client
Ready Message
Word Message
Step Message
prs[0]: Probability of speech continuingprs[1]: Probability of speech ending soonprs[2]: Pause prediction score (0-1, higher = more likely pause)
Error Message
Voice Activity Detection
Pause Prediction
The backend usesprs[2] from Step messages to detect pauses.
File: unmute/unmute_handler.py:372
Exponential Moving Average
File:unmute/stt/exponential_moving_average.py
Interruption Detection
File:unmute/unmute_handler.py:352
Two methods for detecting user interruption:
-
STT Word: Any word from STT during bot speaking
-
VAD-based: Pause prediction drops below threshold
- Prevents echo cancellation issues
- STT word-based interruption always works
Flushing
File:unmute/unmute_handler.py:340
When a pause is detected, the STT needs to be “flushed” to process remaining audio:
Timing & Latency
Time to First Token (TTFT)
File:speech_to_text.py:203
Algorithmic Delay
Constant:STT_DELAY_SEC = 2.5 (configurable)
- Purpose: Look-ahead for better accuracy
- Trade-off: Higher delay = better accuracy, higher latency
- Current Time: Tracked via step messages
Metrics
File:unmute/metrics.py
STT-Specific Metrics
Real-Time Factor
Calculated during flush:Integration with UnmuteHandler
Startup
File:unmute/unmute_handler.py:422
Message Loop
File:unmute/unmute_handler.py:436
Error Handling
Connection Failures
WebSocket Disconnect
Graceful Shutdown
File:speech_to_text.py:157
Testing & Debugging
Dummy STT
File:unmute/stt/dummy_speech_to_text.py
For testing without GPU:
Example Script
File:unmute/scripts/stt_from_file_example.py
Transcribe audio file:
Performance Tuning
Delay Configuration
AdjustSTT_DELAY_SEC in environment:
VAD Threshold
Adjust pause detection sensitivity:EMA Parameters
Adjust smoothing for VAD scores:Next Steps
- Text-to-Speech - TTS component
- LLM Integration - LLM processing
- Data Flow - End-to-end timing