Overview
The speech processing system analyzes audio streams in real-time to extract acoustic features, transcribe speech, and compute research-grade metrics for interview assessment. Source File:backend/interview_analyzer.py
Core Components
1. RunningStatistics Class
Central class for tracking speech metrics using Welford’s online algorithm for numerical stability.interview_analyzer.py:29-72
Key Design Principles
Correct Speech Timing Logic:speaking_time= actual voice activity durationsilence_time= pauses BETWEEN speech segments (thinking time)forced_silence_time= 15s system waits AFTER answer completeeffective_duration= total session time - forced silence
2. Welford’s Online Algorithm
Implements numerically stable running mean and variance calculation without storing all samples.Speech Event Tracking
interview_analyzer.py:110-148
interview_analyzer.py:150-171
3. Pitch Detection (YIN Algorithm)
Uses librosa’s YIN implementation for robust fundamental frequency (F0) estimation.fmin=75: Lower bound for male voicesfmax=400: Upper bound for female voicesframe_length=2048: Analysis window size- Valid range filter: 0 < F0 < 500 Hz
4. Voice Quality Metrics
Jitter (Frequency Perturbation)
Measures cycle-to-cycle variation in pitch period.- Normal jitter: < 1%
- Elevated jitter: May indicate vocal stress or nervousness
- Used in combination with shimmer for voice quality assessment
Shimmer (Amplitude Perturbation)
Measures cycle-to-cycle variation in amplitude.- Normal shimmer: < 3.81%
- Elevated shimmer: May indicate vocal fatigue or stress
5. Faster-Whisper Integration
High-performance speech-to-text transcription using CTranslate2.- 4x faster than OpenAI Whisper
- Lower memory footprint via quantization
- Real-time capable on CPU
6. AssemblyAI Streaming
Real-time streaming transcription with speaker diarization support.- Real-time streaming transcription
- Speaker diarization
- Automatic punctuation and capitalization
- Word-level timestamps
Metric Calculation
Response Latency
interview_analyzer.py:102-148
Speaking Rate (WPM)
- Slow: < 120 WPM
- Normal: 120-160 WPM
- Fast: > 160 WPM
Pause Analysis
Real-Time Processing Pipeline
Thread Safety
All metric updates are protected by reentrant locks:Performance Optimizations
- Welford’s Algorithm: O(1) memory for variance calculation
- Monotonic Timestamps: Use
time.monotonic()for wall-clock independence - Batch Processing: Accumulate metrics, export periodically
- Normalized Audio: Convert to float32 once, reuse
Key Metrics Summary
| Metric | Formula | Clinical Range |
|---|---|---|
| Pitch (F0) | YIN algorithm | 75-400 Hz |
| Jitter | Cycle-to-cycle F0 variation | < 1% normal |
| Shimmer | Cycle-to-cycle amplitude variation | < 3.81% normal |
| Speaking Rate | (words / speaking_time) * 60 | 120-160 WPM |
| Response Latency | first_voice - question_end | < 2s optimal |
| Long Pauses | Pauses > 5s | 0-2 acceptable |