LangShazam’s real-time detection system captures audio from your microphone, processes it through a WebSocket connection, and returns the detected language within seconds. The system is optimized for accuracy while maintaining low latency.
The browser requests access to your microphone using the MediaDevices API. Audio is captured at 16,000 bits per second in MP4 format for maximum compatibility.
2
Audio Buffering
Audio chunks are collected for at least 4 seconds before processing. This ensures enough context for accurate language detection.
3
WebSocket Transmission
Audio data is streamed to the backend via WebSocket connection, allowing for continuous real-time communication.
4
Language Detection
The backend processes the audio using OpenAI’s Whisper API and returns the detected language along with processing metrics.
When audio data arrives at the backend, it follows this flow:
websocket_manager.py (lines 29-53)
buffer = []total_size = 0MIN_AUDIO_SIZE = 20000 # Minimum size in bytes (about 1 second of audio)try: while True: data = await websocket.receive_bytes() if not data: logger.debug(f"[{connection_id}] Received empty data chunk") continue buffer.append(data) total_size += len(data) logger.debug(f"[{connection_id}] Received audio chunk, total size: {total_size} bytes") # Only process when we have enough data if total_size >= MIN_AUDIO_SIZE: audio_data = b''.join(buffer) logger.info(f"[{connection_id}] Processing audio data of size: {len(audio_data)} bytes") result = await self.audio_processor.process_audio( audio_data, self.metrics, connection_id )