WebSocket /ws/audio/
Bidirectional WebSocket for streaming audio from glasses microphone to speech recognition and command processing.
Connection
Establish WebSocket connection:
const ws = new WebSocket('wss://api.jarvis.local/ws/audio/room123');
Path Parameters
Unique room identifier for this audio session (client-generated).
Authentication
No authentication required (for hackathon demo).
Client → Server Messages
Audio Chunk (Binary)
Send raw audio bytes (PCM or WebM format):
// Send audio buffer
ws.send(audioBuffer); // ArrayBuffer or Blob
Format requirements:
- Sample rate: 16kHz recommended
- Channels: Mono (1 channel)
- Encoding: PCM16 or WebM/Opus
- Chunk size: 1-5 seconds of audio
Server → Client Messages
Transcript Event
Sent when speech is recognized:
{
"type": "transcript",
"text": "identify John Smith"
}
Always "transcript" for transcription events.
Transcribed text from the audio chunk.
Command Event
Sent when a command is matched:
{
"type": "command",
"command": "IDENTIFY",
"argument": "John Smith"
}
Always "command" for command events.
Matched command type:
IDENTIFY - Identify a person by name
RESEARCH - Research a topic or person
CAPTURE - Capture current frame
NONE - No command matched
Extracted argument (e.g., person name, research query).
Supported Commands
The audio processor matches these voice commands:
| Command Pattern | Command | Example |
|---|
| ”identify [name]“ | IDENTIFY | ”identify John Smith" |
| "who is [name]“ | IDENTIFY | ”who is Jane Doe" |
| "research [query]“ | RESEARCH | ”research Tesla stock" |
| "look up [query]“ | RESEARCH | ”look up machine learning" |
| "capture” | CAPTURE | ”capture this" |
| "take a picture” | CAPTURE | ”take a picture” |
Connection Lifecycle
Example Implementation (JavaScript)
class AudioStreamer {
constructor(roomCode) {
this.ws = new WebSocket(`wss://api.jarvis.local/ws/audio/${roomCode}`);
this.setupHandlers();
}
setupHandlers() {
this.ws.onopen = () => {
console.log('Audio WebSocket connected');
this.startMicrophone();
};
this.ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === 'transcript') {
console.log('Transcript:', data.text);
this.onTranscript(data.text);
} else if (data.type === 'command') {
console.log('Command:', data.command, data.argument);
this.onCommand(data.command, data.argument);
}
};
this.ws.onerror = (error) => {
console.error('WebSocket error:', error);
};
this.ws.onclose = () => {
console.log('WebSocket closed');
this.stopMicrophone();
};
}
async startMicrophone() {
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const mediaRecorder = new MediaRecorder(stream, {
mimeType: 'audio/webm',
audioBitsPerSecond: 16000
});
mediaRecorder.ondataavailable = (event) => {
if (event.data.size > 0 && this.ws.readyState === WebSocket.OPEN) {
this.ws.send(event.data);
}
};
// Send chunks every 2 seconds
mediaRecorder.start(2000);
this.mediaRecorder = mediaRecorder;
}
stopMicrophone() {
if (this.mediaRecorder) {
this.mediaRecorder.stop();
}
}
onTranscript(text) {
// Update UI with transcript
document.getElementById('transcript').textContent = text;
}
onCommand(command, argument) {
// Handle commands
if (command === 'IDENTIFY') {
this.identifyPerson(argument);
} else if (command === 'RESEARCH') {
this.research(argument);
} else if (command === 'CAPTURE') {
this.captureFrame();
}
}
identifyPerson(name) {
console.log('Identifying:', name);
// Trigger identification flow
}
research(query) {
console.log('Researching:', query);
// Trigger research flow
}
captureFrame() {
console.log('Capturing frame');
// Capture current video frame
}
close() {
this.ws.close();
}
}
// Usage
const streamer = new AudioStreamer('room_' + Date.now());
Close Codes
| Code | Reason | Description |
|---|
| 1000 | Normal closure | Client closed connection cleanly |
| 1008 | Policy violation | OpenAI API key not configured |
| 1011 | Server error | Unexpected server error |
Error Handling
ws.onclose = (event) => {
if (event.code === 1008) {
console.error('Audio API not configured');
alert('Voice commands are not available');
} else if (event.code !== 1000) {
console.error('Connection closed unexpectedly:', event.code, event.reason);
// Attempt reconnection with exponential backoff
setTimeout(() => reconnect(), Math.min(1000 * Math.pow(2, retryCount), 30000));
}
};
- Transcription latency: 500-2000ms per chunk
- Command matching: Less than 10ms
- Max connection duration: No limit
- Recommended chunk size: 2-3 seconds
Best Practices
Use WebM/Opus encoding for better compression
Send audio chunks every 2-3 seconds for responsive transcription
Handle reconnection with exponential backoff
Mute audio input when not needed to save API costs
Transcription uses OpenAI Whisper API (costs ~$0.006 per minute)
Long-running connections can accumulate significant costs
Debugging
Enable verbose logging:
ws.onmessage = (event) => {
console.log('[WS] Received:', event.data);
const data = JSON.parse(event.data);
// ... handle message
};
ws.send = (data) => {
console.log('[WS] Sending:', data.byteLength, 'bytes');
WebSocket.prototype.send.call(ws, data);
};
Monitor connection state:
setInterval(() => {
console.log('WebSocket state:', [
'CONNECTING',
'OPEN',
'CLOSING',
'CLOSED'
][ws.readyState]);
}, 5000);