Skip to main content

WebSocket /ws/audio/

Bidirectional WebSocket for streaming audio from glasses microphone to speech recognition and command processing.

Connection

Establish WebSocket connection:
const ws = new WebSocket('wss://api.jarvis.local/ws/audio/room123');

Path Parameters

room_code
string
required
Unique room identifier for this audio session (client-generated).

Authentication

No authentication required (for hackathon demo).

Client → Server Messages

Audio Chunk (Binary)

Send raw audio bytes (PCM or WebM format):
// Send audio buffer
ws.send(audioBuffer); // ArrayBuffer or Blob
Format requirements:
  • Sample rate: 16kHz recommended
  • Channels: Mono (1 channel)
  • Encoding: PCM16 or WebM/Opus
  • Chunk size: 1-5 seconds of audio

Server → Client Messages

Transcript Event

Sent when speech is recognized:
{
  "type": "transcript",
  "text": "identify John Smith"
}
type
string
Always "transcript" for transcription events.
text
string
Transcribed text from the audio chunk.

Command Event

Sent when a command is matched:
{
  "type": "command",
  "command": "IDENTIFY",
  "argument": "John Smith"
}
type
string
Always "command" for command events.
command
string
Matched command type:
  • IDENTIFY - Identify a person by name
  • RESEARCH - Research a topic or person
  • CAPTURE - Capture current frame
  • NONE - No command matched
argument
string
Extracted argument (e.g., person name, research query).

Supported Commands

The audio processor matches these voice commands:
Command PatternCommandExample
”identify [name]“IDENTIFY”identify John Smith"
"who is [name]“IDENTIFY”who is Jane Doe"
"research [query]“RESEARCH”research Tesla stock"
"look up [query]“RESEARCH”look up machine learning"
"capture”CAPTURE”capture this"
"take a picture”CAPTURE”take a picture”

Connection Lifecycle

Example Implementation (JavaScript)

class AudioStreamer {
  constructor(roomCode) {
    this.ws = new WebSocket(`wss://api.jarvis.local/ws/audio/${roomCode}`);
    this.setupHandlers();
  }
  
  setupHandlers() {
    this.ws.onopen = () => {
      console.log('Audio WebSocket connected');
      this.startMicrophone();
    };
    
    this.ws.onmessage = (event) => {
      const data = JSON.parse(event.data);
      
      if (data.type === 'transcript') {
        console.log('Transcript:', data.text);
        this.onTranscript(data.text);
      } else if (data.type === 'command') {
        console.log('Command:', data.command, data.argument);
        this.onCommand(data.command, data.argument);
      }
    };
    
    this.ws.onerror = (error) => {
      console.error('WebSocket error:', error);
    };
    
    this.ws.onclose = () => {
      console.log('WebSocket closed');
      this.stopMicrophone();
    };
  }
  
  async startMicrophone() {
    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    const mediaRecorder = new MediaRecorder(stream, {
      mimeType: 'audio/webm',
      audioBitsPerSecond: 16000
    });
    
    mediaRecorder.ondataavailable = (event) => {
      if (event.data.size > 0 && this.ws.readyState === WebSocket.OPEN) {
        this.ws.send(event.data);
      }
    };
    
    // Send chunks every 2 seconds
    mediaRecorder.start(2000);
    this.mediaRecorder = mediaRecorder;
  }
  
  stopMicrophone() {
    if (this.mediaRecorder) {
      this.mediaRecorder.stop();
    }
  }
  
  onTranscript(text) {
    // Update UI with transcript
    document.getElementById('transcript').textContent = text;
  }
  
  onCommand(command, argument) {
    // Handle commands
    if (command === 'IDENTIFY') {
      this.identifyPerson(argument);
    } else if (command === 'RESEARCH') {
      this.research(argument);
    } else if (command === 'CAPTURE') {
      this.captureFrame();
    }
  }
  
  identifyPerson(name) {
    console.log('Identifying:', name);
    // Trigger identification flow
  }
  
  research(query) {
    console.log('Researching:', query);
    // Trigger research flow
  }
  
  captureFrame() {
    console.log('Capturing frame');
    // Capture current video frame
  }
  
  close() {
    this.ws.close();
  }
}

// Usage
const streamer = new AudioStreamer('room_' + Date.now());

Close Codes

CodeReasonDescription
1000Normal closureClient closed connection cleanly
1008Policy violationOpenAI API key not configured
1011Server errorUnexpected server error

Error Handling

ws.onclose = (event) => {
  if (event.code === 1008) {
    console.error('Audio API not configured');
    alert('Voice commands are not available');
  } else if (event.code !== 1000) {
    console.error('Connection closed unexpectedly:', event.code, event.reason);
    // Attempt reconnection with exponential backoff
    setTimeout(() => reconnect(), Math.min(1000 * Math.pow(2, retryCount), 30000));
  }
};

Performance

  • Transcription latency: 500-2000ms per chunk
  • Command matching: Less than 10ms
  • Max connection duration: No limit
  • Recommended chunk size: 2-3 seconds

Best Practices

Use WebM/Opus encoding for better compression
Send audio chunks every 2-3 seconds for responsive transcription
Handle reconnection with exponential backoff
Mute audio input when not needed to save API costs
Transcription uses OpenAI Whisper API (costs ~$0.006 per minute)
Long-running connections can accumulate significant costs

Debugging

Enable verbose logging:
ws.onmessage = (event) => {
  console.log('[WS] Received:', event.data);
  const data = JSON.parse(event.data);
  // ... handle message
};

ws.send = (data) => {
  console.log('[WS] Sending:', data.byteLength, 'bytes');
  WebSocket.prototype.send.call(ws, data);
};
Monitor connection state:
setInterval(() => {
  console.log('WebSocket state:', [
    'CONNECTING',
    'OPEN',
    'CLOSING',
    'CLOSED'
  ][ws.readyState]);
}, 5000);

Build docs developers (and LLMs) love