WebRTC & Audio Processing

Overview

While Unmute uses WebSockets for message transport, it leverages Web Audio API and Opus encoding for efficient real-time audio processing. The system handles:

Microphone input capture and encoding
Real-time audio streaming with low latency
Audio decoding and playback
Voice activity detection (VAD)
Audio buffering and synchronization

Unmute does not use WebRTC peer-to-peer connections. Instead, it uses WebSocket for transport with Opus audio encoding, which provides similar efficiency for the client-server architecture.

Audio Pipeline Architecture

Microphone → MediaStream → AudioContext → OpusEncoder → WebSocket → Backend
                                                                        ↓
                                           STT (Speech-to-Text)
                                           LLM (Language Model)
                                           TTS (Text-to-Speech)
                                                                        ↓
Speakers ← AudioContext ← OpusDecoder ← WebSocket ← Backend

Frontend Audio Processing

Audio Processor Setup

The useAudioProcessor hook (frontend/src/app/useAudioProcessor.ts) manages the complete audio pipeline:

export interface AudioProcessor {
  audioContext: AudioContext;
  opusRecorder: OpusRecorder;
  decoder: DecoderWorker;
  outputWorklet: AudioWorkletNode;
  inputAnalyser: AnalyserNode;
  outputAnalyser: AnalyserNode;
  mediaStreamDestination: MediaStreamAudioDestinationNode;
}

Microphone Input Processing

Configuration (useAudioProcessor.ts:83-104):

const recorderOptions = {
  mediaTrackConstraints: {
    audio: {
      echoCancellation: true,
      noiseSuppression: false,
      autoGainControl: true,
      channelCount: 1,
    },
    video: false,
  },
  encoderPath: "/encoderWorker.min.js",
  bufferLength: Math.round((960 * audioContext.sampleRate) / 24000),
  encoderFrameSize: 20,
  encoderSampleRate: 24000,
  maxFramesPerPage: 2,
  numberOfChannels: 1,
  recordingGain: 1,
  resampleQuality: 3,
  encoderComplexity: 0,
  encoderApplication: 2049,
  streamPages: true,
};

Key Settings:

echoCancellation

boolean

Enabled to prevent feedback from speakers

noiseSuppression

boolean

Disabled - handled by backend processing

autoGainControl

boolean

Enabled for consistent audio levels

encoderSampleRate

number

24kHz - balances quality and bandwidth

encoderFrameSize

number

20ms frames for low latency

Opus Encoding

The frontend uses the opus-recorder library to encode microphone input:

const opusRecorder = new OpusRecorder(recorderOptions);
opusRecorder.ondataavailable = (data: Uint8Array) => {
  // opus actually always works at 48khz internally
  micDuration = opusRecorder.encodedSamplePosition / 48000;
  onOpusRecorded(data);
};

The encoded Opus data is then base64-encoded and sent via WebSocket:

export const base64EncodeOpus = (opusData: Uint8Array) => {
  let binary = "";
  for (let i = 0; i < opusData.byteLength; i++) {
    binary += String.fromCharCode(opusData[i]);
  }
  return window.btoa(binary);
};

Audio Output Processing

Decoder Setup (useAudioProcessor.ts:56-77):

const decoder = new Worker("/decoderWorker.min.js");
let micDuration = 0;

decoder.onmessage = (event: MessageEvent<any>) => {
  if (!event.data) {
    return;
  }
  const frame = event.data[0];
  outputWorklet.port.postMessage({
    frame: frame,
    type: "audio",
    micDuration: micDuration,
  });
};

decoder.postMessage({
  command: "init",
  bufferLength: (960 * audioContext.sampleRate) / 24000,
  decoderSampleRate: 24000,
  outputBufferSampleRate: audioContext.sampleRate,
  resampleQuality: 0,
});

Audio Worklet Processing: The audio-output-processor worklet handles the actual audio playback, buffering incoming frames and outputting them at the correct rate.

Audio Analysis for Visualization

Both input and output audio streams are analyzed for visualization:

const inputAnalyser = audioContext.createAnalyser();
inputAnalyser.fftSize = 2048;
source.connect(inputAnalyser);

const outputAnalyser = audioContext.createAnalyser();
outputAnalyser.fftSize = 2048;
outputWorklet.connect(outputAnalyser);

These analyzers provide frequency domain data used by the circular audio visualizers in the UI.

Backend Audio Processing

Opus Decoding

The backend uses the sphn library for Opus stream processing (unmute/main_websocket.py:415-477):

opus_reader = sphn.OpusStreamReader(SAMPLE_RATE)
wait_for_first_opus = True

while True:
    message_raw = await websocket.receive_text()
    message: ora.ClientEvent = ClientEventAdapter.validate_json(message_raw)
    
    if isinstance(message, ora.InputAudioBufferAppend):
        opus_bytes = base64.b64decode(message.audio)
        if wait_for_first_opus:
            # Check for first packet bit (opus_bytes[5] & 2)
            if opus_bytes[5] & 2:
                wait_for_first_opus = False
            else:
                continue
        pcm = await asyncio.to_thread(opus_reader.append_bytes, opus_bytes)
        
        if pcm.size:
            await handler.receive((SAMPLE_RATE, pcm[np.newaxis, :]))

First Packet Detection: The backend waits for the first Opus packet by checking bit 2 of byte 5 in the Opus stream. This ensures proper stream synchronization and prevents processing stale data from previous connections.

Opus Encoding

For outgoing audio, the backend encodes PCM audio to Opus (unmute/main_websocket.py:520-558):

opus_writer = sphn.OpusStreamWriter(SAMPLE_RATE)

while True:
    emitted_by_handler = await handler.emit()
    
    if isinstance(emitted_by_handler, tuple):
        _sr, audio = emitted_by_handler
        audio = audio_to_float32(audio)
        opus_bytes = await asyncio.to_thread(opus_writer.append_pcm, audio)
        
        # Due to buffering/chunking, Opus doesn't necessarily output
        # something on every PCM added
        if opus_bytes:
            to_emit = ora.ResponseAudioDelta(
                delta=base64.b64encode(opus_bytes).decode("utf-8"),
            )

Audio Buffering and Synchronization

Backend Buffering

The TTS system manages audio buffering to prevent stuttering (unmute/tts/text_to_speech.py:88-94):

# Only release the audio such that it's AUDIO_BUFFER_SEC ahead of real time.
# If the value is too low, it might cause stuttering.
# If it's too high, it's difficult to control the synchronization of the text
# and the audio, because that's controlled by emit() and WebRTC. Note that some
# desynchronization can still occur if the TTS is less than real-time, because
# WebRTC will decide to do some buffering of the audio on the fly.
AUDIO_BUFFER_SEC = FRAME_TIME_SEC * 4

AUDIO_BUFFER_SEC

number

Buffer size in seconds to prevent stuttering while maintaining low latency

Frontend Buffering

The audio output processor worklet handles buffering on the client side, ensuring smooth playback even with network jitter.

Audio Format Specifications

Input Audio (Microphone)

Codec

string

Opus

Sample Rate

string

24kHz

Channels

string

Mono (1 channel)

Frame Size

string

20ms

Bitrate

string

Adaptive (Opus encoder automatic)

Transport

string

Base64-encoded over WebSocket

Output Audio (TTS)

Codec

string

Opus

Sample Rate

string

24kHz

Channels

string

Mono (1 channel)

Frame Size

string

Variable (based on TTS output)

Transport

string

Base64-encoded over WebSocket

Cloudflare TURN Configuration

Although Unmute doesn’t use WebRTC peer connections, it includes utilities for obtaining TURN server credentials from Cloudflare (unmute/webrtc_utils.py):

def get_cloudflare_rtc_configuration():
    # see: https://fastrtc.org/deployment/#cloudflare-calls-api
    turn_key_id = os.environ.get("TURN_KEY_ID")
    turn_key_api_token = os.environ.get("TURN_KEY_API_TOKEN")
    ttl = 86400  # 24 hours

    response = requests.post(
        f"https://rtc.live.cloudflare.com/v1/turn/keys/{turn_key_id}/credentials/generate-ice-servers",
        headers={
            "Authorization": f"Bearer {turn_key_api_token}",
            "Content-Type": "application/json",
        },
        json={"ttl": ttl},
    )
    if response.ok:
        return response.json()

This utility is available for future use if Unmute moves to a peer-to-peer WebRTC architecture.

Performance Considerations

Latency Optimization

Small Frame Sizes: 20ms frames minimize encoding latency
Streaming Mode: streamPages: true sends data immediately without waiting for complete pages
Low Complexity: encoderComplexity: 0 trades some quality for lower CPU usage and latency
Minimal Buffering: AUDIO_BUFFER_SEC = FRAME_TIME_SEC * 4 keeps buffer small

Bandwidth Optimization

24kHz Sample Rate: Lower than 48kHz but sufficient for voice
Mono Audio: Single channel reduces bandwidth by 50%
Opus Codec: Highly efficient compression for speech
Adaptive Bitrate: Opus automatically adjusts based on audio characteristics

CPU Optimization

Web Workers: Encoding/decoding runs in separate threads
Audio Worklets: Audio processing runs on high-priority audio thread
Async Processing: Backend uses asyncio.to_thread for CPU-intensive operations

Debugging Audio Issues

Enable Developer Mode

Press D in the frontend to enable developer mode, which shows:

Debug dictionary with internal state
Additional logging in the console

Check Audio Levels

The circular visualizers show audio activity:

User circle (right): Should pulse when speaking
Assistant circle (left): Should pulse during TTS output

Common Issues

No audio input detected

Check microphone permissions
Verify microphone is not muted in system settings
Check browser DevTools console for errors
Ensure echoCancellation is properly configured

Choppy or stuttering audio output

Network issues may be causing packet loss
Backend TTS may be slower than real-time
Increase AUDIO_BUFFER_SEC value
Check CPU usage on backend

Audio/text desynchronization

This can occur when TTS is slower than real-time
Buffering in the audio pipeline causes delayed playback
Adjust AUDIO_BUFFER_SEC to balance latency vs stability

Echo or feedback

Ensure echoCancellation: true is set
Use headphones to prevent speaker feedback
Lower speaker volume

WebSocket Protocol - Message format and communication
System Architecture - Overall system design

System Design

Core Components

Protocols

WebRTC & Audio Processing

Overview

Audio Pipeline Architecture

Frontend Audio Processing

Audio Processor Setup

Microphone Input Processing

Opus Encoding

Audio Output Processing

Audio Analysis for Visualization

Backend Audio Processing

Opus Decoding

Opus Encoding

Audio Buffering and Synchronization

Backend Buffering

Frontend Buffering

Audio Format Specifications

Input Audio (Microphone)

Output Audio (TTS)

Cloudflare TURN Configuration

Performance Considerations

Latency Optimization

Bandwidth Optimization

CPU Optimization

Debugging Audio Issues

Enable Developer Mode

Check Audio Levels

Common Issues

Build docs developers (and LLMs) love

System Design

Core Components

Protocols

​Overview

​Audio Pipeline Architecture

​Frontend Audio Processing

​Audio Processor Setup

​Microphone Input Processing

​Opus Encoding

​Audio Output Processing

​Audio Analysis for Visualization

​Backend Audio Processing

​Opus Decoding

​Opus Encoding

​Audio Buffering and Synchronization

​Backend Buffering

​Frontend Buffering

​Audio Format Specifications

​Input Audio (Microphone)

​Output Audio (TTS)

​Cloudflare TURN Configuration

​Performance Considerations

​Latency Optimization

​Bandwidth Optimization

​CPU Optimization

​Debugging Audio Issues

​Enable Developer Mode

​Check Audio Levels

​Common Issues

​Related Documentation

Build docs developers (and LLMs) love

Overview

Audio Pipeline Architecture

Frontend Audio Processing

Audio Processor Setup

Microphone Input Processing

Opus Encoding

Audio Output Processing

Audio Analysis for Visualization

Backend Audio Processing

Opus Decoding

Opus Encoding

Audio Buffering and Synchronization

Backend Buffering

Frontend Buffering

Audio Format Specifications

Input Audio (Microphone)

Output Audio (TTS)

Cloudflare TURN Configuration

Performance Considerations

Latency Optimization

Bandwidth Optimization

CPU Optimization

Debugging Audio Issues

Enable Developer Mode

Check Audio Levels

Common Issues

Related Documentation