Skip to main content

Overview

While Unmute uses WebSockets for message transport, it leverages Web Audio API and Opus encoding for efficient real-time audio processing. The system handles:
  • Microphone input capture and encoding
  • Real-time audio streaming with low latency
  • Audio decoding and playback
  • Voice activity detection (VAD)
  • Audio buffering and synchronization
Unmute does not use WebRTC peer-to-peer connections. Instead, it uses WebSocket for transport with Opus audio encoding, which provides similar efficiency for the client-server architecture.

Audio Pipeline Architecture

Microphone → MediaStream → AudioContext → OpusEncoder → WebSocket → Backend

                                           STT (Speech-to-Text)
                                           LLM (Language Model)
                                           TTS (Text-to-Speech)

Speakers ← AudioContext ← OpusDecoder ← WebSocket ← Backend

Frontend Audio Processing

Audio Processor Setup

The useAudioProcessor hook (frontend/src/app/useAudioProcessor.ts) manages the complete audio pipeline:
export interface AudioProcessor {
  audioContext: AudioContext;
  opusRecorder: OpusRecorder;
  decoder: DecoderWorker;
  outputWorklet: AudioWorkletNode;
  inputAnalyser: AnalyserNode;
  outputAnalyser: AnalyserNode;
  mediaStreamDestination: MediaStreamAudioDestinationNode;
}

Microphone Input Processing

Configuration (useAudioProcessor.ts:83-104):
const recorderOptions = {
  mediaTrackConstraints: {
    audio: {
      echoCancellation: true,
      noiseSuppression: false,
      autoGainControl: true,
      channelCount: 1,
    },
    video: false,
  },
  encoderPath: "/encoderWorker.min.js",
  bufferLength: Math.round((960 * audioContext.sampleRate) / 24000),
  encoderFrameSize: 20,
  encoderSampleRate: 24000,
  maxFramesPerPage: 2,
  numberOfChannels: 1,
  recordingGain: 1,
  resampleQuality: 3,
  encoderComplexity: 0,
  encoderApplication: 2049,
  streamPages: true,
};
Key Settings:
echoCancellation
boolean
Enabled to prevent feedback from speakers
noiseSuppression
boolean
Disabled - handled by backend processing
autoGainControl
boolean
Enabled for consistent audio levels
encoderSampleRate
number
24kHz - balances quality and bandwidth
encoderFrameSize
number
20ms frames for low latency

Opus Encoding

The frontend uses the opus-recorder library to encode microphone input:
const opusRecorder = new OpusRecorder(recorderOptions);
opusRecorder.ondataavailable = (data: Uint8Array) => {
  // opus actually always works at 48khz internally
  micDuration = opusRecorder.encodedSamplePosition / 48000;
  onOpusRecorded(data);
};
The encoded Opus data is then base64-encoded and sent via WebSocket:
export const base64EncodeOpus = (opusData: Uint8Array) => {
  let binary = "";
  for (let i = 0; i < opusData.byteLength; i++) {
    binary += String.fromCharCode(opusData[i]);
  }
  return window.btoa(binary);
};

Audio Output Processing

Decoder Setup (useAudioProcessor.ts:56-77):
const decoder = new Worker("/decoderWorker.min.js");
let micDuration = 0;

decoder.onmessage = (event: MessageEvent<any>) => {
  if (!event.data) {
    return;
  }
  const frame = event.data[0];
  outputWorklet.port.postMessage({
    frame: frame,
    type: "audio",
    micDuration: micDuration,
  });
};

decoder.postMessage({
  command: "init",
  bufferLength: (960 * audioContext.sampleRate) / 24000,
  decoderSampleRate: 24000,
  outputBufferSampleRate: audioContext.sampleRate,
  resampleQuality: 0,
});
Audio Worklet Processing: The audio-output-processor worklet handles the actual audio playback, buffering incoming frames and outputting them at the correct rate.

Audio Analysis for Visualization

Both input and output audio streams are analyzed for visualization:
const inputAnalyser = audioContext.createAnalyser();
inputAnalyser.fftSize = 2048;
source.connect(inputAnalyser);

const outputAnalyser = audioContext.createAnalyser();
outputAnalyser.fftSize = 2048;
outputWorklet.connect(outputAnalyser);
These analyzers provide frequency domain data used by the circular audio visualizers in the UI.

Backend Audio Processing

Opus Decoding

The backend uses the sphn library for Opus stream processing (unmute/main_websocket.py:415-477):
opus_reader = sphn.OpusStreamReader(SAMPLE_RATE)
wait_for_first_opus = True

while True:
    message_raw = await websocket.receive_text()
    message: ora.ClientEvent = ClientEventAdapter.validate_json(message_raw)
    
    if isinstance(message, ora.InputAudioBufferAppend):
        opus_bytes = base64.b64decode(message.audio)
        if wait_for_first_opus:
            # Check for first packet bit (opus_bytes[5] & 2)
            if opus_bytes[5] & 2:
                wait_for_first_opus = False
            else:
                continue
        pcm = await asyncio.to_thread(opus_reader.append_bytes, opus_bytes)
        
        if pcm.size:
            await handler.receive((SAMPLE_RATE, pcm[np.newaxis, :]))
First Packet Detection: The backend waits for the first Opus packet by checking bit 2 of byte 5 in the Opus stream. This ensures proper stream synchronization and prevents processing stale data from previous connections.

Opus Encoding

For outgoing audio, the backend encodes PCM audio to Opus (unmute/main_websocket.py:520-558):
opus_writer = sphn.OpusStreamWriter(SAMPLE_RATE)

while True:
    emitted_by_handler = await handler.emit()
    
    if isinstance(emitted_by_handler, tuple):
        _sr, audio = emitted_by_handler
        audio = audio_to_float32(audio)
        opus_bytes = await asyncio.to_thread(opus_writer.append_pcm, audio)
        
        # Due to buffering/chunking, Opus doesn't necessarily output
        # something on every PCM added
        if opus_bytes:
            to_emit = ora.ResponseAudioDelta(
                delta=base64.b64encode(opus_bytes).decode("utf-8"),
            )

Audio Buffering and Synchronization

Backend Buffering

The TTS system manages audio buffering to prevent stuttering (unmute/tts/text_to_speech.py:88-94):
# Only release the audio such that it's AUDIO_BUFFER_SEC ahead of real time.
# If the value is too low, it might cause stuttering.
# If it's too high, it's difficult to control the synchronization of the text
# and the audio, because that's controlled by emit() and WebRTC. Note that some
# desynchronization can still occur if the TTS is less than real-time, because
# WebRTC will decide to do some buffering of the audio on the fly.
AUDIO_BUFFER_SEC = FRAME_TIME_SEC * 4
AUDIO_BUFFER_SEC
number
Buffer size in seconds to prevent stuttering while maintaining low latency

Frontend Buffering

The audio output processor worklet handles buffering on the client side, ensuring smooth playback even with network jitter.

Audio Format Specifications

Input Audio (Microphone)

Codec
string
Opus
Sample Rate
string
24kHz
Channels
string
Mono (1 channel)
Frame Size
string
20ms
Bitrate
string
Adaptive (Opus encoder automatic)
Transport
string
Base64-encoded over WebSocket

Output Audio (TTS)

Codec
string
Opus
Sample Rate
string
24kHz
Channels
string
Mono (1 channel)
Frame Size
string
Variable (based on TTS output)
Transport
string
Base64-encoded over WebSocket

Cloudflare TURN Configuration

Although Unmute doesn’t use WebRTC peer connections, it includes utilities for obtaining TURN server credentials from Cloudflare (unmute/webrtc_utils.py):
def get_cloudflare_rtc_configuration():
    # see: https://fastrtc.org/deployment/#cloudflare-calls-api
    turn_key_id = os.environ.get("TURN_KEY_ID")
    turn_key_api_token = os.environ.get("TURN_KEY_API_TOKEN")
    ttl = 86400  # 24 hours

    response = requests.post(
        f"https://rtc.live.cloudflare.com/v1/turn/keys/{turn_key_id}/credentials/generate-ice-servers",
        headers={
            "Authorization": f"Bearer {turn_key_api_token}",
            "Content-Type": "application/json",
        },
        json={"ttl": ttl},
    )
    if response.ok:
        return response.json()
This utility is available for future use if Unmute moves to a peer-to-peer WebRTC architecture.

Performance Considerations

Latency Optimization

  1. Small Frame Sizes: 20ms frames minimize encoding latency
  2. Streaming Mode: streamPages: true sends data immediately without waiting for complete pages
  3. Low Complexity: encoderComplexity: 0 trades some quality for lower CPU usage and latency
  4. Minimal Buffering: AUDIO_BUFFER_SEC = FRAME_TIME_SEC * 4 keeps buffer small

Bandwidth Optimization

  1. 24kHz Sample Rate: Lower than 48kHz but sufficient for voice
  2. Mono Audio: Single channel reduces bandwidth by 50%
  3. Opus Codec: Highly efficient compression for speech
  4. Adaptive Bitrate: Opus automatically adjusts based on audio characteristics

CPU Optimization

  1. Web Workers: Encoding/decoding runs in separate threads
  2. Audio Worklets: Audio processing runs on high-priority audio thread
  3. Async Processing: Backend uses asyncio.to_thread for CPU-intensive operations

Debugging Audio Issues

Enable Developer Mode

Press D in the frontend to enable developer mode, which shows:
  • Debug dictionary with internal state
  • Additional logging in the console

Check Audio Levels

The circular visualizers show audio activity:
  • User circle (right): Should pulse when speaking
  • Assistant circle (left): Should pulse during TTS output

Common Issues

  • Check microphone permissions
  • Verify microphone is not muted in system settings
  • Check browser DevTools console for errors
  • Ensure echoCancellation is properly configured
  • Network issues may be causing packet loss
  • Backend TTS may be slower than real-time
  • Increase AUDIO_BUFFER_SEC value
  • Check CPU usage on backend
  • This can occur when TTS is slower than real-time
  • Buffering in the audio pipeline causes delayed playback
  • Adjust AUDIO_BUFFER_SEC to balance latency vs stability
  • Ensure echoCancellation: true is set
  • Use headphones to prevent speaker feedback
  • Lower speaker volume

Build docs developers (and LLMs) love