Skip to main content

Overview

Airi’s voice chat system provides low-latency, real-time voice interaction through a sophisticated audio pipeline. The system handles audio input/output, voice activity detection (VAD), speech recognition, and text-to-speech synthesis in a unified pipeline.

Architecture

The voice chat system consists of several integrated components:
  • Audio Context Management: High-quality audio processing with configurable sample rates
  • Voice Activity Detection: Client-side speech detection using Silero VAD
  • Audio Pipeline: Streaming audio processing with resampling and encoding
  • Speech Pipeline: Orchestrates TTS generation, playback scheduling, and intent management

Audio Context

The audio context provides the foundation for all audio operations in Airi.

Initialization

import { initializeAudioContext, getAudioContextState } from '@proj-airi/audio/audio-context'

// Initialize with high-quality sample rate
const audioContext = await initializeAudioContext(48000)

// Check state
const state = getAudioContextState()
console.log(state.isReady, state.sampleRate)

Creating Audio Nodes

import {
  createAudioSource,
  createAudioAnalyser,
  createAudioGainNode,
  createResamplingWorkletNode
} from '@proj-airi/audio/audio-context'

// Create source from MediaStream
const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true })
const source = createAudioSource(mediaStream)

// Create analyser for visualization
const analyser = createAudioAnalyser({
  fftSize: 2048,
  smoothingTimeConstant: 0.8
})

// Create gain node for volume control
const gainNode = createAudioGainNode(0.8)

// Create resampling worklet for format conversion
const worklet = createResamplingWorkletNode(source, {
  inputSampleRate: 48000,
  outputSampleRate: 16000,
  channels: 1,
  converterType: 2 // SRC_SINC_MEDIUM_QUALITY
})

// Connect nodes
source.connect(gainNode)
gainNode.connect(analyser)
analyser.connect(worklet)

Audio Context State Management

import {
  subscribeToAudioContext,
  suspendAudioContext,
  resumeAudioContext
} from '@proj-airi/audio/audio-context'

// Subscribe to state changes
const unsubscribe = subscribeToAudioContext((state) => {
  console.log('Audio context state:', state.state)
  console.log('Current time:', state.currentTime)
  console.log('Worklets loaded:', state.workletLoaded)
})

// Suspend/resume context
await suspendAudioContext()
await resumeAudioContext()

// Cleanup subscription
unsubscribe()

Voice Activity Detection (VAD)

VAD automatically detects when the user is speaking, enabling push-to-talk-free interaction.

VAD Configuration

import { createVAD } from '@proj-airi/stage-ui/workers/vad'

const vad = await createVAD({
  sampleRate: 16000,
  speechThreshold: 0.3,      // Probability threshold for speech start
  exitThreshold: 0.1,        // Probability threshold for speech end
  minSilenceDurationMs: 400, // Min silence before ending speech
  speechPadMs: 80,           // Padding around speech segments
  minSpeechDurationMs: 250,  // Min duration to consider as speech
  maxBufferDuration: 30,     // Max recording duration in seconds
  newBufferSize: 512         // Audio chunk size
})

Using VAD

// Listen to VAD events
vad.on('speech-start', () => {
  console.log('User started speaking')
})

vad.on('speech-end', () => {
  console.log('User stopped speaking')
})

vad.on('speech-ready', ({ buffer, duration }) => {
  console.log(`Speech segment ready: ${duration}ms`)
  // Process the audio buffer
})

vad.on('debug', ({ data }) => {
  console.log('Speech probability:', data.probability)
})

// Process audio
await vad.initialize()
const audioBuffer = new Float32Array(512)
await vad.processAudio(audioBuffer)

Vue Composable for VAD

import { useVAD } from '@proj-airi/stage-ui/stores/ai/models/vad'
import vadWorkletUrl from '@proj-airi/stage-ui/workers/vad/process.worklet?worker&url'

const vad = useVAD(vadWorkletUrl, {
  threshold: 0.6,
  onSpeechStart: () => {
    console.log('Speech started')
  },
  onSpeechEnd: () => {
    console.log('Speech ended')
  }
})

// Initialize and start
await vad.init()
const stream = await navigator.mediaDevices.getUserMedia({ audio: true })
await vad.start(stream)

// Access state
console.log(vad.isSpeech.value)      // Boolean
console.log(vad.isSpeechProb.value)  // 0-1 probability
console.log(vad.loaded.value)        // Boolean

// Cleanup
vad.dispose()

Speech Pipeline

The speech pipeline manages TTS generation, playback scheduling, and intent prioritization.

Creating a Speech Pipeline

import { createSpeechPipeline } from '@proj-airi/pipelines-audio'

const pipeline = createSpeechPipeline({
  // TTS generation function
  tts: async (request, signal) => {
    // Generate audio from text
    const audio = await generateSpeech(request.text, signal)
    return audio
  },
  
  // Playback manager
  playback: {
    schedule: (item) => {
      // Schedule audio for playback
      playAudio(item.audio)
    },
    stopAll: (reason) => {
      // Stop all playback
      stopAllAudio()
    },
    stopByIntent: (intentId, reason) => {
      // Stop specific intent
      stopAudioByIntent(intentId)
    },
    stopByOwner: (ownerId, reason) => {
      // Stop by owner ID
      stopAudioByOwner(ownerId)
    },
    onStart: (listener) => { /* ... */ },
    onEnd: (listener) => { /* ... */ },
    onInterrupt: (listener) => { /* ... */ },
    onReject: (listener) => { /* ... */ }
  },
  
  logger: console,
  priority: createPriorityResolver(),
  segmenter: createTtsSegmentStream
})

Using Speech Intents

// Open an intent for speech output
const intent = pipeline.openIntent({
  priority: 'high',
  behavior: 'interrupt', // or 'queue' or 'replace'
  ownerId: 'user-message-123'
})

// Write text tokens
intent.writeLiteral('Hello, ')
intent.writeLiteral('how can I help you today?')

// Write special tokens (emotions, delays, etc.)
intent.writeSpecial('emotion:happy')

// Flush immediately
intent.writeFlush()

// End the intent
intent.end()

// Or cancel it
intent.cancel('User interrupted')

Pipeline Events

// Listen to pipeline events
pipeline.on('onIntentStart', (intentId) => {
  console.log('Intent started:', intentId)
})

pipeline.on('onIntentEnd', (intentId) => {
  console.log('Intent completed:', intentId)
})

pipeline.on('onSegment', (segment) => {
  console.log('Text segment:', segment.text)
})

pipeline.on('onTtsRequest', (request) => {
  console.log('TTS requested:', request.text)
})

pipeline.on('onTtsResult', (result) => {
  console.log('TTS completed:', result.segmentId)
})

pipeline.on('onPlaybackStart', ({ item, startedAt }) => {
  console.log('Playback started:', item.text)
})

pipeline.on('onPlaybackEnd', ({ item, endedAt }) => {
  console.log('Playback ended')
})

Configuration

Audio Quality Settings

// High-quality audio (default)
const audioContext = await initializeAudioContext(48000)

// Lower latency (trade-off with quality)
const audioContext = await initializeAudioContext(24000)

VAD Sensitivity

// More sensitive (picks up quieter speech)
const vad = await createVAD({
  speechThreshold: 0.2,
  exitThreshold: 0.05
})

// Less sensitive (reduces false positives)
const vad = await createVAD({
  speechThreshold: 0.5,
  exitThreshold: 0.2
})

Speech Pipeline Priority

import { createPriorityResolver } from '@proj-airi/pipelines-audio'

const priority = createPriorityResolver()

// Use priority levels
const intent = pipeline.openIntent({
  priority: 'high',    // or 'normal', 'low', or a number
  behavior: 'interrupt'
})

Performance Considerations

  • Sample Rate: Higher sample rates (48kHz) provide better quality but use more processing power
  • Buffer Size: Smaller buffers reduce latency but may cause audio glitches on slower devices
  • VAD Thresholds: Adjust based on microphone quality and ambient noise levels
  • Worklet Processing: Audio worklets run on a separate thread for optimal performance

Best Practices

  1. Initialize Early: Set up the audio context before user interaction to avoid delays
  2. Cleanup Resources: Always disconnect and remove audio nodes when done
  3. Handle Errors: Audio context can fail on iOS without user gesture
  4. Monitor State: Subscribe to context state changes for debugging
  5. Test Across Devices: Audio behavior varies significantly across browsers and devices

Troubleshooting

Audio Context Suspended

if (audioContext.state === 'suspended') {
  // Resume on user interaction
  document.addEventListener('click', async () => {
    await resumeAudioContext()
  }, { once: true })
}

VAD Not Detecting Speech

// Check microphone permissions
const stream = await navigator.mediaDevices.getUserMedia({ audio: true })

// Verify audio is flowing
const analyser = createAudioAnalyser()
source.connect(analyser)
const dataArray = new Uint8Array(analyser.frequencyBinCount)
analyser.getByteTimeDomainData(dataArray)
console.log('Audio level:', Math.max(...dataArray))

High Latency

// Reduce buffer sizes
const worklet = createResamplingWorkletNode(source, {
  bufferSize: 512 // Smaller buffer = lower latency
})

// Use lower sample rate
const audioContext = await initializeAudioContext(24000)

Build docs developers (and LLMs) love