Skip to main content

Overview

The /api/voice-agent endpoint initializes real-time voice AI sessions using OpenAI’s Realtime API. It provides a conversational voice assistant with access to memory storage, web search, and Windows MCP tools for desktop automation. The endpoint returns session credentials for establishing a WebSocket connection.
This endpoint creates a session token for OpenAI’s Realtime API. You’ll need to connect to the WebSocket using the returned credentials.

Endpoint

POST /api/voice-agent

Authentication

Requires authentication via cookies or Authorization header.
Returns 401 Unauthorized if authentication fails.

Requirements

OPENAI_API_KEY
environment variable
required
OpenAI API key must be configured in environment variables.Returns 500 Internal Server Error if not set.

Request Body

model
string
OpenAI Realtime model to use. Defaults to "gpt-realtime-mini" if not specified.
voice
string
Voice personality for the assistant. Defaults to "ash" if not specified.

Response

Returns a JSON object containing the OpenAI Realtime session credentials:
id
string
Unique session ID for the Realtime API connection.
model
string
The model being used for this session.
expires_at
number
Unix timestamp when the session expires.
client_secret
object
WebSocket connection credentials.

Available Tools

The voice agent has access to all tools in the system:

Memory Tools

Windows MCP Tools

Full desktop automation capabilities:
  • State-Tool: Capture desktop state and interactive elements
  • Click-Tool: Click at coordinates
  • Type-Tool: Type text at coordinates
  • Move-Tool: Move mouse cursor
  • Drag-Tool: Drag and drop
  • Scroll-Tool: Scroll windows
  • Shortcut-Tool: Execute keyboard shortcuts
  • App-Tool: Launch/resize/switch applications
  • Powershell-Tool: Execute PowerShell commands (preferred for opening apps/files/URLs)
  • Wait-Tool: Pause for UI loading
  • Scrape-Tool: Fetch content from URLs or browser tabs

Voice-Specific Tools

Additional tools defined in DEFAULT_VOICE_TOOLS for voice interaction features.

Example Request

const response = await fetch('/api/voice-agent', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    model: 'gpt-realtime-mini',
    voice: 'ash'
  })
});

const session = await response.json();
console.log(session);

Example Response

{
  "id": "sess_abc123xyz",
  "model": "gpt-realtime-mini",
  "expires_at": 1234567890,
  "client_secret": {
    "value": "wss://api.openai.com/v1/realtime?model=gpt-realtime-mini",
    "expires_at": 1234567890
  },
  "modalities": ["text", "audio"],
  "instructions": "You are a voice AI assistant...",
  "voice": "ash",
  "input_audio_format": "pcm16",
  "output_audio_format": "pcm16",
  "input_audio_transcription": {
    "model": "whisper-1"
  },
  "turn_detection": {
    "type": "server_vad"
  },
  "tools": [...]
}

Connecting to the WebSocket

Once you have the session credentials, connect to the Realtime API:
const ws = new WebSocket(session.client_secret.value);

ws.onopen = () => {
  console.log('Connected to voice agent');
  
  // Send audio or text messages
  ws.send(JSON.stringify({
    type: 'conversation.item.create',
    item: {
      type: 'message',
      role: 'user',
      content: [{
        type: 'input_text',
        text: 'Hello, who am I?'
      }]
    }
  }));
};

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log('Received:', data);
  
  // Handle different event types:
  // - conversation.item.created
  // - response.audio.delta (audio chunks)
  // - response.done
  // - etc.
};

ws.onerror = (error) => {
  console.error('WebSocket error:', error);
};

ws.onclose = () => {
  console.log('Disconnected from voice agent');
};

Sending Audio Input

// Send PCM16 audio data
const audioData = new Int16Array(buffer); // Your audio buffer
const base64Audio = btoa(String.fromCharCode(...new Uint8Array(audioData.buffer)));

ws.send(JSON.stringify({
  type: 'input_audio_buffer.append',
  audio: base64Audio
}));

// Commit the audio buffer when user stops speaking
ws.send(JSON.stringify({
  type: 'input_audio_buffer.commit'
}));

Error Responses

401 Unauthorized
error
Returned when authentication fails.
"Unauthorized"
500 Internal Server Error
error
Returned when OpenAI API key is not configured or OpenAI API request fails.
{
  "error": "OPENAI_API_KEY is not set"
}
{
  "error": "OpenAI API error: 500"
}

Features

  • Real-time Voice Interaction: Natural conversational voice AI with low latency
  • Automatic Transcription: Uses Whisper-1 for input audio transcription
  • Server VAD: Server-side Voice Activity Detection for natural turn-taking
  • Memory-Powered: Remembers user context and personalizes responses
  • Desktop Automation: Full control over Windows desktop via voice commands
  • Web Search Integration: Access to current information via Tavily
  • Tool Calling: Can invoke multiple tools during conversation
  • Conversational Style: Speaks naturally, avoiding markdown and code blocks in speech

Voice Assistant Behavior

The voice agent is optimized for natural conversation:
  • Searches user memories immediately at conversation start
  • Addresses users by name if known
  • Stores new personal information aggressively
  • Uses conversational language (avoids markdown/bullets in speech)
  • Proactively helps with workflow automation
  • Remembers daily routines and can execute them on command

Use Cases

  • Hands-free Desktop Control: Control Windows applications by voice
  • Voice-Activated Workflows: “Prep my workspace” to open all work apps
  • Conversational Memory: “What did I work on yesterday?”
  • Real-time Information: “What’s in the news about AI today?”
  • Personalized Assistant: Learns your preferences and habits over time
  • Accessibility: Voice control for users who prefer or require hands-free interaction

Best Practices

  • Handle WebSocket reconnection logic for robust voice experiences
  • Process audio in chunks for smooth streaming
  • Monitor the expires_at timestamp and refresh sessions as needed
  • Use Server VAD for natural conversation flow
  • The assistant learns from every interaction - more conversations = better personalization
  • Voice commands like “prep my day” can trigger complex automation workflows if you’ve told the assistant about your routine

Build docs developers (and LLMs) love