Voice Agent

Overview

The /api/voice-agent endpoint initializes real-time voice AI sessions using OpenAI’s Realtime API. It provides a conversational voice assistant with access to memory storage, web search, and Windows MCP tools for desktop automation. The endpoint returns session credentials for establishing a WebSocket connection.

This endpoint creates a session token for OpenAI’s Realtime API. You’ll need to connect to the WebSocket using the returned credentials.

Endpoint

POST /api/voice-agent

Authentication

Requires authentication via cookies or Authorization header.

Returns 401 Unauthorized if authentication fails.

Requirements

OPENAI_API_KEY

environment variable

required

OpenAI API key must be configured in environment variables.Returns 500 Internal Server Error if not set.

Request Body

model

string

OpenAI Realtime model to use. Defaults to "gpt-realtime-mini" if not specified.

Show Available models

gpt-realtime-mini: Faster, more cost-effective real-time model
gpt-4-realtime: More capable model with advanced reasoning

voice

string

Voice personality for the assistant. Defaults to "ash" if not specified.

Show Available voices

OpenAI provides several voice options:

ash: Balanced, neutral voice
sage: Calm, thoughtful voice
coral: Warm, friendly voice
alloy: Professional voice
echo: Clear, articulate voice
shimmer: Bright, energetic voice

Response

Returns a JSON object containing the OpenAI Realtime session credentials:

string

Unique session ID for the Realtime API connection.

model

string

The model being used for this session.

expires_at

number

Unix timestamp when the session expires.

client_secret

object

WebSocket connection credentials.

Show Client secret structure

{
  "value": "wss://api.openai.com/v1/realtime?model=gpt-realtime-mini",
  "expires_at": 1234567890
}

Available Tools

The voice agent has access to all tools in the system:

Memory Tools

Show searchMemory

Search user memories to personalize responses. The assistant is instructed to search for user identity and context at the start of every conversation.Parameters:

query (string): Search query (e.g., “who is this user”, “user preferences”)
userId (string): Automatically set to authenticated user
limit (number, optional): Maximum results

Show addMemory

Store new information about the user. The assistant aggressively stores:

User’s name (highest priority)
Job title, company, role
Preferences and communication style
Projects and tasks
Any personal details shared

Parameters:

messages (array): Conversation messages to extract memories from
userId (string): Automatically set to authenticated user

Show getAllMemories

Retrieve all stored memories for context.Parameters:

userId (string): Automatically set to authenticated user

Web Search

Show tavilySearchTool

Search the web for current information, news, and research.Parameters:

query (string): Search query

Windows MCP Tools

Full desktop automation capabilities:

State-Tool: Capture desktop state and interactive elements
Click-Tool: Click at coordinates
Type-Tool: Type text at coordinates
Move-Tool: Move mouse cursor
Drag-Tool: Drag and drop
Scroll-Tool: Scroll windows
Shortcut-Tool: Execute keyboard shortcuts
App-Tool: Launch/resize/switch applications
Powershell-Tool: Execute PowerShell commands (preferred for opening apps/files/URLs)
Wait-Tool: Pause for UI loading
Scrape-Tool: Fetch content from URLs or browser tabs

Voice-Specific Tools

Additional tools defined in DEFAULT_VOICE_TOOLS for voice interaction features.

Example Request

const response = await fetch('/api/voice-agent', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    model: 'gpt-realtime-mini',
    voice: 'ash'
  })
});

const session = await response.json();
console.log(session);

Example Response

{
  "id": "sess_abc123xyz",
  "model": "gpt-realtime-mini",
  "expires_at": 1234567890,
  "client_secret": {
    "value": "wss://api.openai.com/v1/realtime?model=gpt-realtime-mini",
    "expires_at": 1234567890
  },
  "modalities": ["text", "audio"],
  "instructions": "You are a voice AI assistant...",
  "voice": "ash",
  "input_audio_format": "pcm16",
  "output_audio_format": "pcm16",
  "input_audio_transcription": {
    "model": "whisper-1"
  },
  "turn_detection": {
    "type": "server_vad"
  },
  "tools": [...]
}

Connecting to the WebSocket

Once you have the session credentials, connect to the Realtime API:

const ws = new WebSocket(session.client_secret.value);

ws.onopen = () => {
  console.log('Connected to voice agent');
  
  // Send audio or text messages
  ws.send(JSON.stringify({
    type: 'conversation.item.create',
    item: {
      type: 'message',
      role: 'user',
      content: [{
        type: 'input_text',
        text: 'Hello, who am I?'
      }]
    }
  }));
};

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log('Received:', data);
  
  // Handle different event types:
  // - conversation.item.created
  // - response.audio.delta (audio chunks)
  // - response.done
  // - etc.
};

ws.onerror = (error) => {
  console.error('WebSocket error:', error);
};

ws.onclose = () => {
  console.log('Disconnected from voice agent');
};

Sending Audio Input

// Send PCM16 audio data
const audioData = new Int16Array(buffer); // Your audio buffer
const base64Audio = btoa(String.fromCharCode(...new Uint8Array(audioData.buffer)));

ws.send(JSON.stringify({
  type: 'input_audio_buffer.append',
  audio: base64Audio
}));

// Commit the audio buffer when user stops speaking
ws.send(JSON.stringify({
  type: 'input_audio_buffer.commit'
}));

Error Responses

401 Unauthorized

error

Returned when authentication fails.

"Unauthorized"

500 Internal Server Error

error

Returned when OpenAI API key is not configured or OpenAI API request fails.

{
  "error": "OPENAI_API_KEY is not set"
}

{
  "error": "OpenAI API error: 500"
}

Features

Real-time Voice Interaction: Natural conversational voice AI with low latency
Automatic Transcription: Uses Whisper-1 for input audio transcription
Server VAD: Server-side Voice Activity Detection for natural turn-taking
Memory-Powered: Remembers user context and personalizes responses
Desktop Automation: Full control over Windows desktop via voice commands
Web Search Integration: Access to current information via Tavily
Tool Calling: Can invoke multiple tools during conversation
Conversational Style: Speaks naturally, avoiding markdown and code blocks in speech

Voice Assistant Behavior

The voice agent is optimized for natural conversation:

Searches user memories immediately at conversation start
Addresses users by name if known
Stores new personal information aggressively
Uses conversational language (avoids markdown/bullets in speech)
Proactively helps with workflow automation
Remembers daily routines and can execute them on command

Use Cases

Hands-free Desktop Control: Control Windows applications by voice
Voice-Activated Workflows: “Prep my workspace” to open all work apps
Conversational Memory: “What did I work on yesterday?”
Real-time Information: “What’s in the news about AI today?”
Personalized Assistant: Learns your preferences and habits over time
Accessibility: Voice control for users who prefer or require hands-free interaction

Best Practices

Handle WebSocket reconnection logic for robust voice experiences
Process audio in chunks for smooth streaming
Monitor the expires_at timestamp and refresh sessions as needed
Use Server VAD for natural conversation flow
The assistant learns from every interaction - more conversations = better personalization
Voice commands like “prep my day” can trigger complex automation workflows if you’ve told the assistant about your routine

Memory API

Backend API

Electron IPC

Overview

Endpoint

Authentication

Requirements

Request Body

Response

Available Tools

Memory Tools

Web Search

Windows MCP Tools

Voice-Specific Tools

Example Request

Example Response

Connecting to the WebSocket

Sending Audio Input

Error Responses

Features

Voice Assistant Behavior

Use Cases

Best Practices

Build docs developers (and LLMs) love

Memory API

Backend API

Electron IPC

​Overview

​Endpoint

​Authentication

​Requirements

​Request Body

​Response

​Available Tools

​Memory Tools

​Web Search

​Windows MCP Tools

​Voice-Specific Tools

​Example Request

​Example Response

​Connecting to the WebSocket

​Sending Audio Input

​Error Responses

​Features

​Voice Assistant Behavior

​Use Cases

​Best Practices

​Related Documentation

Build docs developers (and LLMs) love

Overview

Endpoint

Authentication

Requirements

Request Body

Response

Available Tools

Memory Tools

Web Search

Windows MCP Tools

Voice-Specific Tools

Example Request

Example Response

Connecting to the WebSocket

Sending Audio Input

Error Responses

Features

Voice Assistant Behavior

Use Cases

Best Practices

Related Documentation