Overview
The /api/voice-agent endpoint initializes real-time voice AI sessions using OpenAI’s Realtime API. It provides a conversational voice assistant with access to memory storage, web search, and Windows MCP tools for desktop automation. The endpoint returns session credentials for establishing a WebSocket connection.
This endpoint creates a session token for OpenAI’s Realtime API. You’ll need to connect to the WebSocket using the returned credentials.
Endpoint
Authentication
Requires authentication via cookies or Authorization header.
Returns 401 Unauthorized if authentication fails.
Requirements
OPENAI_API_KEY
environment variable
required
OpenAI API key must be configured in environment variables. Returns 500 Internal Server Error if not set.
Request Body
OpenAI Realtime model to use. Defaults to "gpt-realtime-mini" if not specified.
gpt-realtime-mini: Faster, more cost-effective real-time model
gpt-4-realtime: More capable model with advanced reasoning
Voice personality for the assistant. Defaults to "ash" if not specified. OpenAI provides several voice options:
ash: Balanced, neutral voice
sage: Calm, thoughtful voice
coral: Warm, friendly voice
alloy: Professional voice
echo: Clear, articulate voice
shimmer: Bright, energetic voice
Response
Returns a JSON object containing the OpenAI Realtime session credentials:
Unique session ID for the Realtime API connection.
The model being used for this session.
Unix timestamp when the session expires.
WebSocket connection credentials. Show Client secret structure
{
"value" : "wss://api.openai.com/v1/realtime?model=gpt-realtime-mini" ,
"expires_at" : 1234567890
}
The voice agent has access to all tools in the system:
Search user memories to personalize responses. The assistant is instructed to search for user identity and context at the start of every conversation. Parameters:
query (string): Search query (e.g., “who is this user”, “user preferences”)
userId (string): Automatically set to authenticated user
limit (number, optional): Maximum results
Store new information about the user. The assistant aggressively stores:
User’s name (highest priority)
Job title, company, role
Preferences and communication style
Projects and tasks
Any personal details shared
Parameters:
messages (array): Conversation messages to extract memories from
userId (string): Automatically set to authenticated user
Retrieve all stored memories for context. Parameters:
userId (string): Automatically set to authenticated user
Web Search
Search the web for current information, news, and research. Parameters:
query (string): Search query
Full desktop automation capabilities:
State-Tool : Capture desktop state and interactive elements
Click-Tool : Click at coordinates
Type-Tool : Type text at coordinates
Move-Tool : Move mouse cursor
Drag-Tool : Drag and drop
Scroll-Tool : Scroll windows
Shortcut-Tool : Execute keyboard shortcuts
App-Tool : Launch/resize/switch applications
Powershell-Tool : Execute PowerShell commands (preferred for opening apps/files/URLs)
Wait-Tool : Pause for UI loading
Scrape-Tool : Fetch content from URLs or browser tabs
Additional tools defined in DEFAULT_VOICE_TOOLS for voice interaction features.
Example Request
const response = await fetch ( '/api/voice-agent' , {
method: 'POST' ,
headers: {
'Content-Type' : 'application/json' ,
},
body: JSON . stringify ({
model: 'gpt-realtime-mini' ,
voice: 'ash'
})
});
const session = await response . json ();
console . log ( session );
Example Response
{
"id" : "sess_abc123xyz" ,
"model" : "gpt-realtime-mini" ,
"expires_at" : 1234567890 ,
"client_secret" : {
"value" : "wss://api.openai.com/v1/realtime?model=gpt-realtime-mini" ,
"expires_at" : 1234567890
},
"modalities" : [ "text" , "audio" ],
"instructions" : "You are a voice AI assistant..." ,
"voice" : "ash" ,
"input_audio_format" : "pcm16" ,
"output_audio_format" : "pcm16" ,
"input_audio_transcription" : {
"model" : "whisper-1"
},
"turn_detection" : {
"type" : "server_vad"
},
"tools" : [ ... ]
}
Connecting to the WebSocket
Once you have the session credentials, connect to the Realtime API:
const ws = new WebSocket ( session . client_secret . value );
ws . onopen = () => {
console . log ( 'Connected to voice agent' );
// Send audio or text messages
ws . send ( JSON . stringify ({
type: 'conversation.item.create' ,
item: {
type: 'message' ,
role: 'user' ,
content: [{
type: 'input_text' ,
text: 'Hello, who am I?'
}]
}
}));
};
ws . onmessage = ( event ) => {
const data = JSON . parse ( event . data );
console . log ( 'Received:' , data );
// Handle different event types:
// - conversation.item.created
// - response.audio.delta (audio chunks)
// - response.done
// - etc.
};
ws . onerror = ( error ) => {
console . error ( 'WebSocket error:' , error );
};
ws . onclose = () => {
console . log ( 'Disconnected from voice agent' );
};
// Send PCM16 audio data
const audioData = new Int16Array ( buffer ); // Your audio buffer
const base64Audio = btoa ( String . fromCharCode ( ... new Uint8Array ( audioData . buffer )));
ws . send ( JSON . stringify ({
type: 'input_audio_buffer.append' ,
audio: base64Audio
}));
// Commit the audio buffer when user stops speaking
ws . send ( JSON . stringify ({
type: 'input_audio_buffer.commit'
}));
Error Responses
Returned when authentication fails.
500 Internal Server Error
Returned when OpenAI API key is not configured or OpenAI API request fails. {
"error" : "OPENAI_API_KEY is not set"
}
{
"error" : "OpenAI API error: 500"
}
Features
Real-time Voice Interaction : Natural conversational voice AI with low latency
Automatic Transcription : Uses Whisper-1 for input audio transcription
Server VAD : Server-side Voice Activity Detection for natural turn-taking
Memory-Powered : Remembers user context and personalizes responses
Desktop Automation : Full control over Windows desktop via voice commands
Web Search Integration : Access to current information via Tavily
Tool Calling : Can invoke multiple tools during conversation
Conversational Style : Speaks naturally, avoiding markdown and code blocks in speech
Voice Assistant Behavior
The voice agent is optimized for natural conversation:
Searches user memories immediately at conversation start
Addresses users by name if known
Stores new personal information aggressively
Uses conversational language (avoids markdown/bullets in speech)
Proactively helps with workflow automation
Remembers daily routines and can execute them on command
Use Cases
Hands-free Desktop Control : Control Windows applications by voice
Voice-Activated Workflows : “Prep my workspace” to open all work apps
Conversational Memory : “What did I work on yesterday?”
Real-time Information : “What’s in the news about AI today?”
Personalized Assistant : Learns your preferences and habits over time
Accessibility : Voice control for users who prefer or require hands-free interaction
Best Practices
Handle WebSocket reconnection logic for robust voice experiences
Process audio in chunks for smooth streaming
Monitor the expires_at timestamp and refresh sessions as needed
Use Server VAD for natural conversation flow
The assistant learns from every interaction - more conversations = better personalization
Voice commands like “prep my day” can trigger complex automation workflows if you’ve told the assistant about your routine