Overview
Voice Agent provides:- Real-time voice conversations using OpenAI’s Realtime API
- Desktop automation via Windows MCP tools
- Persistent memory to remember your preferences and context
- Web search integration for current information
- Tool execution for system-level control
Activation
Ctrl+Alt+J - Launch the Voice Agent Once activated, the agent listens for your voice input and responds naturally in real-time.Configuration
Voice Selection
The agent supports multiple voice options from OpenAI:Model Selection
By default, the agent usesgpt-realtime-mini for optimal performance and cost-efficiency.
Core Capabilities
Desktop Automation
The Voice Agent has access to powerful Windows MCP tools for full desktop control:State-Tool
State-Tool
Captures current desktop state including:
- Focused application and window dimensions
- All opened applications
- Interactive elements with coordinates
- Scrollable areas
Click-Tool
Click-Tool
Click at specific coordinates with options for:
- Left, right, or middle button clicks
- Single, double, or triple clicks
Type-Tool
Type-Tool
Type text at specific coordinates with options to:
- Clear existing text first
- Press Enter after typing
Powershell-Tool
Powershell-Tool
Execute PowerShell commands for:
- Opening applications and files
- Opening URLs in default browser
- File system operations
- Process management
- System information queries
Shortcut-Tool
Shortcut-Tool
Execute keyboard shortcuts like:
ctrl+c,ctrl+v(copy/paste)alt+tab(switch windows)win+r(run command)
App-Tool
App-Tool
Manage applications:
- Launch new applications
- Resize and position windows
- Switch between running apps
Scroll-Tool
Scroll-Tool
Scroll in any direction at specified coordinates
Move-Tool & Drag-Tool
Move-Tool & Drag-Tool
Move mouse cursor or drag elements between locations
Wait-Tool
Wait-Tool
Pause execution for UI loading or animations to complete
Scrape-Tool
Scrape-Tool
Fetch content from URLs or active browser tabs
Memory System
The Voice Agent maintains persistent memory across conversations:Search Memories
Automatically searches for relevant context before responding to understand who you are and what you’re working on.
Store Memories
Saves important information including:
- Your name and identity
- Role, company, and profession
- Technical preferences and skills
- Projects and tasks
- Communication preferences
Retrieve All Memories
Access all stored memories for a comprehensive view of your profile.
Procedural Memory
Remembers workflows and automates routine tasks like “prep my day” or “start my workspace”.
Web Search
Integrated with Tavily for real-time web search capabilities to answer questions requiring current information.API Endpoints
Create Voice Agent Session
Execute Tool
Built-in Voice Commands
Change Theme
“Change the browser theme to dark mode”Switches between light and dark themes.
End Conversation
“End conversation” or “Hang up”Gracefully ends the voice session.
Example Usage
Start the Agent
Press Ctrl+Alt+J to launch the Voice Agent. The agent will greet you and start listening.
Natural Conversation
Speak naturally: “Open Chrome and search for AI keyboard shortcuts”The agent will:
- Search memories to understand your context
- Use Powershell-Tool to open Chrome
- Use Type-Tool and Click-Tool to perform the search
Memory Building
The agent automatically stores facts like:
- “User prefers Chrome browser”
- “User is interested in keyboard shortcuts”
System Prompt Highlights
The Voice Agent is instructed to:- Speak naturally and conversationally (avoid markdown, bullet points, or code blocks)
- Always search memories first before responding
- Aggressively store new information shared during conversations
- Prioritize user identity (name, role, preferences)
- Use PowerShell for fast execution when opening apps/files/URLs
- Be cost-conscious (avoid expensive vision tools unless necessary)
Advanced Features
Workflow Automation
Train the agent to automate your daily routines:Real-time Communication
The Voice Agent uses WebRTC for low-latency, bidirectional audio streaming:- Audio input captured via microphone
- Transcription via OpenAI Whisper
- Response generation via GPT Realtime API
- Audio output played directly in browser
Requirements
Related Features
Voice Transcription
Convert speech to text for typing in any application
Voice Commands
Execute quick actions via voice shortcuts