Voice Agent

The Voice Agent is an intelligent voice assistant integrated directly into Tabby AI Keyboard, enabling hands-free interaction with your system through natural conversation.

Overview

Voice Agent provides:

Real-time voice conversations using OpenAI’s Realtime API
Desktop automation via Windows MCP tools
Persistent memory to remember your preferences and context
Web search integration for current information
Tool execution for system-level control

Activation

Ctrl+Alt+J - Launch the Voice Agent Once activated, the agent listens for your voice input and responds naturally in real-time.

Configuration

Voice Selection

The agent supports multiple voice options from OpenAI:

{
  Alloy: 'alloy',
  Ash: 'ash',        // Default
  Ballad: 'ballad',
  Coral: 'coral',
  Echo: 'echo',
  Sage: 'sage',
  Shimmer: 'shimmer',
  Verse: 'verse'
}

Model Selection

By default, the agent uses gpt-realtime-mini for optimal performance and cost-efficiency.

Core Capabilities

Desktop Automation

The Voice Agent has access to powerful Windows MCP tools for full desktop control:

State-Tool

Captures current desktop state including:

Focused application and window dimensions
All opened applications
Interactive elements with coordinates
Scrollable areas

Critical: Always called first before performing actions to get accurate coordinates.

Click-Tool

Click at specific coordinates with options for:

Left, right, or middle button clicks
Single, double, or triple clicks

Type-Tool

Type text at specific coordinates with options to:

Clear existing text first
Press Enter after typing

Powershell-Tool

Execute PowerShell commands for:

Opening applications and files
Opening URLs in default browser
File system operations
Process management
System information queries

Shortcut-Tool

Execute keyboard shortcuts like:

ctrl+c, ctrl+v (copy/paste)
alt+tab (switch windows)
win+r (run command)

App-Tool

Manage applications:

Launch new applications
Resize and position windows
Switch between running apps

Scroll-Tool

Scroll in any direction at specified coordinates

Move-Tool & Drag-Tool

Move mouse cursor or drag elements between locations

Wait-Tool

Pause execution for UI loading or animations to complete

Scrape-Tool

Fetch content from URLs or active browser tabs

Memory System

The Voice Agent maintains persistent memory across conversations:

Search Memories

Automatically searches for relevant context before responding to understand who you are and what you’re working on.

Store Memories

Saves important information including:

Your name and identity
Role, company, and profession
Technical preferences and skills
Projects and tasks
Communication preferences

Retrieve All Memories

Access all stored memories for a comprehensive view of your profile.

Procedural Memory

Remembers workflows and automates routine tasks like “prep my day” or “start my workspace”.

Web Search

Integrated with Tavily for real-time web search capabilities to answer questions requiring current information.

API Endpoints

Create Voice Agent Session

const response = await fetch('/api/voice-agent', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'gpt-realtime-mini',
    voice: 'ash'
  })
});

const session = await response.json();
// Returns: OpenAI Realtime session with client_secret

Execute Tool

const response = await fetch('/api/voice-agent/execute-tool', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    toolName: 'tavilySearchTool',
    args: {
      query: 'latest AI news'
    }
  })
});

const result = await response.json();

Built-in Voice Commands

Change Theme

“Change the browser theme to dark mode”Switches between light and dark themes.

End Conversation

“End conversation” or “Hang up”Gracefully ends the voice session.

Example Usage

Start the Agent

Press Ctrl+Alt+J to launch the Voice Agent. The agent will greet you and start listening.

Natural Conversation

Speak naturally: “Open Chrome and search for AI keyboard shortcuts”The agent will:

Search memories to understand your context
Use Powershell-Tool to open Chrome
Use Type-Tool and Click-Tool to perform the search

Memory Building

The agent automatically stores facts like:

“User prefers Chrome browser”
“User is interested in keyboard shortcuts”

End Session

Say “End conversation” or close the Voice Agent window.

System Prompt Highlights

The Voice Agent is instructed to:

Speak naturally and conversationally (avoid markdown, bullet points, or code blocks)
Always search memories first before responding
Aggressively store new information shared during conversations
Prioritize user identity (name, role, preferences)
Use PowerShell for fast execution when opening apps/files/URLs
Be cost-conscious (avoid expensive vision tools unless necessary)

Advanced Features

Workflow Automation

Train the agent to automate your daily routines:

You: "Every morning, I open Slack, Chrome with Gmail, and VS Code in my projects folder"

[Agent stores this as procedural memory]

Next day:
You: "Prep my workspace"

[Agent executes stored workflow automatically]

Real-time Communication

The Voice Agent uses WebRTC for low-latency, bidirectional audio streaming:

Audio input captured via microphone
Transcription via OpenAI Whisper
Response generation via GPT Realtime API
Audio output played directly in browser

Requirements

Voice Agent requires:

OpenAI API key with Realtime API access
Microphone permissions
Modern browser with WebRTC support
(Optional) Windows MCP server for desktop automation
(Optional) Memory backend running on localhost:8000

Voice Transcription

Convert speech to text for typing in any application

Voice Commands

Execute quick actions via voice shortcuts

Interview Copilot

AI Assistance

Memory & Brain

Voice Features

Automation

Overview

Activation

Configuration

Voice Selection

Model Selection

Core Capabilities

Desktop Automation

Memory System

Search Memories

Store Memories

Retrieve All Memories

Procedural Memory

Web Search

API Endpoints

Create Voice Agent Session

Execute Tool

Built-in Voice Commands

Change Theme

End Conversation

Example Usage

System Prompt Highlights

Advanced Features

Workflow Automation

Real-time Communication

Requirements

Voice Transcription

Voice Commands

Build docs developers (and LLMs) love

Interview Copilot

AI Assistance

Memory & Brain

Voice Features

Automation

​Overview

​Activation

​Configuration

​Voice Selection

​Model Selection

​Core Capabilities

​Desktop Automation

​Memory System

Search Memories

Store Memories

Retrieve All Memories

Procedural Memory

​Web Search

​API Endpoints

​Create Voice Agent Session

​Execute Tool

​Built-in Voice Commands

Change Theme

End Conversation

​Example Usage

​System Prompt Highlights

​Advanced Features

​Workflow Automation

​Real-time Communication

​Requirements

​Related Features

Voice Transcription

Voice Commands

Build docs developers (and LLMs) love

Overview

Activation

Configuration

Voice Selection

Model Selection

Core Capabilities

Desktop Automation

Memory System

Web Search

API Endpoints

Create Voice Agent Session

Execute Tool

Built-in Voice Commands

Example Usage

System Prompt Highlights

Advanced Features

Workflow Automation

Real-time Communication

Requirements

Related Features