Voice Agent - Hyperbolic AgentKit

Overview

The Voice Agent enables real-time voice conversations with AI characters using OpenAI’s GPT-4o Realtime API. It supports bidirectional audio streaming over WebSockets with character personalities and tool integrations.

Architecture

The voice agent uses:

OpenAI Realtime API (gpt-4o-realtime-preview) for voice processing
WebSockets for bidirectional audio streaming
Starlette as the web framework
LangChain OpenAI Voice wrapper for agent orchestration

Server Implementation

The voice server is built with Starlette and provides WebSocket endpoints:

server/src/server/app.py

import uvicorn
from starlette.applications import Starlette
from starlette.routing import Route, WebSocketRoute
from starlette.websockets import WebSocket
from dotenv import load_dotenv

from langchain_openai_voice import OpenAIVoiceReactAgent
from server.utils import websocket_stream
from server.tools import TOOLS
from chatbot import loadCharacters, process_character_config
from server.prompt import BASE_INSTRUCTIONS

load_dotenv(override=True)

# Track active connections
active_connections = 0
MAX_CONNECTIONS_PER_INSTANCE = 10

async def websocket_endpoint(websocket: WebSocket, character_file: str):
    global active_connections
    
    # Connection limiting
    if active_connections >= MAX_CONNECTIONS_PER_INSTANCE:
        await websocket.accept()
        await websocket.send_text(
            json.dumps({
                "type": "error",
                "message": "Server at capacity. Please try again later."
            })
        )
        await websocket.close()
        return
    
    active_connections += 1
    try:
        await websocket.accept()
        browser_receive_stream = websocket_stream(websocket)
        
        # Load character configuration
        character = loadCharacters(character_file)[0]
        personality = process_character_config(character)
        
        # Combine base instructions with character config
        full_instructions = BASE_INSTRUCTIONS.format(
            character_instructions=personality,
            character_name=character["name"],
            adjectives=", ".join(character.get("adjectives", [])),
            topics=", ".join(character.get("topics", [])),
        )
        
        # Create voice agent
        agent = OpenAIVoiceReactAgent(
            model="gpt-4o-realtime-preview",
            tools=TOOLS,
            instructions=full_instructions,
            voice="verse",  # Options: alloy, ash, ballad, coral, echo, sage, shimmer, verse
        )
        
        # Connect agent to WebSocket streams
        await agent.aconnect(browser_receive_stream, websocket.send_text)
    finally:
        active_connections -= 1

Character-Specific Endpoints

Create dedicated endpoints for different characters:

server/src/server/app.py

async def rolypoly_websocket_endpoint(websocket: WebSocket):
    await websocket_endpoint(websocket, "characters/rolypoly.json")

async def chainyoda_websocket_endpoint(websocket: WebSocket):
    await websocket_endpoint(websocket, "characters/chainyoda.json")

routes = [
    Route("/", homepage),
    Route("/health", health_check),
    WebSocketRoute("/ws", websocket_endpoint),
    WebSocketRoute("/ws/rolypoly", rolypoly_websocket_endpoint),
    WebSocketRoute("/ws/chainyoda", chainyoda_websocket_endpoint),
]

app = Starlette(debug=True, routes=routes)
app.mount("/", StaticFiles(directory="server/src/server/static"), name="static")

Voice Instructions

Character personalities are injected into the voice agent prompt:

server/src/server/prompt.py

BASE_INSTRUCTIONS = """
You are {character_name}

# Character Identity Configuration
{character_instructions}

# Operational Guidelines
You are a helpful assistant with access to tools. Maintain these core principles:

1. Personality Enforcement:
- Speak in English
- Be concise but insightful
- Prioritize character bio/lore/knowledge over general knowledge

2. Tool Usage Priorities:
- When asked about GPUs: Only state model names from get_available_gpus
- For podcast questions ("The Podcast", "The Rollup"): 
  * Use podcast_query_tool first
  * Cite speakers and exact quotes
  * Cross-reference with character knowledge

3. Interaction Style:
- Maintain this tone: {adjectives}
- Adhere strictly to style guidelines
- Focus on these topics: {topics}
"""

Health Monitoring

The server includes health checks for load balancing:

server/src/server/app.py

import psutil

async def health_check(request):
    # Get system metrics
    memory = psutil.virtual_memory()
    
    # Check if we're in a healthy state
    is_healthy = (
        memory.percent < 90 and 
        active_connections < MAX_CONNECTIONS_PER_INSTANCE
    )
    
    # Return detailed health information
    return JSONResponse(
        {
            "status": "healthy" if is_healthy else "unhealthy",
            "memory": {
                "used_percent": memory.percent,
                "available_mb": memory.available / (1024 * 1024),
            },
            "connections": {
                "active": active_connections,
                "max": MAX_CONNECTIONS_PER_INSTANCE,
            },
        },
        status_code=200 if is_healthy else 503,
    )

Voice Options

OpenAI Realtime API supports multiple voices:

alloy

Neutral, balanced tone

ash

Deep, calm voice

ballad

Warm, friendly tone

coral

Bright, energetic voice

echo

Clear, professional tone

sage

Wise, measured voice

shimmer

Soft, gentle tone

verse

Dynamic, expressive voice

Configuration

Set up your voice agent server:

.env

# OpenAI (Required for voice agent)
OPENAI_API_KEY=your_openai_api_key

# Character Configuration
CHARACTER_FILE=characters/default.json

# Server Configuration
PORT=3000
HOST=0.0.0.0

# Tool Configuration (same as chat agent)
USE_COINBASE_TOOLS=true
USE_HYPERBOLIC_TOOLS=true
USE_PODCAST_KNOWLEDGE_BASE=true

Running the Server

Start the voice agent server:

Terminal

# Install dependencies
pip install uvicorn starlette langchain-openai-voice psutil

# Start the server
python server/src/server/app.py

The server will start on http://0.0.0.0:3000 with WebSocket endpoints ready.

Client Integration

WebSocket Connection

Connect from a web browser:

client.js

const ws = new WebSocket('ws://localhost:3000/ws/chainyoda');

ws.onopen = () => {
  console.log('Connected to voice agent');
  
  // Start audio capture from microphone
  navigator.mediaDevices.getUserMedia({ audio: true })
    .then(stream => {
      const mediaRecorder = new MediaRecorder(stream);
      
      mediaRecorder.ondataavailable = (event) => {
        // Send audio data to server
        ws.send(event.data);
      };
      
      mediaRecorder.start(100); // Send chunks every 100ms
    });
};

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  
  if (data.type === 'audio') {
    // Play received audio from agent
    playAudioChunk(data.audio);
  } else if (data.type === 'transcript') {
    console.log('Agent said:', data.text);
  }
};

ws.onerror = (error) => {
  console.error('WebSocket error:', error);
};

Audio Format

The agent expects audio in PCM16 format at 24kHz sample rate:

const audioContext = new AudioContext({ sampleRate: 24000 });
const processor = audioContext.createScriptProcessor(4096, 1, 1);

processor.onaudioprocess = (e) => {
  const inputData = e.inputBuffer.getChannelData(0);
  
  // Convert Float32Array to Int16Array (PCM16)
  const pcm16 = new Int16Array(inputData.length);
  for (let i = 0; i < inputData.length; i++) {
    const s = Math.max(-1, Math.min(1, inputData[i]));
    pcm16[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
  }
  
  ws.send(pcm16.buffer);
};

Tool Integration

Voice agents can use the same tools as chat agents:

server/tools.py

from langchain.tools import Tool
from langchain_community.tools import DuckDuckGoSearchRun

TOOLS = [
    DuckDuckGoSearchRun(name="web_search"),
    Tool(
        name="query_podcast_knowledge_base",
        func=lambda q: podcast_kb.query_knowledge_base(q),
        description="Query the podcast knowledge base for insights"
    ),
    # Add more tools as needed
]

Load Balancing

The server limits concurrent connections to maintain performance:

MAX_CONNECTIONS_PER_INSTANCE = 10

if active_connections >= MAX_CONNECTIONS_PER_INSTANCE:
    await websocket.send_text(json.dumps({
        "type": "error",
        "message": "Server at capacity. Please try again later."
    }))
    await websocket.close()
    return

For production, deploy multiple instances behind a load balancer.

Deployment

Docker

Create a Dockerfile for the voice server:

Dockerfile

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY server/ ./server/
COPY characters/ ./characters/
COPY chatbot.py .

EXPOSE 3000

CMD ["python", "server/src/server/app.py"]

Build and run:

Terminal

docker build -t voice-agent .
docker run -p 3000:3000 --env-file .env voice-agent

Cloud Deployment

For production deployment with auto-scaling:

Containerize the application

Build and push your Docker image to a container registry

Deploy to cloud platform

Use AWS ECS, Google Cloud Run, or Azure Container Instances

Configure load balancer

Set up health check endpoint at /health

Enable auto-scaling

Scale based on active connections and memory usage

Best Practices

Latency Optimization

Use WebSocket compression
Deploy close to users geographically
Stream audio in small chunks (100ms)
Minimize tool execution time

Connection Management

Implement reconnection logic
Handle network interruptions gracefully
Monitor active connections
Set appropriate timeouts

Audio Quality

Use 24kHz sample rate minimum
Implement noise reduction
Handle audio buffering properly
Test with various microphones

Security

Use WSS (WebSocket Secure) in production
Implement authentication
Rate limit connections
Validate audio data size

Troubleshooting

Connection fails immediately

Verify OPENAI_API_KEY is set correctly
Check WebSocket URL (ws:// for local, wss:// for production)
Ensure firewall allows WebSocket connections
Review browser console for CORS errors

No audio output

Check browser microphone permissions
Verify audio format is PCM16 at 24kHz
Test audio playback independently
Review WebSocket message format

High latency

Reduce audio chunk size
Check network bandwidth
Monitor server CPU/memory usage
Consider deploying closer to users

Server capacity errors

Scale to more instances
Increase MAX_CONNECTIONS_PER_INSTANCE
Implement connection queuing
Add load balancer

Example Use Cases

Crypto Advisor

Voice agent that explains DeFi concepts and executes trades

Podcast Assistant

Query podcast transcripts via voice and get audio summaries

Customer Support

Real-time voice support with blockchain transaction help

Next Steps

Chat Agent

Build text-based conversational agents

Twitter Automation

Create autonomous social media bots

Custom Tools

Add custom capabilities to your voice agent

Twitter Automation

Set up autonomous social media agents

Get Started

Core Concepts

Guides

Agent Types

​Overview

​Architecture

​Server Implementation

​Character-Specific Endpoints

​Voice Instructions

​Health Monitoring

​Voice Options

alloy

ash

ballad

coral

echo

sage

shimmer

verse

​Configuration

​Running the Server

​Client Integration

​WebSocket Connection

​Audio Format

​Tool Integration

​Load Balancing

​Deployment

​Docker

​Cloud Deployment

​Best Practices

Latency Optimization

Connection Management

Audio Quality

Security

​Troubleshooting

​Example Use Cases

Crypto Advisor

Podcast Assistant

Customer Support

​Next Steps

Chat Agent

Twitter Automation

Custom Tools

Twitter Automation

Build docs developers (and LLMs) love

Overview

Architecture

Server Implementation

Character-Specific Endpoints

Voice Instructions

Health Monitoring

Voice Options

Configuration

Running the Server

Client Integration

WebSocket Connection

Audio Format

Tool Integration

Load Balancing

Deployment

Docker

Cloud Deployment

Best Practices

Troubleshooting

Example Use Cases

Next Steps