Tool Calling

While Unmute doesn’t currently have built-in tool calling support, you can integrate function calling capabilities by wrapping the LLM server with a tool-calling layer.

Overview

Tool calling allows the LLM to invoke external functions, such as:

Weather API lookups
Database queries
Web searches
Smart home controls
Calendar operations
Custom business logic

The key insight is that tool calling can be implemented transparently at the LLM layer, making it invisible to the Unmute backend.

Architecture

The tool-calling proxy:

Receives text generation requests from Unmute backend
Forwards to the LLM with tool definitions
Detects when the LLM wants to call a tool
Executes the tool and injects results
Continues generation with tool results
Returns final response to Unmute

Implementation Approach

The recommended approach is to create a FastAPI server that:

Exposes an OpenAI-compatible API endpoint
Wraps your LLM server (VLLM, Ollama, etc.)
Intercepts tool calls and executes them
Streams results back to Unmute

Why This Works

Unmute expects streaming text responses from an OpenAI-compatible endpoint. As long as your proxy server provides this interface, Unmute doesn’t need to know about the tool calling happening behind the scenes.

Step-by-Step Implementation

Create a Tool-Calling Proxy Server

Create a new FastAPI application that wraps your LLM:

# tool_proxy.py
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import httpx
import json

app = FastAPI()

# Your tool definitions
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    }
]

async def execute_tool(tool_name: str, arguments: dict):
    """Execute a tool and return results"""
    if tool_name == "get_weather":
        # Call weather API
        location = arguments["location"]
        # ... fetch weather data ...
        return {"temperature": 72, "condition": "sunny"}
    return None

@app.post("/v1/chat/completions")
async def chat_completions(request: dict):
    """OpenAI-compatible endpoint with tool calling"""
    
    # Add tools to the request
    request["tools"] = tools
    
    # Forward to actual LLM (e.g., VLLM)
    llm_url = "http://localhost:8000/v1/chat/completions"
    
    async with httpx.AsyncClient() as client:
        response = await client.post(llm_url, json=request)
        
        # Check if LLM wants to call a tool
        result = response.json()
        
        if "tool_calls" in result.get("choices", [{}])[0].get("message", {}):
            tool_calls = result["choices"][0]["message"]["tool_calls"]
            
            # Execute each tool call
            for tool_call in tool_calls:
                tool_name = tool_call["function"]["name"]
                args = json.loads(tool_call["function"]["arguments"])
                tool_result = await execute_tool(tool_name, args)
                
                # Add tool result to conversation
                request["messages"].append({
                    "role": "tool",
                    "tool_call_id": tool_call["id"],
                    "content": json.dumps(tool_result)
                })
            
            # Make another LLM call with tool results
            response = await client.post(llm_url, json=request)
        
        return response.json()

Add Streaming Support

For real-time voice, streaming is essential:

@app.post("/v1/chat/completions")
async def chat_completions(request: dict):
    # Enable streaming
    request["stream"] = True
    
    async def stream_with_tools():
        # Track if we're in a tool call
        in_tool_call = False
        tool_buffer = ""
        
        async with httpx.AsyncClient() as client:
            async with client.stream(
                "POST",
                "http://localhost:8000/v1/chat/completions",
                json=request
            ) as response:
                async for chunk in response.aiter_text():
                    # Parse SSE format
                    if chunk.startswith("data: "):
                        data = chunk[6:].strip()
                        if data == "[DONE]":
                            yield chunk
                            continue
                        
                        parsed = json.loads(data)
                        delta = parsed["choices"][0]["delta"]
                        
                        # Check for tool call
                        if "tool_calls" in delta:
                            in_tool_call = True
                            # Buffer tool call data
                            # ...
                        elif in_tool_call:
                            # Execute tool and restart generation
                            # ...
                        else:
                            # Normal text, pass through
                            yield chunk
    
    return StreamingResponse(
        stream_with_tools(),
        media_type="text/event-stream"
    )

Configure Unmute to Use Your Proxy

Update docker-compose.yml to point to your proxy server:

backend:
  environment:
    - KYUTAI_LLM_URL=http://tool-proxy:8080
    - KYUTAI_LLM_MODEL=your-model-name

# Add your proxy service
tool-proxy:
  build:
    context: ./tool-proxy
  ports:
    - "8080:8080"
  environment:
    - VLLM_URL=http://llm:8000

Test Tool Calling

Start a conversation and ask the assistant to use a tool:

User: "What's the weather in Paris?"
Assistant: "Let me check that for you... It's currently 72°F and sunny in Paris!"

The tool call happens transparently behind the scenes.

Example Tools

Weather Lookup

async def get_weather(location: str) -> dict:
    """Fetch weather data from an API"""
    api_key = os.environ["WEATHER_API_KEY"]
    url = f"https://api.weather.com/v1/current?location={location}&key={api_key}"
    
    async with httpx.AsyncClient() as client:
        response = await client.get(url)
        data = response.json()
        
    return {
        "location": location,
        "temperature": data["temp"],
        "condition": data["condition"],
        "humidity": data["humidity"]
    }

Web Search

async def web_search(query: str) -> dict:
    """Search the web and return results"""
    search_api_key = os.environ["SEARCH_API_KEY"]
    url = f"https://api.search.com/v1/search?q={query}&key={search_api_key}"
    
    async with httpx.AsyncClient() as client:
        response = await client.get(url)
        results = response.json()
    
    return {
        "query": query,
        "results": results["items"][:3]  # Top 3 results
    }

Database Query

async def query_database(sql: str) -> dict:
    """Execute a safe, read-only database query"""
    # Add SQL injection protection!
    if not sql.upper().startswith("SELECT"):
        return {"error": "Only SELECT queries allowed"}
    
    async with asyncpg.connect(DATABASE_URL) as conn:
        rows = await conn.fetch(sql)
    
    return {
        "rows": [dict(row) for row in rows],
        "count": len(rows)
    }

LLM Server Compatibility

Tool calling support varies by LLM server:

VLLM

Supports OpenAI-compatible function calling format

Enable with --enable-auto-tool-choice and --tool-call-parser:

llm:
  command:
    - "--model=meta-llama/Llama-3.2-1B-Instruct"
    - "--enable-auto-tool-choice"
    - "--tool-call-parser=llama3_json"

Ollama

Supports tools parameter in recent versions

response = ollama.chat(
    model='llama3',
    messages=messages,
    tools=tools
)

OpenAI API

Native support for function calling

No proxy needed - OpenAI handles tool calls directly.

Advanced Patterns

Multi-Step Tool Chains

Allow the LLM to call multiple tools in sequence:

async def handle_tool_calls(messages, max_iterations=5):
    """Handle multiple rounds of tool calling"""
    for i in range(max_iterations):
        response = await call_llm(messages)
        
        if not has_tool_calls(response):
            return response  # Final answer
        
        # Execute tools and add results to messages
        tool_results = await execute_all_tools(response.tool_calls)
        messages.extend(tool_results)
    
    return {"error": "Max iterations reached"}

Conditional Tool Availability

Show different tools based on context:

def get_available_tools(user_id: str, conversation_context: str) -> list:
    """Return tools based on user permissions and context"""
    tools = [WEATHER_TOOL]  # Always available
    
    if is_admin(user_id):
        tools.append(DATABASE_TOOL)
    
    if "calendar" in conversation_context.lower():
        tools.append(CALENDAR_TOOL)
    
    return tools

Error Handling

async def execute_tool_safely(tool_name: str, arguments: dict):
    """Execute tool with error handling"""
    try:
        result = await execute_tool(tool_name, arguments)
        return {"success": True, "data": result}
    except TimeoutError:
        return {"success": False, "error": "Tool timed out"}
    except Exception as e:
        logger.error(f"Tool {tool_name} failed: {e}")
        return {"success": False, "error": str(e)}

Considerations

Latency Impact: Tool calls add latency to responses. For voice conversations, this can be noticeable. Consider:

Using fast APIs
Caching tool results
Setting reasonable timeouts
Limiting tool call depth

Security: Validate all tool inputs and outputs. Never execute arbitrary code or SQL without sanitization.

Streaming: Continue streaming text to Unmute while executing tools in the background to maintain responsiveness.

Community Contributions

Tool calling support would make a great contribution to the Unmute project! If you build a robust tool-calling proxy server, consider:

Opening a pull request to add it to the main repository
Documenting your approach for others
Sharing example tool implementations

See the GitHub discussion for more details and community input.

Reference Implementation

For a complete working example, check out these resources:

VLLM Tool Calling Docs: docs.vllm.ai/en/latest/features/tool_calling.html
OpenAI Function Calling Guide: platform.openai.com/docs/guides/function-calling
FastAPI Streaming: fastapi.tiangolo.com/advanced/custom-response/#streamingresponse

Customization

Advanced

Overview

Architecture

Implementation Approach

Why This Works

Step-by-Step Implementation

Example Tools

Weather Lookup

Web Search

Database Query

LLM Server Compatibility

VLLM

Ollama

OpenAI API

Advanced Patterns

Multi-Step Tool Chains

Conditional Tool Availability

Error Handling

Considerations

Community Contributions

Reference Implementation

Next Steps

External LLM

Custom Frontend

Build docs developers (and LLMs) love

Customization

Advanced

​Overview

​Architecture

​Implementation Approach

​Why This Works

​Step-by-Step Implementation

​Example Tools

​Weather Lookup

​Web Search

​Database Query

​LLM Server Compatibility

​VLLM

​Ollama

​OpenAI API

​Advanced Patterns

​Multi-Step Tool Chains

​Conditional Tool Availability

​Error Handling

​Considerations

​Community Contributions

​Reference Implementation

​Next Steps

External LLM

Custom Frontend

Build docs developers (and LLMs) love

Overview

Architecture

Implementation Approach

Why This Works

Step-by-Step Implementation

Example Tools

Weather Lookup

Web Search

Database Query

LLM Server Compatibility

VLLM

Ollama

OpenAI API

Advanced Patterns

Multi-Step Tool Chains

Conditional Tool Availability

Error Handling

Considerations

Community Contributions

Reference Implementation

Next Steps