Function Calling
llama.cpp supports OpenAI-style function calling for ~any model through native and generic handlers.
Overview
Function calling allows models to:
- Call external tools and APIs
- Execute code and retrieve results
- Access real-time data (web search, calculators, databases)
- Perform structured actions based on user requests
Function calling is implemented in common/chat.h and used by llama-server when started with the --jinja flag.
Supported Models
These models have optimized native function calling handlers:
- Llama 3.1 / 3.2 / 3.3 — Including builtin tools (
wolfram_alpha, brave_search, code_interpreter)
- Functionary v3.1 / v3.2 — Dedicated function calling models
- Hermes 2/3 — Strong tool use capabilities
- Qwen 2.5 / Qwen 2.5 Coder — Native tool calling support
- Mistral Nemo — Function calling enabled
- Firefunction v2 — Specialized for function calls
- Command R7B — With reasoning extraction
- DeepSeek R1 — Experimental support
When a model’s chat template isn’t recognized, llama.cpp falls back to generic function calling support. You’ll see Chat format: Generic in the logs.
Generic support works with any model but:
- May consume more tokens than native format
- May be less efficient
- Can be overridden with
--chat-template-file
Basic Usage
Server Setup
Start llama-server with function calling enabled:
llama-server -m model.gguf --jinja
Or with a custom chat template:
llama-server -m model.gguf \
--chat-template-file templates/custom.jinja
Define Functions
Define available functions in your API request:
{
"model": "model.gguf",
"messages": [
{"role": "user", "content": "What's the weather in San Francisco?"}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and state, e.g. San Francisco, CA"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
}
]
}
The model will respond with a tool call:
{
"choices": [
{
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_1",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"San Francisco, CA\", \"unit\": \"fahrenheit\"}"
}
}
]
}
}
]
}
Return Results
Execute the function and return results:
{
"model": "model.gguf",
"messages": [
{"role": "user", "content": "What's the weather in San Francisco?"},
{
"role": "assistant",
"content": null,
"tool_calls": [{"id": "call_1", "type": "function", "function": {"name": "get_weather", "arguments": "{\"location\": \"San Francisco, CA\", \"unit\": \"fahrenheit\"}"}}]
},
{
"role": "tool",
"tool_call_id": "call_1",
"content": "{\"temperature\": 72, \"condition\": \"sunny\", \"humidity\": 65}"
}
]
}
The model will generate a natural language response using the tool results.
Some models support calling multiple functions simultaneously:
{
"model": "model.gguf",
"messages": [{"role": "user", "content": "What's the weather in SF and NYC?"}],
"tools": [...],
"parallel_tool_calls": true
}
Parallel tool calling is disabled by default. Enable with "parallel_tool_calls": true in your request.
Complete Examples
import requests
import json
# Define tools
tools = [
{
"type": "function",
"function": {
"name": "calculate",
"description": "Perform basic arithmetic operations",
"parameters": {
"type": "object",
"properties": {
"operation": {
"type": "string",
"enum": ["add", "subtract", "multiply", "divide"]
},
"a": {"type": "number"},
"b": {"type": "number"}
},
"required": ["operation", "a", "b"]
}
}
}
]
# Initial request
response = requests.post(
"http://localhost:8080/v1/chat/completions",
json={
"model": "model.gguf",
"messages": [{"role": "user", "content": "What is 15 * 7?"}],
"tools": tools
}
)
result = response.json()
tool_calls = result["choices"][0]["message"]["tool_calls"]
# Execute tool
if tool_calls:
tool_call = tool_calls[0]
args = json.loads(tool_call["function"]["arguments"])
# Calculate result
operations = {
"add": lambda a, b: a + b,
"multiply": lambda a, b: a * b,
}
result = operations[args["operation"]](args["a"], args["b"])
# Return result to model
response = requests.post(
"http://localhost:8080/v1/chat/completions",
json={
"model": "model.gguf",
"messages": [
{"role": "user", "content": "What is 15 * 7?"},
result["choices"][0]["message"],
{
"role": "tool",
"tool_call_id": tool_call["id"],
"content": str(result)
}
]
}
)
print(response.json()["choices"][0]["message"]["content"])
async function chat() {
const tools = [
{
type: "function",
function: {
name: "get_time",
description: "Get current time in a timezone",
parameters: {
type: "object",
properties: {
timezone: {
type: "string",
description: "IANA timezone name"
}
},
required: ["timezone"]
}
}
}
];
// Initial request
let response = await fetch("http://localhost:8080/v1/chat/completions", {
method: "POST",
headers: {"Content-Type": "application/json"},
body: JSON.stringify({
model: "model.gguf",
messages: [{role: "user", content: "What time is it in Tokyo?"}],
tools: tools
})
});
let data = await response.json();
const toolCalls = data.choices[0].message.tool_calls;
if (toolCalls) {
const toolCall = toolCalls[0];
const args = JSON.parse(toolCall.function.arguments);
// Get current time
const time = new Date().toLocaleString("en-US", {
timeZone: args.timezone
});
// Return result
response = await fetch("http://localhost:8080/v1/chat/completions", {
method: "POST",
headers: {"Content-Type": "application/json"},
body: JSON.stringify({
model: "model.gguf",
messages: [
{role: "user", content: "What time is it in Tokyo?"},
data.choices[0].message,
{
role: "tool",
tool_call_id: toolCall.id,
content: time
}
]
})
});
data = await response.json();
console.log(data.choices[0].message.content);
}
}
chat();
# Initial request
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "model.gguf",
"messages": [
{"role": "user", "content": "Search for llama.cpp on GitHub"}
],
"tools": [
{
"type": "function",
"function": {
"name": "search_github",
"description": "Search GitHub repositories",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"}
},
"required": ["query"]
}
}
}
]
}'
# Model responds with tool_call requesting search_github
# Return results
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "model.gguf",
"messages": [
{"role": "user", "content": "Search for llama.cpp on GitHub"},
{
"role": "assistant",
"tool_calls": [{
"id": "call_1",
"type": "function",
"function": {
"name": "search_github",
"arguments": "{\"query\": \"llama.cpp\"}"
}
}]
},
{
"role": "tool",
"tool_call_id": "call_1",
"content": "Found: ggml-org/llama.cpp - LLM inference in C/C++"
}
]
}'
Llama 3.1+ models support built-in tool names:
wolfram_alpha — Mathematical and factual queries
brave_search / web_search — Web searching
code_interpreter — Code execution
These don’t require parameter definitions but still need tool result handling.
Custom Chat Templates
Override the default chat template for better function calling:
llama-server -m model.gguf \
--chat-template-file templates/functionary-v3.1.jinja
Or specify in the API request:
{
"model": "model.gguf",
"messages": [...],
"tools": [...],
"chat_template": "path/to/template.jinja"
}
Best Practices
- Write clear, concise descriptions
- Include parameter constraints and units
- Specify required vs optional parameters
- Use JSON Schema for parameter validation
- Return structured error messages in tool results
- Include error types and codes
- Handle rate limits and timeouts
- Provide fallback behavior
- Use models with native function calling support when possible
- Test generic support with your specific model
- Consider token efficiency for high-volume applications
- Benchmark accuracy with your tool definitions
Troubleshooting
Model not calling functions
- Ensure
--jinja flag is set on server
- Check tool descriptions are clear and specific
- Verify model supports function calling
- Try with
--chat-template-file override
Invalid arguments generated
- Add parameter constraints to JSON Schema
- Include examples in descriptions
- Use enums for limited choices
- Validate and sanitize arguments before execution
Generic format using too many tokens
Next Steps
REST API
Learn about the chat completions endpoint
Server Configuration
Configure llama-server for function calling