Chat Completions

The /v1/chat/completions endpoint provides conversational AI capabilities using a chat message format. It’s fully compatible with OpenAI’s Chat Completions API.

Endpoint

POST /v1/chat/completions

Request Format

Required Parameters

model

string

required

Model identifier. Can be the model path, alias (set via --alias), or any string when using a single model.

messages

array

required

Array of message objects representing the conversation history. Each message has:

role (string): One of system, user, or assistant
content (string): The message content

For multimodal models, content can be an array with text and image parts.

Optional Parameters

temperature

number

default:"0.8"

Sampling temperature between 0 and 2. Higher values make output more random, lower values more deterministic.

top_p

number

default:"0.95"

Nucleus sampling parameter. Only tokens with cumulative probability up to top_p are considered.

top_k

number

default:"40"

Limits token selection to the K most probable tokens. Set to 0 to disable.

min_p

number

default:"0.05"

Minimum probability threshold relative to the most likely token.

max_tokens

number

default:"-1"

Maximum number of tokens to generate. -1 means unlimited.

stream

boolean

default:"false"

Whether to stream partial message deltas using Server-Sent Events.

stop

array

Array of strings. Generation stops when any of these sequences are encountered.

presence_penalty

number

default:"0.0"

Penalize tokens based on whether they appear in the text so far. Range: -2.0 to 2.0.

frequency_penalty

number

default:"0.0"

Penalize tokens based on their frequency in the text. Range: -2.0 to 2.0.

repeat_penalty

number

default:"1.1"

Penalize repetition of token sequences.

seed

number

default:"-1"

Random seed for reproducible outputs. Use -1 for random.

response_format

object

Control output format:

{"type": "json_object"} - Force valid JSON output
{"type": "json_schema", "schema": {...}} - Constrain to JSON schema

tools

array

Array of tool/function definitions for function calling. Requires --jinja flag.

tool_choice

string | object

Control tool selection: auto, none, or {"type": "function", "function": {"name": "tool_name"}}

llama.cpp-Specific Parameters

mirostat

number

default:"0"

Enable Mirostat sampling. 0 = disabled, 1 = Mirostat 1.0, 2 = Mirostat 2.0

mirostat_tau

number

default:"5.0"

Mirostat target entropy (τ parameter).

mirostat_eta

number

default:"0.1"

Mirostat learning rate (η parameter).

reasoning_format

string

default:"auto"

Controls reasoning/thinking tags:

none - No parsing, raw output in content
deepseek - Extract thoughts to reasoning_content field
deepseek-legacy - Keep tags in content while populating reasoning_content

thinking_forced_open

boolean

default:"false"

Force reasoning models to always output thinking process.

cache_prompt

boolean

default:"true"

Reuse KV cache from previous requests when possible for faster processing.

Request Examples

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer no-key" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Response Format

Standard Response

string

Unique identifier for the completion.

object

string

Always "chat.completion" for non-streaming responses.

created

number

Unix timestamp of when the completion was created.

model

string

The model used for the completion.

choices

array

Array of completion choices. Each choice contains:

index (number) - Choice index
message (object) - The generated message with role and content
finish_reason (string) - Why generation stopped: stop, length, or tool_calls
logprobs (object | null) - Token probabilities if requested

usage

object

Token usage statistics:

prompt_tokens (number) - Tokens in the prompt
completion_tokens (number) - Tokens generated
total_tokens (number) - Sum of prompt and completion tokens

timings

object

Performance metrics (llama.cpp specific):

prompt_n (number) - Prompt tokens processed
prompt_ms (number) - Time spent on prompt
predicted_n (number) - Tokens generated
predicted_ms (number) - Time spent generating
cache_n (number) - Tokens reused from cache

Example Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "gpt-3.5-turbo",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris. It has been the capital since 987 AD and is known for landmarks like the Eiffel Tower and the Louvre Museum."
      },
      "finish_reason": "stop",
      "logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 28,
    "completion_tokens": 34,
    "total_tokens": 62
  },
  "timings": {
    "prompt_n": 28,
    "prompt_ms": 145.2,
    "prompt_per_token_ms": 5.186,
    "prompt_per_second": 192.8,
    "predicted_n": 34,
    "predicted_ms": 682.5,
    "predicted_per_token_ms": 20.074,
    "predicted_per_second": 49.8,
    "cache_n": 0
  }
}

Streaming Responses

When stream: true, the server sends Server-Sent Events (SSE):

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"gpt-3.5-turbo","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"gpt-3.5-turbo","choices":[{"index":0,"delta":{"content":"The"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"gpt-3.5-turbo","choices":[{"index":0,"delta":{"content":" capital"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"gpt-3.5-turbo","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Function Calling

To enable function calling, start the server with --jinja:

llama-server -m model.gguf --jinja

Then define tools in your request:

{
  "model": "gpt-3.5-turbo",
  "messages": [
    {"role": "user", "content": "What's the weather in Boston?"}
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"]
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

The model will respond with tool calls:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"location\": \"Boston, MA\", \"unit\": \"fahrenheit\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ]
}

Reasoning Models

For models with reasoning capabilities (e.g., DeepSeek-R1), thoughts are extracted to reasoning_content:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "The answer is 42.",
        "reasoning_content": "Let me think about this step by step..."
      },
      "finish_reason": "stop"
    }
  ]
}

Set reasoning_format: "none" to get raw output without reasoning extraction.

Multi-turn Conversations

Include the full conversation history in the messages array:

{
  "model": "gpt-3.5-turbo",
  "messages": [
    {"role": "system", "content": "You are a helpful math tutor."},
    {"role": "user", "content": "What is 2 + 2?"},
    {"role": "assistant", "content": "2 + 2 equals 4."},
    {"role": "user", "content": "What about 2 + 3?"}
  ]
}

Performance Tips

Enable prompt caching: Set cache_prompt: true (default) to reuse KV cache across requests
Use streaming: Enable stream: true for better perceived latency
Adjust context size: Use -c flag to set appropriate context window for your use case
GPU acceleration: Use --n-gpu-layers to offload layers to GPU
Parallel requests: Use --parallel to handle multiple concurrent requests

Error Responses

{
  "error": {
    "message": "Invalid request: messages is required",
    "type": "invalid_request_error",
    "code": 400
  }
}

Common error codes:

400 - Invalid request (missing/invalid parameters)
401 - Authentication failed
503 - Server unavailable (model still loading)

C/C++ API

REST API

Tools

Endpoint

Request Format

Required Parameters

Optional Parameters

llama.cpp-Specific Parameters

Request Examples

Response Format

Standard Response

Example Response

Streaming Responses

Function Calling

Reasoning Models

Multi-turn Conversations

Performance Tips

Error Responses

C/C++ API

REST API

Tools

​Endpoint

​Request Format

​Required Parameters

​Optional Parameters

​llama.cpp-Specific Parameters

​Request Examples

​Response Format

​Standard Response

​Example Response

​Streaming Responses

​Function Calling

​Reasoning Models

​Multi-turn Conversations

​Performance Tips

​Error Responses

Endpoint

Request Format

Required Parameters

Optional Parameters

llama.cpp-Specific Parameters

Request Examples

Response Format

Standard Response

Example Response

Streaming Responses

Function Calling

Reasoning Models

Multi-turn Conversations

Performance Tips

Error Responses