/v1/chat/completions endpoint provides conversational AI capabilities using a chat message format. It’s fully compatible with OpenAI’s Chat Completions API.
Endpoint
Request Format
Required Parameters
Model identifier. Can be the model path, alias (set via
--alias), or any string when using a single model.Array of message objects representing the conversation history. Each message has:
role(string): One ofsystem,user, orassistantcontent(string): The message content
content can be an array with text and image parts.Optional Parameters
Sampling temperature between 0 and 2. Higher values make output more random, lower values more deterministic.
Nucleus sampling parameter. Only tokens with cumulative probability up to
top_p are considered.Limits token selection to the K most probable tokens. Set to 0 to disable.
Minimum probability threshold relative to the most likely token.
Maximum number of tokens to generate. -1 means unlimited.
Whether to stream partial message deltas using Server-Sent Events.
Array of strings. Generation stops when any of these sequences are encountered.
Penalize tokens based on whether they appear in the text so far. Range: -2.0 to 2.0.
Penalize tokens based on their frequency in the text. Range: -2.0 to 2.0.
Penalize repetition of token sequences.
Random seed for reproducible outputs. Use -1 for random.
Control output format:
{"type": "json_object"}- Force valid JSON output{"type": "json_schema", "schema": {...}}- Constrain to JSON schema
Array of tool/function definitions for function calling. Requires
--jinja flag.Control tool selection:
auto, none, or {"type": "function", "function": {"name": "tool_name"}}llama.cpp-Specific Parameters
Enable Mirostat sampling. 0 = disabled, 1 = Mirostat 1.0, 2 = Mirostat 2.0
Mirostat target entropy (τ parameter).
Mirostat learning rate (η parameter).
Controls reasoning/thinking tags:
none- No parsing, raw output in contentdeepseek- Extract thoughts toreasoning_contentfielddeepseek-legacy- Keep tags in content while populatingreasoning_content
Force reasoning models to always output thinking process.
Reuse KV cache from previous requests when possible for faster processing.
Request Examples
Response Format
Standard Response
Unique identifier for the completion.
Always
"chat.completion" for non-streaming responses.Unix timestamp of when the completion was created.
The model used for the completion.
Array of completion choices. Each choice contains:
index(number) - Choice indexmessage(object) - The generated message withroleandcontentfinish_reason(string) - Why generation stopped:stop,length, ortool_callslogprobs(object | null) - Token probabilities if requested
Token usage statistics:
prompt_tokens(number) - Tokens in the promptcompletion_tokens(number) - Tokens generatedtotal_tokens(number) - Sum of prompt and completion tokens
Performance metrics (llama.cpp specific):
prompt_n(number) - Prompt tokens processedprompt_ms(number) - Time spent on promptpredicted_n(number) - Tokens generatedpredicted_ms(number) - Time spent generatingcache_n(number) - Tokens reused from cache
Example Response
Streaming Responses
Whenstream: true, the server sends Server-Sent Events (SSE):
Function Calling
To enable function calling, start the server with--jinja:
Reasoning Models
For models with reasoning capabilities (e.g., DeepSeek-R1), thoughts are extracted toreasoning_content:
Set
reasoning_format: "none" to get raw output without reasoning extraction.Multi-turn Conversations
Include the full conversation history in themessages array:
Performance Tips
- Enable prompt caching: Set
cache_prompt: true(default) to reuse KV cache across requests - Use streaming: Enable
stream: truefor better perceived latency - Adjust context size: Use
-cflag to set appropriate context window for your use case - GPU acceleration: Use
--n-gpu-layersto offload layers to GPU - Parallel requests: Use
--parallelto handle multiple concurrent requests
Error Responses
400- Invalid request (missing/invalid parameters)401- Authentication failed503- Server unavailable (model still loading)

