Skip to main content

Chat Completions

The chat completions endpoint generates responses in a conversational format. This endpoint is compatible with OpenAI’s /v1/chat/completions API.

Request

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_completion_tokens": 128,
    "temperature": 0.7
  }'
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    max_completion_tokens=128,
    temperature=0.7
)

print(response.choices[0].message.content)

Parameters

Required

messages
array
required
Array of message objects in the conversation.
role
string
required
Role of the message sender: system, user, assistant, tool, function, or developer.
content
string | array
required
Message content. Can be:
  • A string for text-only messages
  • An array of content parts for multimodal messages:
    • {"type": "text", "text": "..."}
    • {"type": "image_url", "image_url": {"url": "..."}}
    • {"type": "video_url", "video_url": {"url": "..."}}
    • {"type": "audio_url", "audio_url": {"url": "..."}}
name
string
Name of the message sender.
tool_calls
array
Tool calls made by the assistant (for assistant messages).
tool_call_id
string
ID of the tool call this message is responding to (for tool messages).
model
string
required
Model name. Supports LoRA adapters via base-model:adapter-name syntax.

Sampling Parameters

max_completion_tokens
integer
Maximum number of tokens to generate. Replaces deprecated max_tokens.
max_tokens
integer
Deprecated: Use max_completion_tokens instead.
temperature
number
default:"1.0"
Sampling temperature between 0 and 2. Higher values make output more random.
top_p
number
default:"1.0"
Nucleus sampling threshold.
top_k
integer
Only sample from the top K tokens.
min_p
number
Minimum probability threshold for sampling.
n
integer
default:"1"
Number of chat completion choices to generate.
seed
integer
Random seed for deterministic generation.
stop
string | array
Stop sequences.
stop_token_ids
array
Stop token IDs.

Penalization

frequency_penalty
number
default:"0.0"
Penalizes tokens based on frequency. Range: [-2.0, 2.0].
presence_penalty
number
default:"0.0"
Penalizes tokens based on presence. Range: [-2.0, 2.0].
repetition_penalty
number
default:"1.0"
Penalizes repeated tokens.

Structured Output

response_format
object
Format of the response:
  • {"type": "text"} - Plain text (default)
  • {"type": "json_object"} - Valid JSON object
  • {"type": "json_schema", "json_schema": {...}} - JSON matching schema
regex
string
Regular expression for constrained generation.
ebnf
string
EBNF grammar for constrained generation.

Tools & Function Calling

tools
array
List of tools available to the model.
type
string
Always "function".
function
object
Function definition.
name
string
Function name.
description
string
Function description.
parameters
object
JSON schema for function parameters.
strict
boolean
default:"false"
Whether to enforce strict schema validation.
tool_choice
string | object
default:"auto"
Controls tool usage:
  • auto: Model decides whether to call tools
  • none: Model will not call tools
  • required: Model must call at least one tool
  • {"type": "function", "function": {"name": "..."}}: Force specific tool

Logging & Debugging

logprobs
boolean
default:"false"
Whether to return log probabilities.
top_logprobs
integer
Number of top log probabilities to return (requires logprobs=true).
logit_bias
object
Bias certain tokens. Maps token IDs to bias values between -100 and 100.

Streaming

stream
boolean
default:"false"
Whether to stream the response.
stream_options
object
Streaming options:
  • include_usage: Include usage statistics in final chunk
  • continuous_usage_stats: Include usage stats in each chunk

Multimodal

max_dynamic_patch
integer
Maximum number of dynamic patches for vision models.
min_dynamic_patch
integer
Minimum number of dynamic patches for vision models.

Reasoning Models

reasoning_effort
string
default:"medium"
Constrains reasoning effort for reasoning models:
  • low: Least effort, faster responses
  • medium: Balanced effort
  • high: Most effort, more thorough reasoning
Currently only supported for OpenAI models in harmony path (GPT-OSS models).
separate_reasoning
boolean
default:"true"
Separate reasoning content from final response.
stream_reasoning
boolean
default:"true"
Stream reasoning tokens during generation.

SGLang Extensions

ignore_eos
boolean
default:"false"
Continue generation even after EOS token.
skip_special_tokens
boolean
default:"true"
Whether to skip special tokens in output.
no_stop_trim
boolean
default:"false"
Do not trim stop sequences from output.
stop_regex
string | array
Regular expression(s) to use as stop conditions.
min_tokens
integer
default:"0"
Minimum number of tokens to generate.
continue_final_message
boolean
default:"false"
Continue from the last assistant message.
lora_path
string
Path to LoRA adapter weights.
chat_template_kwargs
object
Additional kwargs to pass to the chat template.
custom_logit_processor
string
Custom logit processor for advanced sampling control.
return_hidden_states
boolean
default:"false"
Return hidden states from the model.
return_routed_experts
boolean
default:"false"
Return expert routing information for MoE models.
return_cached_tokens_details
boolean
default:"false"
Return detailed cache hit information.

Response

id
string
Unique identifier for the chat completion.
object
string
Always "chat.completion".
created
integer
Unix timestamp of creation time.
model
string
Model used for generation.
choices
array
Array of chat completion choices.
index
integer
Choice index.
message
object
The generated message.
role
string
Role of the message (usually "assistant").
content
string | null
Message content.
reasoning_content
string | null
Reasoning content for reasoning models.
tool_calls
array | null
Tool calls made by the model.
id
string
Tool call ID.
type
string
Always "function".
function
object
name
string
Function name.
arguments
string
Function arguments as JSON string.
logprobs
object | null
Log probability information.
content
array
Log probabilities for each token.
token
string
The token.
logprob
number
Log probability of the token.
bytes
array
UTF-8 bytes of the token.
top_logprobs
array
Top alternative tokens and their log probabilities.
finish_reason
string
Reason for completion end:
  • stop: Natural stop or stop sequence
  • length: Max tokens reached
  • tool_calls: Model called a tool
  • content_filter: Content filtering
  • abort: Request aborted
matched_stop
integer | string | null
The stop sequence that was matched.
usage
object
Token usage statistics.
prompt_tokens
integer
Tokens in the prompt.
completion_tokens
integer
Tokens in the completion.
total_tokens
integer
Total tokens used.
prompt_tokens_details
object
cached_tokens
integer
Number of cached tokens.
reasoning_tokens
integer
Tokens used for reasoning (reasoning models).
sglext
object
SGLang-specific extensions.
routed_experts
string
Expert routing information for MoE models.
cached_tokens_details
object
Detailed cache information.
device
integer
Tokens from GPU cache.
host
integer
Tokens from CPU cache.
storage
integer
Tokens from storage backend.
storage_backend
string
Storage backend type.

Streaming Response

When stream=true, responses are sent as Server-Sent Events:
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1234567890,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1234567890,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"delta":{"content":"The"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1234567890,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"delta":{"content":" capital"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1234567890,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":15,"completion_tokens":8,"total_tokens":23}}

data: [DONE]

Examples

Basic Chat

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    max_completion_tokens=256
)

print(response.choices[0].message.content)

Streaming Chat

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Write a poem about AI"}],
    max_completion_tokens=512,
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Function Calling

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools,
    tool_choice="auto"
)

if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

JSON Output

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Generate a product review as JSON"}],
    response_format={"type": "json_object"},
    max_completion_tokens=128
)

import json
review = json.loads(response.choices[0].message.content)
print(review)

Multimodal (Vision)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-11B-Vision-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/image.jpg"}
                }
            ]
        }
    ],
    max_completion_tokens=256
)

print(response.choices[0].message.content)

See Also