Chat Completions

The chat completions endpoint generates responses in a conversational format. This endpoint is compatible with OpenAI’s /v1/chat/completions API.

Request

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_completion_tokens": 128,
    "temperature": 0.7
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    max_completion_tokens=128,
    temperature=0.7
)

print(response.choices[0].message.content)

Parameters

Required

messages

array

required

Array of message objects in the conversation.

role

string

required

Role of the message sender: system, user, assistant, tool, function, or developer.

content

string | array

required

Message content. Can be:

A string for text-only messages
An array of content parts for multimodal messages:
- {"type": "text", "text": "..."}
- {"type": "image_url", "image_url": {"url": "..."}}
- {"type": "video_url", "video_url": {"url": "..."}}
- {"type": "audio_url", "audio_url": {"url": "..."}}

name

string

Name of the message sender.

tool_calls

array

Tool calls made by the assistant (for assistant messages).

tool_call_id

string

ID of the tool call this message is responding to (for tool messages).

model

string

required

Model name. Supports LoRA adapters via base-model:adapter-name syntax.

Sampling Parameters

max_completion_tokens

integer

Maximum number of tokens to generate. Replaces deprecated max_tokens.

max_tokens

integer

Deprecated: Use max_completion_tokens instead.

temperature

number

default:"1.0"

Sampling temperature between 0 and 2. Higher values make output more random.

top_p

number

default:"1.0"

Nucleus sampling threshold.

top_k

integer

Only sample from the top K tokens.

min_p

number

Minimum probability threshold for sampling.

integer

default:"1"

Number of chat completion choices to generate.

seed

integer

Random seed for deterministic generation.

stop

string | array

Stop sequences.

stop_token_ids

array

Stop token IDs.

Penalization

frequency_penalty

number

default:"0.0"

Penalizes tokens based on frequency. Range: [-2.0, 2.0].

presence_penalty

number

default:"0.0"

Penalizes tokens based on presence. Range: [-2.0, 2.0].

repetition_penalty

number

default:"1.0"

Penalizes repeated tokens.

Structured Output

response_format

object

Format of the response:

{"type": "text"} - Plain text (default)
{"type": "json_object"} - Valid JSON object
{"type": "json_schema", "json_schema": {...}} - JSON matching schema

regex

string

Regular expression for constrained generation.

ebnf

string

EBNF grammar for constrained generation.

Tools & Function Calling

tools

array

List of tools available to the model.

type

string

Always "function".

function

object

Function definition.

name

string

Function name.

description

string

Function description.

parameters

object

JSON schema for function parameters.

strict

boolean

default:"false"

Whether to enforce strict schema validation.

tool_choice

string | object

default:"auto"

Controls tool usage:

auto: Model decides whether to call tools
none: Model will not call tools
required: Model must call at least one tool
{"type": "function", "function": {"name": "..."}}: Force specific tool

Logging & Debugging

logprobs

boolean

default:"false"

Whether to return log probabilities.

top_logprobs

integer

Number of top log probabilities to return (requires logprobs=true).

logit_bias

object

Bias certain tokens. Maps token IDs to bias values between -100 and 100.

Streaming

stream

boolean

default:"false"

Whether to stream the response.

stream_options

object

Streaming options:

include_usage: Include usage statistics in final chunk
continuous_usage_stats: Include usage stats in each chunk

Multimodal

max_dynamic_patch

integer

Maximum number of dynamic patches for vision models.

min_dynamic_patch

integer

Minimum number of dynamic patches for vision models.

Reasoning Models

reasoning_effort

string

default:"medium"

Constrains reasoning effort for reasoning models:

low: Least effort, faster responses
medium: Balanced effort
high: Most effort, more thorough reasoning

Currently only supported for OpenAI models in harmony path (GPT-OSS models).

separate_reasoning

boolean

default:"true"

Separate reasoning content from final response.

stream_reasoning

boolean

default:"true"

Stream reasoning tokens during generation.

SGLang Extensions

ignore_eos

boolean

default:"false"

Continue generation even after EOS token.

skip_special_tokens

boolean

default:"true"

Whether to skip special tokens in output.

no_stop_trim

boolean

default:"false"

Do not trim stop sequences from output.

stop_regex

string | array

Regular expression(s) to use as stop conditions.

min_tokens

integer

default:"0"

Minimum number of tokens to generate.

continue_final_message

boolean

default:"false"

Continue from the last assistant message.

lora_path

string

Path to LoRA adapter weights.

chat_template_kwargs

object

Additional kwargs to pass to the chat template.

custom_logit_processor

string

Custom logit processor for advanced sampling control.

return_hidden_states

boolean

default:"false"

Return hidden states from the model.

return_routed_experts

boolean

default:"false"

Return expert routing information for MoE models.

return_cached_tokens_details

boolean

default:"false"

Return detailed cache hit information.

Response

string

Unique identifier for the chat completion.

object

string

Always "chat.completion".

created

integer

Unix timestamp of creation time.

model

string

Model used for generation.

choices

array

Array of chat completion choices.

index

integer

Choice index.

message

object

The generated message.

role

string

Role of the message (usually "assistant").

content

string | null

Message content.

reasoning_content

string | null

Reasoning content for reasoning models.

tool_calls

array | null

Tool calls made by the model.

string

Tool call ID.

type

string

Always "function".

function

object

name

string

Function name.

arguments

string

Function arguments as JSON string.

logprobs

object | null

Log probability information.

content

array

Log probabilities for each token.

token

string

The token.

logprob

number

Log probability of the token.

bytes

array

UTF-8 bytes of the token.

top_logprobs

array

Top alternative tokens and their log probabilities.

finish_reason

string

Reason for completion end:

stop: Natural stop or stop sequence
length: Max tokens reached
tool_calls: Model called a tool
content_filter: Content filtering
abort: Request aborted

matched_stop

integer | string | null

The stop sequence that was matched.

usage

object

Token usage statistics.

prompt_tokens

integer

Tokens in the prompt.

completion_tokens

integer

Tokens in the completion.

total_tokens

integer

Total tokens used.

prompt_tokens_details

object

cached_tokens

integer

Number of cached tokens.

reasoning_tokens

integer

Tokens used for reasoning (reasoning models).

sglext

object

SGLang-specific extensions.

routed_experts

string

Expert routing information for MoE models.

cached_tokens_details

object

Detailed cache information.

device

integer

Tokens from GPU cache.

host

integer

Tokens from CPU cache.

storage

integer

Tokens from storage backend.

storage_backend

string

Storage backend type.

Streaming Response

When stream=true, responses are sent as Server-Sent Events:

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1234567890,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1234567890,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"delta":{"content":"The"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1234567890,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"delta":{"content":" capital"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1234567890,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":15,"completion_tokens":8,"total_tokens":23}}

data: [DONE]

Examples

Basic Chat

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    max_completion_tokens=256
)

print(response.choices[0].message.content)

Streaming Chat

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Write a poem about AI"}],
    max_completion_tokens=512,
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Function Calling

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools,
    tool_choice="auto"
)

if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

JSON Output

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Generate a product review as JSON"}],
    response_format={"type": "json_object"},
    max_completion_tokens=128
)

import json
review = json.loads(response.choices[0].message.content)
print(review)

Multimodal (Vision)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-11B-Vision-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/image.jpg"}
                }
            ]
        }
    ],
    max_completion_tokens=256
)

print(response.choices[0].message.content)

Python API

Frontend API

HTTP API

CLI Reference

Chat Completions

Chat Completions

Request

Parameters

Required

Sampling Parameters

Penalization

Structured Output

Tools & Function Calling

Logging & Debugging

Streaming

Multimodal

Reasoning Models

SGLang Extensions

Response

Streaming Response

Examples

Basic Chat

Streaming Chat

Function Calling

JSON Output

Multimodal (Vision)

See Also

Python API

Frontend API

HTTP API

CLI Reference

​Chat Completions

​Request

​Parameters

​Required

​Sampling Parameters

​Penalization

​Structured Output

​Tools & Function Calling

​Logging & Debugging

​Streaming

​Multimodal

​Reasoning Models

​SGLang Extensions

​Response

​Streaming Response

​Examples

​Basic Chat

​Streaming Chat

​Function Calling

​JSON Output

​Multimodal (Vision)

​See Also

Chat Completions

Request

Parameters

Required

Sampling Parameters

Penalization

Structured Output

Tools & Function Calling

Logging & Debugging

Streaming

Multimodal

Reasoning Models

SGLang Extensions

Response

Streaming Response

Examples

Basic Chat

Streaming Chat

Function Calling

JSON Output

Multimodal (Vision)

See Also