Chat Completions
The chat completions endpoint generates responses in a conversational format. This endpoint is compatible with OpenAI’s/v1/chat/completions API.
Request
Parameters
Required
Array of message objects in the conversation.
Role of the message sender:
system, user, assistant, tool, function, or developer.Message content. Can be:
- A string for text-only messages
- An array of content parts for multimodal messages:
{"type": "text", "text": "..."}{"type": "image_url", "image_url": {"url": "..."}}{"type": "video_url", "video_url": {"url": "..."}}{"type": "audio_url", "audio_url": {"url": "..."}}
Name of the message sender.
Tool calls made by the assistant (for assistant messages).
ID of the tool call this message is responding to (for tool messages).
Model name. Supports LoRA adapters via
base-model:adapter-name syntax.Sampling Parameters
Maximum number of tokens to generate. Replaces deprecated
max_tokens.Deprecated: Use
max_completion_tokens instead.Sampling temperature between 0 and 2. Higher values make output more random.
Nucleus sampling threshold.
Only sample from the top K tokens.
Minimum probability threshold for sampling.
Number of chat completion choices to generate.
Random seed for deterministic generation.
Stop sequences.
Stop token IDs.
Penalization
Penalizes tokens based on frequency. Range: [-2.0, 2.0].
Penalizes tokens based on presence. Range: [-2.0, 2.0].
Penalizes repeated tokens.
Structured Output
Format of the response:
{"type": "text"}- Plain text (default){"type": "json_object"}- Valid JSON object{"type": "json_schema", "json_schema": {...}}- JSON matching schema
Regular expression for constrained generation.
EBNF grammar for constrained generation.
Tools & Function Calling
Controls tool usage:
auto: Model decides whether to call toolsnone: Model will not call toolsrequired: Model must call at least one tool{"type": "function", "function": {"name": "..."}}: Force specific tool
Logging & Debugging
Whether to return log probabilities.
Number of top log probabilities to return (requires
logprobs=true).Bias certain tokens. Maps token IDs to bias values between -100 and 100.
Streaming
Whether to stream the response.
Streaming options:
include_usage: Include usage statistics in final chunkcontinuous_usage_stats: Include usage stats in each chunk
Multimodal
Maximum number of dynamic patches for vision models.
Minimum number of dynamic patches for vision models.
Reasoning Models
Constrains reasoning effort for reasoning models:
low: Least effort, faster responsesmedium: Balanced efforthigh: Most effort, more thorough reasoning
Separate reasoning content from final response.
Stream reasoning tokens during generation.
SGLang Extensions
Continue generation even after EOS token.
Whether to skip special tokens in output.
Do not trim stop sequences from output.
Regular expression(s) to use as stop conditions.
Minimum number of tokens to generate.
Continue from the last assistant message.
Path to LoRA adapter weights.
Additional kwargs to pass to the chat template.
Custom logit processor for advanced sampling control.
Return hidden states from the model.
Return expert routing information for MoE models.
Return detailed cache hit information.
Response
Unique identifier for the chat completion.
Always
"chat.completion".Unix timestamp of creation time.
Model used for generation.
Array of chat completion choices.
Choice index.
The generated message.
Role of the message (usually
"assistant").Message content.
Reasoning content for reasoning models.
Reason for completion end:
stop: Natural stop or stop sequencelength: Max tokens reachedtool_calls: Model called a toolcontent_filter: Content filteringabort: Request aborted
The stop sequence that was matched.
SGLang-specific extensions.
Expert routing information for MoE models.
Streaming Response
Whenstream=true, responses are sent as Server-Sent Events:
Examples
Basic Chat
Streaming Chat
Function Calling
JSON Output
Multimodal (Vision)
See Also
- Completions - Text completion format
- Embeddings - Generate embeddings
- Models - List available models
