Chat completions API

The chat completions endpoint generates responses for conversational interactions. It follows the OpenAI Chat Completions API format.

Endpoint

POST /v1/chat/completions

Request body

model

string

required

The model to use for chat completion.

messages

array

required

Array of message objects in the conversation. Each message has:

role

string

required

The role of the message author: “system”, “user”, or “assistant”.

content

string | array

required

The content of the message. Can be a string or array of content parts for multi-modal inputs.

name

string

Optional name for the message author.

max_tokens

integer

default:"null"

Maximum number of tokens to generate. Deprecated in favor of max_completion_tokens.

max_completion_tokens

integer

default:"null"

Maximum number of tokens to generate in the completion.

temperature

number

default:"1.0"

Sampling temperature between 0 and 2.

top_p

number

default:"1.0"

Nucleus sampling threshold.

integer

default:"1"

Number of chat completion choices to generate.

stream

boolean

default:"false"

Whether to stream partial message deltas.

stop

string | array

default:"null"

Up to 4 sequences where generation will stop.

presence_penalty

number

default:"0.0"

Penalty for tokens that have already appeared. Range: [-2.0, 2.0].

frequency_penalty

number

default:"0.0"

Penalty for tokens based on their frequency. Range: [-2.0, 2.0].

logit_bias

object

default:"null"

Modify likelihood of specified tokens.

seed

integer

default:"null"

Random seed for deterministic sampling.

logprobs

boolean

default:"false"

Whether to return log probabilities.

top_logprobs

integer

default:"0"

Number of most likely tokens to return at each position (0-20).

Function calling

tools

array

default:"null"

List of tools (functions) the model can call.

tool_choice

string | object

default:"none"

Controls which tool is called: “none”, “auto”, “required”, or specific tool.

vLLM-specific parameters

top_k

integer

default:"-1"

Number of highest probability tokens to keep.

min_p

number

default:"0.0"

Minimum probability threshold.

repetition_penalty

number

default:"1.0"

Penalty for token repetition.

stop_token_ids

array

default:"[]"

Token IDs that will stop generation.

ignore_eos

boolean

default:"false"

Whether to ignore the EOS token.

Response format

Non-streaming response

string

Unique identifier for the chat completion.

object

string

Always “chat.completion”.

created

integer

Unix timestamp of creation.

model

string

The model used.

choices

array

Array of completion choices.

index

integer

Choice index.

message

object

The generated message.

role

string

Always “assistant”.

content

string

The generated message content.

tool_calls

array

Tool calls made by the model, if any.

finish_reason

string

Why generation stopped: “stop”, “length”, “tool_calls”, or “content_filter”.

usage

object

Token usage statistics.

Example: Basic chat

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "meta-llama/Llama-2-7b-chat-hf",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 10,
    "total_tokens": 30
  }
}

Example: Streaming chat

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [{"role": "user", "content": "Tell me a joke"}],
    "stream": true
  }'

Streaming returns Server-Sent Events:

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"meta-llama/Llama-2-7b-chat-hf","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"meta-llama/Llama-2-7b-chat-hf","choices":[{"index":0,"delta":{"content":"Why"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"meta-llama/Llama-2-7b-chat-hf","choices":[{"index":0,"delta":{"content":" did"},"finish_reason":null}]}

data: [DONE]

Example: Multi-turn conversation

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [
      {"role": "user", "content": "What is 2+2?"},
      {"role": "assistant", "content": "2+2 equals 4."},
      {"role": "user", "content": "What about 3+3?"}
    ]
  }'

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llava-hf/llava-1.5-7b-hf",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is in this image?"},
          {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
        ]
      }
    ]
  }'

Example: Function calling

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "What is the weather in Boston?"}],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get the current weather",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {"type": "string", "description": "City name"}
            },
            "required": ["location"]
          }
        }
      }
    ],
    "tool_choice": "auto"
  }'

Completions endpoint - For non-chat text generation
Models endpoint - List available models
LLM.chat() - Python API equivalent

Python API

REST API

CLI Reference

Chat completions API

Endpoint

Request body

Function calling

vLLM-specific parameters

Response format

Non-streaming response

Example: Basic chat

Example: Streaming chat

Example: Multi-turn conversation

Example: Function calling

Build docs developers (and LLMs) love

Python API

REST API

CLI Reference

​Endpoint

​Request body

​Function calling

​vLLM-specific parameters

​Response format

​Non-streaming response

​Example: Basic chat

​Example: Streaming chat

​Example: Multi-turn conversation

​Example: With vision (multi-modal)

​Example: Function calling

​Related

Build docs developers (and LLMs) love

Endpoint

Request body

Function calling

vLLM-specific parameters

Response format

Non-streaming response

Example: Basic chat

Example: Streaming chat

Example: Multi-turn conversation

Example: With vision (multi-modal)

Example: Function calling

Related