Skip to main content
The chat completions endpoint generates responses for conversational interactions. It follows the OpenAI Chat Completions API format.

Endpoint

POST /v1/chat/completions

Request body

model
string
required
The model to use for chat completion.
messages
array
required
Array of message objects in the conversation. Each message has:
role
string
required
The role of the message author: “system”, “user”, or “assistant”.
content
string | array
required
The content of the message. Can be a string or array of content parts for multi-modal inputs.
name
string
Optional name for the message author.
max_tokens
integer
default:"null"
Maximum number of tokens to generate. Deprecated in favor of max_completion_tokens.
max_completion_tokens
integer
default:"null"
Maximum number of tokens to generate in the completion.
temperature
number
default:"1.0"
Sampling temperature between 0 and 2.
top_p
number
default:"1.0"
Nucleus sampling threshold.
n
integer
default:"1"
Number of chat completion choices to generate.
stream
boolean
default:"false"
Whether to stream partial message deltas.
stop
string | array
default:"null"
Up to 4 sequences where generation will stop.
presence_penalty
number
default:"0.0"
Penalty for tokens that have already appeared. Range: [-2.0, 2.0].
frequency_penalty
number
default:"0.0"
Penalty for tokens based on their frequency. Range: [-2.0, 2.0].
logit_bias
object
default:"null"
Modify likelihood of specified tokens.
seed
integer
default:"null"
Random seed for deterministic sampling.
logprobs
boolean
default:"false"
Whether to return log probabilities.
top_logprobs
integer
default:"0"
Number of most likely tokens to return at each position (0-20).

Function calling

tools
array
default:"null"
List of tools (functions) the model can call.
tool_choice
string | object
default:"none"
Controls which tool is called: “none”, “auto”, “required”, or specific tool.

vLLM-specific parameters

top_k
integer
default:"-1"
Number of highest probability tokens to keep.
min_p
number
default:"0.0"
Minimum probability threshold.
repetition_penalty
number
default:"1.0"
Penalty for token repetition.
stop_token_ids
array
default:"[]"
Token IDs that will stop generation.
ignore_eos
boolean
default:"false"
Whether to ignore the EOS token.

Response format

Non-streaming response

id
string
Unique identifier for the chat completion.
object
string
Always “chat.completion”.
created
integer
Unix timestamp of creation.
model
string
The model used.
choices
array
Array of completion choices.
index
integer
Choice index.
message
object
The generated message.
role
string
Always “assistant”.
content
string
The generated message content.
tool_calls
array
Tool calls made by the model, if any.
finish_reason
string
Why generation stopped: “stop”, “length”, “tool_calls”, or “content_filter”.
usage
object
Token usage statistics.

Example: Basic chat

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "meta-llama/Llama-2-7b-chat-hf",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 10,
    "total_tokens": 30
  }
}

Example: Streaming chat

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [{"role": "user", "content": "Tell me a joke"}],
    "stream": true
  }'
Streaming returns Server-Sent Events:
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"meta-llama/Llama-2-7b-chat-hf","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"meta-llama/Llama-2-7b-chat-hf","choices":[{"index":0,"delta":{"content":"Why"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"meta-llama/Llama-2-7b-chat-hf","choices":[{"index":0,"delta":{"content":" did"},"finish_reason":null}]}

data: [DONE]

Example: Multi-turn conversation

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [
      {"role": "user", "content": "What is 2+2?"},
      {"role": "assistant", "content": "2+2 equals 4."},
      {"role": "user", "content": "What about 3+3?"}
    ]
  }'

Example: With vision (multi-modal)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llava-hf/llava-1.5-7b-hf",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is in this image?"},
          {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
        ]
      }
    ]
  }'

Example: Function calling

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "What is the weather in Boston?"}],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get the current weather",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {"type": "string", "description": "City name"}
            },
            "required": ["location"]
          }
        }
      }
    ],
    "tool_choice": "auto"
  }'

Build docs developers (and LLMs) love