The chat completions endpoint generates responses for conversational interactions. It follows the OpenAI Chat Completions API format.
Endpoint
POST /v1/chat/completions
Request body
The model to use for chat completion.
Array of message objects in the conversation. Each message has:The role of the message author: “system”, “user”, or “assistant”.
The content of the message. Can be a string or array of content parts for multi-modal inputs.
Optional name for the message author.
Maximum number of tokens to generate. Deprecated in favor of max_completion_tokens.
Maximum number of tokens to generate in the completion.
Sampling temperature between 0 and 2.
Nucleus sampling threshold.
Number of chat completion choices to generate.
Whether to stream partial message deltas.
stop
string | array
default:"null"
Up to 4 sequences where generation will stop.
Penalty for tokens that have already appeared. Range: [-2.0, 2.0].
Penalty for tokens based on their frequency. Range: [-2.0, 2.0].
Modify likelihood of specified tokens.
Random seed for deterministic sampling.
Whether to return log probabilities.
Number of most likely tokens to return at each position (0-20).
Function calling
List of tools (functions) the model can call.
tool_choice
string | object
default:"none"
Controls which tool is called: “none”, “auto”, “required”, or specific tool.
vLLM-specific parameters
Number of highest probability tokens to keep.
Minimum probability threshold.
Penalty for token repetition.
Token IDs that will stop generation.
Whether to ignore the EOS token.
Non-streaming response
Unique identifier for the chat completion.
Always “chat.completion”.
Unix timestamp of creation.
Array of completion choices.The generated message.The generated message content.
Tool calls made by the model, if any.
Why generation stopped: “stop”, “length”, “tool_calls”, or “content_filter”.
Example: Basic chat
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0.7,
"max_tokens": 100
}'
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1677652288,
"model": "meta-llama/Llama-2-7b-chat-hf",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 20,
"completion_tokens": 10,
"total_tokens": 30
}
}
Example: Streaming chat
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [{"role": "user", "content": "Tell me a joke"}],
"stream": true
}'
Streaming returns Server-Sent Events:
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"meta-llama/Llama-2-7b-chat-hf","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"meta-llama/Llama-2-7b-chat-hf","choices":[{"index":0,"delta":{"content":"Why"},"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1677652288,"model":"meta-llama/Llama-2-7b-chat-hf","choices":[{"index":0,"delta":{"content":" did"},"finish_reason":null}]}
data: [DONE]
Example: Multi-turn conversation
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "2+2 equals 4."},
{"role": "user", "content": "What about 3+3?"}
]
}'
Example: With vision (multi-modal)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
}
]
}'
Example: Function calling
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "What is the weather in Boston?"}],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}
],
"tool_choice": "auto"
}'