Completions API

The completions endpoint generates text based on a prompt. It follows the OpenAI Completions API format.

Endpoint

POST /v1/completions

Request body

model

string

required

The model to use for completion. Use the model name from /v1/models.

prompt

string | array

required

The prompt(s) to generate completions for. Can be:

A string: "Once upon a time"
An array of strings: ["Hello", "World"]
Token IDs: [123, 456, 789]
Array of token ID arrays: [[123, 456], [789, 012]]

max_tokens

integer

default:"16"

Maximum number of tokens to generate.

temperature

number

default:"1.0"

Sampling temperature between 0 and 2. Higher values make output more random.

top_p

number

default:"1.0"

Nucleus sampling threshold. Only tokens with cumulative probability ≤ top_p are considered.

integer

default:"1"

Number of completions to generate for each prompt.

stream

boolean

default:"false"

Whether to stream partial results as they’re generated.

logprobs

integer | null

default:"null"

Include the log probabilities of the top N tokens. Set to 0 to disable, or a positive integer.

echo

boolean

default:"false"

Whether to echo the prompt in the completion.

stop

string | array

default:"null"

Up to 4 sequences where generation will stop.

presence_penalty

number

default:"0.0"

Penalty for tokens that have already appeared. Range: [-2.0, 2.0].

frequency_penalty

number

default:"0.0"

Penalty for tokens based on their frequency. Range: [-2.0, 2.0].

logit_bias

object

default:"null"

Modify likelihood of specified tokens. Maps token IDs to bias values [-100, 100].

seed

integer

default:"null"

Random seed for deterministic sampling.

vLLM-specific parameters

top_k

integer

default:"-1"

Number of highest probability tokens to keep. -1 means disabled.

min_p

number

default:"0.0"

Minimum probability threshold relative to most likely token.

repetition_penalty

number

default:"1.0"

Penalty for token repetition. Values > 1 discourage repetition.

stop_token_ids

array

default:"[]"

List of token IDs that will stop generation.

ignore_eos

boolean

default:"false"

Whether to ignore the end-of-sequence token.

use_beam_search

boolean

default:"false"

Whether to use beam search instead of sampling.

Response format

Non-streaming response

string

Unique identifier for the completion.

object

string

Always “text_completion”.

created

integer

Unix timestamp of when the completion was created.

model

string

The model used for completion.

choices

array

Array of completion choices.

index

integer

Index of the choice.

text

string

The generated text.

logprobs

object | null

Log probability information, if requested.

finish_reason

string

Why generation stopped: “stop”, “length”, or “abort”.

usage

object

Token usage statistics.

prompt_tokens

integer

Number of tokens in the prompt.

completion_tokens

integer

Number of tokens in the completion.

total_tokens

integer

Total tokens used.

Example: Basic completion

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "San Francisco is a",
    "max_tokens": 50,
    "temperature": 0.7
  }'

{
  "id": "cmpl-abc123",
  "object": "text_completion",
  "created": 1677652288,
  "model": "facebook/opt-125m",
  "choices": [
    {
      "index": 0,
      "text": " city in Northern California, known for its...",
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 50,
    "total_tokens": 55
  }
}

Example: Streaming completion

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "Once upon a time",
    "max_tokens": 100,
    "stream": true
  }'

Streaming returns Server-Sent Events:

data: {"id":"cmpl-123","object":"text_completion","created":1677652288,"choices":[{"index":0,"text":" there","logprobs":null,"finish_reason":null}],"model":"facebook/opt-125m"}

data: {"id":"cmpl-123","object":"text_completion","created":1677652288,"choices":[{"index":0,"text":" was","logprobs":null,"finish_reason":null}],"model":"facebook/opt-125m"}

data: {"id":"cmpl-123","object":"text_completion","created":1677652288,"choices":[{"index":0,"text":"","logprobs":null,"finish_reason":"stop"}],"model":"facebook/opt-125m"}

data: [DONE]

Example: Multiple completions

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "The best programming language is",
    "max_tokens": 20,
    "n": 3,
    "temperature": 0.9
  }'

Returns 3 different completions in the choices array.

Example: With logprobs

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "Hello",
    "max_tokens": 5,
    "logprobs": 3
  }'

Chat completions - For conversational models
Models endpoint - List available models
SamplingParams - Python API equivalent

Python API

REST API

CLI Reference

Completions API

Endpoint

Request body

vLLM-specific parameters

Response format

Non-streaming response

Example: Basic completion

Example: Streaming completion

Example: Multiple completions

Example: With logprobs

Build docs developers (and LLMs) love

Python API

REST API

CLI Reference

​Endpoint

​Request body

​vLLM-specific parameters

​Response format

​Non-streaming response

​Example: Basic completion

​Example: Streaming completion

​Example: Multiple completions

​Example: With logprobs

​Related

Build docs developers (and LLMs) love

Endpoint

Request body

vLLM-specific parameters

Response format

Non-streaming response

Example: Basic completion

Example: Streaming completion

Example: Multiple completions

Example: With logprobs

Related