Skip to main content
The completions endpoint generates text based on a prompt. It follows the OpenAI Completions API format.

Endpoint

POST /v1/completions

Request body

model
string
required
The model to use for completion. Use the model name from /v1/models.
prompt
string | array
required
The prompt(s) to generate completions for. Can be:
  • A string: "Once upon a time"
  • An array of strings: ["Hello", "World"]
  • Token IDs: [123, 456, 789]
  • Array of token ID arrays: [[123, 456], [789, 012]]
max_tokens
integer
default:"16"
Maximum number of tokens to generate.
temperature
number
default:"1.0"
Sampling temperature between 0 and 2. Higher values make output more random.
top_p
number
default:"1.0"
Nucleus sampling threshold. Only tokens with cumulative probability ≤ top_p are considered.
n
integer
default:"1"
Number of completions to generate for each prompt.
stream
boolean
default:"false"
Whether to stream partial results as they’re generated.
logprobs
integer | null
default:"null"
Include the log probabilities of the top N tokens. Set to 0 to disable, or a positive integer.
echo
boolean
default:"false"
Whether to echo the prompt in the completion.
stop
string | array
default:"null"
Up to 4 sequences where generation will stop.
presence_penalty
number
default:"0.0"
Penalty for tokens that have already appeared. Range: [-2.0, 2.0].
frequency_penalty
number
default:"0.0"
Penalty for tokens based on their frequency. Range: [-2.0, 2.0].
logit_bias
object
default:"null"
Modify likelihood of specified tokens. Maps token IDs to bias values [-100, 100].
seed
integer
default:"null"
Random seed for deterministic sampling.

vLLM-specific parameters

top_k
integer
default:"-1"
Number of highest probability tokens to keep. -1 means disabled.
min_p
number
default:"0.0"
Minimum probability threshold relative to most likely token.
repetition_penalty
number
default:"1.0"
Penalty for token repetition. Values > 1 discourage repetition.
stop_token_ids
array
default:"[]"
List of token IDs that will stop generation.
ignore_eos
boolean
default:"false"
Whether to ignore the end-of-sequence token.
Whether to use beam search instead of sampling.

Response format

Non-streaming response

id
string
Unique identifier for the completion.
object
string
Always “text_completion”.
created
integer
Unix timestamp of when the completion was created.
model
string
The model used for completion.
choices
array
Array of completion choices.
index
integer
Index of the choice.
text
string
The generated text.
logprobs
object | null
Log probability information, if requested.
finish_reason
string
Why generation stopped: “stop”, “length”, or “abort”.
usage
object
Token usage statistics.
prompt_tokens
integer
Number of tokens in the prompt.
completion_tokens
integer
Number of tokens in the completion.
total_tokens
integer
Total tokens used.

Example: Basic completion

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "San Francisco is a",
    "max_tokens": 50,
    "temperature": 0.7
  }'
{
  "id": "cmpl-abc123",
  "object": "text_completion",
  "created": 1677652288,
  "model": "facebook/opt-125m",
  "choices": [
    {
      "index": 0,
      "text": " city in Northern California, known for its...",
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 50,
    "total_tokens": 55
  }
}

Example: Streaming completion

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "Once upon a time",
    "max_tokens": 100,
    "stream": true
  }'
Streaming returns Server-Sent Events:
data: {"id":"cmpl-123","object":"text_completion","created":1677652288,"choices":[{"index":0,"text":" there","logprobs":null,"finish_reason":null}],"model":"facebook/opt-125m"}

data: {"id":"cmpl-123","object":"text_completion","created":1677652288,"choices":[{"index":0,"text":" was","logprobs":null,"finish_reason":null}],"model":"facebook/opt-125m"}

data: {"id":"cmpl-123","object":"text_completion","created":1677652288,"choices":[{"index":0,"text":"","logprobs":null,"finish_reason":"stop"}],"model":"facebook/opt-125m"}

data: [DONE]

Example: Multiple completions

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "The best programming language is",
    "max_tokens": 20,
    "n": 3,
    "temperature": 0.9
  }'
Returns 3 different completions in the choices array.

Example: With logprobs

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "Hello",
    "max_tokens": 5,
    "logprobs": 3
  }'

Build docs developers (and LLMs) love