Endpoint
Request body
The model to use for completion. Use the model name from
/v1/models.The prompt(s) to generate completions for. Can be:
- A string:
"Once upon a time" - An array of strings:
["Hello", "World"] - Token IDs:
[123, 456, 789] - Array of token ID arrays:
[[123, 456], [789, 012]]
Maximum number of tokens to generate.
Sampling temperature between 0 and 2. Higher values make output more random.
Nucleus sampling threshold. Only tokens with cumulative probability ≤ top_p are considered.
Number of completions to generate for each prompt.
Whether to stream partial results as they’re generated.
Include the log probabilities of the top N tokens. Set to 0 to disable, or a positive integer.
Whether to echo the prompt in the completion.
Up to 4 sequences where generation will stop.
Penalty for tokens that have already appeared. Range: [-2.0, 2.0].
Penalty for tokens based on their frequency. Range: [-2.0, 2.0].
Modify likelihood of specified tokens. Maps token IDs to bias values [-100, 100].
Random seed for deterministic sampling.
vLLM-specific parameters
Number of highest probability tokens to keep. -1 means disabled.
Minimum probability threshold relative to most likely token.
Penalty for token repetition. Values > 1 discourage repetition.
List of token IDs that will stop generation.
Whether to ignore the end-of-sequence token.
Whether to use beam search instead of sampling.
Response format
Non-streaming response
Unique identifier for the completion.
Always “text_completion”.
Unix timestamp of when the completion was created.
The model used for completion.
Example: Basic completion
Example: Streaming completion
Example: Multiple completions
choices array.
Example: With logprobs
Related
- Chat completions - For conversational models
- Models endpoint - List available models
- SamplingParams - Python API equivalent