Overview
TheSamplingParams class follows the OpenAI text completion API specification and supports additional features like beam search. All parameters are optional and have sensible defaults.
Temperature and randomness
Temperature
Controls the randomness of sampling. Lower values make output more deterministic, higher values make it more random.- Set to
0for greedy sampling (always picks the most likely token) - Values between
0.0and1.0reduce randomness - Values above
1.0increase randomness - Minimum value:
0.01(values below this are clamped)
When
temperature < 0.01, vLLM automatically uses greedy sampling and ignores top_p, top_k, and min_p.Seed
Random seed for reproducible generation with non-zero temperature.Set to any integer for reproducible sampling. Use
None for non-deterministic sampling.Top-k, top-p, and min-p sampling
These parameters filter which tokens are considered during sampling.Top-p (nucleus sampling)
Considers only the smallest set of tokens whose cumulative probability exceeds
top_p.- Range:
(0.0, 1.0] - Set to
1.0to consider all tokens - Lower values (e.g.,
0.9) filter out low-probability tokens
Top-k sampling
Considers only the top
k most likely tokens.- Set to
0or-1to disable (consider all tokens) - Higher values allow more diversity
Min-p sampling
Minimum probability threshold relative to the most likely token.
- Range:
[0.0, 1.0] - Set to
0.0to disable - For example,
0.05means tokens with probability < 5% of the max are excluded
Output length control
Maximum tokens
Maximum number of tokens to generate per output sequence.
- Must be at least
1 - Set to
Noneto generate until EOS token
Minimum tokens
Minimum number of tokens to generate before EOS or stop tokens can be generated.
- Must be less than or equal to
max_tokens - Prevents premature stopping
Repetition penalties
vLLM supports three types of repetition penalties to discourage repeated tokens.Presence penalty
Penalizes tokens that have already appeared in the generated text.
- Range:
[-2.0, 2.0] - Positive values encourage new tokens
- Negative values encourage repetition
Frequency penalty
Penalizes tokens based on how often they appear in the generated text.
- Range:
[-2.0, 2.0] - Positive values discourage frequent tokens
- Negative values encourage frequent tokens
Repetition penalty
Penalizes tokens based on whether they appear in both the prompt and generated text.
- Must be greater than
0.0 - Values >
1.0encourage new tokens - Values <
1.0encourage repetition
Stop sequences
Stop strings
String(s) that stop generation when encountered. The stop string is not included in the output (unless
include_stop_str_in_output=True).Stop token IDs
Token IDs that stop generation. The stop tokens are included in output unless they are special tokens.
Ignore EOS
Whether to ignore the end-of-sequence token and continue generating.
Multiple outputs
Number of output sequences to generate for each prompt.
- Must be at least
1 - Cannot be greater than
1when using greedy sampling (temperature=0)
When using
AsyncLLM, all n outputs are streamed cumulatively. To see all outputs only upon completion, use output_kind=RequestOutputKind.FINAL_ONLY.Log probabilities
Sample logprobs
Number of log probabilities to return per output token.
- Set to
Noneto disable - Set to
-1to return all vocabulary logprobs - Returns logprobs for up to
logprobs + 1tokens (includes sampled token)
Prompt logprobs
Number of log probabilities to return per prompt token.
- Set to
Noneto disable - Set to
-1to return all vocabulary logprobs
Advanced features
Logit bias
Apply bias to specific token logits before sampling.
- Keys: token IDs
- Values: bias values clamped to
[-100.0, 100.0]
Allowed token IDs
Restrict generation to only these token IDs.
- Cannot be empty
- All IDs must be in vocabulary range
Bad words
Words that are not allowed in the generated output.
Detokenization options
Whether to detokenize the output. When
False, only token IDs are returned.Whether to skip special tokens in the output text.
Whether to add spaces between special tokens in the output.
Whether to include stop strings in the output text.
Complete example
Related resources
- Structured outputs - Generate JSON, regex, or grammar-constrained output
- OpenAI API reference - vLLM follows this specification
- Source:
vllm/sampling_params.py:119