SamplingParams class controls how vLLM generates text, including temperature, top-p sampling, penalties, and stopping conditions.
Constructor
Core parameters
Number of output sequences to generate per prompt.
Controls randomness in sampling. Lower values (e.g., 0.2) make output more deterministic, higher values (e.g., 1.5) make it more random. Use 0.0 for greedy decoding.
Nucleus sampling threshold. Only tokens with cumulative probability mass ≤ top_p are considered. Must be in (0, 1].
Number of highest probability tokens to keep. Set to 0 or -1 to disable.
Minimum probability threshold relative to the most likely token. Must be in [0, 1].
Maximum number of tokens to generate per output sequence.
Minimum number of tokens to generate before EOS or stop tokens can be produced.
Penalties
Penalizes tokens that have appeared in the generated text. Values > 0 encourage new tokens, values < 0 encourage repetition. Range: [-2.0, 2.0].
Penalizes tokens based on their frequency in generated text. Values > 0 encourage new tokens, values < 0 encourage repetition. Range: [-2.0, 2.0].
Penalizes tokens based on appearance in prompt and generated text. Values > 1 encourage new tokens, values < 1 encourage repetition. Must be > 0.
Stopping conditions
String(s) that stop generation when produced. The stop string is not included in output.
Token IDs that stop generation when produced.
Whether to ignore the EOS token and continue generating.
Whether to include stop strings in the output text.
Log probabilities
Number of log probabilities to return per output token. Set to -1 to return all vocab log probs.
Number of log probabilities to return per prompt token. Set to -1 to return all vocab log probs.
Output formatting
Whether to detokenize the output tokens to text.
Whether to skip special tokens in the output.
Whether to add spaces between special tokens in the output.
Advanced parameters
Random seed for reproducible sampling.
Bias values to add to logits for specific tokens. Keys are token IDs, values are bias amounts (clamped to [-100, 100]).
If provided, only these token IDs can be generated.
Words that should not be generated. Prevents the last token of matching sequences from being generated.
Parameters for structured output generation (JSON, regex, grammar).
Controls output format:
CUMULATIVE: Return entire output in every updateDELTA: Return only new tokens in each updateFINAL_ONLY: Return only the final complete output
Example: Creative generation
Example: Deterministic generation
Example: Constrained generation
Example: Log probabilities
Example: Multiple outputs
Related
- LLM - Use SamplingParams with the LLM class
- PoolingParams - Parameters for pooling models
- RequestOutput - Output format containing generated text