SamplingParams
TheSamplingParams class controls how text is generated from language models. It configures sampling strategies, stopping conditions, penalties, and output options.
Constructor
Parameters
Token Generation
Maximum number of tokens to generate per output sequence.
Minimum number of tokens to generate. Values < 1 have no effect. Prevents early stopping.
Sampling Strategy
Temperature for sampling (≥ 0). Controls randomness:
0.0: Greedy decoding (deterministic)< 1.0: More focused, deterministic outputs= 1.0: Standard sampling> 1.0: More random, creative outputs
None and neither top_p nor top_k are specified, defaults to greedy decoding.Nucleus sampling threshold (0 to 1). Only tokens with cumulative probability ≥
top_p are considered.1.0: Consider all tokens (standard sampling)< 1.0: Consider only most probable tokens
None and neither temperature nor top_k are specified, defaults to greedy decoding.Sample from the top K most likely tokens.
0: Consider all tokens1: Greedy decoding> 1: Limit sampling to top K tokens
None and neither temperature nor top_p are specified, defaults to greedy decoding.Lower bound for top-P decay algorithm. Defaults to 1e-6.
Decay factor for top-P algorithm. Defaults to 1.0.
Token ID where top-P decay resets. Defaults to 1.
Minimum token probability threshold. Scales the most likely token to determine minimum probability. Defaults to 0.0.
Random seed for reproducible sampling. Defaults to 0.
Beam Search
Enable beam search instead of sampling. When
True, best_of becomes the beam width.Number of output sequences to return per prompt.
Number of sequences to generate for selection:
- Sampling mode: Generate
best_ofsequences, return topnby cumulative log probability - Beam search mode: Use beam width of
best_of, return topn
best_of >= n. Defaults to n.Diversity penalty for beam search. Values > 1.0 encourage diverse beams. Defaults to 1.0.
Array of beam widths for variable-beam-width search.
Exponential penalty for sequence length in beam search. Defaults to 0.0.
Stop beam search when
beam_width complete sequences are generated. Defaults to 1.Stopping Conditions
End-of-sequence token ID. Generation stops when this token is generated. Defaults to tokenizer’s EOS token.
Padding token ID. Defaults to
end_id.Stop string(s). Generation stops when any of these strings are generated.
Stop token IDs. Generation stops when any of these tokens are generated.
Include stop string in the output text. When
False, stop strings are removed.Continue generation after EOS token is generated.
Bad Words / Tokens
Bad string(s) to avoid. When these would be generated, they are redirected to alternative tokens.
Token IDs to avoid during generation.
Repetition Control
Penalty for repeating tokens:
< 1.0: Encourage repetition= 1.0: No penalty (default)> 1.0: Discourage repetition
Penalty for tokens that have already appeared (independent of frequency):
< 0.0: Encourage repetition= 0.0: No penalty (default)> 0.0: Discourage repetition
Penalty based on token frequency in generated text:
< 0.0: Encourage repetition= 0.0: No penalty (default)> 0.0: Discourage repetition (stronger for more frequent tokens)
Number of prompt tokens to ignore for presence/frequency penalties. Defaults to 0.
Prevent repetition of n-grams of this size. Defaults to very large value (no limit).
Output Control
Number of log probabilities to return per output token:
None: No log probabilities0: Only the sampled token’s log probabilityK > 0: Top-K log probabilities plus sampled token (if not in top-K)
Number of log probabilities to return per prompt token. Same format as
logprobs.Log probability calculation mode:
LogprobMode.RAW: Raw log probabilities from model outputLogprobMode.PROCESSED: After applying sampling parameters (temperature, top-k, top-p)
Return full logits tensor for prompt tokens.
Return full logits tensor for generated tokens.
Exclude input tokens from output token IDs.
Return encoder hidden states for encoder-decoder models.
Include performance metrics in output (TTFT, latency, throughput, etc.).
Additional model outputs to gather (model-specific).
Tokenization
Convert output token IDs to text.
Add special tokens (BOS, EOS) when encoding the prompt.
Skip special tokens when decoding output text.
Add spaces between special tokens in decoded output.
Truncate prompt to last K tokens (left truncation). Must be ≥ 1.
Advanced
Embedding bias tensor of shape
[vocab_size] with dtype float32.Custom logits processor callback(s) to modify logits before sampling. Can be a single processor or list.
Apply batched logits processor. Processor must be provided when initializing
LLM.Configuration for lookahead decoding optimization.
Guided decoding parameters for structured output (JSON, regex, grammar).