Skip to main content

SamplingParams

The SamplingParams class controls how text is generated from language models. It configures sampling strategies, stopping conditions, penalties, and output options.

Constructor

from tensorrt_llm.sampling_params import SamplingParams

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    max_tokens=256
)

Parameters

Token Generation

max_tokens
int
default:"32"
Maximum number of tokens to generate per output sequence.
min_tokens
int
default:"None"
Minimum number of tokens to generate. Values < 1 have no effect. Prevents early stopping.

Sampling Strategy

temperature
float
default:"None"
Temperature for sampling (≥ 0). Controls randomness:
  • 0.0: Greedy decoding (deterministic)
  • < 1.0: More focused, deterministic outputs
  • = 1.0: Standard sampling
  • > 1.0: More random, creative outputs
If None and neither top_p nor top_k are specified, defaults to greedy decoding.
top_p
float
default:"None"
Nucleus sampling threshold (0 to 1). Only tokens with cumulative probability ≥ top_p are considered.
  • 1.0: Consider all tokens (standard sampling)
  • < 1.0: Consider only most probable tokens
If None and neither temperature nor top_k are specified, defaults to greedy decoding.
top_k
int
default:"None"
Sample from the top K most likely tokens.
  • 0: Consider all tokens
  • 1: Greedy decoding
  • > 1: Limit sampling to top K tokens
If None and neither temperature nor top_p are specified, defaults to greedy decoding.
top_p_min
float
default:"None"
Lower bound for top-P decay algorithm. Defaults to 1e-6.
top_p_decay
float
default:"None"
Decay factor for top-P algorithm. Defaults to 1.0.
top_p_reset_ids
int
default:"None"
Token ID where top-P decay resets. Defaults to 1.
min_p
float
default:"None"
Minimum token probability threshold. Scales the most likely token to determine minimum probability. Defaults to 0.0.
seed
int
default:"None"
Random seed for reproducible sampling. Defaults to 0.
Enable beam search instead of sampling. When True, best_of becomes the beam width.
n
int
default:"1"
Number of output sequences to return per prompt.
best_of
int
default:"None"
Number of sequences to generate for selection:
  • Sampling mode: Generate best_of sequences, return top n by cumulative log probability
  • Beam search mode: Use beam width of best_of, return top n
Must satisfy best_of >= n. Defaults to n.
beam_search_diversity_rate
float
default:"None"
Diversity penalty for beam search. Values > 1.0 encourage diverse beams. Defaults to 1.0.
beam_width_array
List[int]
default:"None"
Array of beam widths for variable-beam-width search.
length_penalty
float
default:"None"
Exponential penalty for sequence length in beam search. Defaults to 0.0.
early_stopping
int
default:"None"
Stop beam search when beam_width complete sequences are generated. Defaults to 1.

Stopping Conditions

end_id
int
default:"None"
End-of-sequence token ID. Generation stops when this token is generated. Defaults to tokenizer’s EOS token.
pad_id
int
default:"None"
Padding token ID. Defaults to end_id.
stop
str | List[str]
default:"None"
Stop string(s). Generation stops when any of these strings are generated.
SamplingParams(stop=["\n\n", "END", "###"])
stop_token_ids
List[int]
default:"None"
Stop token IDs. Generation stops when any of these tokens are generated.
include_stop_str_in_output
bool
default:"False"
Include stop string in the output text. When False, stop strings are removed.
ignore_eos
bool
default:"False"
Continue generation after EOS token is generated.

Bad Words / Tokens

bad
str | List[str]
default:"None"
Bad string(s) to avoid. When these would be generated, they are redirected to alternative tokens.
bad_token_ids
List[int]
default:"None"
Token IDs to avoid during generation.

Repetition Control

repetition_penalty
float
default:"None"
Penalty for repeating tokens:
  • < 1.0: Encourage repetition
  • = 1.0: No penalty (default)
  • > 1.0: Discourage repetition
presence_penalty
float
default:"None"
Penalty for tokens that have already appeared (independent of frequency):
  • < 0.0: Encourage repetition
  • = 0.0: No penalty (default)
  • > 0.0: Discourage repetition
frequency_penalty
float
default:"None"
Penalty based on token frequency in generated text:
  • < 0.0: Encourage repetition
  • = 0.0: No penalty (default)
  • > 0.0: Discourage repetition (stronger for more frequent tokens)
prompt_ignore_length
int
default:"None"
Number of prompt tokens to ignore for presence/frequency penalties. Defaults to 0.
no_repeat_ngram_size
int
default:"None"
Prevent repetition of n-grams of this size. Defaults to very large value (no limit).

Output Control

logprobs
int
default:"None"
Number of log probabilities to return per output token:
  • None: No log probabilities
  • 0: Only the sampled token’s log probability
  • K > 0: Top-K log probabilities plus sampled token (if not in top-K)
prompt_logprobs
int
default:"None"
Number of log probabilities to return per prompt token. Same format as logprobs.
logprobs_mode
LogprobMode
default:"LogprobMode.RAW"
Log probability calculation mode:
  • LogprobMode.RAW: Raw log probabilities from model output
  • LogprobMode.PROCESSED: After applying sampling parameters (temperature, top-k, top-p)
return_context_logits
bool
default:"False"
Return full logits tensor for prompt tokens.
return_generation_logits
bool
default:"False"
Return full logits tensor for generated tokens.
exclude_input_from_output
bool
default:"True"
Exclude input tokens from output token IDs.
return_encoder_output
bool
default:"False"
Return encoder hidden states for encoder-decoder models.
return_perf_metrics
bool
default:"False"
Include performance metrics in output (TTFT, latency, throughput, etc.).
additional_model_outputs
List[str]
default:"None"
Additional model outputs to gather (model-specific).

Tokenization

detokenize
bool
default:"True"
Convert output token IDs to text.
add_special_tokens
bool
default:"True"
Add special tokens (BOS, EOS) when encoding the prompt.
skip_special_tokens
bool
default:"True"
Skip special tokens when decoding output text.
spaces_between_special_tokens
bool
default:"True"
Add spaces between special tokens in decoded output.
truncate_prompt_tokens
int
default:"None"
Truncate prompt to last K tokens (left truncation). Must be ≥ 1.

Advanced

embedding_bias
torch.Tensor
default:"None"
Embedding bias tensor of shape [vocab_size] with dtype float32.
logits_processor
LogitsProcessor | List[LogitsProcessor]
default:"None"
Custom logits processor callback(s) to modify logits before sampling. Can be a single processor or list.
apply_batched_logits_processor
bool
default:"False"
Apply batched logits processor. Processor must be provided when initializing LLM.
lookahead_config
LookaheadDecodingConfig
default:"None"
Configuration for lookahead decoding optimization.
guided_decoding
GuidedDecodingParams
default:"None"
Guided decoding parameters for structured output (JSON, regex, grammar).

Usage Examples

Basic Sampling

from tensorrt_llm.sampling_params import SamplingParams

# Balanced creativity
params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=256
)

Greedy Decoding

# Deterministic output
params = SamplingParams(
    temperature=0.0,  # or top_k=1
    max_tokens=100
)

Beam Search

params = SamplingParams(
    use_beam_search=True,
    best_of=4,  # beam width
    n=1,        # return best sequence
    max_tokens=200
)

Stop Strings

params = SamplingParams(
    stop=["\n\n", "###", "END"],
    include_stop_str_in_output=False,
    max_tokens=500
)

Multiple Outputs

# Generate 5 candidates, return best 3
params = SamplingParams(
    n=3,
    best_of=5,
    temperature=0.8,
    max_tokens=150
)

Log Probabilities

params = SamplingParams(
    logprobs=5,              # Top-5 token log probs
    prompt_logprobs=3,       # Top-3 for prompt tokens
    max_tokens=100
)

output = llm.generate("Hello", sampling_params=params)
for token_logprobs in output.outputs[0].logprobs:
    print(token_logprobs)  # Dict[token_id -> Logprob]

Repetition Control

params = SamplingParams(
    repetition_penalty=1.2,      # Discourage repetition
    frequency_penalty=0.5,       # Penalize frequent tokens
    presence_penalty=0.3,        # Penalize any repeated tokens
    no_repeat_ngram_size=3,      # No 3-gram repetition
    max_tokens=300
)

Structured Output with Guided Decoding

from tensorrt_llm.sampling_params import SamplingParams, GuidedDecodingParams

# JSON output
params = SamplingParams(
    guided_decoding=GuidedDecodingParams(
        json={
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "age": {"type": "integer"}
            },
            "required": ["name", "age"]
        }
    ),
    max_tokens=200
)

# Regex pattern
params = SamplingParams(
    guided_decoding=GuidedDecodingParams(
        regex=r"\d{3}-\d{3}-\d{4}"  # Phone number format
    ),
    max_tokens=50
)

Performance Monitoring

params = SamplingParams(
    return_perf_metrics=True,
    max_tokens=256
)

output = llm.generate("Test prompt", sampling_params=params)
metrics = output.outputs[0].request_perf_metrics
print(f"Time to first token: {metrics.ttft}")
print(f"Throughput: {metrics.throughput}")

GuidedDecodingParams

Parameters for structured output generation:
from tensorrt_llm.sampling_params import GuidedDecodingParams

guided = GuidedDecodingParams(
    json_object=True  # Any valid JSON object
)
# OR
guided = GuidedDecodingParams(
    json={...}  # Specific JSON schema
)
# OR
guided = GuidedDecodingParams(
    regex="pattern"  # Regex pattern
)
# OR
guided = GuidedDecodingParams(
    grammar="EBNF grammar"  # EBNF grammar
)

LogprobMode

Enum for log probability modes:
from tensorrt_llm.sampling_params import LogprobMode

LogprobMode.RAW        # Raw model output logits
LogprobMode.PROCESSED  # After temperature/top-k/top-p

Build docs developers (and LLMs) love