Skip to main content
Sampling parameters control how vLLM generates text from language models. These parameters let you fine-tune the randomness, diversity, and behavior of model outputs.

Overview

The SamplingParams class follows the OpenAI text completion API specification and supports additional features like beam search. All parameters are optional and have sensible defaults.
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.2-3B-Instruct")

sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=256
)

outputs = llm.generate(["Explain quantum computing:"], sampling_params)

Temperature and randomness

Temperature

Controls the randomness of sampling. Lower values make output more deterministic, higher values make it more random.
temperature
float
default:"1.0"
  • Set to 0 for greedy sampling (always picks the most likely token)
  • Values between 0.0 and 1.0 reduce randomness
  • Values above 1.0 increase randomness
  • Minimum value: 0.01 (values below this are clamped)
# Deterministic output
SamplingParams(temperature=0.0)

# Balanced creativity
SamplingParams(temperature=0.8)

# High creativity
SamplingParams(temperature=1.2)
When temperature < 0.01, vLLM automatically uses greedy sampling and ignores top_p, top_k, and min_p.

Seed

Random seed for reproducible generation with non-zero temperature.
seed
int | None
default:"None"
Set to any integer for reproducible sampling. Use None for non-deterministic sampling.
# Reproducible random sampling
SamplingParams(temperature=0.8, seed=42)

Top-k, top-p, and min-p sampling

These parameters filter which tokens are considered during sampling.

Top-p (nucleus sampling)

top_p
float
default:"1.0"
Considers only the smallest set of tokens whose cumulative probability exceeds top_p.
  • Range: (0.0, 1.0]
  • Set to 1.0 to consider all tokens
  • Lower values (e.g., 0.9) filter out low-probability tokens

Top-k sampling

top_k
int
default:"0"
Considers only the top k most likely tokens.
  • Set to 0 or -1 to disable (consider all tokens)
  • Higher values allow more diversity

Min-p sampling

min_p
float
default:"0.0"
Minimum probability threshold relative to the most likely token.
  • Range: [0.0, 1.0]
  • Set to 0.0 to disable
  • For example, 0.05 means tokens with probability < 5% of the max are excluded
# Nucleus sampling: only top 90% probability mass
SamplingParams(temperature=0.8, top_p=0.9)

# Top-k: consider only top 50 tokens
SamplingParams(temperature=0.8, top_k=50)

# Combine multiple sampling strategies
SamplingParams(temperature=0.8, top_p=0.95, top_k=50, min_p=0.05)

Output length control

Maximum tokens

max_tokens
int | None
default:"16"
Maximum number of tokens to generate per output sequence.
  • Must be at least 1
  • Set to None to generate until EOS token

Minimum tokens

min_tokens
int
default:"0"
Minimum number of tokens to generate before EOS or stop tokens can be generated.
  • Must be less than or equal to max_tokens
  • Prevents premature stopping
# Generate between 50 and 256 tokens
SamplingParams(min_tokens=50, max_tokens=256)

Repetition penalties

vLLM supports three types of repetition penalties to discourage repeated tokens.

Presence penalty

presence_penalty
float
default:"0.0"
Penalizes tokens that have already appeared in the generated text.
  • Range: [-2.0, 2.0]
  • Positive values encourage new tokens
  • Negative values encourage repetition

Frequency penalty

frequency_penalty
float
default:"0.0"
Penalizes tokens based on how often they appear in the generated text.
  • Range: [-2.0, 2.0]
  • Positive values discourage frequent tokens
  • Negative values encourage frequent tokens

Repetition penalty

repetition_penalty
float
default:"1.0"
Penalizes tokens based on whether they appear in both the prompt and generated text.
  • Must be greater than 0.0
  • Values > 1.0 encourage new tokens
  • Values < 1.0 encourage repetition
# Discourage repetition
SamplingParams(
    presence_penalty=0.5,
    frequency_penalty=0.5,
    repetition_penalty=1.2
)

Stop sequences

Stop strings

stop
str | list[str] | None
default:"None"
String(s) that stop generation when encountered. The stop string is not included in the output (unless include_stop_str_in_output=True).

Stop token IDs

stop_token_ids
list[int] | None
default:"None"
Token IDs that stop generation. The stop tokens are included in output unless they are special tokens.

Ignore EOS

ignore_eos
bool
default:"False"
Whether to ignore the end-of-sequence token and continue generating.
# Stop at custom strings
SamplingParams(
    stop=["\n\n", "END"],
    max_tokens=256
)

# Stop at specific token IDs
SamplingParams(stop_token_ids=[128001, 128009])

Multiple outputs

n
int
default:"1"
Number of output sequences to generate for each prompt.
  • Must be at least 1
  • Cannot be greater than 1 when using greedy sampling (temperature=0)
# Generate 3 different completions
sampling_params = SamplingParams(
    temperature=0.8,
    n=3,
    max_tokens=100
)

outputs = llm.generate(["Once upon a time"], sampling_params)
for i, output in enumerate(outputs[0].outputs):
    print(f"Output {i+1}: {output.text}")
When using AsyncLLM, all n outputs are streamed cumulatively. To see all outputs only upon completion, use output_kind=RequestOutputKind.FINAL_ONLY.

Log probabilities

Sample logprobs

logprobs
int | None
default:"None"
Number of log probabilities to return per output token.
  • Set to None to disable
  • Set to -1 to return all vocabulary logprobs
  • Returns logprobs for up to logprobs + 1 tokens (includes sampled token)

Prompt logprobs

prompt_logprobs
int | None
default:"None"
Number of log probabilities to return per prompt token.
  • Set to None to disable
  • Set to -1 to return all vocabulary logprobs
# Return top 5 logprobs for each generated token
SamplingParams(logprobs=5, max_tokens=50)

# Return logprobs for both prompt and generation
SamplingParams(logprobs=5, prompt_logprobs=5)

Advanced features

Logit bias

logit_bias
dict[int, float] | None
default:"None"
Apply bias to specific token logits before sampling.
  • Keys: token IDs
  • Values: bias values clamped to [-100.0, 100.0]
# Increase likelihood of specific tokens
SamplingParams(
    logit_bias={128: 2.0, 256: -5.0},  # Boost token 128, suppress 256
    temperature=0.8
)

Allowed token IDs

allowed_token_ids
list[int] | None
default:"None"
Restrict generation to only these token IDs.
  • Cannot be empty
  • All IDs must be in vocabulary range

Bad words

bad_words
list[str] | None
default:"None"
Words that are not allowed in the generated output.
# Prevent specific words from being generated
SamplingParams(
    bad_words=["forbidden", "banned"],
    max_tokens=100
)

Detokenization options

detokenize
bool
default:"True"
Whether to detokenize the output. When False, only token IDs are returned.
skip_special_tokens
bool
default:"True"
Whether to skip special tokens in the output text.
spaces_between_special_tokens
bool
default:"True"
Whether to add spaces between special tokens in the output.
include_stop_str_in_output
bool
default:"False"
Whether to include stop strings in the output text.

Complete example

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.2-3B-Instruct")

# Configure comprehensive sampling parameters
sampling_params = SamplingParams(
    # Randomness
    temperature=0.8,
    top_p=0.95,
    top_k=50,
    seed=42,
    
    # Length control
    max_tokens=256,
    min_tokens=10,
    
    # Repetition control
    presence_penalty=0.5,
    frequency_penalty=0.5,
    repetition_penalty=1.1,
    
    # Stop conditions
    stop=["\n\n", "END"],
    
    # Outputs
    n=1,
    logprobs=5
)

prompts = [
    "Explain quantum computing in simple terms:",
    "What are the benefits of exercise?"
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Generated: {output.outputs[0].text}")
    print("-" * 50)

Build docs developers (and LLMs) love