Skip to main content

SamplingParams

The SamplingParams class controls the sampling behavior during text generation, including temperature, top-k/top-p sampling, penalties, and structured output constraints.

Usage

from sglang import Engine
from sglang.srt.sampling.sampling_params import SamplingParams

engine = Engine(model_path="meta-llama/Llama-3.1-8B-Instruct")

# Option 1: Pass dict to engine.generate()
response = engine.generate(
    prompt="Once upon a time",
    sampling_params={"temperature": 0.8, "max_new_tokens": 128}
)

# Option 2: Create SamplingParams object
sampling_params = SamplingParams(
    temperature=0.8,
    max_new_tokens=128,
    top_p=0.9
)
response = engine.generate(
    prompt="Once upon a time",
    sampling_params=sampling_params.__dict__
)

Parameters

Generation Length

max_new_tokens
int
default:"128"
Maximum number of tokens to generate.
max_new_tokens=256  # Generate up to 256 tokens
min_new_tokens
int
default:"0"
Minimum number of tokens to generate before stopping.Useful to prevent early termination.

Temperature and Sampling

temperature
float
default:"1.0"
Sampling temperature. Controls randomness in generation.
  • 0.0: Greedy decoding (deterministic)
  • < 1.0: Less random (more focused)
  • = 1.0: Neutral
  • > 1.0: More random (more creative)
temperature=0.0   # Deterministic output
temperature=0.7   # Balanced creativity
temperature=1.5   # Very creative/random
top_p
float
default:"1.0"
Nucleus sampling probability threshold.Only tokens with cumulative probability >= top_p are considered.
top_p=0.9   # Consider tokens covering 90% probability mass
top_p=0.95  # More diverse output
top_k
int
default:"-1"
Top-k sampling: only consider the k most likely tokens.
  • -1: Disabled (consider all tokens)
  • > 0: Only consider top k tokens
top_k=50   # Only sample from top 50 tokens
top_k=-1   # Disabled
min_p
float
default:"0.0"
Minimum probability threshold for token selection.Tokens with probability < min_p are filtered out.

Penalties

frequency_penalty
float
default:"0.0"
Penalty for tokens based on their frequency in the generated text.Range: [-2.0, 2.0]
  • Positive values: Reduce repetition
  • Negative values: Encourage repetition
frequency_penalty=0.5   # Reduce repetition
frequency_penalty=1.0   # Strongly reduce repetition
presence_penalty
float
default:"0.0"
Penalty for tokens that have already appeared in the generated text.Range: [-2.0, 2.0]
  • Positive values: Encourage diversity
  • Negative values: Encourage using the same tokens
presence_penalty=0.6   # Encourage diverse vocabulary
repetition_penalty
float
default:"1.0"
Penalty for repeating tokens from the prompt or previous output.Range: [0.0, 2.0]
  • 1.0: No penalty
  • > 1.0: Discourage repetition
  • < 1.0: Encourage repetition
repetition_penalty=1.1   # Slightly discourage repetition
repetition_penalty=1.5   # Strongly discourage repetition

Stop Conditions

stop
Optional[Union[str, List[str]]]
default:"None"
String(s) that will stop generation when encountered.
stop="\n\n"                    # Stop at double newline
stop=["\n\n", "END", "STOP"]  # Stop at any of these strings
stop_token_ids
Optional[List[int]]
default:"None"
Token IDs that will stop generation when encountered.
stop_token_ids=[2, 32000]  # Stop at these token IDs
stop_regex
Optional[Union[str, List[str]]]
default:"None"
Regular expression(s) that will stop generation when matched.
stop_regex=r"\d{4}-\d{2}-\d{2}"  # Stop at date pattern
ignore_eos
bool
default:"False"
Ignore the end-of-sequence token and continue generating.Useful when you want to generate exactly max_new_tokens tokens.

Structured Output

json_schema
Optional[str]
default:"None"
JSON schema for structured output generation.
json_schema='''{
  "type": "object",
  "properties": {
    "name": {"type": "string"},
    "age": {"type": "integer"}
  },
  "required": ["name", "age"]
}'''
regex
Optional[str]
default:"None"
Regular expression constraint for output generation.
regex=r"\d{3}-\d{2}-\d{4}"  # Generate SSN format
regex=r"[A-Z][a-z]+"        # Generate capitalized word
ebnf
Optional[str]
default:"None"
EBNF grammar constraint for output generation.
ebnf='''
root ::= sentence+
sentence ::= word+ "." "\n"
word ::= [a-zA-Z]+
'''
Only one of json_schema, regex, or ebnf can be set at a time.

Output Control

skip_special_tokens
bool
default:"True"
Skip special tokens in the output text.
spaces_between_special_tokens
bool
default:"True"
Add spaces between special tokens in the output.
no_stop_trim
bool
default:"False"
Don’t trim the stop string from the output.By default, stop strings are removed from output. Set to True to keep them.

Advanced Options

n
int
default:"1"
Number of completions to generate for each prompt.
n=3  # Generate 3 different completions
logit_bias
Optional[Dict[str, float]]
default:"None"
Bias to add to logits of specific tokens.Keys are token IDs (as strings), values are bias values.
logit_bias={
  "1024": 5.0,   # Strongly encourage token 1024
  "2048": -10.0  # Strongly discourage token 2048
}
sampling_seed
Optional[int]
default:"None"
Random seed for sampling. Enables reproducible generation.
sampling_seed=42  # Reproducible output
stream_interval
Optional[int]
default:"None"
Token interval for streaming. Return output every N tokens.
stream_interval=5  # Stream every 5 tokens
custom_params
Optional[Dict[str, Any]]
default:"None"
Custom parameters for specialized use cases.

Common Patterns

Greedy Decoding (Deterministic)

sampling_params = {
    "temperature": 0.0,
    "max_new_tokens": 100
}

Balanced Generation

sampling_params = {
    "temperature": 0.7,
    "top_p": 0.9,
    "max_new_tokens": 256,
    "frequency_penalty": 0.5
}

Creative Writing

sampling_params = {
    "temperature": 1.2,
    "top_p": 0.95,
    "max_new_tokens": 512,
    "repetition_penalty": 1.1
}

Structured JSON Output

sampling_params = {
    "max_new_tokens": 256,
    "json_schema": '''{
        "type": "object",
        "properties": {
            "name": {"type": "string"},
            "age": {"type": "integer"},
            "city": {"type": "string"}
        },
        "required": ["name", "age"]
    }'''
}

Format Constraint (Phone Number)

sampling_params = {
    "max_new_tokens": 20,
    "regex": r"\(\d{3}\) \d{3}-\d{4}"
}

Code Generation

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.95,
    "max_new_tokens": 512,
    "stop": ["\n\n", "```"],
    "repetition_penalty": 1.05
}

Reproducible Output

sampling_params = {
    "temperature": 0.8,
    "max_new_tokens": 256,
    "sampling_seed": 42  # Same seed = same output
}

Usage Examples

Basic Usage

from sglang import Engine

engine = Engine(model_path="meta-llama/Llama-3.1-8B-Instruct")

response = engine.generate(
    prompt="Explain quantum computing in simple terms",
    sampling_params={
        "temperature": 0.7,
        "max_new_tokens": 200,
        "top_p": 0.9
    }
)

print(response["text"])

Multiple Completions

response = engine.generate(
    prompt="Write a tagline for a coffee shop",
    sampling_params={
        "temperature": 1.0,
        "max_new_tokens": 30,
        "n": 5  # Generate 5 different taglines
    }
)

for i, completion in enumerate(response):
    print(f"Tagline {i+1}: {completion['text']}")

Controlled Repetition

response = engine.generate(
    prompt="Generate a list of programming languages:",
    sampling_params={
        "temperature": 0.8,
        "max_new_tokens": 100,
        "frequency_penalty": 1.0,
        "presence_penalty": 0.5,
        "stop": "\n\n"
    }
)

JSON Output

response = engine.generate(
    prompt="Extract information from: John Doe is 30 years old and lives in NYC",
    sampling_params={
        "max_new_tokens": 150,
        "json_schema": '''{
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "age": {"type": "integer"},
                "city": {"type": "string"}
            }
        }'''
    }
)

import json
result = json.loads(response["text"])
print(result)  # {"name": "John Doe", "age": 30, "city": "NYC"}

Regex Constraint

# Generate a date in YYYY-MM-DD format
response = engine.generate(
    prompt="Today's date is:",
    sampling_params={
        "max_new_tokens": 20,
        "regex": r"\d{4}-\d{2}-\d{2}"
    }
)

print(response["text"])  # e.g., "2024-03-15"

Validation

The SamplingParams class includes validation to ensure parameters are within valid ranges:
  • temperature >= 0.0
  • 0.0 < top_p <= 1.0
  • 0.0 <= min_p <= 1.0
  • top_k >= 1 or -1 (disabled)
  • -2.0 <= frequency_penalty <= 2.0
  • -2.0 <= presence_penalty <= 2.0
  • 0.0 <= repetition_penalty <= 2.0
  • 0 <= min_new_tokens <= max_new_tokens

Best Practices

For deterministic output: Use temperature=0.0 or set sampling_seed to a fixed value.
For creative tasks: Use higher temperature (0.8-1.2) with top_p=0.9-0.95.
For structured output: Use json_schema or regex constraints to ensure valid format.
To reduce repetition: Combine frequency_penalty, presence_penalty, and repetition_penalty.
Setting temperature to 0 is converted internally to temperature=1.0 with top_k=1 for greedy sampling.

See Also