SamplingParams

The SamplingParams class controls the sampling behavior during text generation, including temperature, top-k/top-p sampling, penalties, and structured output constraints.

Usage

from sglang import Engine
from sglang.srt.sampling.sampling_params import SamplingParams

engine = Engine(model_path="meta-llama/Llama-3.1-8B-Instruct")

# Option 1: Pass dict to engine.generate()
response = engine.generate(
    prompt="Once upon a time",
    sampling_params={"temperature": 0.8, "max_new_tokens": 128}
)

# Option 2: Create SamplingParams object
sampling_params = SamplingParams(
    temperature=0.8,
    max_new_tokens=128,
    top_p=0.9
)
response = engine.generate(
    prompt="Once upon a time",
    sampling_params=sampling_params.__dict__
)

Parameters

Generation Length

max_new_tokens

int

default:"128"

Maximum number of tokens to generate.

max_new_tokens=256  # Generate up to 256 tokens

min_new_tokens

int

default:"0"

Minimum number of tokens to generate before stopping.Useful to prevent early termination.

Temperature and Sampling

temperature

float

default:"1.0"

Sampling temperature. Controls randomness in generation.

0.0: Greedy decoding (deterministic)
< 1.0: Less random (more focused)
= 1.0: Neutral
> 1.0: More random (more creative)

temperature=0.0   # Deterministic output
temperature=0.7   # Balanced creativity
temperature=1.5   # Very creative/random

top_p

float

default:"1.0"

Nucleus sampling probability threshold.Only tokens with cumulative probability >= top_p are considered.

top_p=0.9   # Consider tokens covering 90% probability mass
top_p=0.95  # More diverse output

top_k

int

default:"-1"

Top-k sampling: only consider the k most likely tokens.

-1: Disabled (consider all tokens)
> 0: Only consider top k tokens

top_k=50   # Only sample from top 50 tokens
top_k=-1   # Disabled

min_p

float

default:"0.0"

Minimum probability threshold for token selection.Tokens with probability < min_p are filtered out.

Penalties

frequency_penalty

float

default:"0.0"

Penalty for tokens based on their frequency in the generated text.Range: [-2.0, 2.0]

Positive values: Reduce repetition
Negative values: Encourage repetition

frequency_penalty=0.5   # Reduce repetition
frequency_penalty=1.0   # Strongly reduce repetition

presence_penalty

float

default:"0.0"

Penalty for tokens that have already appeared in the generated text.Range: [-2.0, 2.0]

Positive values: Encourage diversity
Negative values: Encourage using the same tokens

presence_penalty=0.6   # Encourage diverse vocabulary

repetition_penalty

float

default:"1.0"

Penalty for repeating tokens from the prompt or previous output.Range: [0.0, 2.0]

1.0: No penalty
> 1.0: Discourage repetition
< 1.0: Encourage repetition

repetition_penalty=1.1   # Slightly discourage repetition
repetition_penalty=1.5   # Strongly discourage repetition

Stop Conditions

stop

Optional[Union[str, List[str]]]

default:"None"

String(s) that will stop generation when encountered.

stop="\n\n"                    # Stop at double newline
stop=["\n\n", "END", "STOP"]  # Stop at any of these strings

stop_token_ids

Optional[List[int]]

default:"None"

Token IDs that will stop generation when encountered.

stop_token_ids=[2, 32000]  # Stop at these token IDs

stop_regex

Optional[Union[str, List[str]]]

default:"None"

Regular expression(s) that will stop generation when matched.

stop_regex=r"\d{4}-\d{2}-\d{2}"  # Stop at date pattern

ignore_eos

bool

default:"False"

Ignore the end-of-sequence token and continue generating.Useful when you want to generate exactly max_new_tokens tokens.

Structured Output

json_schema

Optional[str]

default:"None"

JSON schema for structured output generation.

json_schema='''{
  "type": "object",
  "properties": {
    "name": {"type": "string"},
    "age": {"type": "integer"}
  },
  "required": ["name", "age"]
}'''

regex

Optional[str]

default:"None"

Regular expression constraint for output generation.

regex=r"\d{3}-\d{2}-\d{4}"  # Generate SSN format
regex=r"[A-Z][a-z]+"        # Generate capitalized word

ebnf

Optional[str]

default:"None"

EBNF grammar constraint for output generation.

ebnf='''
root ::= sentence+
sentence ::= word+ "." "\n"
word ::= [a-zA-Z]+
'''

Only one of json_schema, regex, or ebnf can be set at a time.

Output Control

skip_special_tokens

bool

default:"True"

Skip special tokens in the output text.

spaces_between_special_tokens

bool

default:"True"

Add spaces between special tokens in the output.

no_stop_trim

bool

default:"False"

Don’t trim the stop string from the output.By default, stop strings are removed from output. Set to True to keep them.

Advanced Options

int

default:"1"

Number of completions to generate for each prompt.

n=3  # Generate 3 different completions

logit_bias

Optional[Dict[str, float]]

default:"None"

Bias to add to logits of specific tokens.Keys are token IDs (as strings), values are bias values.

logit_bias={
  "1024": 5.0,   # Strongly encourage token 1024
  "2048": -10.0  # Strongly discourage token 2048
}

sampling_seed

Optional[int]

default:"None"

Random seed for sampling. Enables reproducible generation.

sampling_seed=42  # Reproducible output

stream_interval

Optional[int]

default:"None"

Token interval for streaming. Return output every N tokens.

stream_interval=5  # Stream every 5 tokens

custom_params

Optional[Dict[str, Any]]

default:"None"

Custom parameters for specialized use cases.

Common Patterns

Greedy Decoding (Deterministic)

sampling_params = {
    "temperature": 0.0,
    "max_new_tokens": 100
}

Balanced Generation

sampling_params = {
    "temperature": 0.7,
    "top_p": 0.9,
    "max_new_tokens": 256,
    "frequency_penalty": 0.5
}

Creative Writing

sampling_params = {
    "temperature": 1.2,
    "top_p": 0.95,
    "max_new_tokens": 512,
    "repetition_penalty": 1.1
}

Structured JSON Output

sampling_params = {
    "max_new_tokens": 256,
    "json_schema": '''{
        "type": "object",
        "properties": {
            "name": {"type": "string"},
            "age": {"type": "integer"},
            "city": {"type": "string"}
        },
        "required": ["name", "age"]
    }'''
}

Format Constraint (Phone Number)

sampling_params = {
    "max_new_tokens": 20,
    "regex": r"\(\d{3}\) \d{3}-\d{4}"
}

Code Generation

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.95,
    "max_new_tokens": 512,
    "stop": ["\n\n", "```"],
    "repetition_penalty": 1.05
}

Reproducible Output

sampling_params = {
    "temperature": 0.8,
    "max_new_tokens": 256,
    "sampling_seed": 42  # Same seed = same output
}

Usage Examples

Basic Usage

from sglang import Engine

engine = Engine(model_path="meta-llama/Llama-3.1-8B-Instruct")

response = engine.generate(
    prompt="Explain quantum computing in simple terms",
    sampling_params={
        "temperature": 0.7,
        "max_new_tokens": 200,
        "top_p": 0.9
    }
)

print(response["text"])

Multiple Completions

response = engine.generate(
    prompt="Write a tagline for a coffee shop",
    sampling_params={
        "temperature": 1.0,
        "max_new_tokens": 30,
        "n": 5  # Generate 5 different taglines
    }
)

for i, completion in enumerate(response):
    print(f"Tagline {i+1}: {completion['text']}")

Controlled Repetition

response = engine.generate(
    prompt="Generate a list of programming languages:",
    sampling_params={
        "temperature": 0.8,
        "max_new_tokens": 100,
        "frequency_penalty": 1.0,
        "presence_penalty": 0.5,
        "stop": "\n\n"
    }
)

JSON Output

response = engine.generate(
    prompt="Extract information from: John Doe is 30 years old and lives in NYC",
    sampling_params={
        "max_new_tokens": 150,
        "json_schema": '''{
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "age": {"type": "integer"},
                "city": {"type": "string"}
            }
        }'''
    }
)

import json
result = json.loads(response["text"])
print(result)  # {"name": "John Doe", "age": 30, "city": "NYC"}

Regex Constraint

# Generate a date in YYYY-MM-DD format
response = engine.generate(
    prompt="Today's date is:",
    sampling_params={
        "max_new_tokens": 20,
        "regex": r"\d{4}-\d{2}-\d{2}"
    }
)

print(response["text"])  # e.g., "2024-03-15"

Validation

The SamplingParams class includes validation to ensure parameters are within valid ranges:

temperature >= 0.0
0.0 < top_p <= 1.0
0.0 <= min_p <= 1.0
top_k >= 1 or -1 (disabled)
-2.0 <= frequency_penalty <= 2.0
-2.0 <= presence_penalty <= 2.0
0.0 <= repetition_penalty <= 2.0
0 <= min_new_tokens <= max_new_tokens

Best Practices

For deterministic output: Use temperature=0.0 or set sampling_seed to a fixed value.

For creative tasks: Use higher temperature (0.8-1.2) with top_p=0.9-0.95.

For structured output: Use json_schema or regex constraints to ensure valid format.

To reduce repetition: Combine frequency_penalty, presence_penalty, and repetition_penalty.

Setting temperature to 0 is converted internally to temperature=1.0 with top_k=1 for greedy sampling.

Python API

Frontend API

HTTP API

CLI Reference

SamplingParams

SamplingParams

Usage

Parameters

Generation Length

Temperature and Sampling

Penalties

Stop Conditions

Structured Output

Output Control

Advanced Options

Common Patterns

Greedy Decoding (Deterministic)

Balanced Generation

Creative Writing

Structured JSON Output

Format Constraint (Phone Number)

Code Generation

Reproducible Output

Usage Examples

Basic Usage

Multiple Completions

Controlled Repetition

JSON Output

Regex Constraint

Validation

Best Practices

See Also

Python API

Frontend API

HTTP API

CLI Reference

​SamplingParams

​Usage

​Parameters

​Generation Length

​Temperature and Sampling

​Penalties

​Stop Conditions

​Structured Output

​Output Control

​Advanced Options

​Common Patterns

​Greedy Decoding (Deterministic)

​Balanced Generation

​Creative Writing

​Structured JSON Output

​Format Constraint (Phone Number)

​Code Generation

​Reproducible Output

​Usage Examples

​Basic Usage

​Multiple Completions

​Controlled Repetition

​JSON Output

​Regex Constraint

​Validation

​Best Practices

​See Also

SamplingParams

Usage

Parameters

Generation Length

Temperature and Sampling

Penalties

Stop Conditions

Structured Output

Output Control

Advanced Options

Common Patterns

Greedy Decoding (Deterministic)

Balanced Generation

Creative Writing

Structured JSON Output

Format Constraint (Phone Number)

Code Generation

Reproducible Output

Usage Examples

Basic Usage

Multiple Completions

Controlled Repetition

JSON Output

Regex Constraint

Validation

Best Practices

See Also