Skip to main content

Overview

Sampling parameters control how the model generates text. They affect randomness, diversity, length, and structure of the output.

Quick Reference

sampling_params = {
    "max_new_tokens": 256,
    "temperature": 0.8,
    "top_p": 0.95,
    "top_k": 50,
    "frequency_penalty": 0.0,
    "presence_penalty": 0.0,
    "stop": ["\n\n", "END"]
}

Token Generation

max_new_tokens

max_new_tokens
int
default:"128"
Maximum number of tokens to generate.
# Short response
engine.generate(
    prompt="What is AI?",
    sampling_params={"max_new_tokens": 50}
)

# Long response
engine.generate(
    prompt="Write a detailed essay",
    sampling_params={"max_new_tokens": 2048}
)

min_new_tokens

min_new_tokens
int
default:"0"
Minimum number of tokens to generate before allowing stop sequences or EOS.
# Ensure at least 100 tokens are generated
engine.generate(
    prompt="Explain quantum physics",
    sampling_params={
        "min_new_tokens": 100,
        "max_new_tokens": 500
    }
)

ignore_eos

ignore_eos
bool
default:"false"
Continue generation even after EOS token is generated.
# Ignore end-of-sequence token
engine.generate(
    prompt="Count to 100",
    sampling_params={
        "max_new_tokens": 1000,
        "ignore_eos": True
    }
)

Randomness Control

temperature

temperature
float
default:"1.0"
Controls randomness. Lower values (0.0-0.5) make output more focused and deterministic. Higher values (0.8-2.0) make output more creative and diverse.Setting to 0.0 enables greedy decoding (always pick most likely token).
# Factual, deterministic output
engine.generate(
    prompt="What is the capital of France?",
    sampling_params={"temperature": 0.0}
)

# Creative, varied output
engine.generate(
    prompt="Write a creative story",
    sampling_params={"temperature": 0.9}
)

# Very random (can be incoherent)
engine.generate(
    prompt="Generate random text",
    sampling_params={"temperature": 1.5}
)
Best Practices:
  • 0.0: Math, factual QA, code generation
  • 0.3-0.5: General assistant, summaries
  • 0.7-0.9: Creative writing, brainstorming
  • 1.0+: Experimental, high diversity needed

top_p (Nucleus Sampling)

top_p
float
default:"1.0"
Cumulative probability threshold for nucleus sampling. Only tokens with cumulative probability up to top_p are considered. Range: (0.0, 1.0]Lower values (0.1-0.5) produce more focused output. Higher values (0.9-1.0) allow more diversity.
# Very focused
engine.generate(
    prompt="Summarize this article",
    sampling_params={"top_p": 0.1, "temperature": 0.7}
)

# Balanced
engine.generate(
    prompt="Write a paragraph",
    sampling_params={"top_p": 0.9, "temperature": 0.8}
)

# Maximum diversity
engine.generate(
    prompt="Brainstorm ideas",
    sampling_params={"top_p": 1.0, "temperature": 1.0}
)

top_k

top_k
int
default:"-1"
Only sample from the top K most likely tokens. Set to -1 to disable (consider all tokens).
# Very constrained
engine.generate(
    prompt="Complete: The capital of France is",
    sampling_params={"top_k": 5, "temperature": 0.7}
)

# More options
engine.generate(
    prompt="Write creatively",
    sampling_params={"top_k": 50, "temperature": 0.8}
)

# Unconstrained (default)
engine.generate(
    prompt="Generate text",
    sampling_params={"top_k": -1, "temperature": 0.8}
)

min_p

min_p
float
default:"0.0"
Minimum probability threshold. Tokens with probability below min_p are filtered out. Range: [0.0, 1.0]
# Filter out low-probability tokens
engine.generate(
    prompt="Write text",
    sampling_params={
        "min_p": 0.05,  # Ignore tokens with p < 5%
        "temperature": 0.8
    }
)

Repetition Control

frequency_penalty

frequency_penalty
float
default:"0.0"
Penalize tokens based on their frequency in the generated text. Higher values reduce repetition. Range: [-2.0, 2.0]Positive values: Discourage repetition Negative values: Encourage repetition
# Reduce repetition
engine.generate(
    prompt="Write a diverse essay",
    sampling_params={
        "frequency_penalty": 0.5,
        "max_new_tokens": 500
    }
)

# Strong anti-repetition
engine.generate(
    prompt="List unique ideas",
    sampling_params={
        "frequency_penalty": 1.0,
        "max_new_tokens": 200
    }
)

presence_penalty

presence_penalty
float
default:"0.0"
Penalize tokens that have already appeared (regardless of frequency). Range: [-2.0, 2.0]Positive values: Encourage new topics Negative values: Stay on topic
# Encourage topic diversity
engine.generate(
    prompt="Brainstorm topics",
    sampling_params={
        "presence_penalty": 0.6,
        "max_new_tokens": 300
    }
)

repetition_penalty

repetition_penalty
float
default:"1.0"
Apply a penalty to tokens that have been generated. Range: [0.0, 2.0]Values > 1.0: Discourage repetition Value = 1.0: No penalty (default) Values < 1.0: Encourage repetition
# Reduce repetition (alternative to frequency_penalty)
engine.generate(
    prompt="Write text",
    sampling_params={
        "repetition_penalty": 1.2,
        "max_new_tokens": 200
    }
)
Penalty Comparison:
  • frequency_penalty: Linear scaling based on token frequency
  • presence_penalty: Binary (appeared or not)
  • repetition_penalty: Multiplicative penalty on logits

Stop Conditions

stop

stop
string | array
default:"null"
Stop generation when any of these strings are generated.
# Single stop string
engine.generate(
    prompt="List items:\n1.",
    sampling_params={
        "stop": "\n\n",
        "max_new_tokens": 200
    }
)

# Multiple stop strings
engine.generate(
    prompt="Write code",
    sampling_params={
        "stop": ["```", "\n\nEND", "<|endoftext|>"],
        "max_new_tokens": 500
    }
)

stop_token_ids

stop_token_ids
array[int]
default:"null"
Stop generation when any of these token IDs are generated.
# Stop on specific token IDs
engine.generate(
    prompt="Generate text",
    sampling_params={
        "stop_token_ids": [128001, 128009],  # Model-specific IDs
        "max_new_tokens": 200
    }
)

stop_regex

stop_regex
string | array
default:"null"
Stop generation when output matches any of these regex patterns.
# Stop when a number pattern appears
engine.generate(
    prompt="Count:",
    sampling_params={
        "stop_regex": r"\d{3}",  # Stop at 3-digit number
        "max_new_tokens": 100
    }
)

# Multiple regex patterns
engine.generate(
    prompt="Write text",
    sampling_params={
        "stop_regex": [r"\bEND\b", r"\d{4}-\d{2}-\d{2}"],
        "max_new_tokens": 300
    }
)

no_stop_trim

no_stop_trim
bool
default:"false"
If true, don’t remove the stop string from the output.
# Include stop string in output
engine.generate(
    prompt="Count to 3:",
    sampling_params={
        "stop": "\n\n",
        "no_stop_trim": True,
        "max_new_tokens": 50
    }
)

Structured Output

json_schema

json_schema
string
default:"null"
JSON schema to constrain output. Ensures generated text is valid JSON matching the schema.
import json

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer", "minimum": 0},
        "email": {"type": "string", "format": "email"},
        "hobbies": {
            "type": "array",
            "items": {"type": "string"}
        }
    },
    "required": ["name", "age"]
}

response = engine.generate(
    prompt="Generate a person profile",
    sampling_params={
        "json_schema": json.dumps(schema),
        "max_new_tokens": 200
    }
)

data = json.loads(response["text"])
print(f"Name: {data['name']}, Age: {data['age']}")

regex

regex
string
default:"null"
Regular expression pattern to constrain output format.
# Phone number format
engine.generate(
    prompt="Generate a phone number:",
    sampling_params={
        "regex": r"\d{3}-\d{3}-\d{4}",
        "max_new_tokens": 20
    }
)
# Output: "555-123-4567"

# Email format
engine.generate(
    prompt="Generate an email:",
    sampling_params={
        "regex": r"[a-z]+@[a-z]+\.[a-z]+",
        "max_new_tokens": 30
    }
)
# Output: "[email protected]"

# Date format
engine.generate(
    prompt="Today's date:",
    sampling_params={
        "regex": r"\d{4}-\d{2}-\d{2}",
        "max_new_tokens": 15
    }
)
# Output: "2024-03-15"

ebnf

ebnf
string
default:"null"
EBNF (Extended Backus-Naur Form) grammar to constrain output.
# Mathematical expression grammar
grammar = """
root ::= expression
expression ::= term (["+" "-"] term)*
term ::= factor (["*" "/"] factor)*
factor ::= number | "(" expression ")"
number ::= [0-9]+
"""

engine.generate(
    prompt="Generate a math expression:",
    sampling_params={
        "ebnf": grammar,
        "max_new_tokens": 50
    }
)
# Output: "(5 + 3) * 2 - 7"

# SQL query grammar
sql_grammar = """
root ::= select_stmt
select_stmt ::= "SELECT" column_list "FROM" table_name where_clause?
column_list ::= "*" | column_name ("," column_name)*
where_clause ::= "WHERE" condition
condition ::= column_name "=" value
column_name ::= [a-zA-Z_]+
table_name ::= [a-zA-Z_]+
value ::= [0-9]+ | "'" [a-zA-Z ]+ "'"
"""

engine.generate(
    prompt="Generate a SQL query:",
    sampling_params={
        "ebnf": sql_grammar,
        "max_new_tokens": 100
    }
)
# Output: "SELECT * FROM users WHERE id = 5"

Advanced Parameters

n (Number of Completions)

n
int
default:"1"
Generate N independent completions for each prompt.
# Generate 3 different responses
response = engine.generate(
    prompt="Write a creative opening sentence",
    sampling_params={
        "n": 3,
        "temperature": 0.9,
        "max_new_tokens": 50
    }
)

for i, text in enumerate(response["text"]):
    print(f"Option {i+1}: {text}")

logit_bias

logit_bias
dict
default:"null"
Modify the likelihood of specific tokens. Keys are token IDs, values are bias adjustments. Range: Typically [-100, 100]
# Discourage specific tokens
tokenizer = engine.tokenizer_manager.tokenizer
token_id = tokenizer.encode("bad")[0]

engine.generate(
    prompt="Write a review",
    sampling_params={
        "logit_bias": {str(token_id): -10.0},
        "max_new_tokens": 100
    }
)

sampling_seed

sampling_seed
int
default:"null"
Random seed for reproducible sampling. Set this for deterministic outputs.
# Reproducible generation
for i in range(3):
    response = engine.generate(
        prompt="Generate a random number",
        sampling_params={
            "sampling_seed": 42,
            "temperature": 1.0,
            "max_new_tokens": 10
        }
    )
    print(response["text"])  # Same output each time

skip_special_tokens

skip_special_tokens
bool
default:"true"
Remove special tokens (BOS, EOS, PAD) from decoded output.
# Include special tokens in output
engine.generate(
    prompt="Hello",
    sampling_params={
        "skip_special_tokens": False,
        "max_new_tokens": 20
    }
)

spaces_between_special_tokens

spaces_between_special_tokens
bool
default:"true"
Add spaces between special tokens when decoding.

Parameter Combinations

Creative Writing

sampling_params = {
    "max_new_tokens": 500,
    "temperature": 0.9,
    "top_p": 0.95,
    "frequency_penalty": 0.3,
    "presence_penalty": 0.3
}

Code Generation

sampling_params = {
    "max_new_tokens": 512,
    "temperature": 0.2,
    "top_p": 0.95,
    "stop": ["\n\n", "```"],
    "repetition_penalty": 1.1
}

Factual Q&A

sampling_params = {
    "max_new_tokens": 150,
    "temperature": 0.0,  # Greedy
    "top_k": 1
}

JSON Generation

sampling_params = {
    "max_new_tokens": 300,
    "temperature": 0.3,
    "json_schema": json.dumps(schema)
}

Diverse Brainstorming

sampling_params = {
    "max_new_tokens": 200,
    "temperature": 1.2,
    "top_p": 0.98,
    "presence_penalty": 0.8,
    "n": 5  # Generate 5 ideas
}

Parameter Validation

SGLang validates parameters and raises errors for invalid values:
try:
    engine.generate(
        prompt="test",
        sampling_params={"temperature": -1.0}  # Invalid
    )
except ValueError as e:
    print(f"Invalid parameter: {e}")
Validation Rules:
  • temperature >= 0.0
  • 0.0 < top_p <= 1.0
  • 0.0 <= min_p <= 1.0
  • top_k >= 1 or top_k == -1
  • -2.0 <= frequency_penalty <= 2.0
  • -2.0 <= presence_penalty <= 2.0
  • 0.0 <= repetition_penalty <= 2.0
  • 0 <= min_new_tokens <= max_new_tokens
  • Only one of json_schema, regex, ebnf can be set

See Also