GenerationConfig

Overview

GenerationConfig controls how the model generates text, including sampling strategies, length constraints, and decoding methods. These parameters can be set globally on the model or passed per generation request.

Loading Configuration

from transformers.generation import GenerationConfig

# Load from checkpoint
config = GenerationConfig.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True
)

# Assign to model
model.generation_config = config

Core Parameters

Temperature

temperature

float

default:"1.0"

Controls randomness in generation:

0.0 - 0.01: Nearly deterministic (use top_k=1 instead)
0.7 - 0.9: Balanced creativity and coherence
1.0+: More random and creative

Note: Qwen recommends tuning top_p instead of temperature.

# More creative responses
response, _ = model.chat(
    tokenizer,
    "Write a story",
    temperature=0.9
)

Top-p (Nucleus Sampling)

top_p

float

default:"0.95"

Nucleus sampling probability threshold. Model considers only tokens whose cumulative probability exceeds top_p.

0.1: Very conservative, deterministic
0.7 - 0.9: Balanced (recommended)
0.95 - 1.0: More diverse outputs

response, _ = model.chat(
    tokenizer,
    "Explain quantum physics",
    top_p=0.85
)

Top-k Sampling

top_k

int

default:"None"

Limits sampling to the k most likely tokens:

1: Greedy decoding (deterministic)
10-50: Conservative sampling
50-100: More diverse sampling

Setting top_k=1 is equivalent to greedy decoding.

# Greedy decoding for factual responses
response, _ = model.chat(
    tokenizer,
    "What is 2+2?",
    top_k=1
)

Length Control

Max Length

max_length

int

default:"8192"

Maximum total sequence length (input + output tokens). Generation stops when this limit is reached.

Max New Tokens

max_new_tokens

int

default:"None"

Maximum number of tokens to generate (excluding input). Takes precedence over max_length.

# Limit response to 100 tokens
response, _ = model.chat(
    tokenizer,
    "Summarize quantum mechanics",
    max_new_tokens=100
)

Min Length

min_length

int

default:"0"

Minimum total sequence length. Model will not generate EOS token before reaching this length.

Min New Tokens

min_new_tokens

int

default:"None"

Minimum number of new tokens to generate

Stopping Criteria

Stop Strings

Stop generation when specific sequences are encountered:

stop_words = ["Observation:", "Final Answer:"]
stop_words_ids = [tokenizer.encode(s) for s in stop_words]

response, _ = model.chat(
    tokenizer,
    query="Solve this problem",
    stop_words_ids=stop_words_ids
)

EOS Token

eos_token_id

int | list[int]

default:"None"

Token ID(s) that trigger end of generation. For Qwen:

151643: Default EOS token ID
<|im_end|>: ChatML format end token

Pad Token

pad_token_id

int

default:"None"

Token ID used for padding sequences in batched generation. For Qwen, typically set to tokenizer.eod_id.

Repetition Control

Repetition Penalty

repetition_penalty

float

default:"1.0"

Penalty for repeating tokens:

1.0: No penalty
1.1 - 1.3: Mild discouragement of repetition
> 1.5: Strong penalty (may harm coherence)

response, _ = model.chat(
    tokenizer,
    "List items",
    repetition_penalty=1.2
)

No Repeat N-gram Size

no_repeat_ngram_size

int

default:"0"

Prevent repeating n-grams of this size. Set to 0 to disable.

Beam Search

Num Beams

num_beams

int

default:"1"

Number of beams for beam search:

1: No beam search (faster)
4-10: Beam search (slower but potentially higher quality)

response, _ = model.chat(
    tokenizer,
    "Translate to French: Hello world",
    num_beams=5
)

Advanced Parameters

Do Sample

do_sample

bool

default:"True"

Whether to use sampling (True) or greedy/beam search (False)

Early Stopping

early_stopping

bool

default:"False"

Stop beam search when all beams reach EOS token

Use Cache

use_cache

bool

default:"True"

Use KV cache for faster generation. Should be True for inference.

Complete Configuration Example

from transformers.generation import GenerationConfig

config = GenerationConfig(
    # Sampling
    do_sample=True,
    temperature=0.8,
    top_p=0.9,
    top_k=50,
    
    # Length
    max_new_tokens=512,
    min_new_tokens=10,
    
    # Repetition
    repetition_penalty=1.1,
    no_repeat_ngram_size=3,
    
    # Special tokens
    eos_token_id=151643,
    pad_token_id=151643,
    
    # Performance
    use_cache=True
)

model.generation_config = config

Modifying at Runtime

# View current config
print(model.generation_config)

# Modify specific parameters
model.generation_config.temperature = 0.7
model.generation_config.max_new_tokens = 256

# Or pass directly to chat()
response, _ = model.chat(
    tokenizer,
    "Your query",
    temperature=0.7,
    max_new_tokens=256
)

Recommended Presets

Factual/Deterministic

config = GenerationConfig(
    top_k=1,  # Greedy decoding
    max_new_tokens=256
)

Balanced

config = GenerationConfig(
    temperature=0.7,
    top_p=0.9,
    max_new_tokens=512,
    repetition_penalty=1.1
)

Creative

config = GenerationConfig(
    temperature=0.9,
    top_p=0.95,
    max_new_tokens=1024,
    repetition_penalty=1.05
)

Model API

OpenAI Compatible API

Training API

GenerationConfig

Overview

Loading Configuration

Core Parameters

Temperature

Top-p (Nucleus Sampling)

Top-k Sampling

Length Control

Max Length

Max New Tokens

Min Length

Min New Tokens

Stopping Criteria

Stop Strings

EOS Token

Pad Token

Repetition Control

Repetition Penalty

No Repeat N-gram Size

Beam Search

Num Beams

Advanced Parameters

Do Sample

Early Stopping

Use Cache

Complete Configuration Example

Modifying at Runtime

Recommended Presets

Factual/Deterministic

Balanced

Creative

Build docs developers (and LLMs) love

Model API

OpenAI Compatible API

Training API

​Overview

​Loading Configuration

​Core Parameters

​Temperature

​Top-p (Nucleus Sampling)

​Top-k Sampling

​Length Control

​Max Length

​Max New Tokens

​Min Length

​Min New Tokens

​Stopping Criteria

​Stop Strings

​EOS Token

​Pad Token

​Repetition Control

​Repetition Penalty

​No Repeat N-gram Size

​Beam Search

​Num Beams

​Advanced Parameters

​Do Sample

​Early Stopping

​Use Cache

​Complete Configuration Example

​Modifying at Runtime

​Recommended Presets

​Factual/Deterministic

​Balanced

​Creative

Build docs developers (and LLMs) love

Overview

Loading Configuration

Core Parameters

Temperature

Top-p (Nucleus Sampling)

Top-k Sampling

Length Control

Max Length

Max New Tokens

Min Length

Min New Tokens

Stopping Criteria

Stop Strings

EOS Token

Pad Token

Repetition Control

Repetition Penalty

No Repeat N-gram Size

Beam Search

Num Beams

Advanced Parameters

Do Sample

Early Stopping

Use Cache

Complete Configuration Example

Modifying at Runtime

Recommended Presets

Factual/Deterministic

Balanced

Creative