Generation - ONNX Runtime GenAI

Overview

Text generation in ONNX Runtime GenAI is controlled by the Search and Generator classes, which implement various decoding strategies for selecting tokens. The library supports both deterministic and stochastic generation methods.

The Generation Loop

The generation process follows this pattern (from src/generators.h:99):

Initialize Generator

Create a Generator with a Model and GeneratorParams.

Append Input Tokens

Feed the prompt tokens using AppendTokens() or AppendTokenSequences().

Generate Tokens

Call GenerateNextToken() in a loop until IsDone() returns true.Each iteration:

Runs model inference via State::Run()
Retrieves logits
Applies logit processors (penalties, constraints)
Selects next token(s) via search strategy
Updates KV cache
Checks termination conditions

Retrieve Results

Extract generated sequences using GetSequence().

Example

import onnxruntime_genai as og

model = og.Model('model_path')
tokenizer = og.Tokenizer(model)

# Configure generation
params = og.GeneratorParams(model)
params.set_search_options(
    max_length=100,
    top_k=50,
    top_p=0.9,
    temperature=0.7
)

# Encode and generate
input_tokens = tokenizer.encode("Hello, world!")
generator = og.Generator(model, params)
generator.append_tokens(input_tokens)

while not generator.is_done():
    generator.generate_next_token()

output = tokenizer.decode(generator.get_sequence(0))
print(output)

Search Strategies

ONNX Runtime GenAI implements two primary search strategies (from src/search.h):

Greedy Search

Greedy search selects the token with the highest probability at each step. It’s fast and deterministic but may not produce the most diverse or creative outputs. When to use:

Factual question answering
Code generation
Translation tasks where determinism is preferred

Configuration:

params = og.GeneratorParams(model)
params.set_search_options(
    max_length=100,
    num_beams=1  # Greedy search (default)
)

Implementation (from src/search.h:68):

struct GreedySearch_Cpu : Search_Cpu {
  void SelectTop() override;
  void SampleTopK(int k, float temperature) override;
  void SampleTopP(float p, float temperature) override;
  void SampleTopKTopP(int k, float p, float temperature) override;
};

Beam Search

Beam search maintains multiple hypotheses (beams) and explores different token sequences in parallel. It finds higher-quality sequences but is more computationally expensive. When to use:

Translation tasks
Summarization
Tasks requiring high-quality, coherent outputs

Configuration:

params = og.GeneratorParams(model)
params.set_search_options(
    max_length=100,
    num_beams=4,            # Number of beams
    num_return_sequences=1, # How many sequences to return
    length_penalty=1.0,     # Length normalization
    early_stopping=True     # Stop when num_beams sequences are done
)

Implementation (from src/search.h:99):

struct BeamSearch_Cpu : Search_Cpu {
  void SelectTop() override;
  std::unique_ptr<BeamSearchScorer> beam_scorer_;
};

The BeamSearchScorer (from src/beam_search_scorer.h) manages:

Beam hypothesis tracking
Score normalization with length penalty
Early stopping logic
Final sequence selection

Search Parameters

All search parameters are defined in src/config.h:293:

Core Parameters

max_length

int

required

Maximum length of generated sequence (including input tokens). Defaults to model.context_length if not set.

min_length

int

default:"0"

Minimum length before EOS token is allowed. Useful for preventing premature termination.

batch_size

int

default:"1"

Number of independent sequences to generate in parallel.

num_beams

int

default:"1"

Number of beams for beam search. Set to 1 for greedy search.

num_return_sequences

int

default:"1"

Number of sequences to return from beam search. Must be ≤ num_beams.

Sampling Parameters

do_sample

bool

default:"false"

Enable randomized sampling. When false, greedy/beam search is deterministic.

top_k

int

default:"50"

Number of highest probability tokens to keep for top-k filtering. Set to 0 to disable.

top_p

float

default:"0.0"

Cumulative probability threshold for nucleus sampling. Only tokens with cumulative probability ≤ top_p are kept. Range: (0, 1]. Set to 0 to disable.

temperature

float

default:"1.0"

Controls randomness in sampling. Lower values make output more deterministic, higher values increase diversity.

0.1-0.5: More focused and deterministic
0.7-0.9: Balanced creativity and coherence
1.0+: More random and creative

random_seed

int

default:"-1"

Random seed for sampling. Set to -1 for non-deterministic random seeding.

Penalty Parameters

repetition_penalty

float

default:"1.0"

Penalty for token repetition. Values > 1.0 discourage repetition, < 1.0 encourage it. Typical range: 1.0-1.5.

no_repeat_ngram_size

int

default:"0"

Prevent repeating n-grams of this size. Currently unused in implementation.

Beam Search Parameters

length_penalty

float

default:"1.0"

Exponential penalty applied to sequence length in beam search.

> 1.0: Favors longer sequences
< 1.0: Favors shorter sequences
= 1.0: No length penalty

early_stopping

bool

default:"true"

Stop beam search when num_beams complete sentences are found.

diversity_penalty

float

default:"0.0"

Penalty to encourage diverse beams. Currently unused in implementation.

Sampling Methods

When do_sample=true, tokens are selected probabilistically rather than deterministically.

Top-K Sampling

Selects from the K most likely tokens (from src/search.cpp:173):

void GreedySearch_Cpu::SampleTopK(int k, float temperature) {
  // 1. Find top K token scores
  std::partial_sort(indices.begin(), indices.begin() + k, indices.end(),
                    [scores](int i, int j) { return scores[i] > scores[j]; });
  
  // 2. Apply temperature and softmax
  Softmax(top_k_scores, temperature);
  
  // 3. Sample from the distribution
  std::discrete_distribution<> dis(top_k_scores.begin(), top_k_scores.end());
  int32_t token = indices[dis(gen_)];
}

Example:

params.set_search_options(
    do_sample=True,
    top_k=50,
    temperature=0.8
)

Top-P (Nucleus) Sampling

Selects from tokens whose cumulative probability exceeds threshold P (from src/search.cpp:195):

void GreedySearch_Cpu::SampleTopP(float p, float temperature) {
  // 1. Sort scores in descending order
  std::sort(indices.begin(), indices.end(),
            [scores](int i, int j) { return scores[i] > scores[j]; });
  
  // 2. Apply temperature and compute cumulative probability
  Softmax(sorted_scores, temperature);
  float cumulative_prob = 0.0f;
  
  // 3. Find cutoff where cumulative probability exceeds p
  for (size_t i = 0; i < sorted_scores.size(); i++) {
    cumulative_prob += sorted_scores[i];
    if (cumulative_prob >= p) {
      cutoff = i + 1;
      break;
    }
  }
  
  // 4. Sample from the nucleus
  std::discrete_distribution<> dis(top_p_scores.begin(), top_p_scores.end());
}

Example:

params.set_search_options(
    do_sample=True,
    top_p=0.9,
    temperature=0.7
)

Top-K + Top-P Sampling

Combines both strategies: first applies top-k, then top-p within those k tokens (from src/search.cpp:227):

params.set_search_options(
    do_sample=True,
    top_k=50,
    top_p=0.9,
    temperature=0.8
)

Temperature Scaling

Temperature is applied before softmax to control randomness:

// From src/softmax.h
void Softmax(std::span<float> values, float temperature) {
  // Scale by temperature
  for (auto& v : values)
    v /= temperature;
  
  // Apply softmax
  float max_value = *std::max_element(values.begin(), values.end());
  float sum = 0.0f;
  for (auto& v : values) {
    v = std::exp(v - max_value);
    sum += v;
  }
  for (auto& v : values)
    v /= sum;
}

Logits Processing

Before token selection, logits are processed to enforce constraints and apply penalties.

Repetition Penalty

Penalizes tokens that already appear in the sequence (from src/search.cpp):

void Search_Cpu::ApplyRepetitionPenalty(float penalty) {
  auto next_token_scores = next_token_scores_.CpuSpan();
  
  for (size_t batch_id = 0; batch_id < batch_size; batch_id++) {
    auto sequence = sequences_.GetSequence(batch_id);
    auto scores = next_token_scores.subspan(batch_id * vocab_size, vocab_size);
    
    // Apply penalty to tokens in sequence
    for (int32_t token : sequence) {
      if (scores[token] < 0)
        scores[token] *= penalty;  // Increase negative scores
      else
        scores[token] /= penalty;  // Decrease positive scores
    }
  }
}

Minimum Length

Suppresses EOS token until minimum length is reached:

void Search_Cpu::ApplyMinLength(int min_length) {
  if (sequences_.GetSequenceLength() < min_length) {
    auto next_token_scores = next_token_scores_.CpuSpan();
    
    // Set EOS token scores to -infinity
    for (size_t batch_id = 0; batch_id < batch_size; batch_id++) {
      auto scores = next_token_scores.subspan(batch_id * vocab_size, vocab_size);
      for (int32_t eos_token : eos_token_ids) {
        scores[eos_token] = std::numeric_limits<float>::lowest();
      }
    }
  }
}

Constrained Decoding

The ConstrainedLogitsProcessor (from src/constrained_logits_processor.h) enables grammar-based generation for structured outputs like JSON:

params.set_guidance(
    type="json_schema",
    data=json_schema_string,
    enable_ff_tokens=False  # Fast-forward tokens
)

See the Constrained Decoding guide for details.

Termination Conditions

Generation stops when (from src/generators.h:102):

EOS Token: An end-of-sequence token is generated (for greedy search)
Max Length: The sequence reaches max_length
Beam Search Done: All beams have completed (with early_stopping=true)
Manual Termination: User interrupts generation

while not generator.is_done():
    generator.generate_next_token()
    
    # Can check individual conditions
    if generator.token_count() >= max_tokens:
        break

Streaming Generation

For real-time output, use TokenizerStream to decode tokens incrementally:

model = og.Model('model_path')
tokenizer = og.Tokenizer(model)
stream = tokenizer.create_stream()

params = og.GeneratorParams(model)
generator = og.Generator(model, params)
generator.append_tokens(input_tokens)

while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    
    # Decode and print token immediately
    chunk = stream.decode(new_token)
    print(chunk, end='', flush=True)

The stream handles multi-byte UTF-8 characters correctly, buffering partial characters until complete.

Batched Generation

Generate multiple independent sequences in parallel:

params = og.GeneratorParams(model)
params.set_search_options(
    max_length=100,
    batch_size=4  # Generate 4 sequences
)

# Encode multiple prompts
prompts = ["Hello", "Goodbye", "Question:", "Answer:"]
input_tokens = [tokenizer.encode(p) for p in prompts]

# Pad to same length
max_len = max(len(t) for t in input_tokens)
for tokens in input_tokens:
    tokens.extend([pad_token_id] * (max_len - len(tokens)))

generator = og.Generator(model, params)
generator.append_tokens(flatten(input_tokens))

while not generator.is_done():
    generator.generate_next_token()

# Get all sequences
for i in range(4):
    output = tokenizer.decode(generator.get_sequence(i))
    print(f"Sequence {i}: {output}")

Advanced Features

Continuous Decoding

Rewind generation to a previous state and continue from there:

# Generate some tokens
for _ in range(20):
    generator.generate_next_token()

initial_length = generator.token_count()

# Generate branch A
for _ in range(10):
    generator.generate_next_token()
branch_a = generator.get_sequence(0)

# Rewind and generate branch B
generator.rewind_to(initial_length)
for _ in range(10):
    generator.generate_next_token()
branch_b = generator.get_sequence(0)

This is useful for:

Speculative decoding
Tree-based search
Alternative hypothesis exploration

Custom Logits

Manipulate logits directly:

# Generate next token computation
generator.generate_next_token()

# Get logits
logits = generator.get_logits()

# Modify logits (e.g., ban certain tokens)
logits[banned_token_id] = float('-inf')

# Set modified logits
generator.set_logits(logits)

Performance Considerations

Search Strategy Performance

Greedy: Fastest, ~1x baseline
Beam Search (4 beams): ~3-4x slower than greedy
Sampling: Similar to greedy, small overhead for RNG

Optimization Tips

Use greedy search for latency-critical applications
Enable past_present_share_buffer for CUDA with greedy search
Limit max_length to avoid unnecessary computation
Use batching to amortize overhead across multiple sequences
Adjust temperature instead of using extreme top_k/top_p values

Next Steps

KV Cache

Learn how KV cache improves generation performance

Constrained Decoding

Generate structured outputs with grammar constraints

API Reference

Explore the complete API

Examples

See generation in action

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

​Overview

​The Generation Loop

​Example

​Search Strategies

​Greedy Search

​Beam Search

​Search Parameters

​Core Parameters

​Sampling Parameters

​Penalty Parameters

​Beam Search Parameters

​Sampling Methods

​Top-K Sampling

​Top-P (Nucleus) Sampling

​Top-K + Top-P Sampling

​Temperature Scaling

​Logits Processing

​Repetition Penalty

​Minimum Length

​Constrained Decoding

​Termination Conditions

​Streaming Generation

​Batched Generation

​Advanced Features

​Continuous Decoding

​Custom Logits

​Performance Considerations

​Search Strategy Performance

​Optimization Tips

​Next Steps

KV Cache

Constrained Decoding

API Reference

Examples

Build docs developers (and LLMs) love

Overview

The Generation Loop

Example

Search Strategies

Greedy Search

Beam Search

Search Parameters

Core Parameters

Sampling Parameters

Penalty Parameters

Beam Search Parameters

Sampling Methods

Top-K Sampling

Top-P (Nucleus) Sampling

Top-K + Top-P Sampling

Temperature Scaling

Logits Processing

Repetition Penalty

Minimum Length

Constrained Decoding

Termination Conditions

Streaming Generation

Batched Generation

Advanced Features

Continuous Decoding

Custom Logits

Performance Considerations

Search Strategy Performance

Optimization Tips

Next Steps