Skip to main content

Overview

Text generation in ONNX Runtime GenAI is controlled by the Search and Generator classes, which implement various decoding strategies for selecting tokens. The library supports both deterministic and stochastic generation methods.

The Generation Loop

The generation process follows this pattern (from src/generators.h:99):
1

Initialize Generator

Create a Generator with a Model and GeneratorParams.
2

Append Input Tokens

Feed the prompt tokens using AppendTokens() or AppendTokenSequences().
3

Generate Tokens

Call GenerateNextToken() in a loop until IsDone() returns true.Each iteration:
  1. Runs model inference via State::Run()
  2. Retrieves logits
  3. Applies logit processors (penalties, constraints)
  4. Selects next token(s) via search strategy
  5. Updates KV cache
  6. Checks termination conditions
4

Retrieve Results

Extract generated sequences using GetSequence().

Example

import onnxruntime_genai as og

model = og.Model('model_path')
tokenizer = og.Tokenizer(model)

# Configure generation
params = og.GeneratorParams(model)
params.set_search_options(
    max_length=100,
    top_k=50,
    top_p=0.9,
    temperature=0.7
)

# Encode and generate
input_tokens = tokenizer.encode("Hello, world!")
generator = og.Generator(model, params)
generator.append_tokens(input_tokens)

while not generator.is_done():
    generator.generate_next_token()

output = tokenizer.decode(generator.get_sequence(0))
print(output)

Search Strategies

ONNX Runtime GenAI implements two primary search strategies (from src/search.h): Greedy search selects the token with the highest probability at each step. It’s fast and deterministic but may not produce the most diverse or creative outputs. When to use:
  • Factual question answering
  • Code generation
  • Translation tasks where determinism is preferred
Configuration:
params = og.GeneratorParams(model)
params.set_search_options(
    max_length=100,
    num_beams=1  # Greedy search (default)
)
Implementation (from src/search.h:68):
struct GreedySearch_Cpu : Search_Cpu {
  void SelectTop() override;
  void SampleTopK(int k, float temperature) override;
  void SampleTopP(float p, float temperature) override;
  void SampleTopKTopP(int k, float p, float temperature) override;
};
Beam search maintains multiple hypotheses (beams) and explores different token sequences in parallel. It finds higher-quality sequences but is more computationally expensive. When to use:
  • Translation tasks
  • Summarization
  • Tasks requiring high-quality, coherent outputs
Configuration:
params = og.GeneratorParams(model)
params.set_search_options(
    max_length=100,
    num_beams=4,            # Number of beams
    num_return_sequences=1, # How many sequences to return
    length_penalty=1.0,     # Length normalization
    early_stopping=True     # Stop when num_beams sequences are done
)
Implementation (from src/search.h:99):
struct BeamSearch_Cpu : Search_Cpu {
  void SelectTop() override;
  std::unique_ptr<BeamSearchScorer> beam_scorer_;
};
The BeamSearchScorer (from src/beam_search_scorer.h) manages:
  • Beam hypothesis tracking
  • Score normalization with length penalty
  • Early stopping logic
  • Final sequence selection

Search Parameters

All search parameters are defined in src/config.h:293:

Core Parameters

max_length
int
required
Maximum length of generated sequence (including input tokens). Defaults to model.context_length if not set.
min_length
int
default:"0"
Minimum length before EOS token is allowed. Useful for preventing premature termination.
batch_size
int
default:"1"
Number of independent sequences to generate in parallel.
num_beams
int
default:"1"
Number of beams for beam search. Set to 1 for greedy search.
num_return_sequences
int
default:"1"
Number of sequences to return from beam search. Must be ≤ num_beams.

Sampling Parameters

do_sample
bool
default:"false"
Enable randomized sampling. When false, greedy/beam search is deterministic.
top_k
int
default:"50"
Number of highest probability tokens to keep for top-k filtering. Set to 0 to disable.
top_p
float
default:"0.0"
Cumulative probability threshold for nucleus sampling. Only tokens with cumulative probability ≤ top_p are kept. Range: (0, 1]. Set to 0 to disable.
temperature
float
default:"1.0"
Controls randomness in sampling. Lower values make output more deterministic, higher values increase diversity.
  • 0.1-0.5: More focused and deterministic
  • 0.7-0.9: Balanced creativity and coherence
  • 1.0+: More random and creative
random_seed
int
default:"-1"
Random seed for sampling. Set to -1 for non-deterministic random seeding.

Penalty Parameters

repetition_penalty
float
default:"1.0"
Penalty for token repetition. Values > 1.0 discourage repetition, < 1.0 encourage it. Typical range: 1.0-1.5.
no_repeat_ngram_size
int
default:"0"
Prevent repeating n-grams of this size. Currently unused in implementation.

Beam Search Parameters

length_penalty
float
default:"1.0"
Exponential penalty applied to sequence length in beam search.
  • > 1.0: Favors longer sequences
  • < 1.0: Favors shorter sequences
  • = 1.0: No length penalty
early_stopping
bool
default:"true"
Stop beam search when num_beams complete sentences are found.
diversity_penalty
float
default:"0.0"
Penalty to encourage diverse beams. Currently unused in implementation.

Sampling Methods

When do_sample=true, tokens are selected probabilistically rather than deterministically.

Top-K Sampling

Selects from the K most likely tokens (from src/search.cpp:173):
void GreedySearch_Cpu::SampleTopK(int k, float temperature) {
  // 1. Find top K token scores
  std::partial_sort(indices.begin(), indices.begin() + k, indices.end(),
                    [scores](int i, int j) { return scores[i] > scores[j]; });
  
  // 2. Apply temperature and softmax
  Softmax(top_k_scores, temperature);
  
  // 3. Sample from the distribution
  std::discrete_distribution<> dis(top_k_scores.begin(), top_k_scores.end());
  int32_t token = indices[dis(gen_)];
}
Example:
params.set_search_options(
    do_sample=True,
    top_k=50,
    temperature=0.8
)

Top-P (Nucleus) Sampling

Selects from tokens whose cumulative probability exceeds threshold P (from src/search.cpp:195):
void GreedySearch_Cpu::SampleTopP(float p, float temperature) {
  // 1. Sort scores in descending order
  std::sort(indices.begin(), indices.end(),
            [scores](int i, int j) { return scores[i] > scores[j]; });
  
  // 2. Apply temperature and compute cumulative probability
  Softmax(sorted_scores, temperature);
  float cumulative_prob = 0.0f;
  
  // 3. Find cutoff where cumulative probability exceeds p
  for (size_t i = 0; i < sorted_scores.size(); i++) {
    cumulative_prob += sorted_scores[i];
    if (cumulative_prob >= p) {
      cutoff = i + 1;
      break;
    }
  }
  
  // 4. Sample from the nucleus
  std::discrete_distribution<> dis(top_p_scores.begin(), top_p_scores.end());
}
Example:
params.set_search_options(
    do_sample=True,
    top_p=0.9,
    temperature=0.7
)

Top-K + Top-P Sampling

Combines both strategies: first applies top-k, then top-p within those k tokens (from src/search.cpp:227):
params.set_search_options(
    do_sample=True,
    top_k=50,
    top_p=0.9,
    temperature=0.8
)

Temperature Scaling

Temperature is applied before softmax to control randomness:
// From src/softmax.h
void Softmax(std::span<float> values, float temperature) {
  // Scale by temperature
  for (auto& v : values)
    v /= temperature;
  
  // Apply softmax
  float max_value = *std::max_element(values.begin(), values.end());
  float sum = 0.0f;
  for (auto& v : values) {
    v = std::exp(v - max_value);
    sum += v;
  }
  for (auto& v : values)
    v /= sum;
}

Logits Processing

Before token selection, logits are processed to enforce constraints and apply penalties.

Repetition Penalty

Penalizes tokens that already appear in the sequence (from src/search.cpp):
void Search_Cpu::ApplyRepetitionPenalty(float penalty) {
  auto next_token_scores = next_token_scores_.CpuSpan();
  
  for (size_t batch_id = 0; batch_id < batch_size; batch_id++) {
    auto sequence = sequences_.GetSequence(batch_id);
    auto scores = next_token_scores.subspan(batch_id * vocab_size, vocab_size);
    
    // Apply penalty to tokens in sequence
    for (int32_t token : sequence) {
      if (scores[token] < 0)
        scores[token] *= penalty;  // Increase negative scores
      else
        scores[token] /= penalty;  // Decrease positive scores
    }
  }
}

Minimum Length

Suppresses EOS token until minimum length is reached:
void Search_Cpu::ApplyMinLength(int min_length) {
  if (sequences_.GetSequenceLength() < min_length) {
    auto next_token_scores = next_token_scores_.CpuSpan();
    
    // Set EOS token scores to -infinity
    for (size_t batch_id = 0; batch_id < batch_size; batch_id++) {
      auto scores = next_token_scores.subspan(batch_id * vocab_size, vocab_size);
      for (int32_t eos_token : eos_token_ids) {
        scores[eos_token] = std::numeric_limits<float>::lowest();
      }
    }
  }
}

Constrained Decoding

The ConstrainedLogitsProcessor (from src/constrained_logits_processor.h) enables grammar-based generation for structured outputs like JSON:
params.set_guidance(
    type="json_schema",
    data=json_schema_string,
    enable_ff_tokens=False  # Fast-forward tokens
)
See the Constrained Decoding guide for details.

Termination Conditions

Generation stops when (from src/generators.h:102):
  1. EOS Token: An end-of-sequence token is generated (for greedy search)
  2. Max Length: The sequence reaches max_length
  3. Beam Search Done: All beams have completed (with early_stopping=true)
  4. Manual Termination: User interrupts generation
while not generator.is_done():
    generator.generate_next_token()
    
    # Can check individual conditions
    if generator.token_count() >= max_tokens:
        break

Streaming Generation

For real-time output, use TokenizerStream to decode tokens incrementally:
model = og.Model('model_path')
tokenizer = og.Tokenizer(model)
stream = tokenizer.create_stream()

params = og.GeneratorParams(model)
generator = og.Generator(model, params)
generator.append_tokens(input_tokens)

while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    
    # Decode and print token immediately
    chunk = stream.decode(new_token)
    print(chunk, end='', flush=True)
The stream handles multi-byte UTF-8 characters correctly, buffering partial characters until complete.

Batched Generation

Generate multiple independent sequences in parallel:
params = og.GeneratorParams(model)
params.set_search_options(
    max_length=100,
    batch_size=4  # Generate 4 sequences
)

# Encode multiple prompts
prompts = ["Hello", "Goodbye", "Question:", "Answer:"]
input_tokens = [tokenizer.encode(p) for p in prompts]

# Pad to same length
max_len = max(len(t) for t in input_tokens)
for tokens in input_tokens:
    tokens.extend([pad_token_id] * (max_len - len(tokens)))

generator = og.Generator(model, params)
generator.append_tokens(flatten(input_tokens))

while not generator.is_done():
    generator.generate_next_token()

# Get all sequences
for i in range(4):
    output = tokenizer.decode(generator.get_sequence(i))
    print(f"Sequence {i}: {output}")

Advanced Features

Continuous Decoding

Rewind generation to a previous state and continue from there:
# Generate some tokens
for _ in range(20):
    generator.generate_next_token()

initial_length = generator.token_count()

# Generate branch A
for _ in range(10):
    generator.generate_next_token()
branch_a = generator.get_sequence(0)

# Rewind and generate branch B
generator.rewind_to(initial_length)
for _ in range(10):
    generator.generate_next_token()
branch_b = generator.get_sequence(0)
This is useful for:
  • Speculative decoding
  • Tree-based search
  • Alternative hypothesis exploration

Custom Logits

Manipulate logits directly:
# Generate next token computation
generator.generate_next_token()

# Get logits
logits = generator.get_logits()

# Modify logits (e.g., ban certain tokens)
logits[banned_token_id] = float('-inf')

# Set modified logits
generator.set_logits(logits)

Performance Considerations

Search Strategy Performance

  • Greedy: Fastest, ~1x baseline
  • Beam Search (4 beams): ~3-4x slower than greedy
  • Sampling: Similar to greedy, small overhead for RNG

Optimization Tips

  1. Use greedy search for latency-critical applications
  2. Enable past_present_share_buffer for CUDA with greedy search
  3. Limit max_length to avoid unnecessary computation
  4. Use batching to amortize overhead across multiple sequences
  5. Adjust temperature instead of using extreme top_k/top_p values

Next Steps

KV Cache

Learn how KV cache improves generation performance

Constrained Decoding

Generate structured outputs with grammar constraints

API Reference

Explore the complete API

Examples

See generation in action

Build docs developers (and LLMs) love