Overview
Text generation in ONNX Runtime GenAI is controlled by theSearch and Generator classes, which implement various decoding strategies for selecting tokens. The library supports both deterministic and stochastic generation methods.
The Generation Loop
The generation process follows this pattern (fromsrc/generators.h:99):
Generate Tokens
Call
GenerateNextToken() in a loop until IsDone() returns true.Each iteration:- Runs model inference via
State::Run() - Retrieves logits
- Applies logit processors (penalties, constraints)
- Selects next token(s) via search strategy
- Updates KV cache
- Checks termination conditions
Example
Search Strategies
ONNX Runtime GenAI implements two primary search strategies (fromsrc/search.h):
Greedy Search
Greedy search selects the token with the highest probability at each step. It’s fast and deterministic but may not produce the most diverse or creative outputs. When to use:- Factual question answering
- Code generation
- Translation tasks where determinism is preferred
src/search.h:68):
Beam Search
Beam search maintains multiple hypotheses (beams) and explores different token sequences in parallel. It finds higher-quality sequences but is more computationally expensive. When to use:- Translation tasks
- Summarization
- Tasks requiring high-quality, coherent outputs
src/search.h:99):
BeamSearchScorer (from src/beam_search_scorer.h) manages:
- Beam hypothesis tracking
- Score normalization with length penalty
- Early stopping logic
- Final sequence selection
Search Parameters
All search parameters are defined insrc/config.h:293:
Core Parameters
Maximum length of generated sequence (including input tokens). Defaults to
model.context_length if not set.Minimum length before EOS token is allowed. Useful for preventing premature termination.
Number of independent sequences to generate in parallel.
Number of beams for beam search. Set to 1 for greedy search.
Number of sequences to return from beam search. Must be ≤ num_beams.
Sampling Parameters
Enable randomized sampling. When false, greedy/beam search is deterministic.
Number of highest probability tokens to keep for top-k filtering. Set to 0 to disable.
Cumulative probability threshold for nucleus sampling. Only tokens with cumulative probability ≤ top_p are kept. Range: (0, 1]. Set to 0 to disable.
Controls randomness in sampling. Lower values make output more deterministic, higher values increase diversity.
0.1-0.5: More focused and deterministic0.7-0.9: Balanced creativity and coherence1.0+: More random and creative
Random seed for sampling. Set to -1 for non-deterministic random seeding.
Penalty Parameters
Penalty for token repetition. Values > 1.0 discourage repetition, < 1.0 encourage it. Typical range: 1.0-1.5.
Prevent repeating n-grams of this size. Currently unused in implementation.
Beam Search Parameters
Exponential penalty applied to sequence length in beam search.
> 1.0: Favors longer sequences< 1.0: Favors shorter sequences= 1.0: No length penalty
Stop beam search when
num_beams complete sentences are found.Penalty to encourage diverse beams. Currently unused in implementation.
Sampling Methods
Whendo_sample=true, tokens are selected probabilistically rather than deterministically.
Top-K Sampling
Selects from the K most likely tokens (fromsrc/search.cpp:173):
Top-P (Nucleus) Sampling
Selects from tokens whose cumulative probability exceeds threshold P (fromsrc/search.cpp:195):
Top-K + Top-P Sampling
Combines both strategies: first applies top-k, then top-p within those k tokens (fromsrc/search.cpp:227):
Temperature Scaling
Temperature is applied before softmax to control randomness:Logits Processing
Before token selection, logits are processed to enforce constraints and apply penalties.Repetition Penalty
Penalizes tokens that already appear in the sequence (fromsrc/search.cpp):
Minimum Length
Suppresses EOS token until minimum length is reached:Constrained Decoding
TheConstrainedLogitsProcessor (from src/constrained_logits_processor.h) enables grammar-based generation for structured outputs like JSON:
Termination Conditions
Generation stops when (fromsrc/generators.h:102):
- EOS Token: An end-of-sequence token is generated (for greedy search)
- Max Length: The sequence reaches
max_length - Beam Search Done: All beams have completed (with
early_stopping=true) - Manual Termination: User interrupts generation
Streaming Generation
For real-time output, useTokenizerStream to decode tokens incrementally:
Batched Generation
Generate multiple independent sequences in parallel:Advanced Features
Continuous Decoding
Rewind generation to a previous state and continue from there:- Speculative decoding
- Tree-based search
- Alternative hypothesis exploration
Custom Logits
Manipulate logits directly:Performance Considerations
Search Strategy Performance
- Greedy: Fastest, ~1x baseline
- Beam Search (4 beams): ~3-4x slower than greedy
- Sampling: Similar to greedy, small overhead for RNG
Optimization Tips
- Use greedy search for latency-critical applications
- Enable past_present_share_buffer for CUDA with greedy search
- Limit max_length to avoid unnecessary computation
- Use batching to amortize overhead across multiple sequences
- Adjust temperature instead of using extreme top_k/top_p values
Next Steps
KV Cache
Learn how KV cache improves generation performance
Constrained Decoding
Generate structured outputs with grammar constraints
API Reference
Explore the complete API
Examples
See generation in action