Overview
Runtime options enable you to:- Terminate generation on-demand
- Enable/disable profiling during inference
- Configure session behavior without reloading the model
SetRuntimeOption API.
Available Options
Terminate Session
Control whether to terminate the current generation session or recover from a terminated state.Accepted values:
"0" or "1""1": Terminate the current session"0": Recover from a terminated state and continue/restart
How It Works
When you enable session termination:- The current generation will throw an exception
- Your code must handle this exception
- You can recover by setting the option back to
"0"
Python Example
C++ Example
C# Example
See
examples/c/src/phi3_terminate.cpp in the repository for a complete working example.Enable Profiling
Dynamically enable or disable ONNX Runtime profiling during generation. When enabled, each token generation produces a separate profiling JSON file.Accepted values:
"0", "1", or a custom prefix string"0": Disable profiling"1": Enable profiling with default prefix"onnxruntime_run_profile""<custom_prefix>": Enable profiling with custom file prefix
How It Works
When profiling is enabled:- Each
generate_next_token()call creates a separate profiling file - Files are named:
{prefix}_{timestamp}.json - You can start/stop profiling at any point during generation
- Useful for profiling specific portions of the generation process
Python Example
Custom Prefix Example
C++ Example
C# Example
Profiling vs SessionOptions
Runtime Option vs Session OptionThere are two ways to enable profiling in ONNX Runtime GenAI:
- SessionOptions (
enable_profilingingenai_config.json):- Session-level configuration
- Collects all profiling data from session creation to end
- Aggregates data into a single JSON file
- Cannot be started or stopped dynamically
- Runtime Option (this API):
- Can be enabled/disabled at any point during generation
- Each token generation produces its own profiling file
- Useful for profiling specific portions of generation
- More flexible for targeted performance analysis
Analyzing Profiling Data
The profiling JSON files can be analyzed using:Chrome Tracing
Open
chrome://tracing in Chrome/Edge and load the JSON filePerfetto
Use Perfetto UI for advanced analysis
Custom Scripts
Parse the JSON for automated performance analysis
ONNX Runtime Tools
Use ONNX Runtime’s profiling analysis tools
Common Patterns
Profile Specific Generation Stages
Conditional Termination
Debug Performance Issues
Best Practices
Use Profiling Sparingly
Use Profiling Sparingly
Profiling adds overhead to generation. Enable it only when needed for performance analysis, not in production.
Handle Termination Gracefully
Handle Termination Gracefully
Always wrap termination in try-catch blocks and handle partial results appropriately.
Use Descriptive Prefixes
Use Descriptive Prefixes
When profiling, use descriptive prefixes that make it easy to identify which portion of code generated each profile.
Clean Up Profile Files
Clean Up Profile Files
Profile files can accumulate quickly. Implement cleanup logic to remove old profiles.
Next Steps
Constrained Decoding
Control output format with grammar constraints
Multi-LoRA
Switch between LoRA adapters dynamically
Python API
Explore the Generator API reference
Build from Source
Build ONNX Runtime GenAI from source