Skip to main content
The SetRuntimeOption API allows you to configure model behavior dynamically during inference without restarting the session. This guide covers all available runtime options and their usage.

Overview

Runtime options enable you to:
  • Terminate generation on-demand
  • Enable/disable profiling during inference
  • Configure session behavior without reloading the model
All runtime options are set using key-value pairs through the SetRuntimeOption API.

Available Options

Terminate Session

Control whether to terminate the current generation session or recover from a terminated state.
terminate_session
string
Accepted values: "0" or "1"
  • "1": Terminate the current session
  • "0": Recover from a terminated state and continue/restart

How It Works

When you enable session termination:
  1. The current generation will throw an exception
  2. Your code must handle this exception
  3. You can recover by setting the option back to "0"

Python Example

import onnxruntime_genai as og

model = og.Model('model_path')
params = og.GeneratorParams(model)
generator = og.Generator(model, params)

try:
    # Start generation
    while not generator.is_done():
        generator.generate_next_token()
        
        # Check some condition to terminate early
        if should_terminate:
            # Terminate the session
            generator.set_runtime_option("terminate_session", "1")
            
except Exception as e:
    print(f"Generation terminated: {e}")
    
    # Recover and restart if needed
    generator.set_runtime_option("terminate_session", "0")
    # Can now start a new generation

C++ Example

#include "ort_genai.h"
#include <iostream>

try {
    auto generator = OgaGenerator::Create(*model, *params);
    
    while (!generator->IsDone()) {
        generator->GenerateNextToken();
        
        if (should_terminate) {
            generator->SetRuntimeOption("terminate_session", "1");
        }
    }
} catch (const std::exception& e) {
    std::cout << "Generation terminated: " << e.what() << std::endl;
    
    // Recover
    generator->SetRuntimeOption("terminate_session", "0");
}

C# Example

using Microsoft.ML.OnnxRuntimeGenAI;

try
{
    using var generator = new Generator(model, generatorParams);
    
    while (!generator.IsDone())
    {
        generator.GenerateNextToken();
        
        if (shouldTerminate)
        {
            generator.SetRuntimeOption("terminate_session", "1");
        }
    }
}
catch (Exception ex)
{
    Console.WriteLine($"Generation terminated: {ex.Message}");
    
    // Recover
    generator.SetRuntimeOption("terminate_session", "0");
}
See examples/c/src/phi3_terminate.cpp in the repository for a complete working example.

Enable Profiling

Dynamically enable or disable ONNX Runtime profiling during generation. When enabled, each token generation produces a separate profiling JSON file.
enable_profiling
string
Accepted values: "0", "1", or a custom prefix string
  • "0": Disable profiling
  • "1": Enable profiling with default prefix "onnxruntime_run_profile"
  • "<custom_prefix>": Enable profiling with custom file prefix

How It Works

When profiling is enabled:
  • Each generate_next_token() call creates a separate profiling file
  • Files are named: {prefix}_{timestamp}.json
  • You can start/stop profiling at any point during generation
  • Useful for profiling specific portions of the generation process

Python Example

import onnxruntime_genai as og

model = og.Model('model_path')
params = og.GeneratorParams(model)
generator = og.Generator(model, params)

# Start generation without profiling
for i in range(10):
    generator.generate_next_token()

# Enable profiling with default prefix
generator.set_runtime_option("enable_profiling", "1")

# Profile the next 5 tokens
for i in range(5):
    generator.generate_next_token()
    # This creates: onnxruntime_run_profile_{timestamp}.json

# Disable profiling
generator.set_runtime_option("enable_profiling", "0")

# Continue generation without profiling
while not generator.is_done():
    generator.generate_next_token()

Custom Prefix Example

# Enable profiling with custom prefix
generator.set_runtime_option("enable_profiling", "my_model_profile")

for i in range(5):
    generator.generate_next_token()
    # This creates: my_model_profile_{timestamp}.json

# Disable profiling
generator.set_runtime_option("enable_profiling", "0")

C++ Example

auto generator = OgaGenerator::Create(*model, *params);

// Start profiling
generator->SetRuntimeOption("enable_profiling", "1");

for (int i = 0; i < 5; ++i) {
    generator->GenerateNextToken();
}

// Stop profiling
generator->SetRuntimeOption("enable_profiling", "0");

C# Example

using var generator = new Generator(model, generatorParams);

// Enable profiling with custom prefix
generator.SetRuntimeOption("enable_profiling", "inference_profile");

for (int i = 0; i < 5; i++)
{
    generator.GenerateNextToken();
}

// Disable profiling
generator.SetRuntimeOption("enable_profiling", "0");

Profiling vs SessionOptions

Runtime Option vs Session OptionThere are two ways to enable profiling in ONNX Runtime GenAI:
  1. SessionOptions (enable_profiling in genai_config.json):
    • Session-level configuration
    • Collects all profiling data from session creation to end
    • Aggregates data into a single JSON file
    • Cannot be started or stopped dynamically
  2. Runtime Option (this API):
    • Can be enabled/disabled at any point during generation
    • Each token generation produces its own profiling file
    • Useful for profiling specific portions of generation
    • More flexible for targeted performance analysis

Analyzing Profiling Data

The profiling JSON files can be analyzed using:

Chrome Tracing

Open chrome://tracing in Chrome/Edge and load the JSON file

Perfetto

Use Perfetto UI for advanced analysis

Custom Scripts

Parse the JSON for automated performance analysis

ONNX Runtime Tools

Use ONNX Runtime’s profiling analysis tools

Common Patterns

Profile Specific Generation Stages

import onnxruntime_genai as og

model = og.Model('model_path')
params = og.GeneratorParams(model)
generator = og.Generator(model, params)

# Phase 1: Prompt processing (no profiling)
prompt_tokens = tokenizer.encode(prompt)
generator.append_tokens(prompt_tokens)

# Phase 2: First few tokens (with profiling)
generator.set_runtime_option("enable_profiling", "first_tokens")
for i in range(10):
    generator.generate_next_token()

generator.set_runtime_option("enable_profiling", "0")

# Phase 3: Remaining tokens (no profiling)
while not generator.is_done():
    generator.generate_next_token()

Conditional Termination

import onnxruntime_genai as og

model = og.Model('model_path')
tokenizer = og.Tokenizer(model)
params = og.GeneratorParams(model)
generator = og.Generator(model, params)

max_time_seconds = 10.0
start_time = time.time()

try:
    while not generator.is_done():
        generator.generate_next_token()
        
        # Terminate if taking too long
        if time.time() - start_time > max_time_seconds:
            print("Generation timeout - terminating")
            generator.set_runtime_option("terminate_session", "1")
            
except Exception as e:
    # Handle graceful termination
    partial_output = tokenizer.decode(generator.get_sequence(0))
    print(f"Partial output: {partial_output}")

Debug Performance Issues

import onnxruntime_genai as og

model = og.Model('model_path')
params = og.GeneratorParams(model)
generator = og.Generator(model, params)

# Profile only the slow tokens
token_times = []

for i in range(100):
    start = time.time()
    generator.generate_next_token()
    elapsed = time.time() - start
    token_times.append(elapsed)
    
    # If a token is slow, enable profiling for the next few
    if elapsed > 0.1:  # 100ms threshold
        print(f"Slow token {i} detected: {elapsed:.3f}s")
        generator.set_runtime_option("enable_profiling", f"slow_token_{i}")
        
        # Profile next 5 tokens
        for j in range(5):
            generator.generate_next_token()
        
        generator.set_runtime_option("enable_profiling", "0")

Best Practices

Profiling adds overhead to generation. Enable it only when needed for performance analysis, not in production.
Always wrap termination in try-catch blocks and handle partial results appropriately.
When profiling, use descriptive prefixes that make it easy to identify which portion of code generated each profile.
Profile files can accumulate quickly. Implement cleanup logic to remove old profiles.

Next Steps

Constrained Decoding

Control output format with grammar constraints

Multi-LoRA

Switch between LoRA adapters dynamically

Python API

Explore the Generator API reference

Build from Source

Build ONNX Runtime GenAI from source

Build docs developers (and LLMs) love