Runtime Options

The SetRuntimeOption API allows you to configure model behavior dynamically during inference without restarting the session. This guide covers all available runtime options and their usage.

Overview

Runtime options enable you to:

Terminate generation on-demand
Enable/disable profiling during inference
Configure session behavior without reloading the model

All runtime options are set using key-value pairs through the SetRuntimeOption API.

Available Options

Terminate Session

Control whether to terminate the current generation session or recover from a terminated state.

terminate_session

string

Accepted values: "0" or "1"

"1": Terminate the current session
"0": Recover from a terminated state and continue/restart

How It Works

When you enable session termination:

The current generation will throw an exception
Your code must handle this exception
You can recover by setting the option back to "0"

Python Example

import onnxruntime_genai as og

model = og.Model('model_path')
params = og.GeneratorParams(model)
generator = og.Generator(model, params)

try:
    # Start generation
    while not generator.is_done():
        generator.generate_next_token()
        
        # Check some condition to terminate early
        if should_terminate:
            # Terminate the session
            generator.set_runtime_option("terminate_session", "1")
            
except Exception as e:
    print(f"Generation terminated: {e}")
    
    # Recover and restart if needed
    generator.set_runtime_option("terminate_session", "0")
    # Can now start a new generation

C++ Example

#include "ort_genai.h"
#include <iostream>

try {
    auto generator = OgaGenerator::Create(*model, *params);
    
    while (!generator->IsDone()) {
        generator->GenerateNextToken();
        
        if (should_terminate) {
            generator->SetRuntimeOption("terminate_session", "1");
        }
    }
} catch (const std::exception& e) {
    std::cout << "Generation terminated: " << e.what() << std::endl;
    
    // Recover
    generator->SetRuntimeOption("terminate_session", "0");
}

C# Example

using Microsoft.ML.OnnxRuntimeGenAI;

try
{
    using var generator = new Generator(model, generatorParams);
    
    while (!generator.IsDone())
    {
        generator.GenerateNextToken();
        
        if (shouldTerminate)
        {
            generator.SetRuntimeOption("terminate_session", "1");
        }
    }
}
catch (Exception ex)
{
    Console.WriteLine($"Generation terminated: {ex.Message}");
    
    // Recover
    generator.SetRuntimeOption("terminate_session", "0");
}

See examples/c/src/phi3_terminate.cpp in the repository for a complete working example.

Enable Profiling

Dynamically enable or disable ONNX Runtime profiling during generation. When enabled, each token generation produces a separate profiling JSON file.

enable_profiling

string

Accepted values: "0", "1", or a custom prefix string

"0": Disable profiling
"1": Enable profiling with default prefix "onnxruntime_run_profile"
"<custom_prefix>": Enable profiling with custom file prefix

How It Works

When profiling is enabled:

Each generate_next_token() call creates a separate profiling file
Files are named: {prefix}_{timestamp}.json
You can start/stop profiling at any point during generation
Useful for profiling specific portions of the generation process

Python Example

import onnxruntime_genai as og

model = og.Model('model_path')
params = og.GeneratorParams(model)
generator = og.Generator(model, params)

# Start generation without profiling
for i in range(10):
    generator.generate_next_token()

# Enable profiling with default prefix
generator.set_runtime_option("enable_profiling", "1")

# Profile the next 5 tokens
for i in range(5):
    generator.generate_next_token()
    # This creates: onnxruntime_run_profile_{timestamp}.json

# Disable profiling
generator.set_runtime_option("enable_profiling", "0")

# Continue generation without profiling
while not generator.is_done():
    generator.generate_next_token()

Custom Prefix Example

# Enable profiling with custom prefix
generator.set_runtime_option("enable_profiling", "my_model_profile")

for i in range(5):
    generator.generate_next_token()
    # This creates: my_model_profile_{timestamp}.json

# Disable profiling
generator.set_runtime_option("enable_profiling", "0")

C++ Example

auto generator = OgaGenerator::Create(*model, *params);

// Start profiling
generator->SetRuntimeOption("enable_profiling", "1");

for (int i = 0; i < 5; ++i) {
    generator->GenerateNextToken();
}

// Stop profiling
generator->SetRuntimeOption("enable_profiling", "0");

C# Example

using var generator = new Generator(model, generatorParams);

// Enable profiling with custom prefix
generator.SetRuntimeOption("enable_profiling", "inference_profile");

for (int i = 0; i < 5; i++)
{
    generator.GenerateNextToken();
}

// Disable profiling
generator.SetRuntimeOption("enable_profiling", "0");

Profiling vs SessionOptions

Runtime Option vs Session OptionThere are two ways to enable profiling in ONNX Runtime GenAI:

SessionOptions (enable_profiling in genai_config.json):
- Session-level configuration
- Collects all profiling data from session creation to end
- Aggregates data into a single JSON file
- Cannot be started or stopped dynamically
Runtime Option (this API):
- Can be enabled/disabled at any point during generation
- Each token generation produces its own profiling file
- Useful for profiling specific portions of generation
- More flexible for targeted performance analysis

Analyzing Profiling Data

The profiling JSON files can be analyzed using:

Chrome Tracing

Open chrome://tracing in Chrome/Edge and load the JSON file

Perfetto

Use Perfetto UI for advanced analysis

Custom Scripts

Parse the JSON for automated performance analysis

ONNX Runtime Tools

Use ONNX Runtime’s profiling analysis tools

Common Patterns

Profile Specific Generation Stages

import onnxruntime_genai as og

model = og.Model('model_path')
params = og.GeneratorParams(model)
generator = og.Generator(model, params)

# Phase 1: Prompt processing (no profiling)
prompt_tokens = tokenizer.encode(prompt)
generator.append_tokens(prompt_tokens)

# Phase 2: First few tokens (with profiling)
generator.set_runtime_option("enable_profiling", "first_tokens")
for i in range(10):
    generator.generate_next_token()

generator.set_runtime_option("enable_profiling", "0")

# Phase 3: Remaining tokens (no profiling)
while not generator.is_done():
    generator.generate_next_token()

Conditional Termination

import onnxruntime_genai as og

model = og.Model('model_path')
tokenizer = og.Tokenizer(model)
params = og.GeneratorParams(model)
generator = og.Generator(model, params)

max_time_seconds = 10.0
start_time = time.time()

try:
    while not generator.is_done():
        generator.generate_next_token()
        
        # Terminate if taking too long
        if time.time() - start_time > max_time_seconds:
            print("Generation timeout - terminating")
            generator.set_runtime_option("terminate_session", "1")
            
except Exception as e:
    # Handle graceful termination
    partial_output = tokenizer.decode(generator.get_sequence(0))
    print(f"Partial output: {partial_output}")

Debug Performance Issues

import onnxruntime_genai as og

model = og.Model('model_path')
params = og.GeneratorParams(model)
generator = og.Generator(model, params)

# Profile only the slow tokens
token_times = []

for i in range(100):
    start = time.time()
    generator.generate_next_token()
    elapsed = time.time() - start
    token_times.append(elapsed)
    
    # If a token is slow, enable profiling for the next few
    if elapsed > 0.1:  # 100ms threshold
        print(f"Slow token {i} detected: {elapsed:.3f}s")
        generator.set_runtime_option("enable_profiling", f"slow_token_{i}")
        
        # Profile next 5 tokens
        for j in range(5):
            generator.generate_next_token()
        
        generator.set_runtime_option("enable_profiling", "0")

Best Practices

Use Profiling Sparingly

Profiling adds overhead to generation. Enable it only when needed for performance analysis, not in production.

Handle Termination Gracefully

Always wrap termination in try-catch blocks and handle partial results appropriately.

Use Descriptive Prefixes

When profiling, use descriptive prefixes that make it easy to identify which portion of code generated each profile.

Clean Up Profile Files

Profile files can accumulate quickly. Implement cleanup logic to remove old profiles.

Next Steps

Constrained Decoding

Control output format with grammar constraints

Multi-LoRA

Switch between LoRA adapters dynamically

Python API

Explore the Generator API reference

Build from Source

Build ONNX Runtime GenAI from source

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

Overview

Available Options

Terminate Session

How It Works

Python Example

C++ Example

C# Example

Enable Profiling

How It Works

Python Example

Custom Prefix Example

C++ Example

C# Example

Profiling vs SessionOptions

Analyzing Profiling Data

Chrome Tracing

Perfetto

Custom Scripts

ONNX Runtime Tools

Common Patterns

Profile Specific Generation Stages

Conditional Termination

Debug Performance Issues

Best Practices

Next Steps

Constrained Decoding

Multi-LoRA

Python API

Build from Source

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

​Overview

​Available Options

​Terminate Session

​How It Works

​Python Example

​C++ Example

​C# Example

​Enable Profiling

​How It Works

​Python Example

​Custom Prefix Example

​C++ Example

​C# Example

​Profiling vs SessionOptions

​Analyzing Profiling Data

Chrome Tracing

Perfetto

Custom Scripts

ONNX Runtime Tools

​Common Patterns

​Profile Specific Generation Stages

​Conditional Termination

​Debug Performance Issues

​Best Practices

​Next Steps

Constrained Decoding

Multi-LoRA

Python API

Build from Source

Build docs developers (and LLMs) love

Overview

Available Options

Terminate Session

How It Works

Python Example

C++ Example

C# Example

Enable Profiling

How It Works

Python Example

Custom Prefix Example

C++ Example

C# Example

Profiling vs SessionOptions

Analyzing Profiling Data

Common Patterns

Profile Specific Generation Stages

Conditional Termination

Debug Performance Issues

Best Practices

Next Steps