Core Concepts Overview

ONNX Runtime is a high-performance inference and training engine for machine learning models. Understanding its core concepts is essential for effectively deploying and optimizing your models.

What is ONNX Runtime?

ONNX Runtime is a cross-platform, high-performance scoring engine for Open Neural Network Exchange (ONNX) models. It enables:

Fast inference across different hardware platforms
Flexible deployment from cloud to edge devices
Hardware acceleration through execution providers
Model optimization via graph transformations

Core Architecture

ONNX Runtime’s architecture consists of several key components working together:

Architecture Diagram

Key Components

1. InferenceSession

The InferenceSession is the main entry point for running models. It:

Loads and parses ONNX models
Manages execution providers
Handles model initialization and optimization
Executes inference requests

import onnxruntime as ort

# Create inference session
session = ort.InferenceSession("model.onnx")

# Run inference
outputs = session.run(["output"], {"input": input_data})

2. Execution Providers

Execution Providers (EPs) are the hardware acceleration interfaces that enable ONNX Runtime to run on different hardware platforms:

CPU

Default provider for general-purpose execution

CUDA

NVIDIA GPU acceleration

TensorRT

Optimized inference on NVIDIA GPUs

DirectML

Hardware acceleration on Windows

3. Graph Optimizations

ONNX Runtime applies multiple levels of graph optimizations:

Level 1 (Basic): Constant folding, redundant node elimination
Level 2 (Extended): Node fusions, operator transformations
Level 3 (Layout): Data layout optimizations for specific hardware

4. ONNX Format

The ONNX format is an open standard for representing machine learning models:

Protobuf-based: Efficient serialization format
Framework-agnostic: Works with PyTorch, TensorFlow, etc.
Operator standard: Well-defined operator specifications
Extensible: Support for custom operators

Execution Flow

Here’s how ONNX Runtime processes an inference request:

Model Loading

The ONNX model is loaded and parsed into an internal graph representation

Graph Optimization

Multiple optimization passes transform the graph for better performance

Graph Partitioning

The graph is partitioned across available execution providers based on their capabilities

Session Initialization

Kernels are instantiated and memory is allocated

Inference Execution

Input data flows through the graph, with each node executed by its assigned execution provider

Performance Considerations

Threading Model

ONNX Runtime uses two types of thread pools:

Intra-op threads: Parallelize computation within a single operator
Inter-op threads: Execute independent operators in parallel

Configure via SessionOptions:

sess_options = ort.SessionOptions()
sess_options.intra_op_num_threads = 4
sess_options.inter_op_num_threads = 2

Memory Management

Automatic memory planning and reuse
Arena-based allocators for efficient allocation
Support for pre-allocated memory binding

Data Types

Supported tensor data types include:

Floating point: float32, float16, bfloat16
Integer: int8, int16, int32, int64
Quantized: uint8, int8 for quantization
Special: float8 variants for modern hardware

Runtime Configuration

Session Options

The SessionOptions class provides extensive configuration:

sess_options = ort.SessionOptions()

# Optimization level
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

# Execution mode
sess_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

# Enable profiling
sess_options.enable_profiling = True

The default optimization level ORT_ENABLE_ALL enables all graph optimizations. For debugging, use ORT_DISABLE_ALL.

Run Options

Control individual inference runs:

run_options = ort.RunOptions()
run_options.log_severity_level = 0  # Verbose logging
run_options.terminate = False  # Don't terminate on timeout

Model Validation

ONNX Runtime validates models during loading:

Graph structure: Ensures valid connections between nodes
Type checking: Verifies operator input/output types
Shape inference: Propagates tensor shapes through the graph
Operator availability: Checks that required operators are supported

If your model uses custom operators, you must register them before creating the session.

Get Started

Core Concepts

Inference

Training

Execution Providers

Performance

Model Conversion

Advanced

What is ONNX Runtime?

Core Architecture

Key Components

1. InferenceSession

2. Execution Providers

CPU

CUDA

TensorRT

DirectML

3. Graph Optimizations

4. ONNX Format

Execution Flow

Performance Considerations

Runtime Configuration

Session Options

Run Options

Model Validation

Next Steps

ONNX Format

Execution Providers

Sessions

Graph Optimizations

Additional Resources

Get Started

Core Concepts

Inference

Training

Execution Providers

Performance

Model Conversion

Advanced

​What is ONNX Runtime?

​Core Architecture

​Key Components

​1. InferenceSession

​2. Execution Providers

CPU

CUDA

TensorRT

DirectML

​3. Graph Optimizations

​4. ONNX Format

​Execution Flow

​Performance Considerations

​Runtime Configuration

​Session Options

​Run Options

​Model Validation

​Next Steps

ONNX Format

Execution Providers

Sessions

Graph Optimizations

​Additional Resources

What is ONNX Runtime?

Core Architecture

Key Components

1. InferenceSession

2. Execution Providers

3. Graph Optimizations

4. ONNX Format

Execution Flow

Performance Considerations

Runtime Configuration

Session Options

Run Options

Model Validation

Next Steps

Additional Resources