Graph optimization is a key feature of ONNX Runtime that improves inference performance by transforming the computational graph without changing its semantics. These optimizations reduce computation, memory usage, and improve hardware utilization.
What are Graph Optimizations?
Graph optimizations are transformations applied to the ONNX computational graph:
Constant folding : Pre-compute constant expressions
Operator fusion : Combine multiple operators into a single kernel
Redundancy elimination : Remove unnecessary computations
Layout transformations : Optimize data layouts for hardware
Optimizations are semantics-preserving - they produce the same results while improving performance.
Optimization Levels
ONNX Runtime organizes optimizations into hierarchical levels:
Level 0: Disabled
Level 1: Basic
Level 2: Extended
Level 3: All (Default)
import onnxruntime as ort
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel. ORT_DISABLE_ALL
session = ort.InferenceSession( "model.onnx" , sess_options)
No optimizations applied. Useful for:
Debugging
Validating optimization correctness
Ensuring bit-exact reproducibility
import onnxruntime as ort
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel. ORT_ENABLE_BASIC
session = ort.InferenceSession( "model.onnx" , sess_options)
Optimizations:
Constant folding
Redundant node elimination
Simple semantics-preserving node fusions
Impact: Low overhead, consistent improvementsimport onnxruntime as ort
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel. ORT_ENABLE_EXTENDED
session = ort.InferenceSession( "model.onnx" , sess_options)
Optimizations:
All Level 1 optimizations
Complex node fusions
Advanced transformations
Execution provider-specific optimizations
Impact: Significant performance gainsimport onnxruntime as ort
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel. ORT_ENABLE_ALL
session = ort.InferenceSession( "model.onnx" , sess_options)
Optimizations:
All Level 2 optimizations
Layout optimizations (NCHW ↔ NHWC)
Hardware-specific transformations
Aggressive fusions
Impact: Maximum performance, higher initialization time
Higher optimization levels increase session creation time but improve inference performance. Use ORT_ENABLE_ALL for production.
ONNX Runtime uses a transformer-based architecture for optimizations:
// Simplified transformer interface
class GraphTransformer {
public:
// Apply transformation to graph
Status Apply ( Graph & graph , bool& modified ) const ;
// Check if transformer should only run once
virtual bool ShouldOnlyApplyOnce () const ;
protected:
// Implementation-specific transformation logic
virtual Status ApplyImpl ( Graph & graph , bool& modified ) const = 0 ;
};
Rule-Based Pattern matching and replacement
EliminateIdentity
ConstantFolding
CommonSubexpressionElimination
Fusion Combine multiple operators
ConvBatchNormFusion
MatMulAddFusion
GELUFusion
Layout Data layout transformations
NCHWToNHWC
TransposeOptimizer
EP-Specific Hardware-specific optimizations
CUDA kernel fusions
TensorRT subgraph compilation
Common Optimizations
Constant Folding
Pre-compute operations with constant inputs:
# Graph structure
input -> Shape -> Gather -> Unsqueeze -> Concat -> Reshape -> output
^ ^ ^
constants constants constants
Shape operations computed at runtime # Optimized graph
input -> Reshape -> output
^
pre - computed shape
Shape computed once during optimization
Constant folding is especially effective for models with dynamic shapes that use shape manipulation operations.
Operator Fusion
Combine multiple operators into a single fused kernel:
Benefits:
Reduces memory bandwidth
Fewer kernel launches
Can fold BN parameters into Conv weights
Implementation: # BatchNorm can be folded into Conv during inference
# new_weight = weight * (gamma / sqrt(var + eps))
# new_bias = beta + (bias - mean) * (gamma / sqrt(var + eps))
Benefits:
Single kernel call
Better cache utilization
BLAS optimization (GEMM)
Common patterns:
Conv + Relu → ConvRelu
MatMul + Relu → GemmRelu
Add + Relu → AddRelu
LayerNorm + GELU → LayerNormGELU
Example: # Before: Two separate kernels
x = matmul(A, B)
y = relu(x)
# After: Single fused kernel
y = matmul_relu(A, B)
Fuse multi-headed attention pattern: Benefits:
Massive reduction in memory transfers
Optimized attention kernels (FlashAttention)
Better GPU utilization
Redundancy Elimination
Remove unnecessary operations:
Identity Elimination
Dropout in Inference
Transpose Cancellation
# Before
Input -> Identity -> Output
# After
Input -> Output
Shape Inference
Propagate shape information through the graph:
import onnx
from onnx import shape_inference
# Infer shapes
model = onnx.load( "model.onnx" )
inferred_model = shape_inference.infer_shapes(model)
# Now all intermediate tensors have known shapes
# Enables more optimizations
Shape inference is automatic in ONNX Runtime but can be pre-computed for faster session initialization.
Layout Optimizations
Transform data layouts for optimal hardware execution:
NCHW vs NHWC
NCHW (Channels First)
NHWC (Channels Last)
[Batch, Channels, Height, Width]
[1, 3, 224, 224]
Best for:
CUDA GPU operations
Standard ONNX format
Most deep learning frameworks
[Batch, Height, Width, Channels]
[1, 224, 224, 3]
Best for:
CPU with SIMD (AVX, NEON)
TensorRT on certain GPUs
Memory-bound operations
Automatic Layout Optimization
import onnxruntime as ort
sess_options = ort.SessionOptions()
# Level 3 includes layout optimizations
sess_options.graph_optimization_level = ort.GraphOptimizationLevel. ORT_ENABLE_ALL
# ONNX Runtime automatically:
# 1. Analyzes the graph
# 2. Determines optimal layout per operator
# 3. Inserts Transpose operations where needed
# 4. Attempts to eliminate redundant Transposes
session = ort.InferenceSession( "model.onnx" , sess_options)
Memory Optimizations
Memory Reuse Planning
ONNX Runtime plans memory reuse to minimize peak memory:
# Without memory reuse
tensor1 = allocate( 1MB ) # Peak: 1MB
tensor2 = allocate( 1MB ) # Peak: 2MB
free(tensor1)
tensor3 = allocate( 1MB ) # Peak: 2MB
# With memory reuse
tensor1 = allocate( 1MB ) # Peak: 1MB
tensor2 = allocate( 1MB ) # Peak: 2MB
free(tensor1)
tensor3 = reuse(tensor1) # Peak: 2MB (reuses tensor1's memory)
Memory reuse reduces peak usage from 3MB to 2MB.
In-Place Operations
Some operations can modify tensors in-place:
# Out-of-place: Requires new buffer
y = relu(x) # x unchanged, y is new tensor
# In-place: Modifies x directly
relu_inplace(x) # x modified, no new allocation
In-place operations require careful analysis to ensure correctness. ONNX Runtime automatically detects safe in-place opportunities.
Execution Provider Optimizations
EPs can provide hardware-specific optimizations:
CUDA EP Optimizations
import onnxruntime as ort
cuda_options = {
'arena_extend_strategy' : 'kSameAsRequested' ,
'cudnn_conv_algo_search' : 'EXHAUSTIVE' , # Best convolution algorithm
'do_copy_in_default_stream' : True ,
}
session = ort.InferenceSession(
"model.onnx" ,
providers = [( 'CUDAExecutionProvider' , cuda_options)]
)
CUDA-specific optimizations:
Kernel fusion (multiple ops in one CUDA kernel)
Memory coalescing
Shared memory utilization
cuDNN algorithm tuning
TensorRT EP Optimizations
import onnxruntime as ort
trt_options = {
'trt_fp16_enable' : True ,
'trt_int8_enable' : False ,
'trt_max_workspace_size' : 2 * 1024 * 1024 * 1024 ,
'trt_engine_cache_enable' : True ,
}
session = ort.InferenceSession(
"model.onnx" ,
providers = [( 'TensorrtExecutionProvider' , trt_options)]
)
TensorRT optimizations:
Layer fusion (vertical and horizontal)
Precision calibration (FP16, INT8)
Kernel auto-tuning
Dynamic tensor memory management
You can implement custom optimizations:
import onnxruntime as ort
from onnxruntime import InferenceSession, SessionOptions
# Register custom transformer (C++ implementation required)
sess_options = ort.SessionOptions()
# Custom transformers run at specified optimization level
# Requires building ONNX Runtime from source
Custom transformers require C++ implementation and building ONNX Runtime from source. See the Custom Operators guide for implementing custom functionality.
Inspecting Optimizations
Save Optimized Model
import onnxruntime as ort
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel. ORT_ENABLE_ALL
sess_options.optimized_model_filepath = "model_optimized.onnx"
# This saves the optimized graph
session = ort.InferenceSession( "model.onnx" , sess_options)
# Now inspect model_optimized.onnx to see applied optimizations
Verbose Logging
import onnxruntime as ort
sess_options = ort.SessionOptions()
sess_options.log_severity_level = 0 # Verbose
session = ort.InferenceSession( "model.onnx" , sess_options)
# Look for log messages like:
# "Applied GraphTransformer: ConstantFolding"
# "Applied GraphTransformer: CommonSubexpressionElimination"
# "Fused Conv+BatchNorm+Relu into ConvBatchNormRelu"
Typical performance improvements from optimizations:
Computer Vision ResNet-50:
Basic: 5-10% faster
Extended: 20-40% faster
All: 30-50% faster
Key optimizations:
Conv+BN fusion
Activation fusions
Layout optimization
NLP Models BERT:
Basic: 10-15% faster
Extended: 40-60% faster
All: 50-70% faster
Key optimizations:
Attention fusion
LayerNorm fusion
Embedding optimization
Actual speedup depends on model architecture, hardware, and input shapes. Always benchmark your specific use case.
Best Practices
Use Maximum Optimization in Production
# Production configuration
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel. ORT_ENABLE_ALL
The initialization overhead is amortized over many inferences.
# Save optimized model for faster deployment
sess_options.optimized_model_filepath = "model_opt.onnx"
session = ort.InferenceSession( "model.onnx" , sess_options)
# Deploy model_opt.onnx in production
Pre-optimized models load faster.
Test Optimization Correctness
import numpy as np
# Run with and without optimizations
def test_optimization ():
# Without optimization
sess_opt_off = ort.InferenceSession(
"model.onnx" ,
sess_options_with_opt_disabled
)
out1 = sess_opt_off.run( None , inputs)
# With optimization
sess_opt_on = ort.InferenceSession(
"model.onnx" ,
sess_options_with_opt_enabled
)
out2 = sess_opt_on.run( None , inputs)
# Compare outputs
np.testing.assert_allclose(out1, out2, rtol = 1e-5 )
sess_options = ort.SessionOptions()
sess_options.enable_profiling = True
sess_options.graph_optimization_level = ort.GraphOptimizationLevel. ORT_ENABLE_ALL
session = ort.InferenceSession( "model.onnx" , sess_options)
# Run benchmark
for _ in range ( 100 ):
session.run( None , inputs)
prof_file = session.end_profiling()
# Analyze profiling data
Troubleshooting
Optimization Increases Latency
# Try disabling specific optimizations
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel. ORT_ENABLE_BASIC
# Or disable optimization entirely
sess_options.graph_optimization_level = ort.GraphOptimizationLevel. ORT_DISABLE_ALL
Numerical Differences
Optimizations are semantics-preserving but may have small numerical differences:
# If strict numerical reproducibility is required
sess_options.graph_optimization_level = ort.GraphOptimizationLevel. ORT_DISABLE_ALL
Small numerical differences (1e-6) are normal due to different operation orders. Larger differences indicate a bug.
Session Creation Too Slow
# Pre-optimize and save model
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel. ORT_ENABLE_ALL
sess_options.optimized_model_filepath = "model_opt.onnx"
session = ort.InferenceSession( "model.onnx" , sess_options)
# In production, load pre-optimized model
session = ort.InferenceSession( "model_opt.onnx" )
Next Steps
Quantization Further optimize models with quantization
Model Optimization End-to-end model optimization workflow
Performance Tuning Complete performance tuning guide
Performance Tuning Profile and analyze model performance
Additional Resources