Skip to main content
Graph optimization is a key feature of ONNX Runtime that improves inference performance by transforming the computational graph without changing its semantics. These optimizations reduce computation, memory usage, and improve hardware utilization.

What are Graph Optimizations?

Graph optimizations are transformations applied to the ONNX computational graph:
  • Constant folding: Pre-compute constant expressions
  • Operator fusion: Combine multiple operators into a single kernel
  • Redundancy elimination: Remove unnecessary computations
  • Layout transformations: Optimize data layouts for hardware
Optimizations are semantics-preserving - they produce the same results while improving performance.

Optimization Levels

ONNX Runtime organizes optimizations into hierarchical levels:
import onnxruntime as ort

sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL

session = ort.InferenceSession("model.onnx", sess_options)
No optimizations applied. Useful for:
  • Debugging
  • Validating optimization correctness
  • Ensuring bit-exact reproducibility
Higher optimization levels increase session creation time but improve inference performance. Use ORT_ENABLE_ALL for production.

Graph Transformer Architecture

ONNX Runtime uses a transformer-based architecture for optimizations:
// Simplified transformer interface
class GraphTransformer {
 public:
  // Apply transformation to graph
  Status Apply(Graph& graph, bool& modified) const;
  
  // Check if transformer should only run once
  virtual bool ShouldOnlyApplyOnce() const;
  
 protected:
  // Implementation-specific transformation logic
  virtual Status ApplyImpl(Graph& graph, bool& modified) const = 0;
};

Transformer Categories

Rule-Based

Pattern matching and replacement
  • EliminateIdentity
  • ConstantFolding
  • CommonSubexpressionElimination

Fusion

Combine multiple operators
  • ConvBatchNormFusion
  • MatMulAddFusion
  • GELUFusion

Layout

Data layout transformations
  • NCHWToNHWC
  • TransposeOptimizer

EP-Specific

Hardware-specific optimizations
  • CUDA kernel fusions
  • TensorRT subgraph compilation

Common Optimizations

Constant Folding

Pre-compute operations with constant inputs:
# Graph structure
input -> Shape -> Gather -> Unsqueeze -> Concat -> Reshape -> output
          ^        ^          ^
       constants  constants  constants
Shape operations computed at runtime
Constant folding is especially effective for models with dynamic shapes that use shape manipulation operations.

Operator Fusion

Combine multiple operators into a single fused kernel:
Benefits:
  • Reduces memory bandwidth
  • Fewer kernel launches
  • Can fold BN parameters into Conv weights
Implementation:
# BatchNorm can be folded into Conv during inference
# new_weight = weight * (gamma / sqrt(var + eps))
# new_bias = beta + (bias - mean) * (gamma / sqrt(var + eps))
Benefits:
  • Single kernel call
  • Better cache utilization
  • BLAS optimization (GEMM)
Common patterns:
  • Conv + Relu → ConvRelu
  • MatMul + Relu → GemmRelu
  • Add + Relu → AddRelu
  • LayerNorm + GELU → LayerNormGELU
Example:
# Before: Two separate kernels
x = matmul(A, B)
y = relu(x)

# After: Single fused kernel
y = matmul_relu(A, B)
Fuse multi-headed attention pattern:Benefits:
  • Massive reduction in memory transfers
  • Optimized attention kernels (FlashAttention)
  • Better GPU utilization

Redundancy Elimination

Remove unnecessary operations:
# Before
Input -> Identity -> Output

# After
Input -> Output

Shape Inference

Propagate shape information through the graph:
import onnx
from onnx import shape_inference

# Infer shapes
model = onnx.load("model.onnx")
inferred_model = shape_inference.infer_shapes(model)

# Now all intermediate tensors have known shapes
# Enables more optimizations
Shape inference is automatic in ONNX Runtime but can be pre-computed for faster session initialization.

Layout Optimizations

Transform data layouts for optimal hardware execution:

NCHW vs NHWC

[Batch, Channels, Height, Width]
[1, 3, 224, 224]
Best for:
  • CUDA GPU operations
  • Standard ONNX format
  • Most deep learning frameworks

Automatic Layout Optimization

import onnxruntime as ort

sess_options = ort.SessionOptions()

# Level 3 includes layout optimizations
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

# ONNX Runtime automatically:
# 1. Analyzes the graph
# 2. Determines optimal layout per operator
# 3. Inserts Transpose operations where needed
# 4. Attempts to eliminate redundant Transposes

session = ort.InferenceSession("model.onnx", sess_options)

Memory Optimizations

Memory Reuse Planning

ONNX Runtime plans memory reuse to minimize peak memory:
# Without memory reuse
tensor1 = allocate(1MB)  # Peak: 1MB
tensor2 = allocate(1MB)  # Peak: 2MB
free(tensor1)
tensor3 = allocate(1MB)  # Peak: 2MB

# With memory reuse
tensor1 = allocate(1MB)  # Peak: 1MB
tensor2 = allocate(1MB)  # Peak: 2MB
free(tensor1)
tensor3 = reuse(tensor1) # Peak: 2MB (reuses tensor1's memory)
Memory reuse reduces peak usage from 3MB to 2MB.

In-Place Operations

Some operations can modify tensors in-place:
# Out-of-place: Requires new buffer
y = relu(x)  # x unchanged, y is new tensor

# In-place: Modifies x directly
relu_inplace(x)  # x modified, no new allocation
In-place operations require careful analysis to ensure correctness. ONNX Runtime automatically detects safe in-place opportunities.

Execution Provider Optimizations

EPs can provide hardware-specific optimizations:

CUDA EP Optimizations

import onnxruntime as ort

cuda_options = {
    'arena_extend_strategy': 'kSameAsRequested',
    'cudnn_conv_algo_search': 'EXHAUSTIVE',  # Best convolution algorithm
    'do_copy_in_default_stream': True,
}

session = ort.InferenceSession(
    "model.onnx",
    providers=[('CUDAExecutionProvider', cuda_options)]
)
CUDA-specific optimizations:
  • Kernel fusion (multiple ops in one CUDA kernel)
  • Memory coalescing
  • Shared memory utilization
  • cuDNN algorithm tuning

TensorRT EP Optimizations

import onnxruntime as ort

trt_options = {
    'trt_fp16_enable': True,
    'trt_int8_enable': False,
    'trt_max_workspace_size': 2 * 1024 * 1024 * 1024,
    'trt_engine_cache_enable': True,
}

session = ort.InferenceSession(
    "model.onnx",
    providers=[('TensorrtExecutionProvider', trt_options)]
)
TensorRT optimizations:
  • Layer fusion (vertical and horizontal)
  • Precision calibration (FP16, INT8)
  • Kernel auto-tuning
  • Dynamic tensor memory management

Custom Graph Transformers

You can implement custom optimizations:
import onnxruntime as ort
from onnxruntime import InferenceSession, SessionOptions

# Register custom transformer (C++ implementation required)
sess_options = ort.SessionOptions()

# Custom transformers run at specified optimization level
# Requires building ONNX Runtime from source
Custom transformers require C++ implementation and building ONNX Runtime from source. See the Custom Operators guide for implementing custom functionality.

Inspecting Optimizations

Save Optimized Model

import onnxruntime as ort

sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.optimized_model_filepath = "model_optimized.onnx"

# This saves the optimized graph
session = ort.InferenceSession("model.onnx", sess_options)

# Now inspect model_optimized.onnx to see applied optimizations

Verbose Logging

import onnxruntime as ort

sess_options = ort.SessionOptions()
sess_options.log_severity_level = 0  # Verbose

session = ort.InferenceSession("model.onnx", sess_options)

# Look for log messages like:
# "Applied GraphTransformer: ConstantFolding"
# "Applied GraphTransformer: CommonSubexpressionElimination"
# "Fused Conv+BatchNorm+Relu into ConvBatchNormRelu"

Performance Impact

Typical performance improvements from optimizations:

Computer Vision

ResNet-50:
  • Basic: 5-10% faster
  • Extended: 20-40% faster
  • All: 30-50% faster
Key optimizations:
  • Conv+BN fusion
  • Activation fusions
  • Layout optimization

NLP Models

BERT:
  • Basic: 10-15% faster
  • Extended: 40-60% faster
  • All: 50-70% faster
Key optimizations:
  • Attention fusion
  • LayerNorm fusion
  • Embedding optimization
Actual speedup depends on model architecture, hardware, and input shapes. Always benchmark your specific use case.

Best Practices

# Production configuration
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
The initialization overhead is amortized over many inferences.
# Save optimized model for faster deployment
sess_options.optimized_model_filepath = "model_opt.onnx"
session = ort.InferenceSession("model.onnx", sess_options)

# Deploy model_opt.onnx in production
Pre-optimized models load faster.
import numpy as np

# Run with and without optimizations
def test_optimization():
    # Without optimization
    sess_opt_off = ort.InferenceSession(
        "model.onnx",
        sess_options_with_opt_disabled
    )
    out1 = sess_opt_off.run(None, inputs)
    
    # With optimization
    sess_opt_on = ort.InferenceSession(
        "model.onnx",
        sess_options_with_opt_enabled
    )
    out2 = sess_opt_on.run(None, inputs)
    
    # Compare outputs
    np.testing.assert_allclose(out1, out2, rtol=1e-5)
sess_options = ort.SessionOptions()
sess_options.enable_profiling = True
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

session = ort.InferenceSession("model.onnx", sess_options)

# Run benchmark
for _ in range(100):
    session.run(None, inputs)

prof_file = session.end_profiling()
# Analyze profiling data

Troubleshooting

Optimization Increases Latency

# Try disabling specific optimizations
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_BASIC
# Or disable optimization entirely
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL

Numerical Differences

Optimizations are semantics-preserving but may have small numerical differences:
# If strict numerical reproducibility is required
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL
Small numerical differences (1e-6) are normal due to different operation orders. Larger differences indicate a bug.

Session Creation Too Slow

# Pre-optimize and save model
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.optimized_model_filepath = "model_opt.onnx"
session = ort.InferenceSession("model.onnx", sess_options)

# In production, load pre-optimized model
session = ort.InferenceSession("model_opt.onnx")

Next Steps

Quantization

Further optimize models with quantization

Model Optimization

End-to-end model optimization workflow

Performance Tuning

Complete performance tuning guide

Performance Tuning

Profile and analyze model performance

Additional Resources