Skip to main content
Execution Providers (EPs) are the interfaces that enable ONNX Runtime to execute models on different hardware platforms. They provide hardware-specific optimizations and acceleration capabilities.

What are Execution Providers?

Execution Providers abstract the hardware-specific implementation details, allowing ONNX Runtime to:
  • Accelerate inference using specialized hardware (GPUs, NPUs, etc.)
  • Optimize operators for specific hardware architectures
  • Manage memory efficiently on target devices
  • Handle data transfer between different memory spaces
Think of Execution Providers as “backends” or “device drivers” for ONNX Runtime, similar to how TensorFlow has device placements or PyTorch has device types.

Architecture Overview

How EPs Work

1

Registration

Execution providers are registered with the session during initialization
2

Capability Query

Each EP reports which nodes/subgraphs it can execute via GetCapability()
3

Graph Partitioning

ONNX Runtime partitions the graph across available EPs based on their capabilities
4

Kernel Execution

Each node is executed by its assigned EP using hardware-specific kernels

Available Execution Providers

CPUExecutionProvider

The default execution provider, always available:

Platforms

  • Windows, Linux, macOS
  • x86_64, ARM64, ARM32
  • WebAssembly

Features

  • Comprehensive operator coverage
  • SIMD optimizations (SSE, AVX, NEON)
  • Multi-threading support
  • Reference implementation
import onnxruntime as ort

# CPU is used by default
session = ort.InferenceSession("model.onnx")

# Explicit configuration
sess_options = ort.SessionOptions()
sess_options.intra_op_num_threads = 4
session = ort.InferenceSession(
    "model.onnx",
    sess_options,
    providers=['CPUExecutionProvider']
)

CUDAExecutionProvider

NVIDIA GPU acceleration using CUDA:
import onnxruntime as ort

# Use CUDA with default settings
session = ort.InferenceSession(
    "model.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

# Configure CUDA options
cuda_options = {
    'device_id': 0,
    'arena_extend_strategy': 'kNextPowerOfTwo',
    'gpu_mem_limit': 2 * 1024 * 1024 * 1024,  # 2GB
    'cudnn_conv_algo_search': 'EXHAUSTIVE',
    'do_copy_in_default_stream': True,
}

session = ort.InferenceSession(
    "model.onnx",
    providers=[('CUDAExecutionProvider', cuda_options),
               'CPUExecutionProvider']
)
OptionDescriptionDefault
device_idGPU device ID0
gpu_mem_limitMaximum GPU memory usage (bytes)SIZE_MAX
arena_extend_strategyMemory arena growth strategykNextPowerOfTwo
cudnn_conv_algo_searchcuDNN convolution algorithm searchEXHAUSTIVE
do_copy_in_default_streamUse default CUDA stream for copiesTrue
cudnn_conv_use_max_workspaceUse maximum workspace for cuDNNTrue

TensorRTExecutionProvider

Optimized inference using NVIDIA TensorRT:
import onnxruntime as ort

trt_options = {
    'device_id': 0,
    'trt_max_workspace_size': 2 * 1024 * 1024 * 1024,  # 2GB
    'trt_fp16_enable': True,  # Enable FP16 precision
    'trt_int8_enable': False,
    'trt_engine_cache_enable': True,
    'trt_engine_cache_path': './trt_cache',
}

session = ort.InferenceSession(
    "model.onnx",
    providers=[('TensorrtExecutionProvider', trt_options),
               'CUDAExecutionProvider',
               'CPUExecutionProvider']
)
TensorRT builds optimized engines at runtime. The first inference run will be slower as engines are built and cached.

DirectMLExecutionProvider

Hardware acceleration on Windows using DirectML:
import onnxruntime as ort

session = ort.InferenceSession(
    "model.onnx",
    providers=['DmlExecutionProvider', 'CPUExecutionProvider']
)

Advantages

  • Works with any DirectX 12 GPU
  • AMD, Intel, NVIDIA support
  • Built into Windows

Use Cases

  • Windows client applications
  • Cross-vendor GPU support
  • Integrated graphics

CoreMLExecutionProvider

Apple Silicon and iOS acceleration:
import onnxruntime as ort

coreml_options = {
    'MLComputeUnits': 'ALL',  # CPU_AND_GPU, CPU_ONLY, or ALL
}

session = ort.InferenceSession(
    "model.onnx",
    providers=[('CoreMLExecutionProvider', coreml_options),
               'CPUExecutionProvider']
)

Additional Execution Providers

Intel CPU, GPU, VPU, and FPGA acceleration:
openvino_options = {
    'device_type': 'CPU_FP32',  # CPU_FP32, GPU_FP32, GPU_FP16, etc.
    'num_of_threads': 8,
}
session = ort.InferenceSession(
    "model.onnx",
    providers=[('OpenVINOExecutionProvider', openvino_options)]
)
Android Neural Networks API:
session = ort.InferenceSession(
    "model.onnx",
    providers=['NnapiExecutionProvider', 'CPUExecutionProvider']
)
ARM Compute Library for ARM CPUs:
session = ort.InferenceSession(
    "model.onnx",
    providers=['AclExecutionProvider', 'CPUExecutionProvider']
)
AMD GPU acceleration:
rocm_options = {
    'device_id': 0,
    'gpu_mem_limit': 2 * 1024 * 1024 * 1024,
}
session = ort.InferenceSession(
    "model.onnx",
    providers=[('ROCMExecutionProvider', rocm_options)]
)

EP Selection and Fallback

Provider Priority

Execution providers are tried in the order specified:
# TensorRT tried first, then CUDA, then CPU
session = ort.InferenceSession(
    "model.onnx",
    providers=[
        'TensorrtExecutionProvider',
        'CUDAExecutionProvider',
        'CPUExecutionProvider'
    ]
)
If a provider cannot execute a node, it falls back to the next provider in the list. CPU is typically the last fallback.

Checking Active Providers

import onnxruntime as ort

# Check available providers
print("Available providers:", ort.get_available_providers())

# Check session providers
session = ort.InferenceSession("model.onnx")
print("Session providers:", session.get_providers())

Graph Partitioning

ONNX Runtime partitions the graph across execution providers:

Capability Query

Each EP implements GetCapability() to report which nodes it can execute:
// Simplified EP capability interface
class IExecutionProvider {
  virtual std::vector<std::unique_ptr<ComputeCapability>>
  GetCapability(
    const GraphViewer& graph_viewer,
    const IKernelLookup& kernel_lookup
  ) const;
};
Use verbose logging to see how nodes are partitioned:
sess_options = ort.SessionOptions()
sess_options.log_severity_level = 0  # Verbose
session = ort.InferenceSession("model.onnx", sess_options)

Data Transfer

Execution providers manage data transfer between memory spaces:

Memory Locations

  • CPU memory: Host memory accessible by CPU
  • GPU memory: Device memory on GPU
  • Shared memory: Accessible by both CPU and GPU

IOBinding for Efficient Transfer

Use IOBinding to avoid unnecessary data copies:
import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model.onnx", providers=['CUDAExecutionProvider'])

# Create IOBinding
io_binding = session.io_binding()

# Bind input to GPU
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
io_binding.bind_cpu_input('input', input_data)

# Bind output to GPU
io_binding.bind_output('output', 'cuda')

# Run on GPU
session.run_with_iobinding(io_binding)

# Get output (still on GPU)
output = io_binding.copy_outputs_to_cpu()[0]

Custom Execution Providers

You can implement custom execution providers for specialized hardware:
// Simplified custom EP structure
class CustomExecutionProvider : public IExecutionProvider {
public:
  CustomExecutionProvider(const std::string& type, OrtDevice device)
      : IExecutionProvider(type, device) {}
  
  // Report which nodes this EP can execute
  std::vector<std::unique_ptr<ComputeCapability>>
  GetCapability(const GraphViewer& graph,
                const IKernelLookup& kernel_lookup) const override {
    // Inspect graph and return capability
  }
  
  // Get kernel registry
  std::shared_ptr<KernelRegistry> GetKernelRegistry() const override {
    return kernel_registry_;
  }
  
  // Data transfer implementation
  std::unique_ptr<IDataTransfer> GetDataTransfer() const override {
    return std::make_unique<CustomDataTransfer>();
  }
};
Building custom execution providers requires compiling ONNX Runtime from source. See the Custom Operators Guide for details.

Performance Considerations

Choose the right provider for your hardware:
  • CPU: Good for small models, low latency, or no GPU available
  • CUDA: Best for NVIDIA GPUs, good operator coverage
  • TensorRT: Maximum performance on NVIDIA GPUs, longer warmup
  • DirectML: Cross-vendor on Windows, good for client applications
Minimize data transfer between providers:
  • Prefer EPs that can execute entire subgraphs
  • CPU-GPU transfers are expensive
  • Use IOBinding to reduce copies
Configure memory limits appropriately:
cuda_options = {
    'gpu_mem_limit': 4 * 1024 * 1024 * 1024,  # 4GB
    'arena_extend_strategy': 'kSameAsRequested',  # More predictable
}
First inference may be slower due to:
  • Kernel compilation
  • Memory allocation
  • Engine building (TensorRT)
Run warmup inferences before measuring performance:
# Warmup
for _ in range(10):
    session.run(None, {"input": dummy_input})

# Now measure performance

Troubleshooting

Provider Not Available

import onnxruntime as ort

if 'CUDAExecutionProvider' not in ort.get_available_providers():
    print("CUDA provider not available")
    print("Available:", ort.get_available_providers())
    # Fallback to CPU

Mixed Precision Issues

Some providers support different precisions:
# TensorRT with FP16
trt_options = {
    'trt_fp16_enable': True,
    'trt_strict_type_constraints': False,  # Allow mixed precision
}
FP16 may produce different results than FP32. Always validate accuracy when using reduced precision.

Memory Errors

Reduce memory usage:
cuda_options = {
    'gpu_mem_limit': 1 * 1024 * 1024 * 1024,  # Reduce limit
    'arena_extend_strategy': 'kSameAsRequested',
}

Best Practices

Always Include CPU

Always include CPUExecutionProvider as fallback:
providers=['CUDAExecutionProvider', 'CPUExecutionProvider']

Test on Target Hardware

Performance varies significantly across hardware. Always profile on deployment targets.

Use IOBinding

Use IOBinding for better performance when doing multiple inferences.

Cache Engines

Enable engine caching for TensorRT:
{'trt_engine_cache_enable': True}

Next Steps

Sessions

Learn about InferenceSession configuration and management

Graph Optimizations

Understand how graph optimizations improve performance

Performance Tuning

Optimize inference performance for your use case

Quantization

Reduce model size and improve speed with quantization