Skip to main content

CUDA Execution Provider

The CUDA Execution Provider enables GPU acceleration for ONNX Runtime on NVIDIA GPUs using CUDA and cuDNN libraries.

When to Use CUDA EP

Use the CUDA Execution Provider when:
  • You have NVIDIA GPUs (compute capability 6.0+)
  • You need general-purpose GPU acceleration
  • You want quick setup without TensorRT complexity
  • You’re developing and testing before optimizing with TensorRT
  • Your model has operators not supported by TensorRT

Prerequisites

Hardware Requirements

  • NVIDIA GPU with compute capability 6.0 or higher
  • Recommended: 4GB+ GPU memory

Software Requirements

  • CUDA Toolkit: 11.8 or 12.x
  • cuDNN: 8.x (matching your CUDA version)
  • ONNX Runtime GPU package

Installation

Python

# Install ONNX Runtime with GPU support
pip install onnxruntime-gpu

# Verify CUDA is available
python -c "import onnxruntime as ort; print(ort.get_available_providers())"
# Should include 'CUDAExecutionProvider'

C++

Download the GPU build from the ONNX Runtime releases page:
# Linux
wget https://github.com/microsoft/onnxruntime/releases/download/v{version}/onnxruntime-linux-x64-gpu-{version}.tgz
tar -xzf onnxruntime-linux-x64-gpu-{version}.tgz

C#

# Install NuGet packages
dotnet add package Microsoft.ML.OnnxRuntime.Gpu

Basic Usage

Python

import onnxruntime as ort
import numpy as np

# Create session with CUDA provider
session = ort.InferenceSession(
    "model.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

# Prepare input
input_name = session.get_inputs()[0].name
x = np.random.randn(1, 3, 224, 224).astype(np.float32)

# Run inference
results = session.run(None, {input_name: x})

C++

#include <onnxruntime_cxx_api.h>

Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "CUDAExample");
Ort::SessionOptions session_options;

// Configure CUDA provider
OrtCUDAProviderOptions cuda_options;
cuda_options.device_id = 0;
cuda_options.arena_extend_strategy = OrtArenaExtendStrategy::kNextPowerOfTwo;
cuda_options.cudnn_conv_algo_search = OrtCudnnConvAlgoSearch::EXHAUSTIVE;
cuda_options.do_copy_in_default_stream = true;

session_options.AppendExecutionProvider_CUDA(cuda_options);

// Create session
Ort::Session session(env, "model.onnx", session_options);

// Run inference
auto output_tensors = session.Run(Ort::RunOptions{nullptr}, 
                                   input_names.data(), 
                                   &input_tensor, 1,
                                   output_names.data(), 1);

C#

using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;

var sessionOptions = new SessionOptions();
sessionOptions.AppendExecutionProvider_CUDA(0);

using var session = new InferenceSession("model.onnx", sessionOptions);

var inputMeta = session.InputMetadata;
var name = inputMeta.Keys.First();
var shape = inputMeta[name].Dimensions;

var tensor = new DenseTensor<float>(shape);
var inputs = new List<NamedOnnxValue> { NamedOnnxValue.CreateFromTensor(name, tensor) };

using var results = session.Run(inputs);

Configuration Options

Python Provider Options

import onnxruntime as ort

session = ort.InferenceSession(
    "model.onnx",
    providers=[
        ('CUDAExecutionProvider', {
            # GPU device ID (0, 1, 2, etc.)
            'device_id': 0,
            
            # Memory arena configuration
            'arena_extend_strategy': 'kNextPowerOfTwo',  # or 'kSameAsRequested'
            'gpu_mem_limit': 2 * 1024 * 1024 * 1024,  # 2GB limit
            
            # cuDNN convolution algorithm search
            'cudnn_conv_algo_search': 'EXHAUSTIVE',  # or 'HEURISTIC', 'DEFAULT'
            
            # Use default stream for memory copies
            'do_copy_in_default_stream': True,
            
            # Enable CUDA graph capture for better performance
            'enable_cuda_graph': False,
            
            # Use TF32 for matrix operations (Ampere GPUs)
            'use_tf32': True,
            
            # Prefer NHWC layout for better performance
            'prefer_nhwc': False,
            
            # Enable tunable operators
            'tunable_op_enable': False,
            'tunable_op_tuning_enable': False,
        }),
        'CPUExecutionProvider'
    ]
)

Key Configuration Parameters

device_id

Specifies which GPU to use (0, 1, 2, etc.). Use when you have multiple GPUs.
# Use second GPU
providers=[('CUDAExecutionProvider', {'device_id': 1})]

gpu_mem_limit

Limits GPU memory usage. Useful to prevent OOM or allow multiple processes.
# Limit to 4GB
'gpu_mem_limit': 4 * 1024 * 1024 * 1024
Controls how cuDNN selects convolution algorithms:
  • EXHAUSTIVE: Tests all algorithms, slowest first run, best performance
  • HEURISTIC: Fast selection, good for development
  • DEFAULT: Uses cuDNN default

enable_cuda_graph

Captures CUDA operations into a graph for better performance. Requires static input shapes.
'enable_cuda_graph': True

use_tf32

Uses TensorFloat-32 on NVIDIA Ampere GPUs (RTX 30/40 series, A100) for faster matrix operations with minimal accuracy impact.
'use_tf32': True  # Default on Ampere+ GPUs

Performance Optimization

Memory Management

Arena Allocation Strategy
# Allocate memory in power-of-two chunks (default)
'arena_extend_strategy': 'kNextPowerOfTwo'

# Allocate exact amount needed (may reduce waste)
'arena_extend_strategy': 'kSameAsRequested'
Set Memory Limit
# Prevent OOM, allow multi-process usage
'gpu_mem_limit': 2 * 1024 * 1024 * 1024  # 2GB

I/O Binding (Zero-Copy)

Avoid CPU-GPU data transfers by binding GPU memory directly:
import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model.onnx", providers=['CUDAExecutionProvider'])

# Create I/O binding
io_binding = session.io_binding()

# Bind input to GPU
input_name = session.get_inputs()[0].name
x = np.random.randn(1, 3, 224, 224).astype(np.float32)
x_ortvalue = ort.OrtValue.ortvalue_from_numpy(x, 'cuda', 0)
io_binding.bind_input(
    name=input_name,
    device_type='cuda',
    device_id=0,
    element_type=np.float32,
    shape=x.shape,
    buffer_ptr=x_ortvalue.data_ptr()
)

# Bind output to GPU
output_name = session.get_outputs()[0].name
io_binding.bind_output(output_name, 'cuda')

# Run inference
session.run_with_iobinding(io_binding)
outputs = io_binding.copy_outputs_to_cpu()

CUDA Streams

Use custom CUDA streams for advanced control:
import onnxruntime as ort
import torch  # For stream creation

cuda_stream = torch.cuda.Stream().cuda_stream

session = ort.InferenceSession(
    "model.onnx",
    providers=[(
        'CUDAExecutionProvider', {
            'has_user_compute_stream': 1,
            'user_compute_stream': cuda_stream
        }
    )]
)

Multi-GPU

Run different sessions on different GPUs:
import onnxruntime as ort
from multiprocessing import Process

def run_on_gpu(gpu_id, model_path):
    session = ort.InferenceSession(
        model_path,
        providers=[('CUDAExecutionProvider', {'device_id': gpu_id})]
    )
    # Run inference...

# Launch on multiple GPUs
processes = []
for gpu_id in [0, 1, 2, 3]:
    p = Process(target=run_on_gpu, args=(gpu_id, "model.onnx"))
    p.start()
    processes.append(p)

for p in processes:
    p.join()

Platform Support

PlatformSupportNotes
Linux x64✅ FullBest performance
Windows x64✅ FullFull feature support
Linux ARM64✅ FullNVIDIA Jetson
Windows ARM64⚠️ LimitedExperimental
macOS❌ NoUse CPU EP

Supported GPUs

Desktop GPUs

  • RTX 40 Series (Ada Lovelace)
  • RTX 30 Series (Ampere)
  • RTX 20 Series (Turing)
  • GTX 16 Series (Turing)
  • GTX 10 Series (Pascal)

Data Center GPUs

  • H100, A100, A40, A30, A10 (Ampere/Hopper)
  • V100, T4 (Volta/Turing)
  • P100, P40 (Pascal)

Embedded/Edge

  • Jetson AGX Orin
  • Jetson Orin Nano/NX
  • Jetson Xavier NX/AGX
  • Jetson Nano (limited)

Troubleshooting

Provider Not Available

import onnxruntime as ort
print(ort.get_available_providers())
# If 'CUDAExecutionProvider' is missing:
# 1. Check CUDA/cuDNN installation
# 2. Verify onnxruntime-gpu is installed
# 3. Check CUDA version compatibility

Out of Memory Errors

# Set memory limit
session = ort.InferenceSession(
    "model.onnx",
    providers=[('CUDAExecutionProvider', {
        'gpu_mem_limit': 2 * 1024 * 1024 * 1024
    })]
)

# Or use smaller batch sizes

Performance Issues

  1. Enable EXHAUSTIVE conv search:
    'cudnn_conv_algo_search': 'EXHAUSTIVE'
    
  2. Use I/O binding for repeated inference
  3. Enable CUDA graph if input shapes are static:
    'enable_cuda_graph': True
    
  4. Check GPU utilization: Use nvidia-smi to monitor GPU usage

Next Steps