CUDA Execution Provider

The CUDA Execution Provider enables GPU acceleration for ONNX Runtime on NVIDIA GPUs using CUDA and cuDNN libraries.

When to Use CUDA EP

Use the CUDA Execution Provider when:

You have NVIDIA GPUs (compute capability 6.0+)
You need general-purpose GPU acceleration
You want quick setup without TensorRT complexity
You’re developing and testing before optimizing with TensorRT
Your model has operators not supported by TensorRT

Prerequisites

Hardware Requirements

NVIDIA GPU with compute capability 6.0 or higher
Recommended: 4GB+ GPU memory

Software Requirements

CUDA Toolkit: 11.8 or 12.x
cuDNN: 8.x (matching your CUDA version)
ONNX Runtime GPU package

Installation

Python

# Install ONNX Runtime with GPU support
pip install onnxruntime-gpu

# Verify CUDA is available
python -c "import onnxruntime as ort; print(ort.get_available_providers())"
# Should include 'CUDAExecutionProvider'

C++

Download the GPU build from the ONNX Runtime releases page:

# Linux
wget https://github.com/microsoft/onnxruntime/releases/download/v{version}/onnxruntime-linux-x64-gpu-{version}.tgz
tar -xzf onnxruntime-linux-x64-gpu-{version}.tgz

C#

# Install NuGet packages
dotnet add package Microsoft.ML.OnnxRuntime.Gpu

Basic Usage

Python

import onnxruntime as ort
import numpy as np

# Create session with CUDA provider
session = ort.InferenceSession(
    "model.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

# Prepare input
input_name = session.get_inputs()[0].name
x = np.random.randn(1, 3, 224, 224).astype(np.float32)

# Run inference
results = session.run(None, {input_name: x})

C++

#include <onnxruntime_cxx_api.h>

Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "CUDAExample");
Ort::SessionOptions session_options;

// Configure CUDA provider
OrtCUDAProviderOptions cuda_options;
cuda_options.device_id = 0;
cuda_options.arena_extend_strategy = OrtArenaExtendStrategy::kNextPowerOfTwo;
cuda_options.cudnn_conv_algo_search = OrtCudnnConvAlgoSearch::EXHAUSTIVE;
cuda_options.do_copy_in_default_stream = true;

session_options.AppendExecutionProvider_CUDA(cuda_options);

// Create session
Ort::Session session(env, "model.onnx", session_options);

// Run inference
auto output_tensors = session.Run(Ort::RunOptions{nullptr}, 
                                   input_names.data(), 
                                   &input_tensor, 1,
                                   output_names.data(), 1);

C#

using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;

var sessionOptions = new SessionOptions();
sessionOptions.AppendExecutionProvider_CUDA(0);

using var session = new InferenceSession("model.onnx", sessionOptions);

var inputMeta = session.InputMetadata;
var name = inputMeta.Keys.First();
var shape = inputMeta[name].Dimensions;

var tensor = new DenseTensor<float>(shape);
var inputs = new List<NamedOnnxValue> { NamedOnnxValue.CreateFromTensor(name, tensor) };

using var results = session.Run(inputs);

Configuration Options

Python Provider Options

import onnxruntime as ort

session = ort.InferenceSession(
    "model.onnx",
    providers=[
        ('CUDAExecutionProvider', {
            # GPU device ID (0, 1, 2, etc.)
            'device_id': 0,
            
            # Memory arena configuration
            'arena_extend_strategy': 'kNextPowerOfTwo',  # or 'kSameAsRequested'
            'gpu_mem_limit': 2 * 1024 * 1024 * 1024,  # 2GB limit
            
            # cuDNN convolution algorithm search
            'cudnn_conv_algo_search': 'EXHAUSTIVE',  # or 'HEURISTIC', 'DEFAULT'
            
            # Use default stream for memory copies
            'do_copy_in_default_stream': True,
            
            # Enable CUDA graph capture for better performance
            'enable_cuda_graph': False,
            
            # Use TF32 for matrix operations (Ampere GPUs)
            'use_tf32': True,
            
            # Prefer NHWC layout for better performance
            'prefer_nhwc': False,
            
            # Enable tunable operators
            'tunable_op_enable': False,
            'tunable_op_tuning_enable': False,
        }),
        'CPUExecutionProvider'
    ]
)

Key Configuration Parameters

device_id

Specifies which GPU to use (0, 1, 2, etc.). Use when you have multiple GPUs.

# Use second GPU
providers=[('CUDAExecutionProvider', {'device_id': 1})]

gpu_mem_limit

Limits GPU memory usage. Useful to prevent OOM or allow multiple processes.

# Limit to 4GB
'gpu_mem_limit': 4 * 1024 * 1024 * 1024

cudnn_conv_algo_search

Controls how cuDNN selects convolution algorithms:

EXHAUSTIVE: Tests all algorithms, slowest first run, best performance
HEURISTIC: Fast selection, good for development
DEFAULT: Uses cuDNN default

enable_cuda_graph

Captures CUDA operations into a graph for better performance. Requires static input shapes.

'enable_cuda_graph': True

use_tf32

Uses TensorFloat-32 on NVIDIA Ampere GPUs (RTX 30/40 series, A100) for faster matrix operations with minimal accuracy impact.

'use_tf32': True  # Default on Ampere+ GPUs

Performance Optimization

Memory Management

Arena Allocation Strategy

# Allocate memory in power-of-two chunks (default)
'arena_extend_strategy': 'kNextPowerOfTwo'

# Allocate exact amount needed (may reduce waste)
'arena_extend_strategy': 'kSameAsRequested'

Set Memory Limit

# Prevent OOM, allow multi-process usage
'gpu_mem_limit': 2 * 1024 * 1024 * 1024  # 2GB

I/O Binding (Zero-Copy)

Avoid CPU-GPU data transfers by binding GPU memory directly:

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model.onnx", providers=['CUDAExecutionProvider'])

# Create I/O binding
io_binding = session.io_binding()

# Bind input to GPU
input_name = session.get_inputs()[0].name
x = np.random.randn(1, 3, 224, 224).astype(np.float32)
x_ortvalue = ort.OrtValue.ortvalue_from_numpy(x, 'cuda', 0)
io_binding.bind_input(
    name=input_name,
    device_type='cuda',
    device_id=0,
    element_type=np.float32,
    shape=x.shape,
    buffer_ptr=x_ortvalue.data_ptr()
)

# Bind output to GPU
output_name = session.get_outputs()[0].name
io_binding.bind_output(output_name, 'cuda')

# Run inference
session.run_with_iobinding(io_binding)
outputs = io_binding.copy_outputs_to_cpu()

CUDA Streams

Use custom CUDA streams for advanced control:

import onnxruntime as ort
import torch  # For stream creation

cuda_stream = torch.cuda.Stream().cuda_stream

session = ort.InferenceSession(
    "model.onnx",
    providers=[(
        'CUDAExecutionProvider', {
            'has_user_compute_stream': 1,
            'user_compute_stream': cuda_stream
        }
    )]
)

Multi-GPU

Run different sessions on different GPUs:

import onnxruntime as ort
from multiprocessing import Process

def run_on_gpu(gpu_id, model_path):
    session = ort.InferenceSession(
        model_path,
        providers=[('CUDAExecutionProvider', {'device_id': gpu_id})]
    )
    # Run inference...

# Launch on multiple GPUs
processes = []
for gpu_id in [0, 1, 2, 3]:
    p = Process(target=run_on_gpu, args=(gpu_id, "model.onnx"))
    p.start()
    processes.append(p)

for p in processes:
    p.join()

Platform Support

Platform	Support	Notes
Linux x64	✅ Full	Best performance
Windows x64	✅ Full	Full feature support
Linux ARM64	✅ Full	NVIDIA Jetson
Windows ARM64	⚠️ Limited	Experimental
macOS	❌ No	Use CPU EP

Supported GPUs

Desktop GPUs

RTX 40 Series (Ada Lovelace)
RTX 30 Series (Ampere)
RTX 20 Series (Turing)
GTX 16 Series (Turing)
GTX 10 Series (Pascal)

Data Center GPUs

H100, A100, A40, A30, A10 (Ampere/Hopper)
V100, T4 (Volta/Turing)
P100, P40 (Pascal)

Embedded/Edge

Jetson AGX Orin
Jetson Orin Nano/NX
Jetson Xavier NX/AGX
Jetson Nano (limited)

Troubleshooting

Provider Not Available

import onnxruntime as ort
print(ort.get_available_providers())
# If 'CUDAExecutionProvider' is missing:
# 1. Check CUDA/cuDNN installation
# 2. Verify onnxruntime-gpu is installed
# 3. Check CUDA version compatibility

Out of Memory Errors

# Set memory limit
session = ort.InferenceSession(
    "model.onnx",
    providers=[('CUDAExecutionProvider', {
        'gpu_mem_limit': 2 * 1024 * 1024 * 1024
    })]
)

# Or use smaller batch sizes

Performance Issues

Enable EXHAUSTIVE conv search:
```
'cudnn_conv_algo_search': 'EXHAUSTIVE'
```
Use I/O binding for repeated inference
Enable CUDA graph if input shapes are static:
```
'enable_cuda_graph': True
```
Check GPU utilization: Use nvidia-smi to monitor GPU usage

Next Steps

For maximum NVIDIA GPU performance, see TensorRT Execution Provider
Learn about I/O Binding for zero-copy operations
Explore performance tuning strategies

Get Started

Core Concepts

Inference

Training

Execution Providers

Performance

Model Conversion

Advanced

​CUDA Execution Provider

​When to Use CUDA EP

​Prerequisites

​Hardware Requirements

​Software Requirements

​Installation

​Python

​C++

​C#

​Basic Usage

​Python

​C++

​C#

​Configuration Options

​Python Provider Options

​Key Configuration Parameters

​device_id

​gpu_mem_limit

​cudnn_conv_algo_search

​enable_cuda_graph

​use_tf32

​Performance Optimization

​Memory Management

​I/O Binding (Zero-Copy)

​CUDA Streams

​Multi-GPU

​Platform Support

​Supported GPUs

​Desktop GPUs

​Data Center GPUs

​Embedded/Edge

​Troubleshooting

​Provider Not Available

​Out of Memory Errors

​Performance Issues

​Next Steps

CUDA Execution Provider

When to Use CUDA EP

Prerequisites

Hardware Requirements

Software Requirements

Installation

Python

C++

C#

Basic Usage

Python

C++

C#

Configuration Options

Python Provider Options

Key Configuration Parameters

device_id

gpu_mem_limit

cudnn_conv_algo_search

enable_cuda_graph

use_tf32

Performance Optimization

Memory Management

I/O Binding (Zero-Copy)

CUDA Streams

Multi-GPU

Platform Support

Supported GPUs

Desktop GPUs

Data Center GPUs

Embedded/Edge

Troubleshooting

Provider Not Available

Out of Memory Errors

Performance Issues

Next Steps