CUDA Execution Provider

The CUDA execution provider enables high-performance inference on NVIDIA GPUs with optimized kernels for generative AI workloads.

Requirements

Hardware

NVIDIA GPU with Compute Capability 6.0 or higher
Recommended: RTX 20 series or newer for optimal performance
Minimum: 4GB GPU memory (varies by model size)

Software

CUDA Toolkit 11.8 or 12.x
cuDNN 8.x or later
NVIDIA driver 520.61.05 or newer (Linux) / 528.33 or newer (Windows)

The CUDA provider is included in the onnxruntime-genai-cuda package.

Installation

Python
C++
C#

pip install onnxruntime-genai-cuda --pre

dotnet add package Microsoft.ML.OnnxRuntimeGenAI.Cuda

Basic Configuration

Python API

import onnxruntime_genai as og

model_path = "path/to/model"

# Create config and set CUDA provider
config = og.Config(model_path)
config.clear_providers()
config.append_provider("cuda")

# Load model
model = og.Model(config)
tokenizer = og.Tokenizer(model)

# Generate
params = og.GeneratorParams(model)
params.set_search_options(max_length=1024)

generator = og.Generator(model, params)

genai_config.json

Configure CUDA in your model configuration:

{
  "model": {
    "decoder": {
      "session_options": {
        "provider_options": [
          {
            "cuda": {}
          }
        ]
      }
    }
  }
}

GPU Memory Management

Memory Allocation

The CUDA provider uses ONNX Runtime’s allocator for GPU memory:

import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("cuda")

# Enable memory arena for better performance
config.set_provider_option("cuda", "enable_cuda_graph", "1")
config.set_provider_option("cuda", "gpu_mem_limit", "4294967296")  # 4GB

model = og.Model(config)

GPU Memory Limit: Set gpu_mem_limit to restrict CUDA memory usage (in bytes). This is useful for multi-tenant scenarios.

KV Cache Management

ONNX Runtime GenAI optimizes key-value cache management:

params = og.GeneratorParams(model)

# Configure cache sharing for better memory efficiency
params.set_search_options(
    max_length=2048,
    past_present_share_buffer=True  # Reduces memory usage
)

When using beam search (num_beams > 1), past_present_share_buffer must be set to False. CUDA graph is also incompatible with beam search.

Performance Tuning

CUDA Graph Optimization

Enable CUDA graphs to reduce kernel launch overhead:

config = og.Config(model_path)
config.clear_providers()
config.append_provider("cuda")

# Enable CUDA graph (for greedy search only)
config.set_provider_option("cuda", "enable_cuda_graph", "1")

model = og.Model(config)

CUDA graphs are only compatible with greedy search (num_beams=1). Disable for beam search scenarios.

Multi-Profile Support

Optimize for multiple sequence lengths:

{
  "model": {
    "decoder": {
      "session_options": {
        "enable_cuda_graph": "1",
        "cuda_graph_enable_multi_profile": "1"
      }
    }
  }
}

Stream Configuration

The CUDA provider uses a dedicated stream for async operations. Memory transfers are optimized using pinned host memory:

// C++ example of stream usage
void* GetStream() { 
  return g_stream.get(); 
}

// Async memory copy
cudaMemcpyAsync(dst, src, size, cudaMemcpyDeviceToHost, GetStream());
cudaStreamSynchronize(GetStream());

Precision Options

FP16 Inference

The CUDA provider supports FP16 for reduced memory and faster inference:

# Use FP16 model variant
model_path = "model-fp16-cuda"

config = og.Config(model_path)
config.clear_providers()
config.append_provider("cuda")

model = og.Model(config)

FP32
FP16

Memory: Higher usage
Speed: Baseline performance
Precision: Full precision
Hardware: All CUDA GPUs

Advanced Configuration

Device Selection

Select a specific GPU in multi-GPU systems:

import os

# Set CUDA device before loading model
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Use first GPU

config = og.Config(model_path)
config.clear_providers()
config.append_provider("cuda")

model = og.Model(config)

Session Options

Fine-tune ONNX Runtime session settings:

{
  "model": {
    "decoder": {
      "session_options": {
        "enable_mem_pattern": true,
        "enable_cpu_mem_arena": false,
        "intra_op_num_threads": 1,
        "provider_options": [
          {
            "cuda": {
              "gpu_mem_limit": "8589934592",
              "arena_extend_strategy": "kSameAsRequested"
            }
          }
        ]
      }
    }
  }
}

Troubleshooting

Out of Memory Errors

# Reduce memory usage
config.set_provider_option("cuda", "gpu_mem_limit", "4294967296")

# Use smaller batch sizes
params.set_search_options(batch_size=1)

# Reduce max length
params.set_search_options(max_length=512)

Performance Issues

Enable CUDA Graph

config.set_provider_option("cuda", "enable_cuda_graph", "1")

Use FP16 Models

Switch to FP16 model variants for Tensor Core acceleration.

Optimize KV Cache

params.set_search_options(past_present_share_buffer=True)

Driver Compatibility

Verify CUDA installation:

# Check CUDA version
nvcc --version

# Verify driver
nvidia-smi

# Test CUDA availability
python -c "import onnxruntime_genai as og; print('CUDA available')"

Benchmarking

import time
import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("cuda")

model = og.Model(config)
tokenizer = og.Tokenizer(model)

prompt = "Tell me about AI"
input_tokens = tokenizer.encode(prompt)

params = og.GeneratorParams(model)
params.set_search_options(max_length=100)

start = time.time()
generator = og.Generator(model, params)
generator.append_tokens(input_tokens)

while not generator.is_done():
    generator.generate_next_token()

end = time.time()
print(f"Generation time: {end - start:.2f}s")
print(f"Tokens/sec: {100 / (end - start):.2f}")

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

Requirements

Hardware

Software

Installation

Basic Configuration

Python API

genai_config.json

GPU Memory Management

Memory Allocation

KV Cache Management

Performance Tuning

CUDA Graph Optimization

Multi-Profile Support

Stream Configuration

Precision Options

FP16 Inference

Advanced Configuration

Device Selection

Session Options

Troubleshooting

Out of Memory Errors

Performance Issues

Driver Compatibility

Benchmarking

Next Steps

Model Optimization

Memory Management

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

​Requirements

​Hardware

​Software

​Installation

​Basic Configuration

​Python API

​genai_config.json

​GPU Memory Management

​Memory Allocation

​KV Cache Management

​Performance Tuning

​CUDA Graph Optimization

​Multi-Profile Support

​Stream Configuration

​Precision Options

​FP16 Inference

​Advanced Configuration

​Device Selection

​Session Options

​Troubleshooting

​Out of Memory Errors

​Performance Issues

​Driver Compatibility

​Benchmarking

​Next Steps

Model Optimization

Memory Management

Build docs developers (and LLMs) love

Requirements

Hardware

Software

Installation

Basic Configuration

Python API

genai_config.json

GPU Memory Management

Memory Allocation

KV Cache Management

Performance Tuning

CUDA Graph Optimization

Multi-Profile Support

Stream Configuration

Precision Options

FP16 Inference

Advanced Configuration

Device Selection

Session Options

Troubleshooting

Out of Memory Errors

Performance Issues

Driver Compatibility

Benchmarking

Next Steps