Skip to main content
The CUDA execution provider enables high-performance inference on NVIDIA GPUs with optimized kernels for generative AI workloads.

Requirements

Hardware

  • NVIDIA GPU with Compute Capability 6.0 or higher
  • Recommended: RTX 20 series or newer for optimal performance
  • Minimum: 4GB GPU memory (varies by model size)

Software

  • CUDA Toolkit 11.8 or 12.x
  • cuDNN 8.x or later
  • NVIDIA driver 520.61.05 or newer (Linux) / 528.33 or newer (Windows)
The CUDA provider is included in the onnxruntime-genai-cuda package.

Installation

pip install onnxruntime-genai-cuda --pre

Basic Configuration

Python API

import onnxruntime_genai as og

model_path = "path/to/model"

# Create config and set CUDA provider
config = og.Config(model_path)
config.clear_providers()
config.append_provider("cuda")

# Load model
model = og.Model(config)
tokenizer = og.Tokenizer(model)

# Generate
params = og.GeneratorParams(model)
params.set_search_options(max_length=1024)

generator = og.Generator(model, params)

genai_config.json

Configure CUDA in your model configuration:
{
  "model": {
    "decoder": {
      "session_options": {
        "provider_options": [
          {
            "cuda": {}
          }
        ]
      }
    }
  }
}

GPU Memory Management

Memory Allocation

The CUDA provider uses ONNX Runtime’s allocator for GPU memory:
import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("cuda")

# Enable memory arena for better performance
config.set_provider_option("cuda", "enable_cuda_graph", "1")
config.set_provider_option("cuda", "gpu_mem_limit", "4294967296")  # 4GB

model = og.Model(config)
GPU Memory Limit: Set gpu_mem_limit to restrict CUDA memory usage (in bytes). This is useful for multi-tenant scenarios.

KV Cache Management

ONNX Runtime GenAI optimizes key-value cache management:
params = og.GeneratorParams(model)

# Configure cache sharing for better memory efficiency
params.set_search_options(
    max_length=2048,
    past_present_share_buffer=True  # Reduces memory usage
)
When using beam search (num_beams > 1), past_present_share_buffer must be set to False. CUDA graph is also incompatible with beam search.

Performance Tuning

CUDA Graph Optimization

Enable CUDA graphs to reduce kernel launch overhead:
config = og.Config(model_path)
config.clear_providers()
config.append_provider("cuda")

# Enable CUDA graph (for greedy search only)
config.set_provider_option("cuda", "enable_cuda_graph", "1")

model = og.Model(config)
CUDA graphs are only compatible with greedy search (num_beams=1). Disable for beam search scenarios.

Multi-Profile Support

Optimize for multiple sequence lengths:
{
  "model": {
    "decoder": {
      "session_options": {
        "enable_cuda_graph": "1",
        "cuda_graph_enable_multi_profile": "1"
      }
    }
  }
}

Stream Configuration

The CUDA provider uses a dedicated stream for async operations. Memory transfers are optimized using pinned host memory:
// C++ example of stream usage
void* GetStream() { 
  return g_stream.get(); 
}

// Async memory copy
cudaMemcpyAsync(dst, src, size, cudaMemcpyDeviceToHost, GetStream());
cudaStreamSynchronize(GetStream());

Precision Options

FP16 Inference

The CUDA provider supports FP16 for reduced memory and faster inference:
# Use FP16 model variant
model_path = "model-fp16-cuda"

config = og.Config(model_path)
config.clear_providers()
config.append_provider("cuda")

model = og.Model(config)
  • Memory: Higher usage
  • Speed: Baseline performance
  • Precision: Full precision
  • Hardware: All CUDA GPUs

Advanced Configuration

Device Selection

Select a specific GPU in multi-GPU systems:
import os

# Set CUDA device before loading model
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Use first GPU

config = og.Config(model_path)
config.clear_providers()
config.append_provider("cuda")

model = og.Model(config)

Session Options

Fine-tune ONNX Runtime session settings:
{
  "model": {
    "decoder": {
      "session_options": {
        "enable_mem_pattern": true,
        "enable_cpu_mem_arena": false,
        "intra_op_num_threads": 1,
        "provider_options": [
          {
            "cuda": {
              "gpu_mem_limit": "8589934592",
              "arena_extend_strategy": "kSameAsRequested"
            }
          }
        ]
      }
    }
  }
}

Troubleshooting

Out of Memory Errors

# Reduce memory usage
config.set_provider_option("cuda", "gpu_mem_limit", "4294967296")

# Use smaller batch sizes
params.set_search_options(batch_size=1)

# Reduce max length
params.set_search_options(max_length=512)

Performance Issues

config.set_provider_option("cuda", "enable_cuda_graph", "1")
Switch to FP16 model variants for Tensor Core acceleration.
params.set_search_options(past_present_share_buffer=True)

Driver Compatibility

Verify CUDA installation:
# Check CUDA version
nvcc --version

# Verify driver
nvidia-smi

# Test CUDA availability
python -c "import onnxruntime_genai as og; print('CUDA available')"

Benchmarking

import time
import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("cuda")

model = og.Model(config)
tokenizer = og.Tokenizer(model)

prompt = "Tell me about AI"
input_tokens = tokenizer.encode(prompt)

params = og.GeneratorParams(model)
params.set_search_options(max_length=100)

start = time.time()
generator = og.Generator(model, params)
generator.append_tokens(input_tokens)

while not generator.is_done():
    generator.generate_next_token()

end = time.time()
print(f"Generation time: {end - start:.2f}s")
print(f"Tokens/sec: {100 / (end - start):.2f}")

Next Steps

Model Optimization

Optimize models for CUDA deployment

Memory Management

Learn advanced memory techniques

Build docs developers (and LLMs) love