Skip to main content
ONNX Runtime GenAI supports multiple hardware acceleration providers to optimize model inference across different platforms. Each execution provider targets specific hardware and offers unique performance characteristics.

Available Execution Providers

ONNX Runtime GenAI supports the following execution providers:

CUDA

NVIDIA GPU acceleration with comprehensive memory management

DirectML

Cross-vendor GPU acceleration on Windows platforms

OpenVINO

Intel hardware optimization for CPU, GPU, and NPU

QNN

Qualcomm NPU acceleration for edge and mobile devices

WebGPU

Browser-based GPU acceleration using WebGPU API

Platform Compatibility Matrix

ProviderWindowsLinuxmacOSAndroidBrowser
CUDA
DirectML
OpenVINO
QNN
WebGPU
CPU

Hardware Type Support

ProviderCPUGPUNPUTarget Hardware
CUDANVIDIA GPUs
DirectMLAll DirectX 12 GPUs
OpenVINOIntel CPUs, iGPUs, NPUs
QNNQualcomm Hexagon NPUs
WebGPUBrowser-supported GPUs

Performance Considerations

Memory Management

Each provider handles memory differently:
  • CUDA: Device memory with host-pinned allocations for efficient transfers
  • DirectML: D3D12 resource management with upload/readback heaps
  • OpenVINO: CPU-accessible memory with optional device acceleration
  • QNN: CPU-accessible NPU memory
  • WebGPU: GPU buffers with async CPU-GPU synchronization

Precision Support

All providers support full precision (FP32) inference.

Provider Selection Guide

Choose the right provider based on your deployment scenario:

Server Deployment

import onnxruntime_genai as og

# NVIDIA GPU server
config = og.Config(model_path)
config.clear_providers()
config.append_provider("cuda")
model = og.Model(config)

Windows Desktop

import onnxruntime_genai as og

# Cross-vendor GPU support
config = og.Config(model_path)
config.clear_providers()
config.append_provider("dml")
model = og.Model(config)

Edge Devices

import onnxruntime_genai as og

# Intel edge hardware
config = og.Config(model_path)
config.clear_providers()
config.append_provider("openvino")
config.set_provider_option("openvino", "device_type", "CPU")
model = og.Model(config)

Mobile Deployment

import onnxruntime_genai as og

# Qualcomm Snapdragon devices
config = og.Config(model_path)
config.clear_providers()
config.append_provider("qnn")
model = og.Model(config)

Configuration in genai_config.json

Providers can be configured directly in your model’s genai_config.json:
{
  "model": {
    "decoder": {
      "session_options": {
        "provider_options": [
          {
            "cuda": {}
          }
        ]
      }
    }
  }
}
The provider_options array specifies execution providers in priority order. ONNX Runtime will use the first available provider.

Device Filtering

For multi-device systems, you can filter by hardware type:
{
  "provider_options": [
    {
      "openvino": {
        "device_type": "GPU"
      },
      "device_filtering_options": {
        "hardware_device_type": "gpu",
        "hardware_device_id": 0
      }
    }
  ]
}
Provider availability depends on your installation. Install provider-specific packages:
  • CUDA: onnxruntime-genai-cuda
  • DirectML: onnxruntime-genai-directml
  • Other providers may require building from source.

Next Steps

CUDA Setup

Configure NVIDIA GPU acceleration

DirectML Setup

Enable DirectML on Windows

OpenVINO Setup

Optimize for Intel hardware

QNN Setup

Deploy to Qualcomm devices

Build docs developers (and LLMs) love