Skip to main content

DirectML Execution Provider

The DirectML Execution Provider enables GPU acceleration on Windows using DirectML, Microsoft’s hardware-accelerated DirectX 12 API for machine learning. DirectML supports any DirectX 12-capable GPU from NVIDIA, AMD, Intel, and Qualcomm.

When to Use DirectML EP

Use the DirectML Execution Provider when:
  • You’re running on Windows 10 (1903+) or Windows 11
  • You need cross-vendor GPU support (NVIDIA, AMD, Intel, Qualcomm)
  • You’re developing Windows desktop applications
  • You want to support a wide range of GPUs without driver-specific code
  • You’re targeting Windows-on-ARM devices (Surface Pro X, etc.)
  • You need NPU acceleration on compatible devices

Key Features

  • Cross-Vendor: Works with NVIDIA, AMD, Intel, and Qualcomm GPUs
  • Wide Hardware Support: Any DirectX 12-capable GPU
  • NPU Support: Leverage Neural Processing Units on compatible hardware
  • Windows Integration: Optimized for Windows platform
  • Single API: No need for vendor-specific SDKs

Prerequisites

Hardware Requirements

  • DirectX 12-capable GPU
  • Windows 10 (version 1903 or later) or Windows 11
  • Minimum 2GB GPU memory recommended

Supported GPUs

  • NVIDIA: GTX 900 series and newer
  • AMD: Radeon RX 400 series and newer
  • Intel: HD Graphics 6xx and newer (Skylake+)
  • Qualcomm: Adreno GPUs in Snapdragon processors

Software Requirements

  • Windows 10 (1903+) or Windows 11
  • ONNX Runtime DirectML package
  • Up-to-date GPU drivers

Installation

Python

# Install ONNX Runtime with DirectML support
pip install onnxruntime-directml

# Verify DirectML is available
python -c "import onnxruntime as ort; print(ort.get_available_providers())"
# Should include 'DmlExecutionProvider'

C++

Download the DirectML-enabled build from ONNX Runtime releases:
# Download Windows DirectML package
Invoke-WebRequest -Uri "https://github.com/microsoft/onnxruntime/releases/download/v{version}/onnxruntime-win-x64-{version}.zip" -OutFile "onnxruntime.zip"
Expand-Archive onnxruntime.zip

C#/.NET

# Install NuGet package
dotnet add package Microsoft.ML.OnnxRuntime.DirectML

UWP (Universal Windows Platform)

# For UWP applications
dotnet add package Microsoft.AI.MachineLearning

Basic Usage

Python

import onnxruntime as ort
import numpy as np

# Create session with DirectML provider
session = ort.InferenceSession(
    "model.onnx",
    providers=['DmlExecutionProvider']
)

# Prepare input
input_name = session.get_inputs()[0].name
x = np.random.randn(1, 3, 224, 224).astype(np.float32)

# Run inference
results = session.run(None, {input_name: x})

C++

#include <onnxruntime_cxx_api.h>
#include <dml_provider_factory.h>

Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "DirectMLExample");
Ort::SessionOptions session_options;

// Add DirectML provider with device ID 0 (default GPU)
Ort::ThrowOnError(OrtSessionOptionsAppendExecutionProvider_DML(session_options, 0));

// Create session
const wchar_t* model_path = L"model.onnx";
Ort::Session session(env, model_path, session_options);

// Run inference
auto output_tensors = session.Run(Ort::RunOptions{nullptr}, 
                                   input_names.data(), 
                                   &input_tensor, 1,
                                   output_names.data(), 1);

C#

using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;

var sessionOptions = new SessionOptions();
sessionOptions.AppendExecutionProvider_DML(0);  // Use default GPU

using var session = new InferenceSession("model.onnx", sessionOptions);

var inputMeta = session.InputMetadata;
var name = inputMeta.Keys.First();
var shape = inputMeta[name].Dimensions;

var tensor = new DenseTensor<float>(shape);
var inputs = new List<NamedOnnxValue> { 
    NamedOnnxValue.CreateFromTensor(name, tensor) 
};

using var results = session.Run(inputs);

WinRT/UWP (C#)

using Microsoft.AI.MachineLearning;

// Load model
var modelFile = await StorageFile.GetFileFromApplicationUriAsync(
    new Uri("ms-appx:///Assets/model.onnx")
);
var model = await LearningModel.LoadFromStorageFileAsync(modelFile);

// Create session with default device (GPU)
var session = new LearningModelSession(model);

// Or specify GPU explicitly
var device = new LearningModelDevice(LearningModelDeviceKind.DirectX);
var session = new LearningModelSession(model, device);

// Run inference
var binding = new LearningModelBinding(session);
binding.Bind("input", inputTensor);
var results = await session.EvaluateAsync(binding, "");

Configuration Options

Device Selection

import onnxruntime as ort

# Use default GPU (adapter 0)
session = ort.InferenceSession(
    "model.onnx",
    providers=[('DmlExecutionProvider', {'device_id': 0})]
)

# Use specific GPU (for multi-GPU systems)
session = ort.InferenceSession(
    "model.onnx",
    providers=[('DmlExecutionProvider', {'device_id': 1})]
)

Performance Preferences

# High performance mode
session = ort.InferenceSession(
    "model.onnx",
    providers=[(
        'DmlExecutionProvider', {
            'device_id': 0,
            'performance_preference': 'high_performance'  # or 'default', 'minimum_power'
        }
    )]
)

Device Filtering

# Target specific device types
session = ort.InferenceSession(
    "model.onnx",
    providers=[(
        'DmlExecutionProvider', {
            'device_filter': 'gpu'  # 'gpu', 'npu', or 'any'
        }
    )]
)

Advanced Configuration

C++ Advanced Options

#include <onnxruntime_cxx_api.h>
#include <dml_provider_factory.h>

Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "DirectMLExample");
Ort::SessionOptions session_options;

// Get DirectML API
const OrtDmlApi* dml_api = nullptr;
Ort::ThrowOnError(Ort::GetApi().GetExecutionProviderApi(
    "DML", ORT_API_VERSION, reinterpret_cast<const void**>(&dml_api)
));

// Configure device options
OrtDmlDeviceOptions device_options;
device_options.Preference = OrtDmlPerformancePreference::HighPerformance;
device_options.Filter = OrtDmlDeviceFilter::Gpu;

// Append execution provider
Ort::ThrowOnError(dml_api->SessionOptionsAppendExecutionProvider_DML2(
    session_options, &device_options
));

Ort::Session session(env, L"model.onnx", session_options);

Custom D3D12 Device

#include <d3d12.h>
#include <dml_provider_factory.h>

// Create custom D3D12 device and command queue
Microsoft::WRL::ComPtr<ID3D12Device> d3d12_device;
D3D12CreateDevice(nullptr, D3D_FEATURE_LEVEL_11_0, IID_PPV_ARGS(&d3d12_device));

D3D12_COMMAND_QUEUE_DESC queue_desc = {};
queue_desc.Type = D3D12_COMMAND_LIST_TYPE_DIRECT;
Microsoft::WRL::ComPtr<ID3D12CommandQueue> command_queue;
d3d12_device->CreateCommandQueue(&queue_desc, IID_PPV_ARGS(&command_queue));

// Create DML device
Microsoft::WRL::ComPtr<IDMLDevice> dml_device;
DMLCreateDevice(d3d12_device.Get(), DML_CREATE_DEVICE_FLAG_NONE, IID_PPV_ARGS(&dml_device));

// Use with ONNX Runtime
const OrtDmlApi* dml_api = nullptr;
Ort::GetApi().GetExecutionProviderApi("DML", ORT_API_VERSION, 
    reinterpret_cast<const void**>(&dml_api));

Ort::SessionOptions session_options;
dml_api->SessionOptionsAppendExecutionProvider_DML1(
    session_options, dml_device.Get(), command_queue.Get()
);

Multi-GPU Support

import onnxruntime as ort
from concurrent.futures import ThreadPoolExecutor

def run_on_gpu(gpu_id, model_path, input_data):
    session = ort.InferenceSession(
        model_path,
        providers=[('DmlExecutionProvider', {'device_id': gpu_id})]
    )
    return session.run(None, input_data)

# Run on multiple GPUs concurrently
with ThreadPoolExecutor(max_workers=2) as executor:
    future1 = executor.submit(run_on_gpu, 0, "model.onnx", input_data_1)
    future2 = executor.submit(run_on_gpu, 1, "model.onnx", input_data_2)
    
    result1 = future1.result()
    result2 = future2.result()

NPU Acceleration

On devices with Neural Processing Units:
import onnxruntime as ort

# Target NPU if available
session = ort.InferenceSession(
    "model.onnx",
    providers=[(
        'DmlExecutionProvider', {
            'device_filter': 'npu',
            'performance_preference': 'default'
        }
    )]
)
NPU-Compatible Devices:
  • Intel Core Ultra (Meteor Lake) with Intel AI Boost
  • AMD Ryzen AI processors
  • Qualcomm Snapdragon X Elite/Plus
  • Some Surface devices

Performance Optimization

Memory Management

# For low-memory devices, use smaller batch sizes
session = ort.InferenceSession(
    "model.onnx",
    providers=['DmlExecutionProvider']
)

# Process in smaller batches
batch_size = 1  # Or 4, 8 depending on GPU memory
for i in range(0, len(inputs), batch_size):
    batch = inputs[i:i+batch_size]
    results = session.run(None, {input_name: batch})

Session Options

import onnxruntime as ort

sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

# Enable memory pattern optimization
sess_options.enable_mem_pattern = True

# Enable CPU memory arena
sess_options.enable_cpu_mem_arena = True

session = ort.InferenceSession(
    "model.onnx",
    sess_options=sess_options,
    providers=['DmlExecutionProvider']
)

Platform Support

PlatformArchitectureSupport
Windows 11x64✅ Full
Windows 11ARM64✅ Full
Windows 10 (1903+)x64✅ Full
Windows 10 (1903+)ARM64✅ Full
Windows Server 2019+x64✅ Full
LinuxAny❌ No
macOSAny❌ No

Vendor-Specific Performance

NVIDIA GPUs

  • Good performance for most models
  • Consider CUDA/TensorRT for maximum performance
  • DirectML useful for cross-vendor compatibility

AMD GPUs

  • Excellent choice for AMD GPUs on Windows
  • Often best or only option for AMD acceleration
  • Good performance on RDNA architecture

Intel GPUs

  • Great for Intel integrated and discrete GPUs
  • Alternative to OpenVINO on Windows
  • Good performance on Arc and Xe GPUs

Qualcomm (Windows on ARM)

  • Primary option for GPU acceleration on ARM
  • Optimized for Snapdragon processors
  • Consider QNN EP for maximum Snapdragon performance

Troubleshooting

Provider Not Available

import onnxruntime as ort

print(ort.get_available_providers())
# If 'DmlExecutionProvider' is missing:
# 1. Check Windows version (need 1903+)
# 2. Verify onnxruntime-directml is installed
# 3. Update GPU drivers
# 4. Ensure DirectX 12 support

Performance Issues

# Check which GPU is being used
import onnxruntime as ort

session = ort.InferenceSession(
    "model.onnx",
    providers=['DmlExecutionProvider']
)

print(f"Using providers: {session.get_providers()}")

# Try different performance preferences
session = ort.InferenceSession(
    "model.onnx",
    providers=[(
        'DmlExecutionProvider', {
            'performance_preference': 'high_performance'
        }
    )]
)

Out of Memory

# Reduce batch size or model size
# Check GPU memory usage in Task Manager > Performance > GPU

# Use smaller input batches
batch_size = 1
results = session.run(None, {input_name: data[:batch_size]})

Comparison with Other Providers

FeatureDirectMLCUDATensorRT
Vendor SupportAllNVIDIA onlyNVIDIA only
Setup ComplexityEasyModerateComplex
PerformanceGoodBetterBest
Windows IntegrationExcellentGoodGood
ARM SupportYesNoNo

Next Steps