DirectML Execution Provider

The DirectML Execution Provider enables GPU acceleration on Windows using DirectML, Microsoft’s hardware-accelerated DirectX 12 API for machine learning. DirectML supports any DirectX 12-capable GPU from NVIDIA, AMD, Intel, and Qualcomm.

When to Use DirectML EP

Use the DirectML Execution Provider when:

You’re running on Windows 10 (1903+) or Windows 11
You need cross-vendor GPU support (NVIDIA, AMD, Intel, Qualcomm)
You’re developing Windows desktop applications
You want to support a wide range of GPUs without driver-specific code
You’re targeting Windows-on-ARM devices (Surface Pro X, etc.)
You need NPU acceleration on compatible devices

Key Features

Cross-Vendor: Works with NVIDIA, AMD, Intel, and Qualcomm GPUs
Wide Hardware Support: Any DirectX 12-capable GPU
NPU Support: Leverage Neural Processing Units on compatible hardware
Windows Integration: Optimized for Windows platform
Single API: No need for vendor-specific SDKs

Prerequisites

Hardware Requirements

DirectX 12-capable GPU
Windows 10 (version 1903 or later) or Windows 11
Minimum 2GB GPU memory recommended

Supported GPUs

NVIDIA: GTX 900 series and newer
AMD: Radeon RX 400 series and newer
Intel: HD Graphics 6xx and newer (Skylake+)
Qualcomm: Adreno GPUs in Snapdragon processors

Software Requirements

Windows 10 (1903+) or Windows 11
ONNX Runtime DirectML package
Up-to-date GPU drivers

Installation

Python

# Install ONNX Runtime with DirectML support
pip install onnxruntime-directml

# Verify DirectML is available
python -c "import onnxruntime as ort; print(ort.get_available_providers())"
# Should include 'DmlExecutionProvider'

C++

Download the DirectML-enabled build from ONNX Runtime releases:

# Download Windows DirectML package
Invoke-WebRequest -Uri "https://github.com/microsoft/onnxruntime/releases/download/v{version}/onnxruntime-win-x64-{version}.zip" -OutFile "onnxruntime.zip"
Expand-Archive onnxruntime.zip

C#/.NET

# Install NuGet package
dotnet add package Microsoft.ML.OnnxRuntime.DirectML

UWP (Universal Windows Platform)

# For UWP applications
dotnet add package Microsoft.AI.MachineLearning

Basic Usage

Python

import onnxruntime as ort
import numpy as np

# Create session with DirectML provider
session = ort.InferenceSession(
    "model.onnx",
    providers=['DmlExecutionProvider']
)

# Prepare input
input_name = session.get_inputs()[0].name
x = np.random.randn(1, 3, 224, 224).astype(np.float32)

# Run inference
results = session.run(None, {input_name: x})

C++

#include <onnxruntime_cxx_api.h>
#include <dml_provider_factory.h>

Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "DirectMLExample");
Ort::SessionOptions session_options;

// Add DirectML provider with device ID 0 (default GPU)
Ort::ThrowOnError(OrtSessionOptionsAppendExecutionProvider_DML(session_options, 0));

// Create session
const wchar_t* model_path = L"model.onnx";
Ort::Session session(env, model_path, session_options);

// Run inference
auto output_tensors = session.Run(Ort::RunOptions{nullptr}, 
                                   input_names.data(), 
                                   &input_tensor, 1,
                                   output_names.data(), 1);

C#

using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;

var sessionOptions = new SessionOptions();
sessionOptions.AppendExecutionProvider_DML(0);  // Use default GPU

using var session = new InferenceSession("model.onnx", sessionOptions);

var inputMeta = session.InputMetadata;
var name = inputMeta.Keys.First();
var shape = inputMeta[name].Dimensions;

var tensor = new DenseTensor<float>(shape);
var inputs = new List<NamedOnnxValue> { 
    NamedOnnxValue.CreateFromTensor(name, tensor) 
};

using var results = session.Run(inputs);

WinRT/UWP (C#)

using Microsoft.AI.MachineLearning;

// Load model
var modelFile = await StorageFile.GetFileFromApplicationUriAsync(
    new Uri("ms-appx:///Assets/model.onnx")
);
var model = await LearningModel.LoadFromStorageFileAsync(modelFile);

// Create session with default device (GPU)
var session = new LearningModelSession(model);

// Or specify GPU explicitly
var device = new LearningModelDevice(LearningModelDeviceKind.DirectX);
var session = new LearningModelSession(model, device);

// Run inference
var binding = new LearningModelBinding(session);
binding.Bind("input", inputTensor);
var results = await session.EvaluateAsync(binding, "");

Configuration Options

Device Selection

import onnxruntime as ort

# Use default GPU (adapter 0)
session = ort.InferenceSession(
    "model.onnx",
    providers=[('DmlExecutionProvider', {'device_id': 0})]
)

# Use specific GPU (for multi-GPU systems)
session = ort.InferenceSession(
    "model.onnx",
    providers=[('DmlExecutionProvider', {'device_id': 1})]
)

Performance Preferences

# High performance mode
session = ort.InferenceSession(
    "model.onnx",
    providers=[(
        'DmlExecutionProvider', {
            'device_id': 0,
            'performance_preference': 'high_performance'  # or 'default', 'minimum_power'
        }
    )]
)

Device Filtering

# Target specific device types
session = ort.InferenceSession(
    "model.onnx",
    providers=[(
        'DmlExecutionProvider', {
            'device_filter': 'gpu'  # 'gpu', 'npu', or 'any'
        }
    )]
)

Advanced Configuration

C++ Advanced Options

#include <onnxruntime_cxx_api.h>
#include <dml_provider_factory.h>

Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "DirectMLExample");
Ort::SessionOptions session_options;

// Get DirectML API
const OrtDmlApi* dml_api = nullptr;
Ort::ThrowOnError(Ort::GetApi().GetExecutionProviderApi(
    "DML", ORT_API_VERSION, reinterpret_cast<const void**>(&dml_api)
));

// Configure device options
OrtDmlDeviceOptions device_options;
device_options.Preference = OrtDmlPerformancePreference::HighPerformance;
device_options.Filter = OrtDmlDeviceFilter::Gpu;

// Append execution provider
Ort::ThrowOnError(dml_api->SessionOptionsAppendExecutionProvider_DML2(
    session_options, &device_options
));

Ort::Session session(env, L"model.onnx", session_options);

Custom D3D12 Device

#include <d3d12.h>
#include <dml_provider_factory.h>

// Create custom D3D12 device and command queue
Microsoft::WRL::ComPtr<ID3D12Device> d3d12_device;
D3D12CreateDevice(nullptr, D3D_FEATURE_LEVEL_11_0, IID_PPV_ARGS(&d3d12_device));

D3D12_COMMAND_QUEUE_DESC queue_desc = {};
queue_desc.Type = D3D12_COMMAND_LIST_TYPE_DIRECT;
Microsoft::WRL::ComPtr<ID3D12CommandQueue> command_queue;
d3d12_device->CreateCommandQueue(&queue_desc, IID_PPV_ARGS(&command_queue));

// Create DML device
Microsoft::WRL::ComPtr<IDMLDevice> dml_device;
DMLCreateDevice(d3d12_device.Get(), DML_CREATE_DEVICE_FLAG_NONE, IID_PPV_ARGS(&dml_device));

// Use with ONNX Runtime
const OrtDmlApi* dml_api = nullptr;
Ort::GetApi().GetExecutionProviderApi("DML", ORT_API_VERSION, 
    reinterpret_cast<const void**>(&dml_api));

Ort::SessionOptions session_options;
dml_api->SessionOptionsAppendExecutionProvider_DML1(
    session_options, dml_device.Get(), command_queue.Get()
);

Multi-GPU Support

import onnxruntime as ort
from concurrent.futures import ThreadPoolExecutor

def run_on_gpu(gpu_id, model_path, input_data):
    session = ort.InferenceSession(
        model_path,
        providers=[('DmlExecutionProvider', {'device_id': gpu_id})]
    )
    return session.run(None, input_data)

# Run on multiple GPUs concurrently
with ThreadPoolExecutor(max_workers=2) as executor:
    future1 = executor.submit(run_on_gpu, 0, "model.onnx", input_data_1)
    future2 = executor.submit(run_on_gpu, 1, "model.onnx", input_data_2)
    
    result1 = future1.result()
    result2 = future2.result()

NPU Acceleration

On devices with Neural Processing Units:

import onnxruntime as ort

# Target NPU if available
session = ort.InferenceSession(
    "model.onnx",
    providers=[(
        'DmlExecutionProvider', {
            'device_filter': 'npu',
            'performance_preference': 'default'
        }
    )]
)

NPU-Compatible Devices:

Intel Core Ultra (Meteor Lake) with Intel AI Boost
AMD Ryzen AI processors
Qualcomm Snapdragon X Elite/Plus
Some Surface devices

Performance Optimization

Memory Management

# For low-memory devices, use smaller batch sizes
session = ort.InferenceSession(
    "model.onnx",
    providers=['DmlExecutionProvider']
)

# Process in smaller batches
batch_size = 1  # Or 4, 8 depending on GPU memory
for i in range(0, len(inputs), batch_size):
    batch = inputs[i:i+batch_size]
    results = session.run(None, {input_name: batch})

Session Options

import onnxruntime as ort

sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

# Enable memory pattern optimization
sess_options.enable_mem_pattern = True

# Enable CPU memory arena
sess_options.enable_cpu_mem_arena = True

session = ort.InferenceSession(
    "model.onnx",
    sess_options=sess_options,
    providers=['DmlExecutionProvider']
)

Platform Support

Platform	Architecture	Support
Windows 11	x64	✅ Full
Windows 11	ARM64	✅ Full
Windows 10 (1903+)	x64	✅ Full
Windows 10 (1903+)	ARM64	✅ Full
Windows Server 2019+	x64	✅ Full
Linux	Any	❌ No
macOS	Any	❌ No

Vendor-Specific Performance

NVIDIA GPUs

Good performance for most models
Consider CUDA/TensorRT for maximum performance
DirectML useful for cross-vendor compatibility

AMD GPUs

Excellent choice for AMD GPUs on Windows
Often best or only option for AMD acceleration
Good performance on RDNA architecture

Intel GPUs

Great for Intel integrated and discrete GPUs
Alternative to OpenVINO on Windows
Good performance on Arc and Xe GPUs

Qualcomm (Windows on ARM)

Primary option for GPU acceleration on ARM
Optimized for Snapdragon processors
Consider QNN EP for maximum Snapdragon performance

Troubleshooting

Provider Not Available

import onnxruntime as ort

print(ort.get_available_providers())
# If 'DmlExecutionProvider' is missing:
# 1. Check Windows version (need 1903+)
# 2. Verify onnxruntime-directml is installed
# 3. Update GPU drivers
# 4. Ensure DirectX 12 support

Performance Issues

# Check which GPU is being used
import onnxruntime as ort

session = ort.InferenceSession(
    "model.onnx",
    providers=['DmlExecutionProvider']
)

print(f"Using providers: {session.get_providers()}")

# Try different performance preferences
session = ort.InferenceSession(
    "model.onnx",
    providers=[(
        'DmlExecutionProvider', {
            'performance_preference': 'high_performance'
        }
    )]
)

Out of Memory

# Reduce batch size or model size
# Check GPU memory usage in Task Manager > Performance > GPU

# Use smaller input batches
batch_size = 1
results = session.run(None, {input_name: data[:batch_size]})

Comparison with Other Providers

Feature	DirectML	CUDA	TensorRT
Vendor Support	All	NVIDIA only	NVIDIA only
Setup Complexity	Easy	Moderate	Complex
Performance	Good	Better	Best
Windows Integration	Excellent	Good	Good
ARM Support	Yes	No	No

Next Steps

For NVIDIA GPUs, compare with CUDA and TensorRT
For Intel hardware, see OpenVINO
Learn about model optimization

Get Started

Core Concepts

Inference

Training

Execution Providers

Performance

Model Conversion

Advanced

​DirectML Execution Provider

​When to Use DirectML EP

​Key Features

​Prerequisites

​Hardware Requirements

​Supported GPUs

​Software Requirements

​Installation

​Python

​C++

​C#/.NET

​UWP (Universal Windows Platform)

​Basic Usage

​Python

​C++

​C#

​WinRT/UWP (C#)

​Configuration Options

​Device Selection

​Performance Preferences

​Device Filtering

​Advanced Configuration

​C++ Advanced Options

​Custom D3D12 Device

​Multi-GPU Support

​NPU Acceleration

​Performance Optimization

​Memory Management

​Session Options

​Platform Support

​Vendor-Specific Performance

​NVIDIA GPUs

​AMD GPUs

​Intel GPUs

​Qualcomm (Windows on ARM)

​Troubleshooting

​Provider Not Available

​Performance Issues

​Out of Memory

​Comparison with Other Providers

​Next Steps

DirectML Execution Provider

When to Use DirectML EP

Key Features

Prerequisites

Hardware Requirements

Supported GPUs

Software Requirements

Installation

Python

C++

C#/.NET

UWP (Universal Windows Platform)

Basic Usage

Python

C++

C#

WinRT/UWP (C#)

Configuration Options

Device Selection

Performance Preferences

Device Filtering

Advanced Configuration

C++ Advanced Options

Custom D3D12 Device

Multi-GPU Support

NPU Acceleration

Performance Optimization

Memory Management

Session Options

Platform Support

Vendor-Specific Performance

NVIDIA GPUs

AMD GPUs

Intel GPUs

Qualcomm (Windows on ARM)

Troubleshooting

Provider Not Available

Performance Issues

Out of Memory

Comparison with Other Providers

Next Steps