OpenVINO Execution Provider - ONNX Runtime GenAI

The OpenVINO execution provider optimizes inference on Intel hardware including CPUs, integrated GPUs (iGPUs), and Neural Processing Units (NPUs).

Requirements

Hardware

CPU: Intel Core, Xeon, or Atom processors
GPU: Intel Iris Xe, Arc, or Data Center GPU Flex/Max
NPU: Intel Core Ultra processors (Meteor Lake and newer)

Software

OpenVINO Runtime 2024.0 or later
Intel Graphics Driver (for GPU acceleration)
Operating Systems:
- Windows 10/11
- Linux (Ubuntu 20.04+, RHEL 8+)
- macOS 10.15+

OpenVINO provides excellent CPU performance and is the recommended provider for Intel hardware.

Installation

Python
Build from Source

# Install OpenVINO
pip install openvino

# Install ONNX Runtime GenAI
pip install onnxruntime-genai --pre

git clone https://github.com/microsoft/onnxruntime-genai.git
cd onnxruntime-genai
python build.py --use_openvino

Basic Configuration

Python API

import onnxruntime_genai as og

model_path = "path/to/model"

# Create config and set OpenVINO provider
config = og.Config(model_path)
config.clear_providers()
config.append_provider("openvino")

# Load model
model = og.Model(config)
tokenizer = og.Tokenizer(model)

# Generate
params = og.GeneratorParams(model)
params.set_search_options(max_length=1024)

generator = og.Generator(model, params)

genai_config.json

{
  "model": {
    "decoder": {
      "session_options": {
        "provider_options": [
          {
            "openvino": {
              "device_type": "CPU"
            }
          }
        ]
      }
    }
  }
}

Device Selection

CPU Acceleration

import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("openvino")
config.set_provider_option("openvino", "device_type", "CPU")

model = og.Model(config)

GPU Acceleration

import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("openvino")
config.set_provider_option("openvino", "device_type", "GPU")

model = og.Model(config)

NPU Acceleration

import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("openvino")
config.set_provider_option("openvino", "device_type", "NPU")

model = og.Model(config)

Best for:

General-purpose inference
Development and testing
Systems without dedicated GPU/NPU

Performance:

Excellent on Intel CPUs
Multi-threading support
INT8 quantization available

CPU Optimization

Thread Configuration

import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("openvino")

# Set CPU threads
config.set_provider_option("openvino", "device_type", "CPU")
config.set_provider_option("openvino", "num_streams", "4")

model = og.Model(config)

Performance Hints

{
  "model": {
    "decoder": {
      "session_options": {
        "provider_options": [
          {
            "openvino": {
              "device_type": "CPU",
              "performance_hint": "THROUGHPUT",
              "num_streams": "AUTO"
            }
          }
        ]
      }
    }
  }
}

Performance Hints:

LATENCY: Optimize for single-request latency
THROUGHPUT: Optimize for maximum throughput
CUMULATIVE_THROUGHPUT: Balance latency and throughput

Advanced Configuration

Model Caching

Enable model caching to speed up subsequent loads:

import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("openvino")

# Enable caching
config.set_provider_option("openvino", "cache_dir", "./ov_cache")

model = og.Model(config)

Load Config

Provide advanced OpenVINO configuration:

{
  "model": {
    "decoder": {
      "session_options": {
        "provider_options": [
          {
            "openvino": {
              "device_type": "CPU",
              "cache_dir": "./ov_cache",
              "load_config": {
                "CPU": {
                  "INFERENCE_PRECISION_HINT": "f32",
                  "PERFORMANCE_HINT": "LATENCY"
                }
              }
            }
          }
        ]
      }
    }
  }
}

Device Filtering

Select specific devices in multi-device systems:

{
  "model": {
    "decoder": {
      "session_options": {
        "provider_options": [
          {
            "openvino": {
              "device_type": "GPU"
            },
            "device_filtering_options": {
              "hardware_device_type": "gpu",
              "hardware_device_id": 0,
              "hardware_vendor_id": 32902
            }
          }
        ]
      }
    }
  }
}

Stateful Models

OpenVINO supports stateful models with internal KV cache management:

import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("openvino")

# Enable stateful model (CausalLM)
config.set_provider_option("openvino", "enable_causallm", "True")

model = og.Model(config)

When enable_causallm is set to “True”, OpenVINO manages the KV cache internally, reducing memory overhead.

Quantization Support

INT8 Quantization

OpenVINO provides excellent INT8 performance:

import onnxruntime_genai as og

config = og.Config(model_path)  # Use INT8 quantized model
config.clear_providers()
config.append_provider("openvino")
config.set_provider_option("openvino", "device_type", "CPU")

model = og.Model(config)

Precision Hints

{
  "openvino": {
    "device_type": "CPU",
    "load_config": {
      "CPU": {
        "INFERENCE_PRECISION_HINT": "i8"
      }
    }
  }
}

FP32
FP16
INT8

Full precision
Highest accuracy
Baseline performance

Multi-Device Execution

OpenVINO can distribute inference across multiple devices:

import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("openvino")

# Use multiple devices
config.set_provider_option("openvino", "device_type", "MULTI:CPU,GPU")

model = og.Model(config)

Troubleshooting

OpenVINO Not Found

# Install OpenVINO
pip install openvino

# Verify installation
python -c "import openvino; print(openvino.__version__)"

Device Not Available

import onnxruntime_genai as og

try:
    config = og.Config(model_path)
    config.clear_providers()
    config.append_provider("openvino")
    config.set_provider_option("openvino", "device_type", "GPU")
    model = og.Model(config)
except Exception as e:
    print(f"GPU not available: {e}")
    print("Falling back to CPU")
    config.set_provider_option("openvino", "device_type", "CPU")
    model = og.Model(config)

Performance Issues

Enable Model Caching

config.set_provider_option("openvino", "cache_dir", "./ov_cache")

First load will be slower, but subsequent loads will be much faster.

Optimize Thread Count

config.set_provider_option("openvino", "num_streams", "AUTO")
config.set_provider_option("openvino", "performance_hint", "THROUGHPUT")

Use Quantized Models

INT8 quantized models provide 2-4x speedup on Intel CPUs.

Benchmarking

import time
import onnxruntime_genai as og

# CPU benchmark
config_cpu = og.Config(model_path)
config_cpu.clear_providers()
config_cpu.append_provider("openvino")
config_cpu.set_provider_option("openvino", "device_type", "CPU")
model_cpu = og.Model(config_cpu)

# GPU benchmark (if available)
config_gpu = og.Config(model_path)
config_gpu.clear_providers()
config_gpu.append_provider("openvino")
config_gpu.set_provider_option("openvino", "device_type", "GPU")

try:
    model_gpu = og.Model(config_gpu)
    print("GPU available for benchmarking")
except:
    model_gpu = None
    print("GPU not available")

# Run inference
tokenizer = og.Tokenizer(model_cpu)
prompt = "Tell me about AI"
input_tokens = tokenizer.encode(prompt)

for device, model in [("CPU", model_cpu), ("GPU", model_gpu)]:
    if model is None:
        continue
    
    params = og.GeneratorParams(model)
    params.set_search_options(max_length=100)
    
    start = time.time()
    generator = og.Generator(model, params)
    generator.append_tokens(input_tokens)
    
    while not generator.is_done():
        generator.generate_next_token()
    
    end = time.time()
    print(f"{device} - Time: {end - start:.2f}s, Tokens/sec: {100 / (end - start):.2f}")

Best Practices

Use Model Caching

Enable cache_dir to reduce model loading time on subsequent runs.

Choose Right Device

Use CPU for latency, GPU for throughput, NPU for power efficiency.

Optimize Precision

Use INT8 models for best CPU performance on Intel hardware.

Performance Hints

Set appropriate performance hints based on your use case.

Next Steps

Model Optimization

Optimize models for OpenVINO

Quantization Guide

Learn about INT8 quantization

Get Started

Core Concepts

Guides

Multi-Modal

Hardware Acceleration

​Requirements

​Hardware

​Software

​Installation

​Basic Configuration

​Python API

​genai_config.json

​Device Selection

​CPU Acceleration

​GPU Acceleration

​NPU Acceleration

​CPU Optimization

​Thread Configuration

​Performance Hints

​Advanced Configuration

​Model Caching

​Load Config

​Device Filtering

​Stateful Models

​Quantization Support

​INT8 Quantization

​Precision Hints

​Multi-Device Execution

​Troubleshooting

​OpenVINO Not Found

​Device Not Available

​Performance Issues

​Benchmarking

​Best Practices

Use Model Caching

Choose Right Device

Optimize Precision

Performance Hints

​Next Steps

Model Optimization

Quantization Guide

Build docs developers (and LLMs) love

Requirements

Hardware

Software

Installation

Basic Configuration

Python API

genai_config.json

Device Selection

CPU Acceleration

GPU Acceleration

NPU Acceleration

CPU Optimization

Thread Configuration

Performance Hints

Advanced Configuration

Model Caching

Load Config

Device Filtering

Stateful Models

Quantization Support

INT8 Quantization

Precision Hints

Multi-Device Execution

Troubleshooting

OpenVINO Not Found

Device Not Available

Performance Issues

Benchmarking

Best Practices

Next Steps