Skip to main content
The OpenVINO execution provider optimizes inference on Intel hardware including CPUs, integrated GPUs (iGPUs), and Neural Processing Units (NPUs).

Requirements

Hardware

  • CPU: Intel Core, Xeon, or Atom processors
  • GPU: Intel Iris Xe, Arc, or Data Center GPU Flex/Max
  • NPU: Intel Core Ultra processors (Meteor Lake and newer)

Software

  • OpenVINO Runtime 2024.0 or later
  • Intel Graphics Driver (for GPU acceleration)
  • Operating Systems:
    • Windows 10/11
    • Linux (Ubuntu 20.04+, RHEL 8+)
    • macOS 10.15+
OpenVINO provides excellent CPU performance and is the recommended provider for Intel hardware.

Installation

# Install OpenVINO
pip install openvino

# Install ONNX Runtime GenAI
pip install onnxruntime-genai --pre

Basic Configuration

Python API

import onnxruntime_genai as og

model_path = "path/to/model"

# Create config and set OpenVINO provider
config = og.Config(model_path)
config.clear_providers()
config.append_provider("openvino")

# Load model
model = og.Model(config)
tokenizer = og.Tokenizer(model)

# Generate
params = og.GeneratorParams(model)
params.set_search_options(max_length=1024)

generator = og.Generator(model, params)

genai_config.json

{
  "model": {
    "decoder": {
      "session_options": {
        "provider_options": [
          {
            "openvino": {
              "device_type": "CPU"
            }
          }
        ]
      }
    }
  }
}

Device Selection

CPU Acceleration

import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("openvino")
config.set_provider_option("openvino", "device_type", "CPU")

model = og.Model(config)

GPU Acceleration

import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("openvino")
config.set_provider_option("openvino", "device_type", "GPU")

model = og.Model(config)

NPU Acceleration

import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("openvino")
config.set_provider_option("openvino", "device_type", "NPU")

model = og.Model(config)
Best for:
  • General-purpose inference
  • Development and testing
  • Systems without dedicated GPU/NPU
Performance:
  • Excellent on Intel CPUs
  • Multi-threading support
  • INT8 quantization available

CPU Optimization

Thread Configuration

import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("openvino")

# Set CPU threads
config.set_provider_option("openvino", "device_type", "CPU")
config.set_provider_option("openvino", "num_streams", "4")

model = og.Model(config)

Performance Hints

{
  "model": {
    "decoder": {
      "session_options": {
        "provider_options": [
          {
            "openvino": {
              "device_type": "CPU",
              "performance_hint": "THROUGHPUT",
              "num_streams": "AUTO"
            }
          }
        ]
      }
    }
  }
}
Performance Hints:
  • LATENCY: Optimize for single-request latency
  • THROUGHPUT: Optimize for maximum throughput
  • CUMULATIVE_THROUGHPUT: Balance latency and throughput

Advanced Configuration

Model Caching

Enable model caching to speed up subsequent loads:
import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("openvino")

# Enable caching
config.set_provider_option("openvino", "cache_dir", "./ov_cache")

model = og.Model(config)

Load Config

Provide advanced OpenVINO configuration:
{
  "model": {
    "decoder": {
      "session_options": {
        "provider_options": [
          {
            "openvino": {
              "device_type": "CPU",
              "cache_dir": "./ov_cache",
              "load_config": {
                "CPU": {
                  "INFERENCE_PRECISION_HINT": "f32",
                  "PERFORMANCE_HINT": "LATENCY"
                }
              }
            }
          }
        ]
      }
    }
  }
}

Device Filtering

Select specific devices in multi-device systems:
{
  "model": {
    "decoder": {
      "session_options": {
        "provider_options": [
          {
            "openvino": {
              "device_type": "GPU"
            },
            "device_filtering_options": {
              "hardware_device_type": "gpu",
              "hardware_device_id": 0,
              "hardware_vendor_id": 32902
            }
          }
        ]
      }
    }
  }
}

Stateful Models

OpenVINO supports stateful models with internal KV cache management:
import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("openvino")

# Enable stateful model (CausalLM)
config.set_provider_option("openvino", "enable_causallm", "True")

model = og.Model(config)
When enable_causallm is set to “True”, OpenVINO manages the KV cache internally, reducing memory overhead.

Quantization Support

INT8 Quantization

OpenVINO provides excellent INT8 performance:
import onnxruntime_genai as og

config = og.Config(model_path)  # Use INT8 quantized model
config.clear_providers()
config.append_provider("openvino")
config.set_provider_option("openvino", "device_type", "CPU")

model = og.Model(config)

Precision Hints

{
  "openvino": {
    "device_type": "CPU",
    "load_config": {
      "CPU": {
        "INFERENCE_PRECISION_HINT": "i8"
      }
    }
  }
}
  • Full precision
  • Highest accuracy
  • Baseline performance

Multi-Device Execution

OpenVINO can distribute inference across multiple devices:
import onnxruntime_genai as og

config = og.Config(model_path)
config.clear_providers()
config.append_provider("openvino")

# Use multiple devices
config.set_provider_option("openvino", "device_type", "MULTI:CPU,GPU")

model = og.Model(config)

Troubleshooting

OpenVINO Not Found

# Install OpenVINO
pip install openvino

# Verify installation
python -c "import openvino; print(openvino.__version__)"

Device Not Available

import onnxruntime_genai as og

try:
    config = og.Config(model_path)
    config.clear_providers()
    config.append_provider("openvino")
    config.set_provider_option("openvino", "device_type", "GPU")
    model = og.Model(config)
except Exception as e:
    print(f"GPU not available: {e}")
    print("Falling back to CPU")
    config.set_provider_option("openvino", "device_type", "CPU")
    model = og.Model(config)

Performance Issues

config.set_provider_option("openvino", "cache_dir", "./ov_cache")
First load will be slower, but subsequent loads will be much faster.
config.set_provider_option("openvino", "num_streams", "AUTO")
config.set_provider_option("openvino", "performance_hint", "THROUGHPUT")
INT8 quantized models provide 2-4x speedup on Intel CPUs.

Benchmarking

import time
import onnxruntime_genai as og

# CPU benchmark
config_cpu = og.Config(model_path)
config_cpu.clear_providers()
config_cpu.append_provider("openvino")
config_cpu.set_provider_option("openvino", "device_type", "CPU")
model_cpu = og.Model(config_cpu)

# GPU benchmark (if available)
config_gpu = og.Config(model_path)
config_gpu.clear_providers()
config_gpu.append_provider("openvino")
config_gpu.set_provider_option("openvino", "device_type", "GPU")

try:
    model_gpu = og.Model(config_gpu)
    print("GPU available for benchmarking")
except:
    model_gpu = None
    print("GPU not available")

# Run inference
tokenizer = og.Tokenizer(model_cpu)
prompt = "Tell me about AI"
input_tokens = tokenizer.encode(prompt)

for device, model in [("CPU", model_cpu), ("GPU", model_gpu)]:
    if model is None:
        continue
    
    params = og.GeneratorParams(model)
    params.set_search_options(max_length=100)
    
    start = time.time()
    generator = og.Generator(model, params)
    generator.append_tokens(input_tokens)
    
    while not generator.is_done():
        generator.generate_next_token()
    
    end = time.time()
    print(f"{device} - Time: {end - start:.2f}s, Tokens/sec: {100 / (end - start):.2f}")

Best Practices

Use Model Caching

Enable cache_dir to reduce model loading time on subsequent runs.

Choose Right Device

Use CPU for latency, GPU for throughput, NPU for power efficiency.

Optimize Precision

Use INT8 models for best CPU performance on Intel hardware.

Performance Hints

Set appropriate performance hints based on your use case.

Next Steps

Model Optimization

Optimize models for OpenVINO

Quantization Guide

Learn about INT8 quantization

Build docs developers (and LLMs) love