The OpenVINO execution provider optimizes inference on Intel hardware including CPUs, integrated GPUs (iGPUs), and Neural Processing Units (NPUs).
Requirements
Hardware
CPU : Intel Core, Xeon, or Atom processors
GPU : Intel Iris Xe, Arc, or Data Center GPU Flex/Max
NPU : Intel Core Ultra processors (Meteor Lake and newer)
Software
OpenVINO Runtime 2024.0 or later
Intel Graphics Driver (for GPU acceleration)
Operating Systems:
Windows 10/11
Linux (Ubuntu 20.04+, RHEL 8+)
macOS 10.15+
OpenVINO provides excellent CPU performance and is the recommended provider for Intel hardware.
Installation
# Install OpenVINO
pip install openvino
# Install ONNX Runtime GenAI
pip install onnxruntime-genai --pre
git clone https://github.com/microsoft/onnxruntime-genai.git
cd onnxruntime-genai
python build.py --use_openvino
Basic Configuration
Python API
import onnxruntime_genai as og
model_path = "path/to/model"
# Create config and set OpenVINO provider
config = og.Config(model_path)
config.clear_providers()
config.append_provider( "openvino" )
# Load model
model = og.Model(config)
tokenizer = og.Tokenizer(model)
# Generate
params = og.GeneratorParams(model)
params.set_search_options( max_length = 1024 )
generator = og.Generator(model, params)
genai_config.json
{
"model" : {
"decoder" : {
"session_options" : {
"provider_options" : [
{
"openvino" : {
"device_type" : "CPU"
}
}
]
}
}
}
}
Device Selection
CPU Acceleration
import onnxruntime_genai as og
config = og.Config(model_path)
config.clear_providers()
config.append_provider( "openvino" )
config.set_provider_option( "openvino" , "device_type" , "CPU" )
model = og.Model(config)
GPU Acceleration
import onnxruntime_genai as og
config = og.Config(model_path)
config.clear_providers()
config.append_provider( "openvino" )
config.set_provider_option( "openvino" , "device_type" , "GPU" )
model = og.Model(config)
NPU Acceleration
import onnxruntime_genai as og
config = og.Config(model_path)
config.clear_providers()
config.append_provider( "openvino" )
config.set_provider_option( "openvino" , "device_type" , "NPU" )
model = og.Model(config)
Best for:
General-purpose inference
Development and testing
Systems without dedicated GPU/NPU
Performance:
Excellent on Intel CPUs
Multi-threading support
INT8 quantization available
Best for:
Parallel processing
Large batch sizes
FP16 inference
Performance:
2-4x faster than CPU (model dependent)
Lower latency for vision models
Best for:
Energy-efficient inference
Mobile and edge devices
Low-power scenarios
Performance:
Minimal power consumption
Offloads CPU/GPU
Optimized for INT8
CPU Optimization
Thread Configuration
import onnxruntime_genai as og
config = og.Config(model_path)
config.clear_providers()
config.append_provider( "openvino" )
# Set CPU threads
config.set_provider_option( "openvino" , "device_type" , "CPU" )
config.set_provider_option( "openvino" , "num_streams" , "4" )
model = og.Model(config)
{
"model" : {
"decoder" : {
"session_options" : {
"provider_options" : [
{
"openvino" : {
"device_type" : "CPU" ,
"performance_hint" : "THROUGHPUT" ,
"num_streams" : "AUTO"
}
}
]
}
}
}
}
Performance Hints:
LATENCY: Optimize for single-request latency
THROUGHPUT: Optimize for maximum throughput
CUMULATIVE_THROUGHPUT: Balance latency and throughput
Advanced Configuration
Model Caching
Enable model caching to speed up subsequent loads:
import onnxruntime_genai as og
config = og.Config(model_path)
config.clear_providers()
config.append_provider( "openvino" )
# Enable caching
config.set_provider_option( "openvino" , "cache_dir" , "./ov_cache" )
model = og.Model(config)
Load Config
Provide advanced OpenVINO configuration:
{
"model" : {
"decoder" : {
"session_options" : {
"provider_options" : [
{
"openvino" : {
"device_type" : "CPU" ,
"cache_dir" : "./ov_cache" ,
"load_config" : {
"CPU" : {
"INFERENCE_PRECISION_HINT" : "f32" ,
"PERFORMANCE_HINT" : "LATENCY"
}
}
}
}
]
}
}
}
}
Device Filtering
Select specific devices in multi-device systems:
{
"model" : {
"decoder" : {
"session_options" : {
"provider_options" : [
{
"openvino" : {
"device_type" : "GPU"
},
"device_filtering_options" : {
"hardware_device_type" : "gpu" ,
"hardware_device_id" : 0 ,
"hardware_vendor_id" : 32902
}
}
]
}
}
}
}
Stateful Models
OpenVINO supports stateful models with internal KV cache management:
import onnxruntime_genai as og
config = og.Config(model_path)
config.clear_providers()
config.append_provider( "openvino" )
# Enable stateful model (CausalLM)
config.set_provider_option( "openvino" , "enable_causallm" , "True" )
model = og.Model(config)
When enable_causallm is set to “True”, OpenVINO manages the KV cache internally, reducing memory overhead.
Quantization Support
INT8 Quantization
OpenVINO provides excellent INT8 performance:
import onnxruntime_genai as og
config = og.Config(model_path) # Use INT8 quantized model
config.clear_providers()
config.append_provider( "openvino" )
config.set_provider_option( "openvino" , "device_type" , "CPU" )
model = og.Model(config)
Precision Hints
{
"openvino" : {
"device_type" : "CPU" ,
"load_config" : {
"CPU" : {
"INFERENCE_PRECISION_HINT" : "i8"
}
}
}
}
Full precision
Highest accuracy
Baseline performance
Half precision
2x memory reduction
Faster on GPU
Quantized precision
4x memory reduction
2-4x speedup on CPU
Multi-Device Execution
OpenVINO can distribute inference across multiple devices:
import onnxruntime_genai as og
config = og.Config(model_path)
config.clear_providers()
config.append_provider( "openvino" )
# Use multiple devices
config.set_provider_option( "openvino" , "device_type" , "MULTI:CPU,GPU" )
model = og.Model(config)
Troubleshooting
OpenVINO Not Found
# Install OpenVINO
pip install openvino
# Verify installation
python -c "import openvino; print(openvino.__version__)"
Device Not Available
import onnxruntime_genai as og
try :
config = og.Config(model_path)
config.clear_providers()
config.append_provider( "openvino" )
config.set_provider_option( "openvino" , "device_type" , "GPU" )
model = og.Model(config)
except Exception as e:
print ( f "GPU not available: { e } " )
print ( "Falling back to CPU" )
config.set_provider_option( "openvino" , "device_type" , "CPU" )
model = og.Model(config)
config.set_provider_option( "openvino" , "cache_dir" , "./ov_cache" )
First load will be slower, but subsequent loads will be much faster.
config.set_provider_option( "openvino" , "num_streams" , "AUTO" )
config.set_provider_option( "openvino" , "performance_hint" , "THROUGHPUT" )
INT8 quantized models provide 2-4x speedup on Intel CPUs.
Benchmarking
import time
import onnxruntime_genai as og
# CPU benchmark
config_cpu = og.Config(model_path)
config_cpu.clear_providers()
config_cpu.append_provider( "openvino" )
config_cpu.set_provider_option( "openvino" , "device_type" , "CPU" )
model_cpu = og.Model(config_cpu)
# GPU benchmark (if available)
config_gpu = og.Config(model_path)
config_gpu.clear_providers()
config_gpu.append_provider( "openvino" )
config_gpu.set_provider_option( "openvino" , "device_type" , "GPU" )
try :
model_gpu = og.Model(config_gpu)
print ( "GPU available for benchmarking" )
except :
model_gpu = None
print ( "GPU not available" )
# Run inference
tokenizer = og.Tokenizer(model_cpu)
prompt = "Tell me about AI"
input_tokens = tokenizer.encode(prompt)
for device, model in [( "CPU" , model_cpu), ( "GPU" , model_gpu)]:
if model is None :
continue
params = og.GeneratorParams(model)
params.set_search_options( max_length = 100 )
start = time.time()
generator = og.Generator(model, params)
generator.append_tokens(input_tokens)
while not generator.is_done():
generator.generate_next_token()
end = time.time()
print ( f " { device } - Time: { end - start :.2f} s, Tokens/sec: { 100 / (end - start) :.2f} " )
Best Practices
Use Model Caching Enable cache_dir to reduce model loading time on subsequent runs.
Choose Right Device Use CPU for latency, GPU for throughput, NPU for power efficiency.
Optimize Precision Use INT8 models for best CPU performance on Intel hardware.
Performance Hints Set appropriate performance hints based on your use case.
Next Steps
Model Optimization Optimize models for OpenVINO
Quantization Guide Learn about INT8 quantization