Python Inference API
Learn how to run ONNX model inference in Python using the ONNX Runtime API. This guide includes real API signatures and working code examples.Installation
Quick Start
Here’s a minimal example to run inference:InferenceSession Class
Creating a Session
From file path:Session Methods
run()
Execute the model with input data.get_inputs()
Get model input metadata.get_outputs()
Get model output metadata.get_modelmeta()
Get model metadata.SessionOptions
Configure session behavior before creating the session.Graph Optimization Levels
RunOptions
Configure individual inference runs.Execution Providers
Checking Available Providers
Setting Providers
Priority order:Common Providers
Working with IOBinding
Use IOBinding for zero-copy inference with GPU tensors.Complete Example: Image Classification
Performance Tips
Use the Right Execution Provider
Use the Right Execution Provider
Always specify execution providers in priority order. GPU providers like CUDA or TensorRT can provide 10-100x speedups for compute-intensive models.
Enable Graph Optimization
Enable Graph Optimization
Set
graph_optimization_level to ORT_ENABLE_ALL for maximum performance. The runtime will fuse operators and optimize the graph.Reuse Sessions
Reuse Sessions
Creating a session is expensive. Create once and reuse for multiple inferences.
Use IOBinding for GPU
Use IOBinding for GPU
When using GPU providers, IOBinding eliminates CPU-GPU memory copies for better performance.
Batch Inputs
Batch Inputs
Process multiple inputs in a single batch when possible to maximize hardware utilization.
Error Handling
Next Steps
Model Optimization
Learn how to optimize models for production
Execution Providers
Configure hardware acceleration