Inference Overview
ONNX Runtime provides high-performance inference for ONNX models across multiple platforms and programming languages. This section covers everything you need to know about running inference with ONNX Runtime.What is ONNX Runtime Inference?
ONNX Runtime inference is the process of using a trained ONNX model to make predictions on new data. ONNX Runtime optimizes models for production deployment and provides:- High Performance: Optimized kernels and execution providers for CPU, GPU, and specialized hardware
- Cross-Platform: Run models on Windows, Linux, macOS, iOS, Android, and web browsers
- Multiple APIs: Native APIs for Python, C/C++, C#, Java, and JavaScript
- Hardware Acceleration: Support for CUDA, TensorRT, DirectML, CoreML, and more
Key Concepts
InferenceSession
TheInferenceSession (or OrtSession in Java) is the main class for running inference. It:
- Loads and validates your ONNX model
- Manages model execution and optimization
- Provides access to model metadata (inputs, outputs, custom metadata)
- Handles multiple execution providers
Execution Providers
Execution providers enable hardware acceleration:- CPU: Default provider, optimized for x86/ARM processors
- CUDA: NVIDIA GPU acceleration
- TensorRT: NVIDIA TensorRT optimization
- DirectML: Windows GPU acceleration
- CoreML: Apple device acceleration
- OpenVINO: Intel hardware optimization
- WebGPU/WebNN: Browser-based acceleration
Session Options
Configure session behavior:- Graph Optimization Level: Control model optimization (disabled, basic, extended, all)
- Thread Count: Set intra-op and inter-op thread parallelism
- Memory Patterns: Enable/disable memory optimization strategies
- Execution Mode: Sequential or parallel execution
- Provider Options: Configure execution provider-specific settings
Basic Inference Workflow
- Create Environment/Session Options (optional)
- Load Model: Create an InferenceSession with your ONNX model
- Inspect Model: Query input/output names and shapes
- Prepare Inputs: Create tensors with your input data
- Run Inference: Execute the model with your inputs
- Process Outputs: Extract and use the prediction results
Language-Specific Guides
Choose your programming language:Python API
Use ONNX Runtime in Python applications with NumPy integration
C/C++ API
High-performance inference with native C++ code
C# API
.NET integration for Windows, Linux, and cross-platform apps
Java API
Java applications and Android development
JavaScript API
Web browsers and Node.js applications
Performance Optimization
For production deployments, see the Model Optimization guide to learn about:- Graph optimizations
- Quantization
- Model profiling
- Memory optimization
- Multi-threading strategies
Supported Platforms
| Platform | Python | C/C++ | C# | Java | JavaScript |
|---|---|---|---|---|---|
| Windows | ✓ | ✓ | ✓ | ✓ | ✓ (Node.js) |
| Linux | ✓ | ✓ | ✓ | ✓ | ✓ (Node.js) |
| macOS | ✓ | ✓ | ✓ | ✓ | ✓ (Node.js) |
| iOS | - | ✓ | ✓ | - | - |
| Android | - | ✓ | - | ✓ | - |
| Web | - | - | - | - | ✓ |