Inference Overview

ONNX Runtime provides high-performance inference for ONNX models across multiple platforms and programming languages. This section covers everything you need to know about running inference with ONNX Runtime.

What is ONNX Runtime Inference?

ONNX Runtime inference is the process of using a trained ONNX model to make predictions on new data. ONNX Runtime optimizes models for production deployment and provides:

High Performance: Optimized kernels and execution providers for CPU, GPU, and specialized hardware
Cross-Platform: Run models on Windows, Linux, macOS, iOS, Android, and web browsers
Multiple APIs: Native APIs for Python, C/C++, C#, Java, and JavaScript
Hardware Acceleration: Support for CUDA, TensorRT, DirectML, CoreML, and more

Key Concepts

InferenceSession

The InferenceSession (or OrtSession in Java) is the main class for running inference. It:

Loads and validates your ONNX model
Manages model execution and optimization
Provides access to model metadata (inputs, outputs, custom metadata)
Handles multiple execution providers

Execution Providers

Execution providers enable hardware acceleration:

CPU: Default provider, optimized for x86/ARM processors
CUDA: NVIDIA GPU acceleration
TensorRT: NVIDIA TensorRT optimization
DirectML: Windows GPU acceleration
CoreML: Apple device acceleration
OpenVINO: Intel hardware optimization
WebGPU/WebNN: Browser-based acceleration

Session Options

Configure session behavior:

Graph Optimization Level: Control model optimization (disabled, basic, extended, all)
Thread Count: Set intra-op and inter-op thread parallelism
Memory Patterns: Enable/disable memory optimization strategies
Execution Mode: Sequential or parallel execution
Provider Options: Configure execution provider-specific settings

Basic Inference Workflow

Create Environment/Session Options (optional)
Load Model: Create an InferenceSession with your ONNX model
Inspect Model: Query input/output names and shapes
Prepare Inputs: Create tensors with your input data
Run Inference: Execute the model with your inputs
Process Outputs: Extract and use the prediction results

Language-Specific Guides

Choose your programming language:

Python API

Use ONNX Runtime in Python applications with NumPy integration

C/C++ API

High-performance inference with native C++ code

C# API

.NET integration for Windows, Linux, and cross-platform apps

Java API

Java applications and Android development

JavaScript API

Web browsers and Node.js applications

Performance Optimization

For production deployments, see the Model Optimization guide to learn about:

Graph optimizations
Quantization
Model profiling
Memory optimization
Multi-threading strategies

Supported Platforms

Platform	Python	C/C++	C#	Java	JavaScript
Windows	✓	✓	✓	✓	✓ (Node.js)
Linux	✓	✓	✓	✓	✓ (Node.js)
macOS	✓	✓	✓	✓	✓ (Node.js)
iOS	-	✓	✓	-	-
Android	-	✓	-	✓	-
Web	-	-	-	-	✓

Next Steps

Choose Your Language

Select the API guide for your programming language from the cards above

Run Your First Model

Follow the quickstart examples to load and run an ONNX model

Optimize for Production

Learn about performance optimization and deployment best practices

Get Started

Core Concepts

Inference

Training

Execution Providers

Performance

Model Conversion

Advanced

Inference Overview

Inference Overview

What is ONNX Runtime Inference?

Key Concepts

InferenceSession

Execution Providers

Session Options

Basic Inference Workflow

Language-Specific Guides

Python API

C/C++ API

C# API

Java API

JavaScript API

Performance Optimization

Supported Platforms

Next Steps

Get Started

Core Concepts

Inference

Training

Execution Providers

Performance

Model Conversion

Advanced

​Inference Overview

​What is ONNX Runtime Inference?

​Key Concepts

​InferenceSession

​Execution Providers

​Session Options

​Basic Inference Workflow

​Language-Specific Guides

Python API

C/C++ API

C# API

Java API

JavaScript API

​Performance Optimization

​Supported Platforms

​Next Steps

Inference Overview

What is ONNX Runtime Inference?

Key Concepts

InferenceSession

Execution Providers

Session Options

Basic Inference Workflow

Language-Specific Guides

Performance Optimization

Supported Platforms

Next Steps