InferenceSession

The InferenceSession class is the main entry point for loading and running ONNX models. It manages model execution, execution providers, and provides methods for inference.

Constructor

InferenceSession(
    path_or_bytes: str | bytes | os.PathLike,
    sess_options: SessionOptions | None = None,
    providers: Sequence[str | tuple[str, dict]] | None = None,
    provider_options: Sequence[dict] | None = None
)

path_or_bytes

str | bytes | os.PathLike

required

Path to the ONNX model file or serialized model as bytes. File extension .ort indicates ORT format, otherwise ONNX format is assumed.

sess_options

SessionOptions

Session configuration options. See SessionOptions for details.

providers

Sequence[str | tuple[str, dict]]

Execution providers in order of decreasing precedence. Can be provider names or tuples of (provider name, options dict). If not provided, all available providers are used.

provider_options

Sequence[dict]

Options dicts corresponding to providers. Should not be used if providers contains tuples with options.

Methods

run()

Compute predictions for the given inputs.

run(
    output_names: list[str] | None,
    input_feed: dict[str, np.ndarray],
    run_options: RunOptions | None = None
) -> list[np.ndarray]

output_names

list[str]

Names of the outputs to compute. If None, all outputs are computed.

input_feed

dict[str, np.ndarray]

required

Dictionary mapping input names to input values as numpy arrays.

run_options

RunOptions

Run-specific options. See RunOptions.

outputs

list[np.ndarray]

List of output tensors as numpy arrays.

run_async()

Compute predictions asynchronously in a separate thread.

run_async(
    output_names: list[str] | None,
    input_feed: dict[str, np.ndarray],
    callback: Callable,
    user_data: Any,
    run_options: RunOptions | None = None
)

callback

Callable

required

Python function that accepts array of results and error string. Called by ORT thread when inference completes.

user_data

Any

User data passed to callback function.

run_with_iobinding()

Run inference using IOBinding for GPU memory optimization.

run_with_iobinding(
    iobinding: IOBinding,
    run_options: RunOptions | None = None
)

iobinding

IOBinding

required

IOBinding object with inputs/outputs bound to device memory. See IOBinding.

get_inputs()

Get metadata about model inputs.

get_inputs() -> list[NodeArg]

inputs

list[NodeArg]

List of NodeArg objects describing input names, shapes, and types.

get_outputs()

Get metadata about model outputs.

get_outputs() -> list[NodeArg]

outputs

list[NodeArg]

List of NodeArg objects describing output names, shapes, and types.

get_providers()

Get registered execution providers for this session.

get_providers() -> list[str]

providers

list[str]

List of provider names in order of precedence.

set_providers()

set_providers(
    providers: Sequence[str | tuple[str, dict]] | None = None,
    provider_options: Sequence[dict] | None = None
)

get_modelmeta()

Get model metadata.

get_modelmeta() -> ModelMetadata

metadata

ModelMetadata

ModelMetadata object with producer name, version, description, etc.

end_profiling()

End profiling session and return results file path.

end_profiling() -> str

profile_file

str

Path to profiling results file.

Example Usage

import onnxruntime as ort
import numpy as np

# Create session with CUDA provider
sess = ort.InferenceSession(
    "model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

# Inspect inputs
for input in sess.get_inputs():
    print(f"Input: {input.name}, shape: {input.shape}, type: {input.type}")

# Run inference
input_data = {"input": np.random.randn(1, 3, 224, 224).astype(np.float32)}
outputs = sess.run(None, input_data)

print(f"Output shape: {outputs[0].shape}")

GPU Memory Optimization

# Use IOBinding for zero-copy GPU inference
sess = ort.InferenceSession("model.onnx", providers=["CUDAExecutionProvider"])

io_binding = sess.io_binding()

# Bind inputs/outputs to GPU
ortvalue = ort.OrtValue.ortvalue_from_numpy(input_data, "cuda", 0)
io_binding.bind_ortvalue_input("input", ortvalue)
io_binding.bind_output("output", "cuda")

# Run on GPU
sess.run_with_iobinding(io_binding)
outputs = io_binding.get_outputs()

SessionOptions - Configure session behavior
RunOptions - Configure individual runs
IOBinding - Bind I/O to device memory
Execution Providers - Configure hardware acceleration

Python API

C/C++ API

C# API

Java API

JavaScript API

InferenceSession