Skip to main content

InferenceSession

The InferenceSession class is the main entry point for loading and running ONNX models. It manages model execution, execution providers, and provides methods for inference.

Constructor

InferenceSession(
    path_or_bytes: str | bytes | os.PathLike,
    sess_options: SessionOptions | None = None,
    providers: Sequence[str | tuple[str, dict]] | None = None,
    provider_options: Sequence[dict] | None = None
)
path_or_bytes
str | bytes | os.PathLike
required
Path to the ONNX model file or serialized model as bytes. File extension .ort indicates ORT format, otherwise ONNX format is assumed.
sess_options
SessionOptions
Session configuration options. See SessionOptions for details.
providers
Sequence[str | tuple[str, dict]]
Execution providers in order of decreasing precedence. Can be provider names or tuples of (provider name, options dict). If not provided, all available providers are used.
provider_options
Sequence[dict]
Options dicts corresponding to providers. Should not be used if providers contains tuples with options.

Methods

run()

Compute predictions for the given inputs.
run(
    output_names: list[str] | None,
    input_feed: dict[str, np.ndarray],
    run_options: RunOptions | None = None
) -> list[np.ndarray]
output_names
list[str]
Names of the outputs to compute. If None, all outputs are computed.
input_feed
dict[str, np.ndarray]
required
Dictionary mapping input names to input values as numpy arrays.
run_options
RunOptions
Run-specific options. See RunOptions.
outputs
list[np.ndarray]
List of output tensors as numpy arrays.

run_async()

Compute predictions asynchronously in a separate thread.
run_async(
    output_names: list[str] | None,
    input_feed: dict[str, np.ndarray],
    callback: Callable,
    user_data: Any,
    run_options: RunOptions | None = None
)
callback
Callable
required
Python function that accepts array of results and error string. Called by ORT thread when inference completes.
user_data
Any
User data passed to callback function.

run_with_iobinding()

Run inference using IOBinding for GPU memory optimization.
run_with_iobinding(
    iobinding: IOBinding,
    run_options: RunOptions | None = None
)
iobinding
IOBinding
required
IOBinding object with inputs/outputs bound to device memory. See IOBinding.

get_inputs()

Get metadata about model inputs.
get_inputs() -> list[NodeArg]
inputs
list[NodeArg]
List of NodeArg objects describing input names, shapes, and types.

get_outputs()

Get metadata about model outputs.
get_outputs() -> list[NodeArg]
outputs
list[NodeArg]
List of NodeArg objects describing output names, shapes, and types.

get_providers()

Get registered execution providers for this session.
get_providers() -> list[str]
providers
list[str]
List of provider names in order of precedence.

set_providers()

Register new execution providers, recreating the underlying session.
set_providers(
    providers: Sequence[str | tuple[str, dict]] | None = None,
    provider_options: Sequence[dict] | None = None
)

get_modelmeta()

Get model metadata.
get_modelmeta() -> ModelMetadata
metadata
ModelMetadata
ModelMetadata object with producer name, version, description, etc.

end_profiling()

End profiling session and return results file path.
end_profiling() -> str
profile_file
str
Path to profiling results file.

Example Usage

import onnxruntime as ort
import numpy as np

# Create session with CUDA provider
sess = ort.InferenceSession(
    "model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

# Inspect inputs
for input in sess.get_inputs():
    print(f"Input: {input.name}, shape: {input.shape}, type: {input.type}")

# Run inference
input_data = {"input": np.random.randn(1, 3, 224, 224).astype(np.float32)}
outputs = sess.run(None, input_data)

print(f"Output shape: {outputs[0].shape}")

GPU Memory Optimization

# Use IOBinding for zero-copy GPU inference
sess = ort.InferenceSession("model.onnx", providers=["CUDAExecutionProvider"])

io_binding = sess.io_binding()

# Bind inputs/outputs to GPU
ortvalue = ort.OrtValue.ortvalue_from_numpy(input_data, "cuda", 0)
io_binding.bind_ortvalue_input("input", ortvalue)
io_binding.bind_output("output", "cuda")

# Run on GPU
sess.run_with_iobinding(io_binding)
outputs = io_binding.get_outputs()