Deep dive into InferenceSession configuration and lifecycle management
The InferenceSession is the primary interface for running ONNX models in ONNX Runtime. It manages model loading, optimization, initialization, and execution.
import onnxruntime as ortsess_options = ort.SessionOptions()# Intra-op threads: Parallelize within operators# Good for: Matrix multiplication, convolutionssess_options.intra_op_num_threads = 4# Inter-op threads: Execute independent operators in parallel# Good for: Models with many parallel branchessess_options.inter_op_num_threads = 2
CPU-bound Models
Complex Graphs
Single-threaded
Copy
Ask AI
# Use more intra-op threadssess_options.intra_op_num_threads = 8sess_options.inter_op_num_threads = 1
Copy
Ask AI
# Balance both thread poolssess_options.intra_op_num_threads = 4sess_options.inter_op_num_threads = 4
Copy
Ask AI
# Force single-threaded executionsess_options.intra_op_num_threads = 1sess_options.inter_op_num_threads = 1sess_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
Setting too many threads can hurt performance due to context switching and cache contention. Start with the number of physical cores.
import onnxruntime as ortimport numpy as npsession = ort.InferenceSession( "model.onnx", providers=['CUDAExecutionProvider'])io_binding = session.io_binding()# Keep input on GPUinput_data = np.random.randn(1, 3, 224, 224).astype(np.float32)io_binding.bind_cpu_input('input', input_data)# Keep output on GPUio_binding.bind_output('output', 'cuda')# Run on GPUsession.run_with_iobinding(io_binding)# Output stays on GPU - efficient for multiple operationsort_value = io_binding.get_outputs()[0]# Copy to CPU only when neededoutput = ort_value.numpy()
import onnxruntime as ortimport numpy as npsession = ort.InferenceSession("model.onnx")io_binding = session.io_binding()# Pre-allocate output bufferoutput_shape = (1, 1000) # Known output shapeoutput_buffer = np.empty(output_shape, dtype=np.float32)# Bind to pre-allocated bufferio_binding.bind_cpu_input('input', input_data)io_binding.bind_output( 'output', 'cpu', output_buffer)session.run_with_iobinding(io_binding)# Result is now in output_buffer, no copy needed
import onnxruntime as ortsess_options = ort.SessionOptions()sess_options.enable_profiling = Truesession = ort.InferenceSession("model.onnx", sess_options)# Run inferencefor _ in range(100): outputs = session.run(["output"], {"input": input_data})# Get profiling resultsprof_file = session.end_profiling()print(f"Profiling data saved to: {prof_file}")
View the profiling JSON file in Chrome’s tracing viewer (chrome://tracing) for detailed performance analysis.
Creating a session is expensive. Reuse the same session for multiple inferences:
Copy
Ask AI
# Good: Reuse sessionsession = ort.InferenceSession("model.onnx")for data in dataset: outputs = session.run(["output"], {"input": data})# Bad: Create session in loopfor data in dataset: session = ort.InferenceSession("model.onnx") # DON'T DO THIS outputs = session.run(["output"], {"input": data})
Thread Safety
Sessions are thread-safe for inference:
Copy
Ask AI
import concurrent.futuressession = ort.InferenceSession("model.onnx")def run_inference(data): return session.run(["output"], {"input": data})# Safe to use from multiple threadswith concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor: results = list(executor.map(run_inference, dataset))
Use IOBinding for Performance
Use IOBinding when running multiple inferences:
Copy
Ask AI
session = ort.InferenceSession("model.onnx")io_binding = session.io_binding()for data in dataset: io_binding.bind_cpu_input('input', data) io_binding.bind_output('output') session.run_with_iobinding(io_binding) output = io_binding.copy_outputs_to_cpu()[0] # Process output io_binding.clear_binding_inputs() io_binding.clear_binding_outputs()