Model Optimization for Inference
Learn how to optimize ONNX models for production inference with graph optimization, quantization, profiling, and performance tuning techniques.Overview
Model optimization is crucial for production deployment. ONNX Runtime provides multiple optimization strategies:- Graph Optimization: Fuse operators, eliminate redundant nodes, optimize memory layout
- Quantization: Reduce model size and improve speed with reduced precision
- Profiling: Identify performance bottlenecks
- Memory Optimization: Reduce memory footprint and allocations
- Threading: Optimize parallelism for multi-core processors
Graph Optimization
Graph optimization transforms the model computation graph for better performance.Optimization Levels
ONNX Runtime provides four optimization levels: 1. Disabled (ORT_DISABLE_ALL)
- No optimizations applied
- Use for debugging or when optimizations cause issues
ORT_ENABLE_BASIC)
- Constant folding
- Redundant node elimination
- Semantics-preserving node fusions
ORT_ENABLE_EXTENDED)
- All basic optimizations
- Complex node fusions (e.g., Conv + BatchNorm + Relu)
- Node reordering
- Algebraic simplifications
ORT_ENABLE_ALL)
- All extended optimizations
- Layout transformations (e.g., NCHWc format)
- Advanced memory planning
Applying Graph Optimization
Common Graph Optimizations
Operator Fusion:- Conv + BatchNorm + Relu → FusedConv
- MatMul + Add → Gemm
- Multiple Transpose operations → Single Transpose
- Pre-compute constant operations at graph load time
- Reduces inference computation
- Remove unused nodes and outputs
- Reduces memory and computation
- Convert NCHW → NCHWc (channels-last format)
- Better cache locality and vectorization
Quantization
Quantization reduces model size and improves inference speed by using lower precision (INT8) instead of FP32.Dynamic Quantization
Weights are quantized offline, activations are quantized dynamically during inference.- 4x model size reduction
- 2-4x inference speedup on CPU
- Minimal accuracy loss (< 1%)
- No calibration data required
Static Quantization (QDQ)
Both weights and activations are quantized using calibration data.- Better accuracy than dynamic quantization
- Faster inference than dynamic quantization
- Requires calibration dataset
Quantization Guidelines
Choose the Right Quantization Method
Choose the Right Quantization Method
- Use dynamic quantization for quick deployment with minimal setup
- Use static quantization for maximum performance when calibration data is available
Evaluate Accuracy
Evaluate Accuracy
Always evaluate quantized model accuracy on your validation set:
Not All Operators Support Quantization
Not All Operators Support Quantization
Some operators may not be quantized. The quantization tool will skip unsupported operators automatically.
Platform-Specific Acceleration
Platform-Specific Acceleration
- x86 CPUs: Use VNNI or AVX512 for INT8 acceleration
- ARM CPUs: Use NEON instructions
- GPUs: Limited INT8 support, check execution provider documentation
Profiling
Profile model execution to identify bottlenecks.Enable Profiling
Analyze Profiling Results
The profiling output is a JSON file with Chrome Tracing format. View in Chrome:- Open Chrome browser
- Navigate to
chrome://tracing - Click “Load” and select the profiling JSON file
Profiling Metrics
- Kernel Time: Time spent executing each operator
- Memory Allocation: Memory allocation events
- Data Transfer: CPU-GPU data transfer time (if using GPU)
- Session Overhead: Session initialization and cleanup
Memory Optimization
Memory Arena
Enable memory arena for efficient memory allocation:- Reduces memory fragmentation
- Faster allocation/deallocation
- Lower peak memory usage
Memory Pattern Optimization
Sequential Execution Mode
For memory-constrained environments:Threading Optimization
Intra-Op Threading
Parallelism within a single operator (e.g., matrix multiplication):- Set to number of physical cores for CPU-bound operations
- More threads ≠ always faster (overhead increases)
- Start with physical core count and tune based on profiling
Inter-Op Threading
Parallelism between independent operators:- Useful for models with parallel branches
- Usually set to 1 or 2
- Higher values can cause overhead
Threading Best Practices
Execution Provider Optimization
CPU Optimization
GPU Optimization (CUDA)
TensorRT Optimization
Model Size Optimization
External Data Format
For large models, store weights externally:Model Pruning
Remove unnecessary outputs:Batching Strategies
Static Batching
Dynamic Batching
Benchmarking
Performance Measurement
Optimization Checklist
Choose Execution Provider
Use CUDA/TensorRT for NVIDIA GPUs, DirectML for Windows, CoreML for Apple devices
Next Steps
Execution Providers
Learn about hardware-specific optimizations
Python API
Return to Python inference guide