Overview
ONNX Runtime provides multiple strategies for optimizing memory usage during model inference and training. This guide covers memory management techniques, the Memory Optimizer for training, and best practices for reducing memory footprint.Memory Management Basics
Memory Arenas
ONNX Runtime uses memory arenas to reduce allocation overhead:Memory Pattern Optimization
Memory pattern optimization pre-allocates memory based on the model’s execution pattern:- Analyzes memory usage during the first inference
- Pre-allocates required memory for subsequent runs
- Reduces allocation overhead and fragmentation
GPU Memory Management
Limiting GPU Memory
Arena Extension Strategies
kNextPowerOfTwo: Extends memory in power-of-two increments (default)Memory Optimizer for Training
The Memory Optimizer trades computation for memory by recomputing activations instead of storing them.When to Use Memory Optimizer
Memory Optimizer is beneficial when:- Training fails with OOM (Out of Memory) at minimum batch size
- You can run batch size N but want to run 2N without OOM
- GPU compute and memory bandwidth are not fully saturated
Mode 1: Transformer Layerwise Recompute
Simple one-line configuration for transformer models:Memory Optimization Levels
Example Output
Mode 2: Manual Subgraph Selection
Advanced mode for fine-grained control:Step 1: Discover Available Plans
Step 2: Create Configuration File
"<ClusterID>:<Strategy>:<RequestCount>"
- ClusterID: Subgraph pattern (e.g., “BiasGelu+”)
- Strategy: 0=disabled, 1=recompute, 2=compromised recompute
- RequestCount: Number of occurrences to apply (-1 = all)
Step 3: Apply Configuration
Configuration Examples
Example 1: Recompute All BiasGelu Operations
Example 2: Recompute First Dropout Only
Example 3: Multiple Subgraphs
Example 4: Compromised Recompute
Saves partial memory (e.g., 50% of activations):Debug Information
Enable detailed logging:- Node-level activation patterns
- Memory saving opportunities
- Reuse frequency of activations
- Byte savings per optimization
I/O Binding for Memory Efficiency
Zero-Copy Inference
Eliminate memory copies between host and device:GPU Zero-Copy
Memory Profiling
Track Memory Usage
GPU Memory Profiling
Model Optimization for Memory
Quantization
Reduce memory footprint with quantization:Graph Optimization
Best Practices
1. Enable Memory Patterns
2. Use Appropriate Batch Sizes
3. Limit GPU Memory Growth
4. Reuse Sessions
5. Use I/O Binding
Memory Optimization Checklist
- Enable memory pattern optimization
- Enable CPU/GPU memory arenas
- Use appropriate arena extension strategy
- Limit GPU memory if needed
- Use I/O binding for zero-copy
- Enable Memory Optimizer for training (if applicable)
- Consider model quantization
- Profile memory usage
- Use optimal batch sizes
- Reuse sessions and bindings
Troubleshooting
Out of Memory (OOM) Errors
-
Reduce batch size
-
Enable Memory Optimizer (training)
-
Limit GPU memory
-
Use quantized model
Memory Leaks
-
Explicitly release outputs
-
Clear I/O bindings
-
Destroy sessions when done