Enable OOM recording
Configure the tracker with OOM recording:- oom_dump_dir: Directory for diagnostic bundles
- oom_buffer_size: Number of events to keep in memory (defaults to
max_events) - oom_max_dumps: Maximum number of dump bundles to retain
- oom_max_total_mb: Maximum total storage for dumps
Capture OOM context
Use thecapture_oom() context manager to wrap code that might run out of memory:
Classify exceptions
The recorder automatically detects OOM errors:torch.cuda.OutOfMemoryErrortensorflow.ResourceExhaustedError- Generic errors with “out of memory” messages
Simulated OOM testing
Test OOM recording without actually running out of memory:Stress testing
Trigger real OOM conditions for testing:Dump bundle structure
Each OOM dump contains:manifest.json
metadata.json
events.json
Contains the sequence of memory events:Analyze OOM dumps
Load and analyze captured dumps:Retention policy
The recorder enforces storage limits:- Oldest dumps are deleted first
- Size is calculated based on actual file sizes
- Ensures bounded disk usage
Backend support
OOM recording works with multiple backends:Next steps
- Export events with telemetry export
- Set up continuous monitoring with leak detection
- Learn about context managers for profiling