OOM recorder

The OOM Flight Recorder captures memory state leading up to out-of-memory errors, creating diagnostic bundles that help you understand and debug memory issues. This feature works with both PyTorch (CUDA/ROCm/MPS) and TensorFlow.

Enable OOM recording

Configure the tracker with OOM recording:

from gpumemprof.tracker import MemoryTracker

tracker = MemoryTracker(
    device=0,
    sampling_interval=0.1,
    enable_oom_flight_recorder=True,
    oom_dump_dir="oom_dumps",
    oom_buffer_size=10_000,
    oom_max_dumps=3,
    oom_max_total_mb=128,
)

See oom_flight_recorder_scenario.py:131-138 Configuration options:

oom_dump_dir: Directory for diagnostic bundles
oom_buffer_size: Number of events to keep in memory (defaults to max_events)
oom_max_dumps: Maximum number of dump bundles to retain
oom_max_total_mb: Maximum total storage for dumps

Capture OOM context

Use the capture_oom() context manager to wrap code that might run out of memory:

tracker.start_tracking()

try:
    with tracker.capture_oom(
        context="training.forward_pass",
        metadata={"batch_size": 128, "model": "resnet50"}
    ):
        # Code that might OOM
        outputs = model(large_batch)
        loss = criterion(outputs, targets)
        loss.backward()
except RuntimeError as e:
    print(f"OOM occurred: {e}")
    print(f"Dump saved to: {tracker.last_oom_dump_path}")
finally:
    tracker.stop_tracking()

See oom_flight_recorder_scenario.py:29-38

Classify exceptions

The recorder automatically detects OOM errors:

from gpumemprof.oom_flight_recorder import classify_oom_exception

try:
    # Code that might fail
    tensor = torch.randn(1000000, 1000000, device="cuda")
except Exception as exc:
    classification = classify_oom_exception(exc)
    
    if classification.is_oom:
        print(f"OOM detected: {classification.reason}")
        # Dump was automatically captured
    else:
        print("Non-OOM error")
        raise

See oom_flight_recorder_scenario.py:71-74 and oom_flight_recorder.py:51-79 The classifier detects:

torch.cuda.OutOfMemoryError
tensorflow.ResourceExhaustedError
Generic errors with “out of memory” messages

Simulated OOM testing

Test OOM recording without actually running out of memory:

tracker.start_tracking()

try:
    with tracker.capture_oom(
        context="test.simulated_oom",
        metadata={"scenario_mode": "simulated"}
    ):
        # Simulate an OOM error
        raise RuntimeError("simulated out of memory for demo")
except RuntimeError as exc:
    print(f"Captured simulated OOM: {exc}")
finally:
    tracker.stop_tracking()

print(f"Dump path: {tracker.last_oom_dump_path}")

See oom_flight_recorder_scenario.py:29-38

Stress testing

Trigger real OOM conditions for testing:

import torch

tracker.start_tracking()
tensors = []
device = torch.device("cuda")

try:
    with tracker.capture_oom(
        context="stress_test",
        metadata={
            "max_total_mb": 8192,
            "step_mb": 64,
        }
    ):
        # Allocate until OOM
        while True:
            elements = int(64 * 1024 * 1024 / 4)  # 64MB
            block = torch.randn(elements, device=device)
            tensors.append(block)
except RuntimeError as exc:
    print(f"OOM after {len(tensors)} allocations")
    print(f"Dump: {tracker.last_oom_dump_path}")
finally:
    tensors.clear()
    torch.cuda.empty_cache()
    tracker.stop_tracking()

See oom_flight_recorder_scenario.py:41-86

Dump bundle structure

Each OOM dump contains:

oom_dumps/
└── oom_cuda_20260303_152030_001/
    ├── manifest.json          # Bundle metadata
    ├── events.json            # Memory events leading to OOM
    ├── metadata.json          # Exception and context details
    └── environment.json       # System and GPU information

manifest.json

{
  "schema_version": 1,
  "bundle_name": "oom_cuda_20260303_152030_001",
  "created_at_utc": "2026-03-03T15:20:30Z",
  "reason": "oom_exception",
  "backend": "cuda",
  "event_count": 1247,
  "files": ["events.json", "metadata.json", "environment.json"]
}

metadata.json

{
  "reason": "oom_exception",
  "exception_type": "OutOfMemoryError",
  "exception_module": "torch.cuda",
  "exception_message": "CUDA out of memory...",
  "context": "training.forward_pass",
  "backend": "cuda",
  "captured_event_count": 1247,
  "custom_metadata": {
    "batch_size": 128,
    "model": "resnet50"
  }
}

See oom_flight_recorder.py:103-151

events.json

Contains the sequence of memory events:

[
  {
    "timestamp": 1709480430.123,
    "event_type": "allocation",
    "memory_allocated": 2147483648,
    "memory_reserved": 2415919104,
    "memory_change": 134217728,
    "device_id": 0,
    "context": "training.forward_pass",
    "backend": "cuda"
  }
]

Analyze OOM dumps

Load and analyze captured dumps:

import json
from pathlib import Path

def analyze_oom_dump(dump_path):
    dump_dir = Path(dump_path)
    
    # Load manifest
    with open(dump_dir / "manifest.json") as f:
        manifest = json.load(f)
    
    # Load events
    with open(dump_dir / "events.json") as f:
        events = json.load(f)
    
    # Load metadata
    with open(dump_dir / "metadata.json") as f:
        metadata = json.load(f)
    
    print(f"OOM occurred in: {metadata['context']}")
    print(f"Exception: {metadata['exception_type']}")
    print(f"Events captured: {len(events)}")
    
    # Analyze memory growth
    if events:
        first_allocated = events[0]["memory_allocated"]
        last_allocated = events[-1]["memory_allocated"]
        growth_mb = (last_allocated - first_allocated) / (1024**2)
        print(f"Memory growth: {growth_mb:.2f} MB")
    
    return manifest, events, metadata

# Analyze the last dump
if tracker.last_oom_dump_path:
    analyze_oom_dump(tracker.last_oom_dump_path)

Retention policy

The recorder enforces storage limits:

# Configure retention
tracker = MemoryTracker(
    enable_oom_flight_recorder=True,
    oom_max_dumps=5,         # Keep at most 5 dumps
    oom_max_total_mb=256,    # Use at most 256MB total
)

When limits are exceeded:

Oldest dumps are deleted first
Size is calculated based on actual file sizes
Ensures bounded disk usage

See oom_flight_recorder.py:32-40

Backend support

OOM recording works with multiple backends:

# CUDA/ROCm
tracker = MemoryTracker(device=0, enable_oom_flight_recorder=True)

# MPS (Apple Silicon)
tracker = MemoryTracker(device="mps", enable_oom_flight_recorder=True)

# CPU (for TensorFlow or CPU-only workloads)
from gpumemprof import CPUMemoryTracker
tracker = CPUMemoryTracker(sampling_interval=0.1)

See oom_flight_recorder_scenario.py:23-26 and oom_flight_recorder_scenario.py:104-115

Next steps

Export events with telemetry export
Set up continuous monitoring with leak detection
Learn about context managers for profiling

Get Started

Core Concepts

Guides

Examples

Advanced

Enable OOM recording

Capture OOM context

Classify exceptions

Simulated OOM testing

Stress testing

Dump bundle structure

manifest.json

metadata.json

events.json

Analyze OOM dumps

Retention policy

Backend support

Next steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Examples

Advanced

​Enable OOM recording

​Capture OOM context

​Classify exceptions

​Simulated OOM testing

​Stress testing

​Dump bundle structure

​manifest.json

​metadata.json

​events.json

​Analyze OOM dumps

​Retention policy

​Backend support

​Next steps

Build docs developers (and LLMs) love

Enable OOM recording

Capture OOM context

Classify exceptions

Simulated OOM testing

Stress testing

Dump bundle structure

manifest.json

metadata.json

events.json

Analyze OOM dumps

Retention policy

Backend support

Next steps