OOM recorder

The oom_flight_recorder module provides automatic OOM diagnostic capture and retention management.

Classes

OOMFlightRecorderConfig

Runtime configuration for OOM flight recorder dumps.

from gpumemprof.oom_flight_recorder import OOMFlightRecorderConfig

config = OOMFlightRecorderConfig(
    enabled=True,
    dump_dir="oom_dumps",
    buffer_size=10000,
    max_dumps=5,
    max_total_mb=256
)

Attributes

enabled

bool

default:"False"

Whether OOM recording is enabled

dump_dir

str

default:"'oom_dumps'"

Directory for storing OOM dump bundles

buffer_size

int

default:"10000"

Size of the event ring buffer

max_dumps

int

default:"5"

Maximum number of OOM dumps to retain

max_total_mb

int

default:"256"

Maximum total storage for OOM dumps in megabytes

OOMExceptionClassification

Normalized classification result for an exception.

from gpumemprof.oom_flight_recorder import OOMExceptionClassification

Attributes

is_oom

bool

Whether the exception is an OOM error

reason

Optional[str]

Classification reason (e.g., “torch.cuda.OutOfMemoryError”)

OOMFlightRecorder

Bounded recorder that writes diagnostic dump bundles on OOM.

from gpumemprof.oom_flight_recorder import (
    OOMFlightRecorder,
    OOMFlightRecorderConfig
)

config = OOMFlightRecorderConfig(
    enabled=True,
    dump_dir="./oom_diagnostics",
    buffer_size=5000,
    max_dumps=10
)

recorder = OOMFlightRecorder(config)

Constructor

config

OOMFlightRecorderConfig

Configuration for the recorder

Methods

record_event()

Append one event payload to the in-memory ring buffer.

event = {
    "timestamp": time.time(),
    "event_type": "allocation",
    "memory_allocated": 1024 * 1024 * 500,
    "device_id": 0
}

recorder.record_event(event)

event

Dict[str, Any]

Event payload to record

snapshot_events()

Return buffered events in chronological order.

events = recorder.snapshot_events()
print(f"Buffered events: {len(events)}")

return

List[Dict[str, Any]]

List of buffered events

dump()

Write an OOM diagnostic bundle and enforce retention constraints.

try:
    model = LargeModel().cuda()
except torch.cuda.OutOfMemoryError as exc:
    dump_path = recorder.dump(
        reason="torch.cuda.OutOfMemoryError",
        exception=exc,
        context="model_initialization",
        backend="cuda",
        metadata={
            "model_size": "7B parameters",
            "batch_size": 32
        }
    )
    print(f"OOM dump saved to: {dump_path}")
    raise

reason

str

OOM classification reason

exception

BaseException

The exception that triggered the dump

context

Optional[str]

Context description (e.g., “training”, “inference”)

backend

str

Backend type (“cuda”, “rocm”, “mps”)

metadata

Optional[Dict[str, Any]]

default:"None"

Additional metadata to include in the dump

return

Optional[str]

Path to the dump directory if created, None if recorder is disabled

The dump bundle includes:

manifest.json: Bundle metadata and file listing
events.json: Buffered memory events
metadata.json: Exception details and custom metadata
environment.json: System and process information

Functions

classify_oom_exception()

Classify whether an exception corresponds to an OOM condition.

from gpumemprof.oom_flight_recorder import classify_oom_exception
import torch

try:
    tensor = torch.randn(10000, 10000, device="cuda")
except Exception as exc:
    classification = classify_oom_exception(exc)
    
    if classification.is_oom:
        print(f"OOM detected: {classification.reason}")
    else:
        print("Not an OOM error")

exc

BaseException

Exception to classify

return

OOMExceptionClassification

Classification result with is_oom flag and reason

Recognized OOM patterns:

torch.cuda.OutOfMemoryError
TensorFlow ResourceExhaustedError
Message patterns: “out of memory”, “cuda out of memory”, “hip out of memory”, “resource exhausted”, “failed to allocate”, “allocation failed”

Dump Bundle Structure

Each OOM dump creates a directory with the following structure:

oom_dump_20260303T120000Z_12345_cuda_1/
├── manifest.json       # Bundle metadata
├── events.json         # Memory events from ring buffer
├── metadata.json       # Exception and context info
└── environment.json    # System information

manifest.json

{
  "schema_version": 1,
  "bundle_name": "oom_dump_20260303T120000Z_12345_cuda_1",
  "created_at_utc": "2026-03-03T12:00:00Z",
  "reason": "torch.cuda.OutOfMemoryError",
  "backend": "cuda",
  "event_count": 5000,
  "files": ["manifest.json", "events.json", "metadata.json", "environment.json"]
}

metadata.json

{
  "reason": "torch.cuda.OutOfMemoryError",
  "exception_type": "OutOfMemoryError",
  "exception_module": "torch.cuda",
  "exception_message": "CUDA out of memory...",
  "context": "model_loading",
  "backend": "cuda",
  "captured_event_count": 5000,
  "custom_metadata": {
    "batch_size": 32,
    "model_name": "resnet50"
  }
}

environment.json

{
  "pid": 12345,
  "cwd": "/home/user/project",
  "system": {
    "platform": "Linux",
    "python_version": "3.10.0",
    "torch_version": "2.0.0",
    "cuda_available": true,
    "cuda_version": "11.8"
  }
}

Example Usage

import torch
from gpumemprof.oom_flight_recorder import (
    OOMFlightRecorder,
    OOMFlightRecorderConfig,
    classify_oom_exception
)

# Configure recorder
config = OOMFlightRecorderConfig(
    enabled=True,
    dump_dir="./oom_diagnostics",
    buffer_size=10000,
    max_dumps=10,
    max_total_mb=512
)

recorder = OOMFlightRecorder(config)

# Record events continuously
import time

def record_memory_state():
    if torch.cuda.is_available():
        event = {
            "timestamp": time.time(),
            "allocated": torch.cuda.memory_allocated(),
            "reserved": torch.cuda.memory_reserved(),
            "device_id": 0
        }
        recorder.record_event(event)

# Training loop with OOM protection
try:
    model = LargeModel().cuda()
    
    for epoch in range(100):
        for batch in dataloader:
            record_memory_state()
            
            output = model(batch)
            loss = criterion(output)
            loss.backward()
            optimizer.step()
            
except Exception as exc:
    # Classify the exception
    classification = classify_oom_exception(exc)
    
    if classification.is_oom:
        print(f"OOM Error detected: {classification.reason}")
        
        # Create diagnostic dump
        dump_path = recorder.dump(
            reason=classification.reason,
            exception=exc,
            context="training_loop",
            backend="cuda",
            metadata={
                "epoch": epoch,
                "batch_size": batch.size(0),
                "model_params": sum(p.numel() for p in model.parameters())
            }
        )
        
        if dump_path:
            print(f"Diagnostic dump saved to: {dump_path}")
            print("Bundle contains:")
            print("  - Memory event timeline")
            print("  - Exception details")
            print("  - System information")
    
    raise

# Check buffered events
events = recorder.snapshot_events()
print(f"Currently tracking {len(events)} events")

Integration with MemoryTracker

The OOM recorder is automatically integrated into MemoryTracker:

from gpumemprof import MemoryTracker

tracker = MemoryTracker(
    device="cuda:0",
    enable_oom_flight_recorder=True,
    oom_dump_dir="./oom_dumps",
    oom_buffer_size=5000,
    oom_max_dumps=10
)

tracker.start_tracking()

try:
    # Your code
    model.train()
except Exception as e:
    dump_path = tracker.handle_exception(e, context="training")
    if dump_path:
        print(f"OOM dump: {dump_path}")
    raise

PyTorch (gpumemprof)

TensorFlow (tfmemprof)

CLI Reference

Classes

OOMFlightRecorderConfig

Attributes

OOMExceptionClassification

Attributes

OOMFlightRecorder

Constructor

Methods

record_event()

snapshot_events()

dump()

Functions

classify_oom_exception()

Dump Bundle Structure

manifest.json

metadata.json

environment.json

Example Usage

Integration with MemoryTracker

Build docs developers (and LLMs) love

PyTorch (gpumemprof)

TensorFlow (tfmemprof)

CLI Reference

​Classes

​OOMFlightRecorderConfig

​Attributes

​OOMExceptionClassification

​Attributes

​OOMFlightRecorder

​Constructor

​Methods

record_event()

snapshot_events()

dump()

​Functions

​classify_oom_exception()

​Dump Bundle Structure

​manifest.json

​metadata.json

​environment.json

​Example Usage

​Integration with MemoryTracker

Build docs developers (and LLMs) love

Classes

OOMFlightRecorderConfig

Attributes

OOMExceptionClassification

Attributes

OOMFlightRecorder

Constructor

Methods

Functions

classify_oom_exception()

Dump Bundle Structure

manifest.json

metadata.json

environment.json

Example Usage

Integration with MemoryTracker