Skip to main content
The oom_flight_recorder module provides automatic OOM diagnostic capture and retention management.

Classes

OOMFlightRecorderConfig

Runtime configuration for OOM flight recorder dumps.
from gpumemprof.oom_flight_recorder import OOMFlightRecorderConfig

config = OOMFlightRecorderConfig(
    enabled=True,
    dump_dir="oom_dumps",
    buffer_size=10000,
    max_dumps=5,
    max_total_mb=256
)

Attributes

enabled
bool
default:"False"
Whether OOM recording is enabled
dump_dir
str
default:"'oom_dumps'"
Directory for storing OOM dump bundles
buffer_size
int
default:"10000"
Size of the event ring buffer
max_dumps
int
default:"5"
Maximum number of OOM dumps to retain
max_total_mb
int
default:"256"
Maximum total storage for OOM dumps in megabytes

OOMExceptionClassification

Normalized classification result for an exception.
from gpumemprof.oom_flight_recorder import OOMExceptionClassification

Attributes

is_oom
bool
Whether the exception is an OOM error
reason
Optional[str]
Classification reason (e.g., “torch.cuda.OutOfMemoryError”)

OOMFlightRecorder

Bounded recorder that writes diagnostic dump bundles on OOM.
from gpumemprof.oom_flight_recorder import (
    OOMFlightRecorder,
    OOMFlightRecorderConfig
)

config = OOMFlightRecorderConfig(
    enabled=True,
    dump_dir="./oom_diagnostics",
    buffer_size=5000,
    max_dumps=10
)

recorder = OOMFlightRecorder(config)

Constructor

config
OOMFlightRecorderConfig
Configuration for the recorder

Methods

record_event()
Append one event payload to the in-memory ring buffer.
event = {
    "timestamp": time.time(),
    "event_type": "allocation",
    "memory_allocated": 1024 * 1024 * 500,
    "device_id": 0
}

recorder.record_event(event)
event
Dict[str, Any]
Event payload to record
snapshot_events()
Return buffered events in chronological order.
events = recorder.snapshot_events()
print(f"Buffered events: {len(events)}")
return
List[Dict[str, Any]]
List of buffered events
dump()
Write an OOM diagnostic bundle and enforce retention constraints.
try:
    model = LargeModel().cuda()
except torch.cuda.OutOfMemoryError as exc:
    dump_path = recorder.dump(
        reason="torch.cuda.OutOfMemoryError",
        exception=exc,
        context="model_initialization",
        backend="cuda",
        metadata={
            "model_size": "7B parameters",
            "batch_size": 32
        }
    )
    print(f"OOM dump saved to: {dump_path}")
    raise
reason
str
OOM classification reason
exception
BaseException
The exception that triggered the dump
context
Optional[str]
Context description (e.g., “training”, “inference”)
backend
str
Backend type (“cuda”, “rocm”, “mps”)
metadata
Optional[Dict[str, Any]]
default:"None"
Additional metadata to include in the dump
return
Optional[str]
Path to the dump directory if created, None if recorder is disabled
The dump bundle includes:
  • manifest.json: Bundle metadata and file listing
  • events.json: Buffered memory events
  • metadata.json: Exception details and custom metadata
  • environment.json: System and process information

Functions

classify_oom_exception()

Classify whether an exception corresponds to an OOM condition.
from gpumemprof.oom_flight_recorder import classify_oom_exception
import torch

try:
    tensor = torch.randn(10000, 10000, device="cuda")
except Exception as exc:
    classification = classify_oom_exception(exc)
    
    if classification.is_oom:
        print(f"OOM detected: {classification.reason}")
    else:
        print("Not an OOM error")
exc
BaseException
Exception to classify
return
OOMExceptionClassification
Classification result with is_oom flag and reason
Recognized OOM patterns:
  • torch.cuda.OutOfMemoryError
  • TensorFlow ResourceExhaustedError
  • Message patterns: “out of memory”, “cuda out of memory”, “hip out of memory”, “resource exhausted”, “failed to allocate”, “allocation failed”

Dump Bundle Structure

Each OOM dump creates a directory with the following structure:
oom_dump_20260303T120000Z_12345_cuda_1/
├── manifest.json       # Bundle metadata
├── events.json         # Memory events from ring buffer
├── metadata.json       # Exception and context info
└── environment.json    # System information

manifest.json

{
  "schema_version": 1,
  "bundle_name": "oom_dump_20260303T120000Z_12345_cuda_1",
  "created_at_utc": "2026-03-03T12:00:00Z",
  "reason": "torch.cuda.OutOfMemoryError",
  "backend": "cuda",
  "event_count": 5000,
  "files": ["manifest.json", "events.json", "metadata.json", "environment.json"]
}

metadata.json

{
  "reason": "torch.cuda.OutOfMemoryError",
  "exception_type": "OutOfMemoryError",
  "exception_module": "torch.cuda",
  "exception_message": "CUDA out of memory...",
  "context": "model_loading",
  "backend": "cuda",
  "captured_event_count": 5000,
  "custom_metadata": {
    "batch_size": 32,
    "model_name": "resnet50"
  }
}

environment.json

{
  "pid": 12345,
  "cwd": "/home/user/project",
  "system": {
    "platform": "Linux",
    "python_version": "3.10.0",
    "torch_version": "2.0.0",
    "cuda_available": true,
    "cuda_version": "11.8"
  }
}

Example Usage

import torch
from gpumemprof.oom_flight_recorder import (
    OOMFlightRecorder,
    OOMFlightRecorderConfig,
    classify_oom_exception
)

# Configure recorder
config = OOMFlightRecorderConfig(
    enabled=True,
    dump_dir="./oom_diagnostics",
    buffer_size=10000,
    max_dumps=10,
    max_total_mb=512
)

recorder = OOMFlightRecorder(config)

# Record events continuously
import time

def record_memory_state():
    if torch.cuda.is_available():
        event = {
            "timestamp": time.time(),
            "allocated": torch.cuda.memory_allocated(),
            "reserved": torch.cuda.memory_reserved(),
            "device_id": 0
        }
        recorder.record_event(event)

# Training loop with OOM protection
try:
    model = LargeModel().cuda()
    
    for epoch in range(100):
        for batch in dataloader:
            record_memory_state()
            
            output = model(batch)
            loss = criterion(output)
            loss.backward()
            optimizer.step()
            
except Exception as exc:
    # Classify the exception
    classification = classify_oom_exception(exc)
    
    if classification.is_oom:
        print(f"OOM Error detected: {classification.reason}")
        
        # Create diagnostic dump
        dump_path = recorder.dump(
            reason=classification.reason,
            exception=exc,
            context="training_loop",
            backend="cuda",
            metadata={
                "epoch": epoch,
                "batch_size": batch.size(0),
                "model_params": sum(p.numel() for p in model.parameters())
            }
        )
        
        if dump_path:
            print(f"Diagnostic dump saved to: {dump_path}")
            print("Bundle contains:")
            print("  - Memory event timeline")
            print("  - Exception details")
            print("  - System information")
    
    raise

# Check buffered events
events = recorder.snapshot_events()
print(f"Currently tracking {len(events)} events")

Integration with MemoryTracker

The OOM recorder is automatically integrated into MemoryTracker:
from gpumemprof import MemoryTracker

tracker = MemoryTracker(
    device="cuda:0",
    enable_oom_flight_recorder=True,
    oom_dump_dir="./oom_dumps",
    oom_buffer_size=5000,
    oom_max_dumps=10
)

tracker.start_tracking()

try:
    # Your code
    model.train()
except Exception as e:
    dump_path = tracker.handle_exception(e, context="training")
    if dump_path:
        print(f"OOM dump: {dump_path}")
    raise

Build docs developers (and LLMs) love