The oom_flight_recorder module provides automatic OOM diagnostic capture and retention management.
Classes
OOMFlightRecorderConfig
Runtime configuration for OOM flight recorder dumps.
from gpumemprof.oom_flight_recorder import OOMFlightRecorderConfig
config = OOMFlightRecorderConfig(
enabled=True,
dump_dir="oom_dumps",
buffer_size=10000,
max_dumps=5,
max_total_mb=256
)
Attributes
Whether OOM recording is enabled
Directory for storing OOM dump bundles
Size of the event ring buffer
Maximum number of OOM dumps to retain
Maximum total storage for OOM dumps in megabytes
OOMExceptionClassification
Normalized classification result for an exception.
from gpumemprof.oom_flight_recorder import OOMExceptionClassification
Attributes
Whether the exception is an OOM error
Classification reason (e.g., “torch.cuda.OutOfMemoryError”)
OOMFlightRecorder
Bounded recorder that writes diagnostic dump bundles on OOM.
from gpumemprof.oom_flight_recorder import (
OOMFlightRecorder,
OOMFlightRecorderConfig
)
config = OOMFlightRecorderConfig(
enabled=True,
dump_dir="./oom_diagnostics",
buffer_size=5000,
max_dumps=10
)
recorder = OOMFlightRecorder(config)
Constructor
Configuration for the recorder
Methods
record_event()
Append one event payload to the in-memory ring buffer.
event = {
"timestamp": time.time(),
"event_type": "allocation",
"memory_allocated": 1024 * 1024 * 500,
"device_id": 0
}
recorder.record_event(event)
snapshot_events()
Return buffered events in chronological order.
events = recorder.snapshot_events()
print(f"Buffered events: {len(events)}")
dump()
Write an OOM diagnostic bundle and enforce retention constraints.
try:
model = LargeModel().cuda()
except torch.cuda.OutOfMemoryError as exc:
dump_path = recorder.dump(
reason="torch.cuda.OutOfMemoryError",
exception=exc,
context="model_initialization",
backend="cuda",
metadata={
"model_size": "7B parameters",
"batch_size": 32
}
)
print(f"OOM dump saved to: {dump_path}")
raise
OOM classification reason
The exception that triggered the dump
Context description (e.g., “training”, “inference”)
Backend type (“cuda”, “rocm”, “mps”)
metadata
Optional[Dict[str, Any]]
default:"None"
Additional metadata to include in the dump
Path to the dump directory if created, None if recorder is disabled
The dump bundle includes:
manifest.json: Bundle metadata and file listing
events.json: Buffered memory events
metadata.json: Exception details and custom metadata
environment.json: System and process information
Functions
classify_oom_exception()
Classify whether an exception corresponds to an OOM condition.
from gpumemprof.oom_flight_recorder import classify_oom_exception
import torch
try:
tensor = torch.randn(10000, 10000, device="cuda")
except Exception as exc:
classification = classify_oom_exception(exc)
if classification.is_oom:
print(f"OOM detected: {classification.reason}")
else:
print("Not an OOM error")
return
OOMExceptionClassification
Classification result with is_oom flag and reason
Recognized OOM patterns:
torch.cuda.OutOfMemoryError
- TensorFlow
ResourceExhaustedError
- Message patterns: “out of memory”, “cuda out of memory”, “hip out of memory”, “resource exhausted”, “failed to allocate”, “allocation failed”
Dump Bundle Structure
Each OOM dump creates a directory with the following structure:
oom_dump_20260303T120000Z_12345_cuda_1/
├── manifest.json # Bundle metadata
├── events.json # Memory events from ring buffer
├── metadata.json # Exception and context info
└── environment.json # System information
manifest.json
{
"schema_version": 1,
"bundle_name": "oom_dump_20260303T120000Z_12345_cuda_1",
"created_at_utc": "2026-03-03T12:00:00Z",
"reason": "torch.cuda.OutOfMemoryError",
"backend": "cuda",
"event_count": 5000,
"files": ["manifest.json", "events.json", "metadata.json", "environment.json"]
}
{
"reason": "torch.cuda.OutOfMemoryError",
"exception_type": "OutOfMemoryError",
"exception_module": "torch.cuda",
"exception_message": "CUDA out of memory...",
"context": "model_loading",
"backend": "cuda",
"captured_event_count": 5000,
"custom_metadata": {
"batch_size": 32,
"model_name": "resnet50"
}
}
environment.json
{
"pid": 12345,
"cwd": "/home/user/project",
"system": {
"platform": "Linux",
"python_version": "3.10.0",
"torch_version": "2.0.0",
"cuda_available": true,
"cuda_version": "11.8"
}
}
Example Usage
import torch
from gpumemprof.oom_flight_recorder import (
OOMFlightRecorder,
OOMFlightRecorderConfig,
classify_oom_exception
)
# Configure recorder
config = OOMFlightRecorderConfig(
enabled=True,
dump_dir="./oom_diagnostics",
buffer_size=10000,
max_dumps=10,
max_total_mb=512
)
recorder = OOMFlightRecorder(config)
# Record events continuously
import time
def record_memory_state():
if torch.cuda.is_available():
event = {
"timestamp": time.time(),
"allocated": torch.cuda.memory_allocated(),
"reserved": torch.cuda.memory_reserved(),
"device_id": 0
}
recorder.record_event(event)
# Training loop with OOM protection
try:
model = LargeModel().cuda()
for epoch in range(100):
for batch in dataloader:
record_memory_state()
output = model(batch)
loss = criterion(output)
loss.backward()
optimizer.step()
except Exception as exc:
# Classify the exception
classification = classify_oom_exception(exc)
if classification.is_oom:
print(f"OOM Error detected: {classification.reason}")
# Create diagnostic dump
dump_path = recorder.dump(
reason=classification.reason,
exception=exc,
context="training_loop",
backend="cuda",
metadata={
"epoch": epoch,
"batch_size": batch.size(0),
"model_params": sum(p.numel() for p in model.parameters())
}
)
if dump_path:
print(f"Diagnostic dump saved to: {dump_path}")
print("Bundle contains:")
print(" - Memory event timeline")
print(" - Exception details")
print(" - System information")
raise
# Check buffered events
events = recorder.snapshot_events()
print(f"Currently tracking {len(events)} events")
Integration with MemoryTracker
The OOM recorder is automatically integrated into MemoryTracker:
from gpumemprof import MemoryTracker
tracker = MemoryTracker(
device="cuda:0",
enable_oom_flight_recorder=True,
oom_dump_dir="./oom_dumps",
oom_buffer_size=5000,
oom_max_dumps=10
)
tracker.start_tracking()
try:
# Your code
model.train()
except Exception as e:
dump_path = tracker.handle_exception(e, context="training")
if dump_path:
print(f"OOM dump: {dump_path}")
raise