gpu-profiler

The gpu-profiler command launches an interactive Terminal User Interface (TUI) for GPU memory profiling and monitoring.

Installation

Install with TUI support:

pip install 'gpu-memory-profiler[tui]'

Required dependencies:

torch - PyTorch for GPU profiling
textual - Terminal UI framework
pyfiglet - ASCII art generation (optional)

Optional dependencies:

tensorflow - TensorFlow profiling support
matplotlib - PNG plot generation
plotly - HTML interactive plot generation

Usage

Launch the interactive TUI:

gpu-profiler

No additional command-line arguments are supported. All configuration is done through the interactive interface.

Interface overview

The TUI provides a tabbed interface with the following sections:

Overview tab

Displays:

Welcome banner with ASCII art
System information
GPU information for PyTorch and TensorFlow
Backend diagnostics (CUDA, ROCm, MPS, Metal)
Memory statistics

PyTorch tab

Features:

Real-time GPU statistics table
Profile management (refresh, clear)
Historical profiling results
Memory metrics per profiling session

TensorFlow tab

Features:

TensorFlow GPU statistics
Profile management (refresh, clear)
TensorFlow-specific memory tracking
Device information

Monitoring tab

Live memory tracking interface with: Controls:

Start Live Tracking - Begin memory monitoring
Stop Tracking - End monitoring session
Auto Cleanup toggle - Enable/disable automatic memory cleanup
Apply Thresholds - Update warning/critical thresholds
Force Cleanup - Manually trigger memory cleanup (standard)
Aggressive Cleanup - Manually trigger aggressive memory cleanup
Export CSV - Save tracking events to CSV
Export JSON - Save tracking events to JSON
Clear Monitor Log - Clear the event log

Threshold inputs:

Warning % - Memory usage warning threshold (default: 80)
Critical % - Memory usage critical threshold (default: 95)

Statistics display:

Status (Active/Idle)
Device label
Current Allocated memory
Current Reserved memory
Peak Memory
Utilization percentage
Allocations per second
Alert Count
Total Events
Tracking Duration
Cleanup count

Alert history table:

Timestamp of each alert
Alert type (warning, critical, error)
Alert message

Event log: Real-time stream of memory events with:

Timestamp
Event type (color-coded)
Memory allocated/reserved/delta
Context messages

Visualizations tab

Memory timeline visualization with: Controls:

Refresh Timeline - Update timeline from current tracking session
Generate PNG Plot - Export timeline as PNG image
Generate HTML Plot - Export interactive HTML plot

Timeline statistics:

Sample count
Duration
Allocated Max
Reserved Max
Allocated Latest
Reserved Latest

ASCII timeline canvas: Text-based visualization of memory usage over time showing:

Allocated memory line
Reserved memory line
Time axis
Memory axis

Export locations:

PNG plots: ./visualizations/timeline_YYYYMMDD_HHMMSS.png
HTML plots: ./visualizations/timeline_YYYYMMDD_HHMMSS.html

CLI & Actions tab

Command execution interface with: Quick action buttons:

gpumemprof info - Display system information
gpumemprof monitor - Run 30-second monitoring session
tfmemprof monitor - Run TensorFlow monitoring
gpumemprof diagnose - Generate diagnostic bundle
PyTorch Sample - Run PyTorch sample workload
TensorFlow Sample - Run TensorFlow sample workload
OOM Scenario - Test OOM flight recorder
Capability Matrix - Run comprehensive test suite

Custom command runner:

Text input for custom commands
Run Command - Execute the entered command
Cancel Command - Cancel running command
Loading indicator during execution
Real-time stdout/stderr output in command log

Keyboard shortcuts

q - Quit application
r - Refresh overview tab
f - Focus command log
g - Show gpumemprof info help
t - Show tfmemprof info help
Tab - Navigate between tabs
Ctrl+C - Cancel operation

Live tracking features

Memory watchdog

Automatic memory management system: Standard cleanup:

Clears PyTorch cache
Runs garbage collection
Minimal performance impact

Aggressive cleanup:

Clears PyTorch cache
Clears CUDA IPC memory
Multiple garbage collection passes
Higher performance impact

Alert thresholds

Configurable memory usage alerts: Warning threshold (default 80%):

Yellow warning event logged
Notification in event stream
Watchdog may trigger cleanup

Critical threshold (default 95%):

Red critical event logged
Urgent notification
Automatic cleanup if watchdog enabled

Event export formats

CSV format:

timestamp,event_type,memory_allocated,memory_reserved,context
1709481234.567,allocation,1024000000,2048000000,"Tensor allocation"
1709481235.123,warning,2048000000,4096000000,"Memory usage at 82%"

JSON format:

{
  "peak_memory": 4096000000,
  "average_memory": 2048000000,
  "duration": 120.5,
  "events": [
    {
      "timestamp": 1709481234.567,
      "event_type": "allocation",
      "memory_allocated": 1024000000,
      "memory_reserved": 2048000000,
      "context": "Tensor allocation"
    }
  ]
}

Exported files are saved to: ./exports/tracker_events_YYYYMMDD_HHMMSS.{csv,json}

Sample workload outputs

PyTorch sample

Runs a small neural network training loop and reports:

Peak memory allocated
Peak memory reserved
Total snapshots collected
Profiling duration

TensorFlow sample

Executes TensorFlow operations and displays:

Peak GPU memory usage
Average memory usage
Total allocations
Profiling duration

CPU sample (fallback)

When GPU is unavailable, runs CPU profiling:

Peak RSS (Resident Set Size)
Memory change
Snapshots collected
Profiling duration

Profile management

PyTorch profiles

Stored in: ~/.gpumemprof/profiles/ Profile data includes:

Timestamp
Peak memory
Reserved memory
Number of snapshots
Device information

TensorFlow profiles

Stored in: ~/.tfmemprof/profiles/ Profile data includes:

Timestamp
Peak memory
Average memory
Sample count
Device information

Backend detection

The TUI automatically detects and adapts to: PyTorch backends:

CUDA (NVIDIA GPUs)
ROCm (AMD GPUs)
MPS (Apple Silicon)
CPU (fallback)

TensorFlow backends:

CUDA (NVIDIA GPUs)
ROCm (AMD GPUs)
Metal (Apple Silicon with tensorflow-metal)
CPU (fallback)

Diagnostic bundle generation

When running gpumemprof diagnose or tfmemprof diagnose from the CLI tab, diagnostic bundles are saved to:

artifacts/tui_diagnose/gpumemprof_diag_YYYYMMDD_HHMMSS/
artifacts/tui_diagnose/tfmemprof_diag_YYYYMMDD_HHMMSS/

Bundle contents:

manifest.json - Metadata and risk flags
diagnostic_summary.json - Analysis summary
system_info.json - System and GPU information
memory_timeline.json - Memory usage over time
stack_traces.txt - Memory allocation stack traces

Common workflows

Monitor training session

Switch to Monitoring tab
Set Warning % to 80 and Critical % to 95
Click “Apply Thresholds”
Enable “Auto Cleanup” if desired
Click “Start Live Tracking”
Run your training script
Monitor alerts and statistics in real-time
Click “Stop Tracking” when complete
Click “Export JSON” to save results

Profile and visualize

Start live tracking in Monitoring tab
Run your workload
Switch to Visualizations tab
Click “Refresh Timeline”
Review ASCII timeline
Click “Generate PNG Plot” or “Generate HTML Plot”
Find plots in ./visualizations/ directory

Debug memory issues

Switch to CLI & Actions tab
Click “gpumemprof diagnose”
Review diagnostic output in command log
Check artifacts directory for detailed bundle
Review diagnostic_summary.json for risk flags

Run sample workloads

Switch to CLI & Actions tab
Click “PyTorch Sample” or “TensorFlow Sample”
Wait for workload completion
Review results in command log
Switch to PyTorch/TensorFlow tab
Click “Refresh Profiles” to see new profile entry

Troubleshooting

TUI fails to launch

Ensure torch is installed:

pip install torch

No GPU detected

Check system information in Overview tab. The TUI will fall back to CPU profiling if no GPU is available.

Visualizations not generating

Install visualization dependencies:

pip install 'gpu-memory-profiler[viz]'

TensorFlow tab shows errors

Install TensorFlow:

pip install tensorflow

For Apple Silicon:

pip install tensorflow-metal

PyTorch (gpumemprof)

TensorFlow (tfmemprof)

CLI Reference

Installation

Usage

Interface overview

Overview tab

PyTorch tab

TensorFlow tab

Monitoring tab

Visualizations tab

CLI & Actions tab

Keyboard shortcuts

Live tracking features

Memory watchdog

Alert thresholds

Event export formats

Sample workload outputs

PyTorch sample

TensorFlow sample

CPU sample (fallback)

Profile management

PyTorch profiles

TensorFlow profiles

Backend detection

Diagnostic bundle generation

Common workflows

Monitor training session

Profile and visualize

Debug memory issues

Run sample workloads

Troubleshooting

TUI fails to launch

No GPU detected

Visualizations not generating

TensorFlow tab shows errors

Build docs developers (and LLMs) love

PyTorch (gpumemprof)

TensorFlow (tfmemprof)

CLI Reference

​Installation

​Usage

​Interface overview

​Overview tab

​PyTorch tab

​TensorFlow tab

​Monitoring tab

​Visualizations tab

​CLI & Actions tab

​Keyboard shortcuts

​Live tracking features

​Memory watchdog

​Alert thresholds

​Event export formats

​Sample workload outputs

​PyTorch sample

​TensorFlow sample

​CPU sample (fallback)

​Profile management

​PyTorch profiles

​TensorFlow profiles

​Backend detection

​Diagnostic bundle generation

​Common workflows

​Monitor training session

​Profile and visualize

​Debug memory issues

​Run sample workloads

​Troubleshooting

​TUI fails to launch

​No GPU detected

​Visualizations not generating

​TensorFlow tab shows errors

Build docs developers (and LLMs) love

Installation

Usage

Interface overview

Overview tab

PyTorch tab

TensorFlow tab

Monitoring tab

Visualizations tab

CLI & Actions tab

Keyboard shortcuts

Live tracking features

Memory watchdog

Alert thresholds

Event export formats

Sample workload outputs

PyTorch sample

TensorFlow sample

CPU sample (fallback)

Profile management

PyTorch profiles

TensorFlow profiles

Backend detection

Diagnostic bundle generation

Common workflows

Monitor training session

Profile and visualize

Debug memory issues

Run sample workloads

Troubleshooting

TUI fails to launch

No GPU detected

Visualizations not generating

TensorFlow tab shows errors