gpu-profiler command launches an interactive Terminal User Interface (TUI) for GPU memory profiling and monitoring.
Installation
Install with TUI support:torch- PyTorch for GPU profilingtextual- Terminal UI frameworkpyfiglet- ASCII art generation (optional)
tensorflow- TensorFlow profiling supportmatplotlib- PNG plot generationplotly- HTML interactive plot generation
Usage
Launch the interactive TUI:Interface overview
The TUI provides a tabbed interface with the following sections:Overview tab
Displays:- Welcome banner with ASCII art
- System information
- GPU information for PyTorch and TensorFlow
- Backend diagnostics (CUDA, ROCm, MPS, Metal)
- Memory statistics
PyTorch tab
Features:- Real-time GPU statistics table
- Profile management (refresh, clear)
- Historical profiling results
- Memory metrics per profiling session
TensorFlow tab
Features:- TensorFlow GPU statistics
- Profile management (refresh, clear)
- TensorFlow-specific memory tracking
- Device information
Monitoring tab
Live memory tracking interface with: Controls:- Start Live Tracking - Begin memory monitoring
- Stop Tracking - End monitoring session
- Auto Cleanup toggle - Enable/disable automatic memory cleanup
- Apply Thresholds - Update warning/critical thresholds
- Force Cleanup - Manually trigger memory cleanup (standard)
- Aggressive Cleanup - Manually trigger aggressive memory cleanup
- Export CSV - Save tracking events to CSV
- Export JSON - Save tracking events to JSON
- Clear Monitor Log - Clear the event log
- Warning % - Memory usage warning threshold (default: 80)
- Critical % - Memory usage critical threshold (default: 95)
- Status (Active/Idle)
- Device label
- Current Allocated memory
- Current Reserved memory
- Peak Memory
- Utilization percentage
- Allocations per second
- Alert Count
- Total Events
- Tracking Duration
- Cleanup count
- Timestamp of each alert
- Alert type (warning, critical, error)
- Alert message
- Timestamp
- Event type (color-coded)
- Memory allocated/reserved/delta
- Context messages
Visualizations tab
Memory timeline visualization with: Controls:- Refresh Timeline - Update timeline from current tracking session
- Generate PNG Plot - Export timeline as PNG image
- Generate HTML Plot - Export interactive HTML plot
- Sample count
- Duration
- Allocated Max
- Reserved Max
- Allocated Latest
- Reserved Latest
- Allocated memory line
- Reserved memory line
- Time axis
- Memory axis
- PNG plots:
./visualizations/timeline_YYYYMMDD_HHMMSS.png - HTML plots:
./visualizations/timeline_YYYYMMDD_HHMMSS.html
CLI & Actions tab
Command execution interface with: Quick action buttons:- gpumemprof info - Display system information
- gpumemprof monitor - Run 30-second monitoring session
- tfmemprof monitor - Run TensorFlow monitoring
- gpumemprof diagnose - Generate diagnostic bundle
- PyTorch Sample - Run PyTorch sample workload
- TensorFlow Sample - Run TensorFlow sample workload
- OOM Scenario - Test OOM flight recorder
- Capability Matrix - Run comprehensive test suite
- Text input for custom commands
- Run Command - Execute the entered command
- Cancel Command - Cancel running command
- Loading indicator during execution
- Real-time stdout/stderr output in command log
Keyboard shortcuts
q- Quit applicationr- Refresh overview tabf- Focus command logg- Show gpumemprof info helpt- Show tfmemprof info helpTab- Navigate between tabsCtrl+C- Cancel operation
Live tracking features
Memory watchdog
Automatic memory management system: Standard cleanup:- Clears PyTorch cache
- Runs garbage collection
- Minimal performance impact
- Clears PyTorch cache
- Clears CUDA IPC memory
- Multiple garbage collection passes
- Higher performance impact
Alert thresholds
Configurable memory usage alerts: Warning threshold (default 80%):- Yellow warning event logged
- Notification in event stream
- Watchdog may trigger cleanup
- Red critical event logged
- Urgent notification
- Automatic cleanup if watchdog enabled
Event export formats
CSV format:./exports/tracker_events_YYYYMMDD_HHMMSS.{csv,json}
Sample workload outputs
PyTorch sample
Runs a small neural network training loop and reports:- Peak memory allocated
- Peak memory reserved
- Total snapshots collected
- Profiling duration
TensorFlow sample
Executes TensorFlow operations and displays:- Peak GPU memory usage
- Average memory usage
- Total allocations
- Profiling duration
CPU sample (fallback)
When GPU is unavailable, runs CPU profiling:- Peak RSS (Resident Set Size)
- Memory change
- Snapshots collected
- Profiling duration
Profile management
PyTorch profiles
Stored in:~/.gpumemprof/profiles/
Profile data includes:
- Timestamp
- Peak memory
- Reserved memory
- Number of snapshots
- Device information
TensorFlow profiles
Stored in:~/.tfmemprof/profiles/
Profile data includes:
- Timestamp
- Peak memory
- Average memory
- Sample count
- Device information
Backend detection
The TUI automatically detects and adapts to: PyTorch backends:- CUDA (NVIDIA GPUs)
- ROCm (AMD GPUs)
- MPS (Apple Silicon)
- CPU (fallback)
- CUDA (NVIDIA GPUs)
- ROCm (AMD GPUs)
- Metal (Apple Silicon with tensorflow-metal)
- CPU (fallback)
Diagnostic bundle generation
When runninggpumemprof diagnose or tfmemprof diagnose from the CLI tab, diagnostic bundles are saved to:
manifest.json- Metadata and risk flagsdiagnostic_summary.json- Analysis summarysystem_info.json- System and GPU informationmemory_timeline.json- Memory usage over timestack_traces.txt- Memory allocation stack traces
Common workflows
Monitor training session
- Switch to Monitoring tab
- Set Warning % to 80 and Critical % to 95
- Click “Apply Thresholds”
- Enable “Auto Cleanup” if desired
- Click “Start Live Tracking”
- Run your training script
- Monitor alerts and statistics in real-time
- Click “Stop Tracking” when complete
- Click “Export JSON” to save results
Profile and visualize
- Start live tracking in Monitoring tab
- Run your workload
- Switch to Visualizations tab
- Click “Refresh Timeline”
- Review ASCII timeline
- Click “Generate PNG Plot” or “Generate HTML Plot”
- Find plots in
./visualizations/directory
Debug memory issues
- Switch to CLI & Actions tab
- Click “gpumemprof diagnose”
- Review diagnostic output in command log
- Check artifacts directory for detailed bundle
- Review
diagnostic_summary.jsonfor risk flags
Run sample workloads
- Switch to CLI & Actions tab
- Click “PyTorch Sample” or “TensorFlow Sample”
- Wait for workload completion
- Review results in command log
- Switch to PyTorch/TensorFlow tab
- Click “Refresh Profiles” to see new profile entry