Overview
The profiler captures detailed performance metrics including:- Step time and throughput (TFLOP/s)
- Compilation time
- Memory usage patterns
- Device utilization
- Communication overhead
Configuration parameters
Profiling is controlled by three parameters in your config file:enable_profiler
- Type: boolean
- Default:
False - Description: Master switch to enable or disable profiling
True to activate profiling:
skip_first_n_steps_for_profiler
- Type: integer
- Default:
5 - Description: Number of initial training steps to skip before profiling begins
- Early steps include compilation overhead
- Step times are unstable during warmup
- Skipping these steps provides more accurate performance measurements
profiler_steps
- Type: integer
- Default:
10 - Description: Number of steps to profile after the skip period
Example usage
Training with profiling
Here’s a complete example from Wan 2.1 training that shows profiling in action:Inference profiling
Profiling workflow
1. Enable profiling
Modify your config or pass parameters on the command line:2. Run your workload
Execute training or inference as normal. The profiler will:- Skip the first N steps (default: 5)
- Collect profiling data for the next M steps (default: 10)
- Save profiling traces to your output directory
3. Analyze results
View metrics in TensorBoard:Best practices
Skip warmup steps
Always skip initial steps to avoid skewed measurements:Profile sufficient steps
Capture enough steps to identify patterns:Production vs development
Disable profiling in production to avoid overhead:Common profiling scenarios
Optimize step time
Quick performance check
Multi-slice profiling
Understanding profiler output
The profiler generates:- Console metrics: Step time, TFLOP/s, loss values
- TensorBoard logs: Detailed metrics over time
- JAX traces: Low-level performance data for expert analysis
Key metrics to monitor
- TFLOP/s/device: Higher is better; indicates compute efficiency
- Step time: Lower is better; total time per training step
- Step time stability: Should stabilize after warmup
- Loss values: Verify training is progressing correctly
Related resources
- Quantization - Optimize memory usage
- Checkpointing - Save training progress