bench.py script provides a simplified benchmarking tool to measure training performance without the overhead of data loading, logging, and checkpointing.
Overview
bench.py is a stripped-down version of train.py that focuses on the core training loop for accurate performance measurement.
From README.md:207: “For simple model benchmarking and profiling, bench.py might be useful. It’s identical to what happens in the meat of the training loop of train.py, but omits much of the other complexities.”
Basic benchmarking
Run a simple benchmark
Execute bench.py with default settings:- Run 10 warmup iterations (burnin phase)
- Run 20 benchmark iterations
- Report average time per iteration and MFU
Benchmark output
Expected output:Configuration options
Customize benchmark parameters by modifying the script or using the configurator.Key parameters
From bench.py:12-20:Parameter descriptions
Parameter descriptions
batch_size: Number of sequences per batch
- Default: 12
- Increase if you have more GPU memory
- Directly impacts throughput
- Default: 1024
- Memory usage scales quadratically with attention
- Larger values = more memory required
- Default: False
- False is faster and uses less memory
- Default: True
- Set to False to isolate model performance from I/O
- Default: True
- Significant speedup (~2x) when enabled
- Default: False
- Enable for detailed profiling analysis
Override via command line
Use the configurator to override parameters:Benchmarking modes
Simple benchmarking (default)
The default mode runs a warmup phase followed by timed iterations (bench.py:98-117):The two-stage approach ensures that compilation overhead and cache warmup don’t skew results.
PyTorch Profiler mode
Enable detailed profiling for analysis in TensorBoard:View profiler results
After profiling, view results in TensorBoard:Data loading options
Real data (default)
Use actual OpenWebText data (bench.py:33-43):Fixed random data
Isolate model performance from data loading (bench.py:44-48):Model configuration
The benchmark uses a GPT-2 124M sized model by default (bench.py:51-56):Benchmark different model sizes
Modify the configuration to test different architectures:GPT-2 model sizes
GPT-2 model sizes
GPT-2 Small (124M)GPT-2 Medium (350M)GPT-2 Large (774M)GPT-2 XL (1.5B)
Interpreting results
Time per iteration
Measures the wall-clock time for one training step:- Less than 100ms: Excellent performance
- 100-200ms: Good performance
- Greater than 200ms: May indicate configuration issues
Model FLOPs Utilization (MFU)
Percentage of theoretical peak FLOPS achieved:- Greater than 50%: Excellent utilization
- 40-50%: Good utilization
- Less than 40%: Room for optimization
MFU is calculated relative to A100 GPU peak performance (312 TFLOPS for bfloat16).
Common benchmarking scenarios
Compare compile on/off
Test different precisions
Memory-limited benchmarking
Reduce batch size and block size for smaller GPUs:Multi-GPU benchmarking
Test on different GPUs:Benchmark best practices
- Run multiple times: Performance can vary between runs
- Warm GPU: First run may be slower due to GPU initialization
- Close other processes: Ensure GPU is not being used by other tasks
- Monitor temperature: GPU throttling can affect results
- Consistent settings: Use same batch_size/block_size for fair comparisons