PyTorch 2.0 compile
Thetorch.compile() feature in PyTorch 2.0 provides significant speedups with a single line of code.
Enable compile mode
By default, nanoGPT uses PyTorch 2.0’s compile feature. In train.py:74, thecompile flag is set to True:
PyTorch compile can reduce iteration time from ~250ms to ~135ms, nearly a 2x speedup.
Disable compile mode
On some platforms (like Windows) or older PyTorch versions, compile may not be available:Flash Attention
Flash Attention uses optimized CUDA kernels for dramatically faster attention computation.Automatic detection
The model automatically detects Flash Attention support in model.py:44-50:Flash vs. standard attention
When Flash Attention is available (model.py:62-64):Mixed precision training
nanoGPT supports multiple precision modes to balance speed and memory usage.Precision options
Set thedtype parameter in train.py:73:
Precision mode comparison
Precision mode comparison
bfloat16 (recommended for A100/H100)
- Best balance of speed and stability
- Wider dynamic range than float16
- No gradient scaling required
- Requires GPU support (A100, H100, etc.)
- Fast on most modern GPUs
- Requires gradient scaling (automatic in nanoGPT)
- May require careful tuning for stability
- Slowest but most stable
- Use for debugging or CPU training
- No special GPU features required
Automatic mixed precision
The training script uses PyTorch’s autocast context (train.py:112):TF32 precision
TensorFloat-32 (TF32) provides a speedup on Ampere GPUs and newer.Enable TF32
By default, nanoGPT enables TF32 for matmul and cuDNN operations (train.py:107-108):TF32 is only available on NVIDIA Ampere GPUs (A100, RTX 3090, etc.) and newer.
Fused AdamW optimizer
nanoGPT automatically uses the fused AdamW optimizer when available for faster updates.Automatic detection
The optimizer setup (model.py:281-285) detects fused AdamW support:Model FLOPs utilization (MFU)
Track how efficiently your model uses GPU compute with MFU metrics.MFU calculation
The model estimates MFU based on A100 peak FLOPS (model.py:289-303):Gradient accumulation
Simulate larger batch sizes without increasing memory usage.Configure accumulation steps
Setgradient_accumulation_steps to simulate larger batches (train.py:48):
Distributed training adjustment
With DDP, gradient accumulation is automatically scaled (train.py:94-95):Memory optimizations
Disable bias parameters
Setbias=False for faster and more memory-efficient training (train.py:56):
Disabling bias in LayerNorm and Linear layers provides a small speedup and reduces memory usage with minimal impact on model quality.
Efficient data loading
Use memory-mapped files to avoid loading the entire dataset into RAM (train.py:117-122):Pinned memory for GPU transfers
Pinned memory enables faster CPU-to-GPU transfers (train.py:128):Platform-specific optimizations
Apple Silicon (MPS)
For M1/M2/M3 Macs, use the Metal Performance Shaders backend:CPU training
For CPU-only environments, disable compile and adjust settings:Performance checklist
Before training, verify these optimizations are enabled:- PyTorch 2.0+ installed for compile and Flash Attention
-
--compile=Trueenabled (default) -
dtype='bfloat16'on supported GPUs (A100, H100) - TF32 enabled on Ampere+ GPUs (automatic)
- Fused AdamW detected and enabled (check logs)
- Appropriate gradient accumulation for your GPU memory
-
bias=Falsefor slightly better efficiency