PyTorch 2.0 compile issues
The most common issue is related to PyTorch 2.0’storch.compile() feature.
Compile not available
Problem: Training fails with compile-related errors Solution: Disable compile mode:Platform compatibility
Affected platforms:- Windows (limited support)
- Older Linux distributions
- Some cloud environments
Select the appropriate PyTorch version at https://pytorch.org/get-started/locally/
Flash Attention warnings
WARNING: using slow attention
Problem: You see this warning during training:Out of memory errors
CUDA out of memory
Problem: Training crashes withCUDA out of memory error
Solutions (try in order):
1. Reduce batch size
1. Reduce batch size
Decrease the batch size to use less GPU memory:Or even smaller:
2. Reduce block size (context length)
2. Reduce block size (context length)
Memory usage scales quadratically with sequence length:Or smaller:
3. Reduce model size
3. Reduce model size
Train a smaller model:From README.md:166, available GPT-2 model sizes:
gpt2(124M) - defaultgpt2-medium(350M)gpt2-large(774M)gpt2-xl(1558M)
4. Use gradient accumulation
4. Use gradient accumulation
Simulate larger batch sizes without using more memory:This gives an effective batch size of 4 × 3 = 12.
5. Use float16 instead of bfloat16
5. Use float16 instead of bfloat16
On GPUs without bfloat16 support, this is automatic. To force float16:
CPU/MPS training issues
Training on CPU
Problem: Need to train on CPU-only system Solution: Adjust settings for CPU training from README.md:82-88:Apple Silicon (M1/M2/M3) Macs
Problem: Training on Apple Silicon Solution: Use the MPS (Metal Performance Shaders) backend from README.md:105:Multi-node training issues
Slow multi-node training
Problem: Multi-node training is extremely slow Cause: Missing or slow network interconnect Solution: From README.md:132: Benchmark your interconnect:Data loading issues
Data files not found
Problem: Error about missingtrain.bin or val.bin
Solution: Prepare the dataset first:
For Shakespeare:
Memory leak with data loading
Problem: Memory usage grows over time Solution: The code already handles this from train.py:117-122:Checkpointing issues
Checkpoint loading errors
Problem: Error when resuming from checkpoint with_orig_mod. prefix
Solution: This is handled automatically in train.py:174-177:
Incompatible checkpoint
Problem: Can’t resume training from checkpoint Cause: Model architecture mismatch Check: The checkpoint enforces matching architecture from train.py:166-167:Sampling/Inference issues
Sampling fails or produces bad output
Problem:sample.py produces errors or nonsensical text
Solutions:
Wrong output directory
Wrong output directory
Ensure you’re pointing to the correct checkpoint:
Insufficient training
Insufficient training
The model may not be trained enough. Check training loss:
- Shakespeare char-level: aim for loss < 1.5
- Shakespeare with GPT-2: aim for loss < 1.0
Temperature too high/low
Temperature too high/low
Adjust sampling temperature (default 1.0):
Performance issues
Training is slower than expected
Checklist:-
Compile enabled?
-
Using appropriate dtype?
- Flash Attention available? Check for the warning about slow attention and upgrade PyTorch if needed.
-
GPU utilization
Monitor with:
GPU utilization should be >80%.
Low MFU (Model FLOPs Utilization)
Problem: MFU is less than 40% Potential causes and solutions:Small batch size
Small batch size
Increase batch size to improve GPU utilization:
Data loading bottleneck
Data loading bottleneck
Ensure data is on fast storage (NVMe SSD, not network drive).The code already uses async data loading from train.py:302-303:
Compile not enabled
Compile not enabled
Ensure
--compile=True (default).Loss not decreasing
Loss is NaN
Problem: Training loss becomes NaN Common causes:-
Learning rate too high
-
Mixed precision instability
-
Gradient clipping disabled
Ensure gradient clipping is enabled (default
grad_clip=1.0)
Loss not improving
Problem: Loss plateaus too high Solutions:-
Insufficient training
- Check you’re running enough iterations
- For GPT-2 124M: aim for 600k iterations
-
Learning rate decay
- Ensure
--decay_lr=True(default) - Check
lr_decay_itersmatchesmax_iters
- Ensure
-
Data issues
- Verify dataset prepared correctly
- Check data files aren’t corrupted
Logging and monitoring
Weights & Biases (wandb) issues
Problem: wandb logging not working Solution: Enable wandb logging:Getting help
If you’re still experiencing issues:- Check the GitHub issues: Many common problems are already documented
- Watch the educational video: From README.md:226-227
-
Join the Discord: From README.md:228-230
- #nanoGPT channel on Discord
- https://discord.gg/3zy8kqD9Cp