Compilation issues
Compilation takes too long or hangs
Compilation takes too long or hangs
Symptoms: Training or inference hangs during compilation, or compilation takes over 30 minutes.Solutions:
-
Use JAX compilation cache to avoid recompiling:
-
Reduce model or batch size during initial testing:
-
Check LIBTPU_INIT_ARGS - some flag combinations can slow compilation:
-
Enable profiler to see where it’s stuck:
XLA compilation errors or mismatched shapes
XLA compilation errors or mismatched shapes
Symptoms: Errors like “Shape mismatch” or “XLA compilation failed”.Solutions:
-
Verify parallelism settings match your hardware:
-
Check batch size divisibility:
-
For Wan models, verify head parallelism divides 40:
-
Disable jit_initializers for debugging:
Incompatible dtype errors
Incompatible dtype errors
Symptoms: Errors about bfloat16/float32 incompatibility.Solutions:
-
Match weights and activations dtypes:
-
Use float32 for higher precision (slower):
-
For GPU, ensure Transformer Engine is installed when using cudnn_flash_te:
Out of memory (OOM) errors
TPU/GPU runs out of memory during training
TPU/GPU runs out of memory during training
Symptoms: “Out of memory” or “HBM allocation failed” errors.Solutions:
-
Reduce batch size:
-
Enable gradient checkpointing (rematerialization):
-
Use smaller flash block sizes:
-
Reduce resolution or number of frames:
-
Increase FSDP parallelism to shard model across more devices:
-
For Wan, adjust scoped_vmem_limit:
Out of memory during checkpoint loading
Out of memory during checkpoint loading
Symptoms: OOM when loading pretrained weights.Solutions:
-
Enable single replica checkpoint restoring:
-
For Wan models, use external disk for HuggingFace cache:
-
Load weights in bfloat16:
Out of memory during data preprocessing
Out of memory during data preprocessing
Symptoms: OOM when creating TFRecord datasets.Solutions:
-
Process in smaller batches:
-
Increase number of shards:
-
Use streaming dataset instead of in-memory:
Disk space issues
Insufficient disk space for checkpoints or datasets
Insufficient disk space for checkpoints or datasets
Symptoms: “No space left on device” errors.Solutions:
-
Attach external disk to VM:
-
Save checkpoints to GCS instead of local disk:
-
Disable checkpoint saving during debugging:
-
Clean up HuggingFace cache:
-
Use smaller dataset or streaming:
Dataset download fills up disk
Dataset download fills up disk
Symptoms: Disk full when downloading datasets from HuggingFace.Solutions:
-
Use streaming dataset:
-
Download to external disk:
-
Download directly to GCS:
Permission and access errors
HuggingFace authentication errors for gated models
HuggingFace authentication errors for gated models
Symptoms: “401 Client Error: Unauthorized” or “Access denied”.Solutions:
- Obtain access to the model on HuggingFace (e.g., Flux, Wan).
-
Create HuggingFace token:
- Go to https://huggingface.co/settings/tokens
- Create a token with read permissions
-
Set token in config or environment:
Or:
GCS permission errors
GCS permission errors
Symptoms: “403 Forbidden” or “Permission denied” when accessing GCS buckets.Solutions:
-
Authenticate gcloud:
-
Set project:
-
Grant VM service account permissions:
-
Check bucket exists and is accessible:
Permission denied when writing to disk
Permission denied when writing to disk
Symptoms: “Permission denied” when saving checkpoints locally.Solutions:
-
Check directory permissions:
-
Use home directory or /tmp:
-
Run with appropriate user:
Training and inference issues
Loss is NaN or training diverges
Loss is NaN or training diverges
Symptoms: Loss shows as NaN or increases dramatically.Solutions:
-
Reduce learning rate:
-
Enable gradient clipping:
-
Use float32 instead of bfloat16:
- Check data preprocessing - ensure images/videos are normalized correctly.
- Reduce batch size - very large batches can cause instability.
Generated images/videos have poor quality
Generated images/videos have poor quality
Symptoms: Outputs are blurry, distorted, or don’t match prompts.Solutions:
-
Increase inference steps:
-
Adjust guidance scale:
-
For Wan models, set flow_shift:
-
Use higher precision:
- Check if model loaded correctly - verify checkpoint path and weights.
Slow training or inference performance
Slow training or inference performance
Symptoms: Step time is much slower than expected.Solutions:
-
Enable flash attention:
- Optimize LIBTPU_INIT_ARGS - see optimization guide.
- Use appropriate flash block sizes for your TPU generation.
-
Cache latents and text encodings:
-
Enable profiler to identify bottlenecks:
-
For GPU, use fused attention:
Multihost issues
Multihost training hangs or crashes
Multihost training hangs or crashes
Symptoms: Training hangs when running on multiple hosts.Solutions:
-
Enable distributed system initialization:
-
Ensure all hosts have same code version:
-
Check DCN parallelism settings:
- Verify network connectivity between hosts.
-
Use GCS for checkpoints not local disk:
Multihost data loading is slow
Multihost data loading is slow
Symptoms: Slow step times with multiple hosts.Solutions:
-
Ensure enough data files - need more files than hosts:
-
Use GCS for data storage not local:
-
Enable data shuffling:
Getting help
If you’re still experiencing issues:- Check the logs for detailed error messages
- Enable profiler to identify performance bottlenecks
- Search GitHub issues: https://github.com/AI-Hypercomputer/maxdiffusion/issues
- File a bug report with:
- Complete error message and stack trace
- Hardware type (TPU v5p, v6e, GPU model)
- MaxDiffusion version and commit hash
- Full command or config used
- Steps to reproduce