Troubleshooting

This guide covers common issues you may encounter when using MaxDiffusion and how to resolve them.

Compilation issues

Compilation takes too long or hangs

Symptoms: Training or inference hangs during compilation, or compilation takes over 30 minutes.Solutions:

Use JAX compilation cache to avoid recompiling:

python src/maxdiffusion/train_wan.py \
  src/maxdiffusion/configs/base_wan_14b.yml \
  jax_cache_dir=gs://your-bucket/jax_cache/

Reduce model or batch size during initial testing:

per_device_batch_size=0.125  # Smaller batch for faster compilation

Check LIBTPU_INIT_ARGS - some flag combinations can slow compilation:
```
# Try disabling all flags first
export LIBTPU_INIT_ARGS=""
```

Enable profiler to see where it’s stuck:

enable_profiler: True
skip_first_n_steps_for_profiler: 1

XLA compilation errors or mismatched shapes

Symptoms: Errors like “Shape mismatch” or “XLA compilation failed”.Solutions:

Verify parallelism settings match your hardware:

# Check that product of ICI axes equals devices per slice
ici_data_parallelism=2
ici_fsdp_parallelism=4  # 2 * 4 = 8 devices
ici_tensor_parallelism=1

Check batch size divisibility:

# Global batch must be evenly divisible by (data * fsdp) parallelism
per_device_batch_size * num_devices % (ici_data_parallelism * ici_fsdp_parallelism) == 0

For Wan models, verify head parallelism divides 40:

# Valid values: 1, 2, 4, 5, 8, 10, 20, 40
ici_tensor_parallelism=5  # OK
ici_tensor_parallelism=3  # ERROR: 40 % 3 != 0

Disable jit_initializers for debugging:

jit_initializers: False  # Only for single-host debugging

Incompatible dtype errors

Symptoms: Errors about bfloat16/float32 incompatibility.Solutions:

Match weights and activations dtypes:

weights_dtype: bfloat16
activations_dtype: bfloat16

Use float32 for higher precision (slower):

weights_dtype: float32
activations_dtype: float32
precision: "HIGHEST"

For GPU, ensure Transformer Engine is installed when using cudnn_flash_te:

pip install "transformer_engine[jax]"
NVTE_FUSED_ATTN=1 python src/maxdiffusion/train_sdxl.py ...

Out of memory (OOM) errors

TPU/GPU runs out of memory during training

Symptoms: “Out of memory” or “HBM allocation failed” errors.Solutions:

Reduce batch size:

per_device_batch_size=0.125  # Or even smaller like 0.0625

Enable gradient checkpointing (rematerialization):

remat_policy: "HIDDEN_STATE_WITH_OFFLOAD"  # For Wan
remat_policy: "FULL"  # For maximum memory savings

Use smaller flash block sizes:

flash_block_sizes: {
  "block_q" : 512,
  "block_kv_compute" : 512,
  "block_kv" : 512,
  "block_q_dkv" : 512,
  "block_kv_dkv" : 512,
  "block_kv_dkv_compute" : 512,
  "block_q_dq" : 512,
  "block_kv_dq" : 512
}

Reduce resolution or number of frames:

# For Wan models
height=720  # Instead of 1280
width=480   # Instead of 720
num_frames=49  # Instead of 81

Increase FSDP parallelism to shard model across more devices:

ici_fsdp_parallelism=8  # More sharding = less memory per device

For Wan, adjust scoped_vmem_limit:

export LIBTPU_INIT_ARGS="--xla_tpu_scoped_vmem_limit_kib=32768"  # Reduce from 65536

Out of memory during checkpoint loading

Symptoms: OOM when loading pretrained weights.Solutions:

Enable single replica checkpoint restoring:

enable_single_replica_ckpt_restoring: True

For Wan models, use external disk for HuggingFace cache:

HF_HUB_CACHE=/mnt/disks/external_disk/maxdiffusion_hf_cache/ python ...

Load weights in bfloat16:
```
weights_dtype: bfloat16
from_pt: True
```

Out of memory during data preprocessing

Symptoms: OOM when creating TFRecord datasets.Solutions:

Process in smaller batches:

# In wan_txt2vid_data_preprocessing.py, reduce batch_size
batch_size = 5  # Default is 10

Increase number of shards:

no_records_per_shard=5  # Smaller shards = less memory

Use streaming dataset instead of in-memory:
```
dataset_type: hf  # Instead of tf
```

Disk space issues

Insufficient disk space for checkpoints or datasets

Symptoms: “No space left on device” errors.Solutions:

Attach external disk to VM:

# Follow: https://cloud.google.com/tpu/docs/attach-durable-block-storage
# Then mount and use for cache:
HF_HUB_CACHE=/mnt/disks/external_disk/maxdiffusion_hf_cache/

Save checkpoints to GCS instead of local disk:

output_dir: gs://my-bucket/checkpoints/
jax_cache_dir: gs://my-bucket/jax_cache/

Disable checkpoint saving during debugging:

checkpoint_every: -1
save_final_checkpoint: False

Clean up HuggingFace cache:

rm -rf ~/.cache/huggingface/hub/*
# Or set cache to GCS bucket

Use smaller dataset or streaming:

dataset_type: hf  # Streams data without downloading
max_train_samples: 1000  # Limit dataset size

Dataset download fills up disk

Symptoms: Disk full when downloading datasets from HuggingFace.Solutions:

Use streaming dataset:

dataset_type: hf  # No download needed
dataset_name: BleachNick/UltraEdit_500k

Download to external disk:

export HF_DATASET_DIR=/mnt/disks/external_disk/datasets/
huggingface-cli download RaphaelLiu/PusaV1_training --local-dir $HF_DATASET_DIR

Download directly to GCS:

# Download locally first, then upload and delete
huggingface-cli download ... --local-dir /tmp/dataset
gsutil -m cp -r /tmp/dataset gs://my-bucket/
rm -rf /tmp/dataset

Permission and access errors

HuggingFace authentication errors for gated models

Symptoms: “401 Client Error: Unauthorized” or “Access denied”.Solutions:

Obtain access to the model on HuggingFace (e.g., Flux, Wan).
Create HuggingFace token:
- Go to https://huggingface.co/settings/tokens
- Create a token with read permissions

Set token in config or environment:

hf_access_token: 'hf_xxxxxxxxxxxxxxxxxxxx'

Or:

export HF_TOKEN='hf_xxxxxxxxxxxxxxxxxxxx'
huggingface-cli login --token $HF_TOKEN

GCS permission errors

Symptoms: “403 Forbidden” or “Permission denied” when accessing GCS buckets.Solutions:

Authenticate gcloud:

gcloud auth login
gcloud auth application-default login

Set project:

gcloud config set project YOUR_PROJECT_ID

Grant VM service account permissions:

# Give Storage Admin role to TPU service account
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
  --member serviceAccount:SERVICE_ACCOUNT_EMAIL \
  --role roles/storage.admin

Check bucket exists and is accessible:
```
gsutil ls gs://my-bucket/
```

Permission denied when writing to disk

Symptoms: “Permission denied” when saving checkpoints locally.Solutions:

Check directory permissions:

ls -la /tmp/
chmod 777 /tmp/output  # Or appropriate permissions

Use home directory or /tmp:

output_dir: /tmp/checkpoints/
dataset_save_location: /tmp/dataset/

Run with appropriate user:

sudo chown -R $USER:$USER /path/to/output

Training and inference issues

Loss is NaN or training diverges

Symptoms: Loss shows as NaN or increases dramatically.Solutions:

Reduce learning rate:

learning_rate: 1.e-6  # Instead of 1.e-5

Enable gradient clipping:

max_grad_norm: 1.0  # Default, try 0.5 for more aggressive clipping

Use float32 instead of bfloat16:

weights_dtype: float32
activations_dtype: float32

Check data preprocessing - ensure images/videos are normalized correctly.
Reduce batch size - very large batches can cause instability.

Generated images/videos have poor quality

Symptoms: Outputs are blurry, distorted, or don’t match prompts.Solutions:

Increase inference steps:

num_inference_steps=50  # Instead of 20

Adjust guidance scale:

guidance_scale=7.5  # Try values between 5-15

For Wan models, set flow_shift:

flow_shift=5.0  # Wan2.1 recommended value

Use higher precision:

weights_dtype: float32
activations_dtype: float32

Check if model loaded correctly - verify checkpoint path and weights.

Slow training or inference performance

Symptoms: Step time is much slower than expected.Solutions:

Enable flash attention:

attention='flash'
flash_min_seq_length=0

Optimize LIBTPU_INIT_ARGS - see optimization guide.
Use appropriate flash block sizes for your TPU generation.

Cache latents and text encodings:

cache_latents_text_encoder_outputs: True

Enable profiler to identify bottlenecks:

enable_profiler: True
skip_first_n_steps_for_profiler: 5
profiler_steps: 10

For GPU, use fused attention:

NVTE_FUSED_ATTN=1 python ... attention="cudnn_flash_te"

Multihost issues

Multihost training hangs or crashes

Symptoms: Training hangs when running on multiple hosts.Solutions:

Enable distributed system initialization:
```
skip_jax_distributed_system: False
```

Ensure all hosts have same code version:

# On all workers:
cd maxdiffusion && git pull && pip install -e .

Check DCN parallelism settings:

dcn_data_parallelism=-1  # Auto-shard across slices
dcn_fsdp_parallelism=1
dcn_tensor_parallelism=1

Verify network connectivity between hosts.
Use GCS for checkpoints not local disk:
```
output_dir: gs://my-bucket/output/
```

Multihost data loading is slow

Symptoms: Slow step times with multiple hosts.Solutions:

Ensure enough data files - need more files than hosts:

# If 8 hosts, need at least 8+ TFRecord files
no_records_per_shard=10  # Reduce to create more files

Use GCS for data storage not local:

train_data_dir: gs://my-bucket/dataset/

Enable data shuffling:
```
enable_data_shuffling: True
```

Getting help

If you’re still experiencing issues:

Check the logs for detailed error messages
Enable profiler to identify performance bottlenecks
Search GitHub issues: https://github.com/AI-Hypercomputer/maxdiffusion/issues
File a bug report with:
- Complete error message and stack trace
- Hardware type (TPU v5p, v6e, GPU model)
- MaxDiffusion version and commit hash
- Full command or config used
- Steps to reproduce

Getting Started

Core Concepts

Training

Inference

Advanced Features

Deployment

Guides

Compilation issues

Out of memory (OOM) errors

Disk space issues

Permission and access errors

Training and inference issues

Multihost issues

Getting help

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Training

Inference

Advanced Features

Deployment

Guides

​Compilation issues

​Out of memory (OOM) errors

​Disk space issues

​Permission and access errors

​Training and inference issues

​Multihost issues

​Getting help

Build docs developers (and LLMs) love

Compilation issues

Out of memory (OOM) errors

Disk space issues

Permission and access errors

Training and inference issues

Multihost issues

Getting help