Overview
Multihost deployment enables distributed training by coordinating multiple TPU VMs to work together on the same training job. This approach is ideal when you need more compute power than a single host can provide.Prerequisites
- A TPU pod slice (multiple TPU VMs)
- gcloud CLI configured with your project
- MaxDiffusion repository cloned locally
- GCS bucket for storing outputs
Setup and training
Configuration options
Parallelism strategies
MaxDiffusion supports several parallelism strategies for multihost training:- Data parallelism (
ici_data_parallelism) - Distribute different batches across devices - FSDP parallelism (
ici_fsdp_parallelism) - Shard model parameters across devices - Tensor parallelism (
ici_tensor_parallelism) - Split individual tensors across devices
Environment variables
For optimal performance, setLIBTPU_INIT_ARGS with appropriate XLA flags:
Monitoring
Training metrics are automatically logged to TensorBoard:Best practices
Start with single host
Start with single host
Always validate your training configuration on a single host before scaling to multihost to catch configuration issues early.
Use GCS for storage
Use GCS for storage
Store all outputs (checkpoints, logs, datasets) in Google Cloud Storage for accessibility across all workers.
Set appropriate batch sizes
Set appropriate batch sizes
Ensure your
per_device_batch_size multiplied by the number of devices results in a reasonable global batch size.