Dataset types
MaxDiffusion supports four dataset types, controlled by thedataset_type flag:
| Pipeline | Location | Formats | Features |
|---|---|---|---|
| hf | HuggingFace Hub or Cloud Storage | parquet, arrow, json, csv, txt | Streaming, good for large datasets |
| tf | HuggingFace Hub (downloads to disk) | parquet, arrow, json, csv, txt | In-memory, works for small datasets |
| tfrecord | Local/Cloud Storage | TFRecord | Streaming, good for large datasets |
| grain | Local/Cloud Storage | ArrayRecord | Streaming, global shuffle, deterministic |
HuggingFace streaming (dataset_type=hf)
Stream data directly from HuggingFace Hub or cloud storage without downloading.From HuggingFace Hub
From cloud storage
tf.data in-memory (dataset_type=tf)
Downloads entire dataset to memory. Best for small datasets.cache_latents_text_encoder_outputs=True, the VAE and text encoder process images and captions during dataset creation, saving preprocessed latents and embeddings.
TFRecord format (dataset_type=tfrecord)
Use TFRecord files for efficient streaming of large datasets.Grain format (dataset_type=grain)
Grain provides global shuffle and deterministic data iteration.Wan dataset preprocessing
Wan models require special preprocessing to create TFRecord datasets with video latents and text embeddings.Wan PusaV1 dataset example
This example uses the PusaV1 dataset.Download the dataset
Create training dataset
Create evaluation dataset
Remove duplicates from training set
Delete the first 420 samples from training data (they’re in eval):Clean up empty files
Remove any empty eval files:Directory structure
Your dataset should now have:General text-to-video preprocessing
For other video datasets, use the general preprocessing script:- Loads videos from HuggingFace datasets
- Encodes videos using the VAE
- Encodes captions using the T5 text encoder
- Saves latents and embeddings to TFRecord format
Configuration options
| Parameter | Description |
|---|---|
train_data_dir | Path to downloaded dataset |
tfrecords_dir | Output directory for TFRecord files |
no_records_per_shard | Number of examples per TFRecord file |
enable_eval_timesteps | Add timestep annotations for evaluation |
timesteps_list | Timesteps for evaluation buckets |
num_eval_samples | Number of evaluation samples (default: 420) |
Upload to cloud storage
Copy preprocessed data to GCS for distributed training:Using preprocessed data
For training
For evaluation
Multihost dataloading
In multihost environments, optimal performance requires each data file to be accessed by only one host.Best practices
- Number of files > Number of hosts - Each host reads a subset of files
- File assignment - Files are distributed evenly across hosts
- Epoch handling - Hosts may finish epochs at different times