Overview
This guide will walk you through training your first diffusion model on MNIST. You’ll see how to:- Load and preprocess the MNIST dataset
- Initialize the diffusion process and U-Net model
- Train the model with early stopping
- Generate new digit samples using DDPM and DDIM
Training on MNIST takes approximately 5-10 minutes on a GPU (or 30-60 minutes on CPU). The code automatically detects and uses CUDA if available.
Train MNIST DDPM
Run the training script
The simplest way to get started is to run the MNIST training script directly:This will:
- Automatically download MNIST to the
data/directory - Train a U-Net based DDPM with a cosine beta schedule
- Save intermediate samples to
samples/ - Save the best model weights to
best_model.pt - Generate a training loss curve at
samples/training_curve.png - Create final samples at
DDPM.png
Monitor training progress
During training, you’ll see output like this:The training loop includes:
- Early stopping with patience=4 to prevent overfitting
- Sample generation every 10 epochs to visualize progress
- Loss tracking to monitor convergence
View generated samples
After training completes, check the
DDPM.png file in the repository root. You should see a 4x4 grid of generated MNIST digits.The samples/ directory contains:noising_epoch*.png- Forward diffusion visualizationsamples_epoch*.png- Generated samples at different epochstraining_curve.png- Loss over timebeta_schedule.png- Cosine beta schedule visualization
Understanding the code
Data loading
The training script uses standard PyTorch data loading with normalization to [-1, 1]:Model initialization
TheDiffusionProcess class handles both the diffusion schedule and the U-Net model:
image_size=28- MNIST images are 28x28 pixelschannels=1- Grayscale images (CIFAR-10 useschannels=3)hidden_dims=[128, 256, 512]- Channel dimensions for encoder/decoder levelsdevice- Automatically uses CUDA if available
Training loop
The training loop is simple and explicit:The
train_step method handles the entire training step: sampling timesteps, adding noise, predicting noise, computing loss, and backpropagation.What happens in train_step?
Here’s the core logic from src/models/diffusion.py:train_step:
Generate samples
Once you have a trained model (best_model.pt), you can generate new samples:
Analyze the diffusion process
After training, run the interpolation and timestep analysis script:- Estimate noise-prediction MSE vs timestep to see which parts of the diffusion chain are hardest to learn
- Generate latent interpolations between random noise vectors using DDPM and DDIM
- Save visualizations to
interp.pngandinterp_ddim.png
Compare DDPM vs DDIM
Benchmark sampling speed and quality:- Sample grids at different step counts (10, 50, 100, 1000)
- Timing analysis plots showing speed/quality trade-offs
- A detailed analysis report at
ddim_comparison_mnist/analysis_report.txt
DDPM
1000 steps
~5-10 seconds per batch
Highest quality
DDIM (50 steps)
50 steps
~0.2-0.5 seconds per batch
Near-identical quality
Visualizing the forward process
Thevisualize_noising utility shows how images are progressively corrupted:
t=0, you see the original image. By t=999, it’s pure Gaussian noise.
Next steps
Train on CIFAR-10
Try the more advanced CIFAR-10 model with EMA weights and multi-resolution attention
Experiment with hyperparameters
Modify
hidden_dims, noise_steps, learning rate, or beta schedule in the source files to see how they affect training