Denoising Diffusion Probabilistic Models explained with implementation details and training strategies
DDPM (Denoising Diffusion Probabilistic Models) is the foundational algorithm for training and sampling from diffusion models. Introduced by Ho et al. in 2020, DDPM provides a principled framework for learning generative models through iterative denoising.
Training: Learn a neural network ε_θ that predicts the noise added at each diffusion step
Sampling: Start from pure noise and iteratively denoise for T steps to generate samples
The term “probabilistic” refers to the stochastic nature of the reverse process—at each denoising step, we sample from a Gaussian distribution rather than using a deterministic update.
The core building block is a residual block that incorporates time information:
src/models/diffusion.py
class ResBlock(nn.Module): def __init__(self, in_ch, out_ch, time_dim): super().__init__() self.block1 = nn.Sequential( nn.GroupNorm(8, in_ch), nn.SiLU(), nn.Conv2d(in_ch, out_ch, 3, padding=1), ) self.block2 = nn.Sequential( nn.GroupNorm(8, out_ch), nn.SiLU(), nn.Conv2d(out_ch, out_ch, 3, padding=1), ) self.time_emb = nn.Sequential( nn.SiLU(), nn.Linear(time_dim, out_ch) ) self.shortcut = nn.Conv2d(in_ch, out_ch, 1) if in_ch != out_ch else nn.Identity() def forward(self, x, t_emb): h = self.block1(x) # Inject time embedding h = h + self.time_emb(t_emb)[:, :, None, None] h = self.block2(h) return self.shortcut(x) + h
Time conditioning is crucial because the denoising strategy must adapt based on the noise level. Early timesteps (high noise) require broad stroke denoising, while late timesteps (low noise) refine fine details.
Timesteps are sampled uniformly during training, which means the model sees all noise levels equally often. This ensures it learns to denoise across the entire diffusion trajectory.
Where z ~ N(0, I) is fresh noise added at each step (except the final step).
DDPM sampling requires T forward passes through the neural network (typically T=1000). This makes generation slow compared to GANs or VAEs. See DDIM for a faster alternative.
# Update EMA after each training stepwith torch.no_grad(): for ema_param, param in zip(self.ema_model.parameters(), self.model.parameters()): ema_param.data.mul_(self.ema_decay).add_( param.data, alpha=1 - self.ema_decay)
EMA decay of 0.999 means the EMA model slowly tracks the training model, smoothing out high-frequency updates and improving generalization.
with self.autocast_ctx(): noise_pred = self.model(x_t, t) loss = F.mse_loss(noise_pred, noise)if self.grad_scaler.is_enabled(): self.grad_scaler.scale(loss).backward() self.grad_scaler.step(self.optimizer) self.grad_scaler.update()
Mixed precision uses float16 for most operations while keeping float32 for numerical stability where needed. This can provide 2-3x speedup on modern GPUs.