Understanding the forward and reverse diffusion process that forms the foundation of generative diffusion models
The diffusion process is the core mechanism behind denoising diffusion probabilistic models (DDPM). It consists of two phases: a forward process that gradually adds noise to data, and a reverse process that learns to denoise and generate new samples.
Here’s how the forward process is implemented in the codebase:
src/models/diffusion.py
def add_noise(self, x, t): """ Add noise to the input images according to the diffusion process. Args: x: Clean images tensor of shape [batch_size, channels, height, width] t: Timesteps tensor of shape [batch_size] Returns: Tuple of (noisy_images, noise) """ x = x.to(self.device) t = t.to(self.device) sqrt_alpha_cumprod_t = self.sqrt_alpha_cumprod[t].view(-1, 1, 1, 1) noise = torch.randn_like(x) sqrt_one_minus_alpha_cumprod_t = self.sqrt_one_minus_alpha_cumprod[t].view(-1, 1, 1, 1) x_t = sqrt_alpha_cumprod_t * x + sqrt_one_minus_alpha_cumprod_t * noise return x_t, noise
The key insight is that we can sample x_t directly from x_0 at any timestep without computing all intermediate steps. This is enabled by the reparameterization:
x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε
Where ε ~ N(0, I) is standard Gaussian noise.
By timestep T=1000, the original image becomes nearly indistinguishable from pure Gaussian noise. This property ensures that the reverse process can start from a simple prior distribution.
The reverse process learns to invert the forward diffusion, gradually denoising samples from pure noise back to realistic data. At each step, we predict:
p_θ(x_{t-1} | x_t) = N(μ_θ(x_t, t), σ_t² I)
Where the model predicts the mean μ_θ and uses a fixed variance schedule σ_t.
The model is trained to predict the noise ε that was added during the forward process:
src/models/diffusion.py
def train_step(self, x): x = x.to(self.device) # Sample random timesteps t = torch.randint(0, self.noise_steps, (x.shape[0],), device=self.device) # Add noise to images x_t, noise = self.add_noise(x, t) # Predict the noise using the model self.optimizer.zero_grad(set_to_none=True) with self.autocast_ctx(): noise_pred = self.model(x_t, t) # Calculate MSE loss between predicted and actual noise loss = F.mse_loss(noise_pred, noise) # Backpropagation if self.grad_scaler.is_enabled(): self.grad_scaler.scale(loss).backward() self.grad_scaler.step(self.optimizer) self.grad_scaler.update() else: loss.backward() self.optimizer.step() return loss.item()
The training objective is simply:
L = E[||ε - ε_θ(x_t, t)||²]
This noise prediction parameterization is mathematically equivalent to predicting the denoised image x_0, but empirically produces better sample quality and more stable training.