DDIM sampling

DDIM (Denoising Diffusion Implicit Models) enables faster sampling by skipping timesteps while maintaining high sample quality. Unlike DDPM, DDIM can produce deterministic samples when

\eta = 0

Why DDIM?

Standard DDPM sampling requires iterating through all T timesteps (e.g., 1000 steps), making generation slow. DDIM addresses this by:

Fewer steps: Use only 50-100 steps instead of 1000
Deterministic: Same initial noise produces same output when $\eta = 0$
Quality preservation: Maintains sample quality with proper step selection

Mathematical formulation

DDIM uses a non-Markovian forward process that allows skipping timesteps. The reverse update is:

x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \cdot \text{pred}_{x_0} + \sqrt{1 - \bar{\alpha}_{t-1} - \sigma_t^2} \cdot \epsilon_\theta(x_t, t) + \sigma_t \epsilon

where:

$\text{pred}_{x_0} = \frac{x_t - \sqrt{1-\bar{\alpha}_t} \cdot \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}}$ is the predicted clean image
$\sigma_t = \eta \cdot \sqrt{\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \cdot \frac{1-\bar{\alpha}_t}{\bar{\alpha}_{t-1}}}$ controls stochasticity
$\epsilon \sim \mathcal{N}(0, I)$ is random noise (only if $\eta > 0$ )

When

\eta = 0

, DDIM is fully deterministic. When

\eta = 1

, DDIM recovers the DDPM sampling process.

Implementation

Here’s the complete DDIM implementation from src/models/diffusion.py:122:

def sample_ddim(self, num_samples=16, ddim_steps=50, eta=0.0):
    """
    Generate samples using DDIM (Denoising Diffusion Implicit Models).
    
    DDIM allows faster sampling by skipping timesteps while maintaining quality.
    Based on "Denoising Diffusion Implicit Models" (Song et al., 2020).
    
    Args:
        num_samples: Number of samples to generate
        ddim_steps: Number of denoising steps (fewer = faster, original uses noise_steps)
        eta: Stochasticity parameter. eta=0 is deterministic DDIM, eta=1 recovers DDPM
    
    Returns:
        Generated images tensor
    
    Pre: ddim_steps > 0 and ddim_steps <= noise_steps
    Post: returns tensor of shape (num_samples, channels, image_size, image_size)
    """
    if ddim_steps <= 0 or ddim_steps > self.noise_steps:
        raise ValueError(f"ddim_steps must be in (0, {self.noise_steps}], got {ddim_steps}")
    
    self.model.eval()
    with torch.no_grad():
        # Create uniform timestep schedule
        step_size = self.noise_steps // ddim_steps
        timesteps = list(range(0, self.noise_steps, step_size))
        if timesteps[-1] != self.noise_steps - 1:
            timesteps.append(self.noise_steps - 1)
        timesteps = sorted(timesteps, reverse=True)
        
        # Start with random noise
        x_t = torch.randn(num_samples, self.model.channels, 
                        self.model.image_size, self.model.image_size,
                        device=self.device)
        
        for i, t in enumerate(timesteps):
            t_batch = torch.full((num_samples,), t, device=self.device, dtype=torch.long)
            
            # Predict noise
            predicted_noise = self.model(x_t, t_batch)
            
            # Get schedule values
            alpha_cumprod_t = self.alpha_cumprod[t]
            sqrt_alpha_cumprod_t = self.sqrt_alpha_cumprod[t]
            sqrt_one_minus_alpha_cumprod_t = self.sqrt_one_minus_alpha_cumprod[t]
            
            # Predict x_0
            pred_x0 = (x_t - sqrt_one_minus_alpha_cumprod_t * predicted_noise) / sqrt_alpha_cumprod_t
            pred_x0 = torch.clamp(pred_x0, -1.0, 1.0)
            
            if i < len(timesteps) - 1:
                t_prev = timesteps[i + 1]
                alpha_cumprod_t_prev = self.alpha_cumprod[t_prev]
                sqrt_alpha_cumprod_t_prev = self.sqrt_alpha_cumprod[t_prev]
                sqrt_one_minus_alpha_cumprod_t_prev = self.sqrt_one_minus_alpha_cumprod[t_prev]
                
                # Compute variance
                sigma_t = eta * torch.sqrt(
                    (1 - alpha_cumprod_t_prev) / (1 - alpha_cumprod_t) * 
                    (1 - alpha_cumprod_t / alpha_cumprod_t_prev)
                )
                
                # Direction pointing to x_t
                dir_xt = torch.sqrt(1 - alpha_cumprod_t_prev - sigma_t**2) * predicted_noise
                
                # Compute x_{t-1}
                x_t = sqrt_alpha_cumprod_t_prev * pred_x0 + dir_xt
                
                # Add stochastic noise if eta > 0
                if eta > 0:
                    noise = torch.randn_like(x_t)
                    x_t = x_t + sigma_t * noise
            else:
                x_t = pred_x0
        
        # Clamp only at the end
        result = torch.clamp(x_t, -1.0, 1.0)
    
    self.model.train()
    return result

Usage example

Set up diffusion model

Load a trained model (same as DDPM):

from src.models.diffusion import DiffusionProcess
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
diffusion = DiffusionProcess(
    image_size=28,
    channels=1,
    hidden_dims=[128, 256, 512],
    noise_steps=1000,
    device=device
)
diffusion.model.load_state_dict(torch.load('best_model.pt'))

Generate samples with DDIM

Use fewer steps for faster generation:

# Deterministic sampling with 50 steps (20x faster than DDPM)
samples = diffusion.sample_ddim(
    num_samples=16,
    ddim_steps=50,
    eta=0.0  # Fully deterministic
)

Experiment with stochasticity

Adjust the eta parameter to control randomness:

# More stochastic (closer to DDPM)
samples = diffusion.sample_ddim(
    num_samples=16,
    ddim_steps=50,
    eta=0.5  # Partially stochastic
)

CIFAR-10 DDIM implementation

The CIFAR-10 variant uses uniform timestep spacing for better coverage. From src/models/diffusion_cifar.py:375:

def sample_ddim(self, num_samples=16, ddim_steps=50, eta=0.0):
    """
    Generate samples using DDIM with EMA parameters.

    DDIM chooses a sparse subsequence of timesteps t_0 > … > t_{S-1}
    and follows a deterministic trajectory when η = 0.
    """
    if ddim_steps <= 0 or ddim_steps > self.noise_steps:
        raise ValueError(f"ddim_steps must be in (0, {self.noise_steps}], got {ddim_steps}")
    
    model = self.ema_model  # Use EMA weights
    was_training = model.training
    model.eval()

    with torch.no_grad():
        # Uniform grid of timesteps in [0, T-1], highest to lowest
        step_indices = torch.linspace(
            0,
            self.noise_steps - 1,
            steps=ddim_steps,
            dtype=torch.long,
            device=self.device,
        )
        timesteps = list(reversed(step_indices.tolist()))
        
        x_t = torch.randn(
            num_samples,
            self.model.channels,
            self.model.image_size,
            self.model.image_size,
            device=self.device,
        )
        
        for i, t in enumerate(timesteps):
            t_batch = torch.full((num_samples,), t, device=self.device, dtype=torch.long)
            eps_pred = model(x_t, t_batch)
            
            alpha_cumprod_t = self.alpha_cumprod[t]
            sqrt_alpha_cumprod_t = self.sqrt_alpha_cumprod[t]
            sqrt_one_minus_alpha_cumprod_t = self.sqrt_one_minus_alpha_cumprod[t]
            
            pred_x0 = (x_t - sqrt_one_minus_alpha_cumprod_t * eps_pred) / sqrt_alpha_cumprod_t
            
            if i < len(timesteps) - 1:
                t_prev = timesteps[i + 1]
                alpha_cumprod_t_prev = self.alpha_cumprod[t_prev]
                sqrt_alpha_cumprod_t_prev = self.sqrt_alpha_cumprod[t_prev]
                sqrt_one_minus_alpha_cumprod_t_prev = self.sqrt_one_minus_alpha_cumprod[t_prev]
                
                sigma_t = eta * torch.sqrt(
                    (1 - alpha_cumprod_t_prev) / (1 - alpha_cumprod_t)
                    * (1 - alpha_cumprod_t / alpha_cumprod_t_prev)
                )
                
                # Direction term along the predicted noise
                dir_xt = torch.sqrt(
                    1 - alpha_cumprod_t_prev - sigma_t**2
                ) * eps_pred
                
                x_t = sqrt_alpha_cumprod_t_prev * pred_x0 + dir_xt
                
                if eta > 0:
                    noise = torch.randn_like(x_t)
                    x_t = x_t + sigma_t * noise
            else:
                x_t = pred_x0
            
            # Final clamp to the valid image range
            x_t = torch.clamp(x_t, -1.0, 1.0)
    
    if was_training:
        model.train()
    return x_t

The CIFAR-10 implementation uses torch.linspace for uniform timestep spacing, while the base implementation uses integer division with step_size.

Performance comparison

Method	Steps	Time	Quality
DDPM	1000	~10s	Excellent
DDIM (50 steps)	50	~0.5s	Very good
DDIM (100 steps)	100	~1s	Excellent

Start with ddim_steps=50 and eta=0.0 for a good balance of speed and quality. Increase steps if you need higher quality, or use eta=0.3 for slightly more diverse samples.

Key advantages

Speed: 10-20x faster than DDPM with 50-100 steps
Deterministic: Reproducible results when $\eta = 0$
Flexible: Can trade off speed vs quality by adjusting steps
Interpolation: Deterministic trajectories enable meaningful latent space interpolation

Get Started

Core Concepts

Training Guides

Model Architecture

Sampling & Inference

Experiments

Why DDIM?

Mathematical formulation

Implementation

Usage example

CIFAR-10 DDIM implementation

Performance comparison

Key advantages

Build docs developers (and LLMs) love

Get Started

Core Concepts

Training Guides

Model Architecture

Sampling & Inference

Experiments

​Why DDIM?

​Mathematical formulation

​Implementation

​Usage example

​CIFAR-10 DDIM implementation

​Performance comparison

​Key advantages

Build docs developers (and LLMs) love

Why DDIM?

Mathematical formulation

Implementation

Usage example

CIFAR-10 DDIM implementation

Performance comparison

Key advantages