Introduction

What is Stable Diffusion from Scratch?

This project implements denoising diffusion probabilistic models (DDPM) and DDIM samplers from scratch in PyTorch, training them on MNIST and CIFAR-10 datasets. If you want to understand how diffusion models work step-by-step—without relying on Hugging Face or the diffusers library—this is the perfect starting point. By building these models yourself, you’ll gain deep insights into:

How the forward diffusion process gradually adds noise to images
How the reverse process learns to denoise and generate new samples
The mathematics behind beta schedules and timestep sampling
The architecture of U-Nets with time embeddings and self-attention
Trade-offs between DDPM and DDIM sampling methods

Key features

MNIST DDPM

U-Net architecture with cosine noise schedule, sinusoidal time embeddings, and self-attention in the bottleneck

CIFAR-10 DDPM

Deeper U-Net with multi-resolution self-attention, dropout regularization, and exponential moving average (EMA) weights

DDIM samplers

Fast deterministic sampling with configurable step counts (10-1000), enabling speed/quality trade-off analysis

Training utilities

Reproducible training scripts with early stopping, loss curves, sample visualization, and timestep analysis

Architecture overview

The core models use a U-Net architecture with residual blocks, group normalization, and time conditioning:

class DiffusionModel(nn.Module):
    def __init__(self, image_size, channels, hidden_dims=[32, 64, 128], time_dim=128):
        super().__init__()
        self.time_mlp = TimeEmbedding(time_dim)
        self.init_conv = nn.Conv2d(channels, hidden_dims[0], 3, padding=1)
        
        # Encoder path with downsampling
        self.down_blocks = nn.ModuleList([
            DownBlock(hidden_dims[0], hidden_dims[1], time_dim),
            DownBlock(hidden_dims[1], hidden_dims[2], time_dim)
        ])
        
        # Bottleneck with self-attention
        self.bottleneck = BottleneckBlock(hidden_dims[2], time_dim)
        
        # Decoder path with skip connections
        self.up_blocks = nn.ModuleList([
            UpBlock(hidden_dims[2], hidden_dims[2], hidden_dims[1], time_dim),
            UpBlock(hidden_dims[1], hidden_dims[1], hidden_dims[0], time_dim)
        ])

The model learns to predict the noise that was added to an image at a given timestep, not the clean image directly. This is a key insight from the DDPM paper.

Design philosophy

This implementation is intentionally kept minimal and explicit:

No hidden training frameworks or complex configuration systems
Every modeling decision is visible in the Python files
Clear separation between models, training, and utilities
Easy to modify and adapt to your own datasets

Core models

diffusion.py and diffusion_cifar.py contain the U-Net architectures and diffusion processes

Training scripts

train_diffusion.py and train_diffusion_cifar.py handle the training loops with early stopping

Utilities

DDIM comparisons, interpolation experiments, and timestep analysis scripts

What you’ll learn

By working through this project, you’ll understand:

Forward diffusion process

How to gradually add Gaussian noise to images using a predefined beta schedule

Reverse denoising

How to train a neural network to predict and remove noise step by step

U-Net architecture

How to build an encoder-decoder with skip connections, time embeddings, and self-attention

Sampling methods

The difference between DDPM (1000 steps) and DDIM (as few as 10 steps) for generation

Training best practices

Early stopping, loss monitoring, checkpoint saving, and sample visualization

Theoretical foundation

This implementation is based on two foundational papers:

Denoising Diffusion Probabilistic Models (Ho et al., NeurIPS 2020)

Introduced the DDPM framework with a fixed forward process and learned reverse process. The model is trained to predict the noise added at each timestep using a simple MSE loss.Key contributions:

Simplified training objective (predict noise, not images)
Cosine beta schedule for better performance
Connection to score-based generative models

Denoising Diffusion Implicit Models (Song et al., ICLR 2021)

Extended DDPM with a deterministic sampling process that allows skipping timesteps without retraining. This enables 10-100x faster generation with minimal quality loss.Key contributions:

Non-Markovian forward process
Deterministic sampling (η=0) for reproducibility
Flexible step counts for speed/quality trade-offs

Next steps

Quick start

Train your first MNIST diffusion model in under 10 minutes

Installation

Set up your environment and install dependencies

Get Started

Core Concepts

Training Guides

Model Architecture

Sampling & Inference

Experiments

What is Stable Diffusion from Scratch?

Key features

MNIST DDPM

CIFAR-10 DDPM

DDIM samplers

Training utilities

Architecture overview

Design philosophy

Core models

Training scripts

Utilities

What you’ll learn

Theoretical foundation

Next steps

Quick start

Installation

Build docs developers (and LLMs) love

Get Started

Core Concepts

Training Guides

Model Architecture

Sampling & Inference

Experiments

​What is Stable Diffusion from Scratch?

​Key features

MNIST DDPM

CIFAR-10 DDPM

DDIM samplers

Training utilities

​Architecture overview

​Design philosophy

Core models

Training scripts

Utilities

​What you’ll learn

​Theoretical foundation

​Next steps

Quick start

Installation

Build docs developers (and LLMs) love

What is Stable Diffusion from Scratch?

Key features

Architecture overview

Design philosophy

What you’ll learn

Theoretical foundation

Next steps