Training Strategy

Real-ESRGAN employs a carefully designed two-stage training strategy that balances pixel-accurate reconstruction with perceptual quality and photorealism.

Two-Stage Training Overview

The training process is divided into two distinct stages, each with different objectives and loss functions:

Stage 1: Real-ESRNet - Train with L1 loss for pixel-accurate reconstruction
Stage 2: Real-ESRGAN - Fine-tune with L1 + perceptual + GAN losses for enhanced perceptual quality

This staged approach prevents the common GAN training pitfall where the discriminator becomes too strong early in training, making it difficult for the generator to learn the basic reconstruction task.

Stage 1: Training Real-ESRNet

The first stage focuses on learning accurate pixel-level reconstruction without adversarial training.

Objective

Train the generator (RRDBNet or SRVGGNetCompact) to accurately reconstruct high-resolution images from synthetically degraded low-resolution inputs.

Loss Function

L1 Loss Only

L_pixel = ||G(I_LR) - I_HR||_1

Where:

G is the generator network
I_LR is the low-resolution input image
I_HR is the high-resolution ground truth image
||·||_1 is the L1 (mean absolute error) norm

Training Configuration

Based on docs/Training.md: Initialization

Pre-trained ESRGAN model: ESRGAN_SRx4_DF2KOST_official-ff704c30.pth
Starting from a model trained on paired bicubic degradation provides a good initialization

Dataset

DF2K (DIV2K + Flickr2K) + OST datasets
Only high-resolution images are required
Low-quality images are generated on-the-fly using the degradation pipeline

Training Details

Typically trained for ~1,000,000 iterations
Batch size: 12 per GPU (with 4 GPUs = 48 total batch size)
Learning rate and scheduling defined in training config

The output of Stage 1 is the Real-ESRNet model, which produces sharp and accurate reconstructions but may lack the fine textures and details that make images look truly realistic.

Why L1 Loss First?

Training with L1 loss first provides several benefits:

Stable Foundation: Establishes basic super-resolution capability without adversarial instability
Accurate Structure: Ensures the network learns correct image structure and content
Better Initialization: Provides a strong starting point for GAN training
Faster Convergence: L1 training converges more reliably than joint GAN training from scratch

Stage 2: Training Real-ESRGAN

The second stage adds perceptual and adversarial losses to enhance realism and fine details.

Objective

Fine-tune the generator to produce perceptually realistic images with natural textures and fine details while maintaining structural accuracy.

Loss Function

Combined Loss

L_total = λ_pixel * L_pixel + λ_percep * L_percep + λ_gan * L_gan

Where the loss combines three components:

1. L1 Loss (Pixel Loss)

L_pixel = ||G(I_LR) - I_HR||_1

Maintains pixel-level accuracy
Prevents the generator from deviating too far from ground truth
Ensures structural consistency

2. Perceptual Loss

L_percep = ||φ(G(I_LR)) - φ(I_HR)||_2

Compares high-level features extracted by a pre-trained VGG network
φ represents features from specific VGG layers
Encourages perceptual similarity rather than exact pixel matching
Helps generate natural textures and patterns

3. GAN Loss (Adversarial Loss)

L_gan = -log(D(G(I_LR)))

Trains generator to fool the discriminator
D is the UNetDiscriminatorSN discriminator
Pushes outputs toward the manifold of natural images
Creates photorealistic textures and fine details

Training Configuration

Initialization

Uses the trained Real-ESRNet model from Stage 1 as starting point
Example: experiments/train_RealESRNetx4plus_1000k_B12G4_fromESRGAN/model/net_g_1000000.pth
Discriminator is initialized from scratch (or from a pre-trained discriminator)

Training Details

Additional training iterations on top of Stage 1
Same dataset and degradation pipeline as Stage 1
Loss weights λ are carefully tuned to balance the three loss components

Loss Weight Balancing

The relative weights of the three losses significantly impact the final result:Higher λ_pixel

More faithful to ground truth pixels
Sharper, more accurate structure
May appear slightly smoother or less detailed

Higher λ_percep

More natural textures
Better perceptual quality
May introduce slight inaccuracies in structure

Higher λ_gan

Most photorealistic appearance
Finest details and textures
Risk of hallucinating incorrect details
Can introduce artifacts if too high

Real-ESRGAN’s default weights are carefully tuned to balance these trade-offs for general images.

Discriminator Training

While the generator trains with the combined loss, the discriminator (UNetDiscriminatorSN) trains with its own objective:

L_D = -log(D(I_HR)) - log(1 - D(G(I_LR)))

The discriminator learns to:

Output high values for real high-resolution images
Output low values for generator-produced images
Provide gradient feedback to improve generator quality

Spectral Normalization: Applied to all discriminator convolutional layers to:

Stabilize adversarial training
Prevent discriminator from becoming too strong
Enable balanced generator/discriminator learning

Degradation Pipeline During Training

Both training stages use the same sophisticated degradation pipeline to generate training pairs on-the-fly:

High-Order Degradation Process

First Degradation Round
- Blur (isotropic/anisotropic Gaussian, generalized Gaussian)
- Downsample (bilinear, bicubic, area)
- Noise (Gaussian, Poisson)
- JPEG compression
Second Degradation Round
- Repeat with different random parameters
- Models multiple processing rounds
Sinc Filter
- Apply sinc filter to simulate ringing artifacts
- Models common image processing artifacts

All degradation parameters are randomly sampled during training, ensuring the model learns to handle a wide variety of real-world degradations without overfitting to specific patterns.

Training Commands

Based on the source documentation in docs/Training.md:

Stage 1: Train Real-ESRNet

Multi-GPU Training (4 GPUs)

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -m torch.distributed.launch --nproc_per_node=4 --master_port=4321 \
realesrgan/train.py -opt options/train_realesrnet_x4plus.yml \
--launcher pytorch --auto_resume

Single GPU Training

python realesrgan/train.py -opt options/train_realesrnet_x4plus.yml --auto_resume

Stage 2: Train Real-ESRGAN

Multi-GPU Training (4 GPUs)

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -m torch.distributed.launch --nproc_per_node=4 --master_port=4321 \
realesrgan/train.py -opt options/train_realesrgan_x4plus.yml \
--launcher pytorch --auto_resume

Single GPU Training

python realesrgan/train.py -opt options/train_realesrgan_x4plus.yml --auto_resume

Fine-tuning on Custom Datasets

Real-ESRGAN supports two approaches for fine-tuning on custom data:

Option 1: Generate Degradations On-the-Fly

Use your own high-resolution images with the same degradation pipeline:

Prepare your HR image dataset
Download pre-trained models:
- RealESRGAN_x4plus.pth (generator)
- RealESRGAN_x4plus_netD.pth (discriminator)
Modify options/finetune_realesrgan_x4plus.yml to point to your dataset
Run training:

python realesrgan/train.py -opt options/finetune_realesrgan_x4plus.yml --auto_resume

This approach works well when your domain has similar degradation patterns to natural images, or when you want the model to learn from your specific degradation synthesis.

Option 2: Use Paired Training Data

If you have real low-resolution and high-resolution pairs:

Prepare folders with paired LR and HR images
Generate paired meta-info file using scripts/generate_meta_info_pairdata.py
Modify options/finetune_realesrgan_x4plus_pairdata.yml
Run training:

python realesrgan/train.py -opt options/finetune_realesrgan_x4plus_pairdata.yml --auto_resume

Paired data training is similar to fine-tuning ESRGAN and works well when you have real degraded images you want to restore, or when degradations are domain-specific and hard to synthesize.

Dataset Preparation

For training or fine-tuning, the following datasets are used:

Standard Training Datasets

DIV2K: 800 high-quality 2K resolution images
Flickr2K: 2650 high-quality 2K resolution images
OST: Outdoor Scene Training dataset

Data Preprocessing Steps

[Optional] Multi-scale Generation

python scripts/generate_multiscale_DF2K.py \
  --input datasets/DF2K/DF2K_HR \
  --output datasets/DF2K/DF2K_multiscale

[Optional] Crop to Sub-images

python scripts/extract_subimages.py \
  --input datasets/DF2K/DF2K_multiscale \
  --output datasets/DF2K/DF2K_multiscale_sub \
  --crop_size 400 --step 200

Generate Meta-info File

python scripts/generate_meta_info.py \
  --input datasets/DF2K/DF2K_HR datasets/DF2K/DF2K_multiscale \
  --root datasets/DF2K datasets/DF2K \
  --meta_info datasets/DF2K/meta_info/meta_info_DF2Kmultiscale.txt

Why These Preprocessing Steps?

Multi-scale Images

Provides training samples at different resolutions
Helps model learn scale-invariant features
Improves generalization to different input sizes

Sub-image Cropping

Faster I/O during training (smaller files)
More efficient data loading
Increased effective dataset size through more crops
Optional if disk space or I/O is limited

Meta-info File

Lists all training image paths
Enables efficient dataset indexing
Supports combining multiple datasets

Training Tips

Debug Mode

Before full training, test your configuration in debug mode:

python realesrgan/train.py -opt options/train_realesrnet_x4plus.yml --debug

This performs a quick check that:

Data loading works correctly
Model architecture is valid
Loss computations succeed
GPU memory is sufficient

Validation During Training

Uncomment validation sections in the config file to monitor progress:

val:
  val_freq: !!float 5e3
  save_img: True
  metrics:
    psnr:
      type: calculate_psnr
      crop_border: 4
      test_y_channel: false

Auto-Resume

Use --auto_resume flag to automatically continue from the latest checkpoint if training is interrupted.

Why This Strategy Works

The two-stage training approach with combined losses achieves superior results because:

Stable Learning: L1 pre-training provides a stable foundation before adversarial training
Balanced Quality: Combined losses balance pixel accuracy, perceptual quality, and photorealism
Robust Discriminator: U-Net discriminator with spectral normalization enables stable GAN training
Rich Training Data: On-the-fly degradation synthesis provides unlimited diverse training samples
Transfer Learning: Starting from ESRGAN pre-trained on clean data accelerates convergence

The result is a model that produces sharp, detailed, and photorealistic super-resolution results on real-world degraded images.

Get Started

Core Concepts

Usage Guides

Training

Models

Resources

Two-Stage Training Overview

Stage 1: Training Real-ESRNet

Objective

Loss Function

Training Configuration

Why L1 Loss First?

Stage 2: Training Real-ESRGAN

Objective

Loss Function

1. L1 Loss (Pixel Loss)

2. Perceptual Loss

3. GAN Loss (Adversarial Loss)

Training Configuration

Discriminator Training

Degradation Pipeline During Training

High-Order Degradation Process

Training Commands

Stage 1: Train Real-ESRNet

Stage 2: Train Real-ESRGAN

Fine-tuning on Custom Datasets

Option 1: Generate Degradations On-the-Fly

Option 2: Use Paired Training Data

Dataset Preparation

Standard Training Datasets

Data Preprocessing Steps

Training Tips

Debug Mode

Validation During Training

Auto-Resume

Why This Strategy Works

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage Guides

Training

Models

Resources

​Two-Stage Training Overview

​Stage 1: Training Real-ESRNet

​Objective

​Loss Function

​Training Configuration

​Why L1 Loss First?

​Stage 2: Training Real-ESRGAN

​Objective

​Loss Function

​1. L1 Loss (Pixel Loss)

​2. Perceptual Loss

​3. GAN Loss (Adversarial Loss)

​Training Configuration

​Discriminator Training

​Degradation Pipeline During Training

​High-Order Degradation Process

​Training Commands

​Stage 1: Train Real-ESRNet

​Stage 2: Train Real-ESRGAN

​Fine-tuning on Custom Datasets

​Option 1: Generate Degradations On-the-Fly

​Option 2: Use Paired Training Data

​Dataset Preparation

​Standard Training Datasets

​Data Preprocessing Steps

​Training Tips

​Debug Mode

​Validation During Training

​Auto-Resume

​Why This Strategy Works

Build docs developers (and LLMs) love

Two-Stage Training Overview

Stage 1: Training Real-ESRNet

Objective

Loss Function

Training Configuration

Why L1 Loss First?

Stage 2: Training Real-ESRGAN

Objective

Loss Function

1. L1 Loss (Pixel Loss)

2. Perceptual Loss

3. GAN Loss (Adversarial Loss)

Training Configuration

Discriminator Training

Degradation Pipeline During Training

High-Order Degradation Process

Training Commands

Stage 1: Train Real-ESRNet

Stage 2: Train Real-ESRGAN

Fine-tuning on Custom Datasets

Option 1: Generate Degradations On-the-Fly

Option 2: Use Paired Training Data

Dataset Preparation

Standard Training Datasets

Data Preprocessing Steps

Training Tips

Debug Mode

Validation During Training

Auto-Resume

Why This Strategy Works