Two-Stage Training Overview
The training process is divided into two distinct stages, each with different objectives and loss functions:Stage 1: Real-ESRNet - Train with L1 loss for pixel-accurate reconstruction
Stage 2: Real-ESRGAN - Fine-tune with L1 + perceptual + GAN losses for enhanced perceptual quality
Stage 2: Real-ESRGAN - Fine-tune with L1 + perceptual + GAN losses for enhanced perceptual quality
Stage 1: Training Real-ESRNet
The first stage focuses on learning accurate pixel-level reconstruction without adversarial training.Objective
Train the generator (RRDBNet or SRVGGNetCompact) to accurately reconstruct high-resolution images from synthetically degraded low-resolution inputs.Loss Function
L1 Loss OnlyGis the generator networkI_LRis the low-resolution input imageI_HRis the high-resolution ground truth image||·||_1is the L1 (mean absolute error) norm
Training Configuration
Based ondocs/Training.md:
Initialization
- Pre-trained ESRGAN model:
ESRGAN_SRx4_DF2KOST_official-ff704c30.pth - Starting from a model trained on paired bicubic degradation provides a good initialization
- DF2K (DIV2K + Flickr2K) + OST datasets
- Only high-resolution images are required
- Low-quality images are generated on-the-fly using the degradation pipeline
- Typically trained for ~1,000,000 iterations
- Batch size: 12 per GPU (with 4 GPUs = 48 total batch size)
- Learning rate and scheduling defined in training config
The output of Stage 1 is the Real-ESRNet model, which produces sharp and accurate reconstructions but may lack the fine textures and details that make images look truly realistic.
Why L1 Loss First?
Training with L1 loss first provides several benefits:- Stable Foundation: Establishes basic super-resolution capability without adversarial instability
- Accurate Structure: Ensures the network learns correct image structure and content
- Better Initialization: Provides a strong starting point for GAN training
- Faster Convergence: L1 training converges more reliably than joint GAN training from scratch
Stage 2: Training Real-ESRGAN
The second stage adds perceptual and adversarial losses to enhance realism and fine details.Objective
Fine-tune the generator to produce perceptually realistic images with natural textures and fine details while maintaining structural accuracy.Loss Function
Combined Loss1. L1 Loss (Pixel Loss)
- Maintains pixel-level accuracy
- Prevents the generator from deviating too far from ground truth
- Ensures structural consistency
2. Perceptual Loss
- Compares high-level features extracted by a pre-trained VGG network
φrepresents features from specific VGG layers- Encourages perceptual similarity rather than exact pixel matching
- Helps generate natural textures and patterns
3. GAN Loss (Adversarial Loss)
- Trains generator to fool the discriminator
Dis the UNetDiscriminatorSN discriminator- Pushes outputs toward the manifold of natural images
- Creates photorealistic textures and fine details
Training Configuration
Initialization- Uses the trained Real-ESRNet model from Stage 1 as starting point
- Example:
experiments/train_RealESRNetx4plus_1000k_B12G4_fromESRGAN/model/net_g_1000000.pth - Discriminator is initialized from scratch (or from a pre-trained discriminator)
- Additional training iterations on top of Stage 1
- Same dataset and degradation pipeline as Stage 1
- Loss weights
λare carefully tuned to balance the three loss components
Loss Weight Balancing
Loss Weight Balancing
The relative weights of the three losses significantly impact the final result:Higher λ_pixel
- More faithful to ground truth pixels
- Sharper, more accurate structure
- May appear slightly smoother or less detailed
- More natural textures
- Better perceptual quality
- May introduce slight inaccuracies in structure
- Most photorealistic appearance
- Finest details and textures
- Risk of hallucinating incorrect details
- Can introduce artifacts if too high
Discriminator Training
While the generator trains with the combined loss, the discriminator (UNetDiscriminatorSN) trains with its own objective:- Output high values for real high-resolution images
- Output low values for generator-produced images
- Provide gradient feedback to improve generator quality
- Stabilize adversarial training
- Prevent discriminator from becoming too strong
- Enable balanced generator/discriminator learning
Degradation Pipeline During Training
Both training stages use the same sophisticated degradation pipeline to generate training pairs on-the-fly:High-Order Degradation Process
-
First Degradation Round
- Blur (isotropic/anisotropic Gaussian, generalized Gaussian)
- Downsample (bilinear, bicubic, area)
- Noise (Gaussian, Poisson)
- JPEG compression
-
Second Degradation Round
- Repeat with different random parameters
- Models multiple processing rounds
-
Sinc Filter
- Apply sinc filter to simulate ringing artifacts
- Models common image processing artifacts
All degradation parameters are randomly sampled during training, ensuring the model learns to handle a wide variety of real-world degradations without overfitting to specific patterns.
Training Commands
Based on the source documentation indocs/Training.md:
Stage 1: Train Real-ESRNet
Multi-GPU Training (4 GPUs)Stage 2: Train Real-ESRGAN
Multi-GPU Training (4 GPUs)Fine-tuning on Custom Datasets
Real-ESRGAN supports two approaches for fine-tuning on custom data:Option 1: Generate Degradations On-the-Fly
Use your own high-resolution images with the same degradation pipeline:- Prepare your HR image dataset
- Download pre-trained models:
RealESRGAN_x4plus.pth(generator)RealESRGAN_x4plus_netD.pth(discriminator)
- Modify
options/finetune_realesrgan_x4plus.ymlto point to your dataset - Run training:
This approach works well when your domain has similar degradation patterns to natural images, or when you want the model to learn from your specific degradation synthesis.
Option 2: Use Paired Training Data
If you have real low-resolution and high-resolution pairs:- Prepare folders with paired LR and HR images
- Generate paired meta-info file using
scripts/generate_meta_info_pairdata.py - Modify
options/finetune_realesrgan_x4plus_pairdata.yml - Run training:
Paired data training is similar to fine-tuning ESRGAN and works well when you have real degraded images you want to restore, or when degradations are domain-specific and hard to synthesize.
Dataset Preparation
For training or fine-tuning, the following datasets are used:Standard Training Datasets
- DIV2K: 800 high-quality 2K resolution images
- Flickr2K: 2650 high-quality 2K resolution images
- OST: Outdoor Scene Training dataset
Data Preprocessing Steps
-
[Optional] Multi-scale Generation
-
[Optional] Crop to Sub-images
-
Generate Meta-info File
Why These Preprocessing Steps?
Why These Preprocessing Steps?
Multi-scale Images
- Provides training samples at different resolutions
- Helps model learn scale-invariant features
- Improves generalization to different input sizes
- Faster I/O during training (smaller files)
- More efficient data loading
- Increased effective dataset size through more crops
- Optional if disk space or I/O is limited
- Lists all training image paths
- Enables efficient dataset indexing
- Supports combining multiple datasets
Training Tips
Debug Mode
Before full training, test your configuration in debug mode:- Data loading works correctly
- Model architecture is valid
- Loss computations succeed
- GPU memory is sufficient
Validation During Training
Uncomment validation sections in the config file to monitor progress:Auto-Resume
Use--auto_resume flag to automatically continue from the latest checkpoint if training is interrupted.
Why This Strategy Works
The two-stage training approach with combined losses achieves superior results because:- Stable Learning: L1 pre-training provides a stable foundation before adversarial training
- Balanced Quality: Combined losses balance pixel accuracy, perceptual quality, and photorealism
- Robust Discriminator: U-Net discriminator with spectral normalization enables stable GAN training
- Rich Training Data: On-the-fly degradation synthesis provides unlimited diverse training samples
- Transfer Learning: Starting from ESRGAN pre-trained on clean data accelerates convergence