Skip to main content
Real-ESRGAN extends the powerful ESRGAN architecture to develop practical algorithms for general image and video restoration. Unlike traditional super-resolution methods that assume clean, well-defined degradations, Real-ESRGAN is trained with pure synthetic data to handle real-world blind super-resolution.

Blind Super-Resolution Challenge

Real-world image degradation is complex and unpredictable. Images can suffer from:
  • Unknown blur kernels
  • Complex noise patterns
  • JPEG compression artifacts
  • Multiple combined degradations
  • Various downsampling operations
Traditional super-resolution models trained on simple bicubic downsampling fail catastrophically when faced with these real-world degradations. Real-ESRGAN addresses this by learning from a sophisticated degradation synthesis process.

Synthetic Data Degradation Pipeline

Real-ESRGAN’s key innovation is training exclusively on synthetic data that closely mimics real-world degradations. The training pipeline generates low-quality images through:

First-order Degradation

  1. Blur: Apply various blur kernels (isotropic/anisotropic Gaussian, generalized Gaussian)
  2. Downsampling: Use different algorithms (bilinear, bicubic, area)
  3. Noise: Add Gaussian noise with varying levels
  4. JPEG Compression: Apply compression with random quality factors

Second-order Degradation

The process repeats with different parameters to simulate multiple rounds of degradation, better representing real-world image processing chains.

Sinc Filters

Apply sinc filters to model common artifacts from image processing operations.
By generating degraded images on-the-fly during training, Real-ESRGAN learns to handle a wide spectrum of degradation types without requiring paired real-world training data.

High-Order Degradation Modeling

The training uses a high-order degradation model that chains multiple degradation operations:
HR Image → [Blur → Resize → Noise → JPEG] → [Blur → Resize → Noise → JPEG] → Sinc Filter → LR Image
Each operation uses randomly sampled parameters, creating enormous variety in training data. This approach enables the model to generalize to diverse real-world scenarios.

Pure Synthetic Training

Real-ESRGAN achieves practical blind super-resolution without any real-world paired training data. All low-quality images are synthetically generated from high-quality images during training.
This approach offers several advantages:
  • Scalability: Easy to generate unlimited training data
  • Flexibility: Can adjust degradation parameters for specific domains
  • No alignment issues: No need for perfectly aligned HR/LR pairs
  • Domain adaptation: Can retrain for specific image types (anime, faces, etc.)

Network Architecture Strategy

Real-ESRGAN uses two primary generator architectures:

RRDBNet (Large Models)

  • Based on ESRGAN’s Residual-in-Residual Dense Block architecture
  • Used for general purpose models (RealESRGAN_x4plus, RealESRNet_x4plus)
  • Offers high quality at the cost of model size and inference speed
  • Default configuration: 23 RRDB blocks with 64 base features

SRVGGNetCompact (Lightweight Models)

  • Compact VGG-style architecture for fast inference
  • Used for anime videos and general-purpose lightweight models
  • Significantly smaller and faster than RRDBNet
  • Performs upsampling only in the final layer
Choose RRDBNet when:
  • Quality is the top priority
  • Computational resources are available
  • Processing photos or complex natural images
Choose SRVGGNetCompact when:
  • Speed is critical (real-time or video processing)
  • Running on limited hardware
  • Processing anime or cartoon content
  • Model size needs to be minimal

Discriminator Architecture

Real-ESRGAN employs a U-Net discriminator with spectral normalization that:
  • Provides multi-scale discrimination through its U-Net structure
  • Uses skip connections to preserve fine details
  • Applies spectral normalization for training stability
  • Outputs a feature map rather than a single real/fake prediction
This discriminator design enables the model to:
  • Distinguish between real and generated images at multiple scales
  • Provide more informative gradients for generator training
  • Maintain stable adversarial training

Inference Features

The trained Real-ESRGAN models support practical features for deployment:
  • Tile processing: Handle arbitrarily large images by processing in tiles
  • Alpha channel support: Preserve transparency in RGBA images
  • Grayscale images: Process both color and grayscale inputs
  • 16-bit images: Support high bit-depth images
  • Arbitrary output scales: Use --outscale to generate any desired output size
  • Face enhancement: Optional integration with GFPGAN for face restoration
The inference implementation automatically handles images with different characteristics, making it practical for diverse real-world applications.

Why It Works

Real-ESRGAN’s effectiveness stems from several key design choices:
  1. Comprehensive degradation modeling: The high-order degradation process covers the vast majority of real-world scenarios
  2. Strong generator architecture: Both RRDBNet and SRVGGNetCompact provide sufficient capacity to learn complex mappings
  3. Advanced discriminator: The U-Net discriminator provides rich multi-scale feedback
  4. Two-stage training: Separate L1 and GAN training stages balance sharpness and perceptual quality
  5. Domain-specific variants: Specialized models for anime, faces, and general content maximize performance
The result is a practical blind super-resolution system that works on real-world images without requiring any real degraded training data.

Build docs developers (and LLMs) love