Prerequisites
Prepare Dataset
Complete the dataset preparation steps and have your meta info file ready.
Download Pre-trained ESRGAN Model
Real-ESRNet training starts from a pre-trained ESRGAN model:Configure Training Options
Modify the training configuration fileoptions/train_realesrnet_x4plus.yml:
Dataset Configuration
Update the dataset paths to match your prepared data:Root directory containing your ground-truth images
Path to the meta info text file you generated in dataset preparation
Dataset type - use
RealESRGANDataset for on-the-fly degradationValidation Configuration (Optional)
If you want to run validation during training, uncomment and modify these sections:Validation is optional but helps monitor training progress. Set
val_freq to control how often validation runs (e.g., 5e3 means every 5000 iterations).Debug Mode
Before starting the full training, test your configuration in debug mode to catch any issues:What debug mode does
What debug mode does
Debug mode:
- Runs a few training iterations to verify everything works
- Checks data loading and model initialization
- Validates file paths and configurations
- Exits early without full training
Start Training
Once debug mode runs successfully, start the full training:Training Parameters
Number of GPUs to use for distributed training (e.g., 4 for 4 GPUs)
Port for distributed training communication (e.g., 4321)
Distributed training backend - use
pytorch for PyTorch distributedAutomatically resume training from the last checkpoint if interrupted
Training Output
Training artifacts are saved to the experiments directory:Key Files
- models/net_g_*.pth: Generator model checkpoints saved at intervals
- models/net_g_1000000.pth: The final Real-ESRNet model after 1M iterations
- training_states/*.state: Training state for resuming (optimizer, scheduler, etc.)
- visualization/: Sample outputs during training (if enabled)
The final model
net_g_1000000.pth will be used as the initialization for Real-ESRGAN training in stage 2.Monitoring Training
Training progress is logged to the console and tensorboard (if configured):- L1 loss: Should decrease steadily
- Learning rate: Check the schedule is working
- Validation metrics: PSNR/SSIM if validation is enabled
Training Duration
Typical training time for Real-ESRNet:- 1M iterations on 4x V100 GPUs: ~3-4 days
- Single GPU training will take proportionally longer
Adjusting training iterations
Adjusting training iterations
The default configuration trains for 1,000,000 iterations. You can adjust this in the config file:For quick testing, reduce to 100,000 iterations, though results will be suboptimal.
Troubleshooting
Out of memory errors
Out of memory errors
Reduce batch size in the configuration:
Dataset not found
Dataset not found
Verify your paths:
- Check
dataroot_gtpoints to the correct directory - Ensure
meta_infofile exists and contains valid paths - Paths in meta_info should be relative to
dataroot_gt
CUDA device errors
CUDA device errors
Adjust
CUDA_VISIBLE_DEVICES to match your available GPUs:Training is very slow
Training is very slow
- Use cropped sub-images (Step 2 of dataset preparation)
- Increase
num_worker_per_gpufor faster data loading - Ensure data is on fast storage (SSD)
- Check GPU utilization with
nvidia-smi
Next Step
Train Real-ESRGAN
Continue to stage 2: Train Real-ESRGAN with perceptual and GAN losses