Image Preprocessing Pipeline
All images undergo standardized preprocessing before being fed into the CNN:1. Resizing
All images are resized to 224×224 pixels regardless of original dimensions.
- Standard size for ImageNet-trained models
- Large enough to preserve diagnostic details
- Small enough for reasonable memory usage
- Allows batch processing of 32 images simultaneously
Resizing uses bilinear interpolation to maintain image quality while changing dimensions.
2. Normalization
Pixel values are normalized from [0, 255] to [0, 1] range.
- Why Normalize?
- Before/After
- Faster convergence: Gradient descent works better with small values
- Numerical stability: Prevents overflow/underflow in calculations
- Consistent scale: All features have equal importance
- Standard practice: Expected input range for most CNN architectures
3. RGB Channel Handling
Images are loaded as RGB (3 channels) even though X-rays are grayscale.
- Dataset stores X-rays in JPEG RGB format
- All three channels contain identical information
- CNN architecture expects 3-channel input
- Simplifies preprocessing pipeline
While technically redundant, keeping RGB format ensures compatibility and doesn’t significantly impact performance.
Data Augmentation Techniques
Data augmentation is applied only during training to artificially increase dataset diversity and prevent overfitting.Augmentation Parameters
Random rotation between -15° and +15°
- Simulates different patient positioning during X-ray capture
- Chest X-rays can have slight angle variations
- Helps model become rotation-invariant
- ±15° is medically reasonable (larger angles would be unrealistic)
Random horizontal translation up to ±10% of image width
Random vertical translation up to ±10% of image height
- X-rays may not be perfectly centered
- Simulates different framing by radiologists
- Prevents model from relying on absolute position of features
Random zoom between 0.9× and 1.1× (±10%)
- Patient size varies (pediatric patients in this dataset)
- Distance from X-ray machine can vary
- Lung features should be detectable at multiple scales
Randomly flip images horizontally (left-right mirror)
- Left and right lungs have similar anatomy
- Pneumonia can occur in either lung
- Effectively doubles the dataset size
- Medically valid: mirror images don’t change diagnosis
Augmentation Examples
- Original
- Rotated
- Translated
- Zoomed
- Flipped
- Combined
Base chest X-ray image:
- Size: 224×224×3
- Normalized: [0, 1]
- No transformations
Dataset Split Strategy
Original Kaggle Split (Not Used)
Our Custom Split (Used in Code)
20% of training data is held out for validation
- Training Set
- Validation Set
- Test Set
Size: 4,173 images
Purpose: Model learns from this data
Augmentation: YES (rotation, zoom, flip, etc.)
Batches: 32 images per batchClass distribution:
Purpose: Model learns from this data
Augmentation: YES (rotation, zoom, flip, etc.)
Batches: 32 images per batchClass distribution:
- NORMAL: ~1,073 images
- PNEUMONIA: ~3,100 images
Why Augmentation Prevents Overfitting
The Problem: Limited Data
With only ~4,200 training images, the model could memorize specific images instead of learning generalizable patterns.
- High training accuracy (>95%)
- Lower validation accuracy (<80%)
- Model performs poorly on new X-rays
The Solution: Data Augmentation
- Effective Dataset Size
- Regularization Effect
- Medical Realism
Without augmentation: 4,173 unique imagesWith augmentation: Nearly infinite variations
- Each epoch sees different versions of same images
- Rotation alone creates ~30 variations per image
- Combined augmentations create thousands of variations
Combined Regularization Strategy
Data augmentation works alongside other techniques:- Dropout (50%): Randomly disables neurons during training
- Data Augmentation: Creates diverse training samples
- Early Stopping: Stops training when validation loss stops improving
Together, these techniques achieve validation accuracy within 2-3% of training accuracy, indicating minimal overfitting.
Preprocessing Code Example
Dataset Statistics
Image Properties
All images stored in JPEG format
Pediatric patients aged 1-5 years
Complete dataset size on disk
Class Distribution
Training set is imbalanced:- NORMAL: 1,341 images (26%)
- PNEUMONIA: 3,875 images (74%)
- NORMAL: 234 images (37.5%)
- PNEUMONIA: 390 images (62.5%)
This imbalance reflects real-world clinical scenarios where pneumonia cases are more common in symptomatic patients being screened.
Performance Impact
- With Augmentation
- Without Augmentation
Metrics:
- Training Accuracy: ~92%
- Validation Accuracy: ~89%
- Recall: ~96%
- Generalization: Excellent
Technical References
- Dataset Source: Kermany et al. (2018), “Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images for Classification”
- Kaggle Link: https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia
- License: CC BY 4.0