Skip to main content
This page details the complete data preprocessing pipeline, from raw chest X-ray images to model-ready tensors.

Image Preprocessing Pipeline

All images undergo standardized preprocessing before being fed into the CNN:

1. Resizing

target_size
tuple
default:"(224, 224)"
All images are resized to 224×224 pixels regardless of original dimensions.
Why 224×224?
  • Standard size for ImageNet-trained models
  • Large enough to preserve diagnostic details
  • Small enough for reasonable memory usage
  • Allows batch processing of 32 images simultaneously
Original Image Dimensions: Variable (dataset contains images of different sizes)
Resizing uses bilinear interpolation to maintain image quality while changing dimensions.

2. Normalization

rescale
float
default:"1./255"
Pixel values are normalized from [0, 255] to [0, 1] range.
Formula:
normalized_pixel = original_pixel / 255.0
  • Faster convergence: Gradient descent works better with small values
  • Numerical stability: Prevents overflow/underflow in calculations
  • Consistent scale: All features have equal importance
  • Standard practice: Expected input range for most CNN architectures

3. RGB Channel Handling

color_mode
string
default:"rgb"
Images are loaded as RGB (3 channels) even though X-rays are grayscale.
Why RGB for grayscale images?
  • Dataset stores X-rays in JPEG RGB format
  • All three channels contain identical information
  • CNN architecture expects 3-channel input
  • Simplifies preprocessing pipeline
While technically redundant, keeping RGB format ensures compatibility and doesn’t significantly impact performance.

Data Augmentation Techniques

Data augmentation is applied only during training to artificially increase dataset diversity and prevent overfitting.
Augmentation is NOT applied to validation or test sets to ensure accurate performance evaluation.

Augmentation Parameters

rotation_range
int
default:"15"
Random rotation between -15° and +15°
Why rotation?
  • Simulates different patient positioning during X-ray capture
  • Chest X-rays can have slight angle variations
  • Helps model become rotation-invariant
  • ±15° is medically reasonable (larger angles would be unrealistic)
width_shift_range
float
default:"0.1"
Random horizontal translation up to ±10% of image width
height_shift_range
float
default:"0.1"
Random vertical translation up to ±10% of image height
Why translation?
  • X-rays may not be perfectly centered
  • Simulates different framing by radiologists
  • Prevents model from relying on absolute position of features
zoom_range
float
default:"0.1"
Random zoom between 0.9× and 1.1× (±10%)
Why zoom?
  • Patient size varies (pediatric patients in this dataset)
  • Distance from X-ray machine can vary
  • Lung features should be detectable at multiple scales
horizontal_flip
boolean
default:"true"
Randomly flip images horizontally (left-right mirror)
Why horizontal flip?
  • Left and right lungs have similar anatomy
  • Pneumonia can occur in either lung
  • Effectively doubles the dataset size
  • Medically valid: mirror images don’t change diagnosis
Vertical flipping is NOT used because upside-down chest X-rays would be anatomically incorrect and could confuse the model.

Augmentation Examples

Base chest X-ray image:
  • Size: 224×224×3
  • Normalized: [0, 1]
  • No transformations

Dataset Split Strategy

Original Kaggle Split (Not Used)

Train:  5,216 images (89%)
Test:     624 images (11%)
Val:       16 images (<1%)  ❌ Too small!
The original validation set contains only 16 images, which is statistically insufficient for model evaluation.

Our Custom Split (Used in Code)

validation_split
float
default:"0.2"
20% of training data is held out for validation
Original Train: 5,216 images

New Train:      4,173 images (80%)
Validation:     1,043 images (20%)

Test:             624 images (unchanged)
Size: 4,173 images
Purpose: Model learns from this data
Augmentation: YES (rotation, zoom, flip, etc.)
Batches: 32 images per batch
Class distribution:
  • NORMAL: ~1,073 images
  • PNEUMONIA: ~3,100 images

Why Augmentation Prevents Overfitting

The Problem: Limited Data

With only ~4,200 training images, the model could memorize specific images instead of learning generalizable patterns.
Signs of overfitting:
  • High training accuracy (>95%)
  • Lower validation accuracy (<80%)
  • Model performs poorly on new X-rays

The Solution: Data Augmentation

Without augmentation: 4,173 unique imagesWith augmentation: Nearly infinite variations
  • Each epoch sees different versions of same images
  • Rotation alone creates ~30 variations per image
  • Combined augmentations create thousands of variations
Result: Model sees “new” images every epoch

Combined Regularization Strategy

Data augmentation works alongside other techniques:
  1. Dropout (50%): Randomly disables neurons during training
  2. Data Augmentation: Creates diverse training samples
  3. Early Stopping: Stops training when validation loss stops improving
Together, these techniques achieve validation accuracy within 2-3% of training accuracy, indicating minimal overfitting.

Preprocessing Code Example

from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Training data generator (with augmentation)
train_datagen = ImageDataGenerator(
    rescale=1./255,              # Normalize to [0,1]
    rotation_range=15,           # ±15° rotation
    width_shift_range=0.1,       # ±10% horizontal shift
    height_shift_range=0.1,      # ±10% vertical shift
    zoom_range=0.1,              # ±10% zoom
    horizontal_flip=True,        # Random flip
    validation_split=0.2         # Hold out 20% for validation
)

# Test data generator (no augmentation)
test_datagen = ImageDataGenerator(
    rescale=1./255               # Only normalization
)

# Load training data
train_generator = train_datagen.flow_from_directory(
    'data/chest_xray/train',
    target_size=(224, 224),
    batch_size=32,
    class_mode='categorical',
    subset='training'
)

# Load validation data
val_generator = train_datagen.flow_from_directory(
    'data/chest_xray/train',
    target_size=(224, 224),
    batch_size=32,
    class_mode='categorical',
    subset='validation'
)

Dataset Statistics

Image Properties

format
string
default:"JPEG"
All images stored in JPEG format
source
string
default:"Guangzhou Women and Children's Medical Center"
Pediatric patients aged 1-5 years
total_size
string
default:"~1.15 GB"
Complete dataset size on disk

Class Distribution

Training set is imbalanced:
  • NORMAL: 1,341 images (26%)
  • PNEUMONIA: 3,875 images (74%)
Class imbalance means the model sees 3× more pneumonia cases than normal cases. This naturally biases it toward detecting pneumonia, which is actually desirable in medical screening (better to false-alarm than miss cases).
Test set is also imbalanced:
  • NORMAL: 234 images (37.5%)
  • PNEUMONIA: 390 images (62.5%)
This imbalance reflects real-world clinical scenarios where pneumonia cases are more common in symptomatic patients being screened.

Performance Impact

Metrics:
  • Training Accuracy: ~92%
  • Validation Accuracy: ~89%
  • Recall: ~96%
  • Generalization: Excellent
Training time: ~20-30 minutes (CPU)

Technical References

Build docs developers (and LLMs) love