Dataset Guide

Dataset Overview

Chest X-Ray Images (Pneumonia)

Source: Kaggle
Author: Paul Mooney
URL: https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia
Size: ~1.15 GB (5,863 images)
License: CC BY 4.0

Description

The dataset contains chest X-ray radiographs from pediatric patients (ages 1-5 years) from the Guangzhou Women and Children’s Medical Center, China.

Images are organized into 3 folders (train, test, val) with subfolders for each category (Pneumonia/Normal).

Dataset Composition

Distribution Statistics

Total images: 5,863
├── Train:     5,216 images (89%)
│   ├── NORMAL:    1,341 images
│   └── PNEUMONIA: 3,875 images
├── Test:        624 images (11%)
│   ├── NORMAL:    234 images
│   └── PNEUMONIA: 390 images
└── Val:          16 images (very small, not used)

The original validation set contains only 16 images, which is too small for reliable validation. Our code automatically creates a proper 80/20 split from the training set.

Class Distribution

The dataset has class imbalance:

Normal cases: ~26% of training data
Pneumonia cases: ~74% of training data

This imbalance reflects real-world clinical scenarios but requires careful handling during training.

Download Instructions

Option 1: Manual Download (Recommended)

Visit Kaggle

Go to https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumoniaIf you don’t have a Kaggle account, create one (it’s free)

Download the Dataset

Click the Download button (top right)A file named archive.zip (~1.15 GB) will download

Extract Files

Unzip archive.zipYou’ll find a folder named chest_xray

Move to Project

Place the extracted folder in your project:

proyectoia/
└── data/
    └── chest_xray/
        ├── train/
        │   ├── NORMAL/
        │   │   └── [1341 .jpeg images]
        │   └── PNEUMONIA/
        │       └── [3875 .jpeg images]
        └── test/
            ├── NORMAL/
            │   └── [234 .jpeg images]
            └── PNEUMONIA/
                └── [390 .jpeg images]

Option 2: Download with Kaggle API

For command-line enthusiasts:

Install Kaggle API

pipenv install kaggle

Configure Credentials

Go to your Kaggle profile → Account → API → Create New Token
Download kaggle.json
On Windows, move it to: C:\Users\<username>\.kaggle\kaggle.json
On Linux/Mac: ~/.kaggle/kaggle.json and run chmod 600 ~/.kaggle/kaggle.json

Download Dataset

pipenv run kaggle datasets download -d paultimothymooney/chest-xray-pneumonia -p data/
cd data
unzip chest-xray-pneumonia.zip -d chest_xray

Dataset Verification

After downloading, verify the structure:

pipenv run python verificar_dataset.py

Expected Structure

data/chest_xray/
├── train/
│   ├── NORMAL/         [1,341 images]
│   └── PNEUMONIA/      [3,875 images]
├── test/
│   ├── NORMAL/         [234 images]
│   └── PNEUMONIA/      [390 images]
└── val/ (optional, not used)

If the verification script reports any missing images or incorrect counts, re-download the dataset to ensure data integrity.

Image Characteristics

Technical Specifications

Format: JPEG
Dimensions: Variable (resized to 224×224 in code)
Channels: 3 (RGB, though X-rays are grayscale)
Quality: Real clinical-use images
Bit Depth: 8-bit per channel

Clinical Categories

NORMAL

Clear lung parenchyma
No abnormal opacities
Normal cardiothoracic ratio
Clear costophrenic angles

PNEUMONIA

Presence of infiltrates
Opacities or consolidations
Increased density in lung fields
May show air bronchograms

The dataset includes both viral and bacterial pneumonia cases, though they are not separately labeled.

Ethical and Legal Considerations

License Information

The dataset is released under CC BY 4.0 license, which permits:

✅ Commercial and non-commercial use
✅ Modification and distribution
✅ Research and education

Requirement: Provide appropriate attribution to the creator.

Privacy Protection

All images have been anonymized
No patient-identifiable information is present
Approved for research use
Compliant with medical data sharing regulations

Responsible Use

CRITICAL: This model is for educational and research purposes ONLY.DO NOT use for:

Real clinical diagnosis without medical supervision
Replacing professional radiologist judgment
Medical decision-making without additional validation
Patient care without proper clinical oversight

AI systems are clinical decision support tools, not replacements for healthcare professionals.

Preprocessing Pipeline

Our code automatically applies the following preprocessing:

1. Resizing

All images resized to 224×224 pixels
Maintains aspect ratio considerations
Standard input size for the CNN

2. Normalization

Pixel values scaled to [0, 1] range
Divides by 255.0
Improves training stability

3. Data Augmentation (Training Only)

Rotation

Random rotations: ±15 degrees
Simulates different patient positioning
Improves rotation invariance

Translation

Horizontal/vertical shifts: ±10%
Accounts for different X-ray centering
Enhances spatial robustness

Zoom

Random zoom: ±10%
Simulates different patient distances
Improves scale invariance

Horizontal Flip

Random horizontal flipping
Anatomically valid (lungs are roughly symmetric)
Doubles effective training data

Data augmentation is applied only during training, not during validation or testing, to ensure fair evaluation.

Dataset Split Strategy

Our Approach

Since the original validation set is too small, we:

Combine train and val sets
Create new split: 80% train / 20% validation
Keep original test set separate and untouched

Final Distribution

Training:   ~4,173 images (80% of combined train+val)
Validation: ~1,043 images (20% of combined train+val)
Test:         624 images (original test set, never seen during training)

The test set remains completely independent to provide unbiased performance evaluation.

Citation and Attribution

Dataset Citation

Kermany, Daniel; Zhang, Kang; Goldbaum, Michael (2018),
"Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images for Classification",
Mendeley Data, v2.
http://dx.doi.org/10.17632/rscbjbr9sj.2

Scientific Publication

Kermany DS, Goldbaum M, et al. (2018).
"Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning".
Cell. 172(5):1122-1131.
https://doi.org/10.1016/j.cell.2018.02.010

When using this dataset in your work, please provide proper attribution to respect the creators’ contribution to open medical AI research.

Common Issues and Solutions

Download fails or times out

Solution: Try the manual download option instead of the API. Large files can be interrupted on unstable connections.

Wrong number of images after extraction

Solution: Re-download the dataset. The zip file may have been corrupted during download.

Permission denied when accessing kaggle.json

Solution: On Linux/Mac, run chmod 600 ~/.kaggle/kaggle.json to set proper permissions.

Out of memory during training

Solution: Reduce batch size in training configuration. The dataset is large, and 224×224 RGB images require significant memory.

Next Steps

With the dataset downloaded and understood, you’re ready to:

Verify the data with the verification script
Explore the images to understand the visual patterns
Train the model using the prepared data pipeline
Evaluate results on the test set

The preprocessing pipeline in the code will handle all the technical details of loading and augmenting the data.

Introducción

Fundamentos del Proyecto

Guías de Implementación

Presentación y Exposición

Recursos Técnicos

Dataset Overview

Chest X-Ray Images (Pneumonia)

Description

Dataset Composition

Distribution Statistics

Class Distribution

Download Instructions

Option 1: Manual Download (Recommended)

Option 2: Download with Kaggle API

Dataset Verification

Expected Structure

Image Characteristics

Technical Specifications

Clinical Categories

Ethical and Legal Considerations

License Information

Privacy Protection

Responsible Use

Preprocessing Pipeline

1. Resizing

2. Normalization

3. Data Augmentation (Training Only)

Dataset Split Strategy

Our Approach

Final Distribution

Citation and Attribution

Dataset Citation

Scientific Publication

Common Issues and Solutions

Next Steps

Build docs developers (and LLMs) love

Introducción

Fundamentos del Proyecto

Guías de Implementación

Presentación y Exposición

Recursos Técnicos

​Dataset Overview

​Chest X-Ray Images (Pneumonia)

​Description

​Dataset Composition

​Distribution Statistics

​Class Distribution

​Download Instructions

​Option 1: Manual Download (Recommended)

​Option 2: Download with Kaggle API

​Dataset Verification

​Expected Structure

​Image Characteristics

​Technical Specifications

​Clinical Categories

​Ethical and Legal Considerations

​License Information

​Privacy Protection

​Responsible Use

​Preprocessing Pipeline

​1. Resizing

​2. Normalization

​3. Data Augmentation (Training Only)

​Dataset Split Strategy

​Our Approach

​Final Distribution

​Citation and Attribution

​Dataset Citation

​Scientific Publication

​Common Issues and Solutions

​Next Steps

Build docs developers (and LLMs) love

Dataset Overview

Chest X-Ray Images (Pneumonia)

Description

Dataset Composition

Distribution Statistics

Class Distribution

Download Instructions

Option 1: Manual Download (Recommended)

Option 2: Download with Kaggle API

Dataset Verification

Expected Structure

Image Characteristics

Technical Specifications

Clinical Categories

Ethical and Legal Considerations

License Information

Privacy Protection

Responsible Use

Preprocessing Pipeline

1. Resizing

2. Normalization

3. Data Augmentation (Training Only)

Dataset Split Strategy

Our Approach

Final Distribution

Citation and Attribution

Dataset Citation

Scientific Publication

Common Issues and Solutions

Next Steps