Dataset Overview
Chest X-Ray Images (Pneumonia)
Source: KaggleAuthor: Paul Mooney
URL: https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia
Size: ~1.15 GB (5,863 images)
License: CC BY 4.0
Description
The dataset contains chest X-ray radiographs from pediatric patients (ages 1-5 years) from the Guangzhou Women and Children’s Medical Center, China.Images are organized into 3 folders (train, test, val) with subfolders for each category (Pneumonia/Normal).
Dataset Composition
Distribution Statistics
Class Distribution
The dataset has class imbalance:- Normal cases: ~26% of training data
- Pneumonia cases: ~74% of training data
Download Instructions
Option 1: Manual Download (Recommended)
Visit Kaggle
Go to https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumoniaIf you don’t have a Kaggle account, create one (it’s free)
Download the Dataset
Click the Download button (top right)A file named
archive.zip (~1.15 GB) will downloadOption 2: Download with Kaggle API
For command-line enthusiasts:Configure Credentials
- Go to your Kaggle profile → Account → API → Create New Token
- Download
kaggle.json - On Windows, move it to:
C:\Users\<username>\.kaggle\kaggle.json - On Linux/Mac:
~/.kaggle/kaggle.jsonand runchmod 600 ~/.kaggle/kaggle.json
Dataset Verification
After downloading, verify the structure:Expected Structure
If the verification script reports any missing images or incorrect counts, re-download the dataset to ensure data integrity.
Image Characteristics
Technical Specifications
- Format: JPEG
- Dimensions: Variable (resized to 224×224 in code)
- Channels: 3 (RGB, though X-rays are grayscale)
- Quality: Real clinical-use images
- Bit Depth: 8-bit per channel
Clinical Categories
NORMAL- Clear lung parenchyma
- No abnormal opacities
- Normal cardiothoracic ratio
- Clear costophrenic angles
- Presence of infiltrates
- Opacities or consolidations
- Increased density in lung fields
- May show air bronchograms
The dataset includes both viral and bacterial pneumonia cases, though they are not separately labeled.
Ethical and Legal Considerations
License Information
The dataset is released under CC BY 4.0 license, which permits:- ✅ Commercial and non-commercial use
- ✅ Modification and distribution
- ✅ Research and education
Privacy Protection
- All images have been anonymized
- No patient-identifiable information is present
- Approved for research use
- Compliant with medical data sharing regulations
Responsible Use
Preprocessing Pipeline
Our code automatically applies the following preprocessing:1. Resizing
- All images resized to 224×224 pixels
- Maintains aspect ratio considerations
- Standard input size for the CNN
2. Normalization
- Pixel values scaled to [0, 1] range
- Divides by 255.0
- Improves training stability
3. Data Augmentation (Training Only)
Rotation
Rotation
- Random rotations: ±15 degrees
- Simulates different patient positioning
- Improves rotation invariance
Translation
Translation
- Horizontal/vertical shifts: ±10%
- Accounts for different X-ray centering
- Enhances spatial robustness
Zoom
Zoom
- Random zoom: ±10%
- Simulates different patient distances
- Improves scale invariance
Horizontal Flip
Horizontal Flip
- Random horizontal flipping
- Anatomically valid (lungs are roughly symmetric)
- Doubles effective training data
Data augmentation is applied only during training, not during validation or testing, to ensure fair evaluation.
Dataset Split Strategy
Our Approach
Since the original validation set is too small, we:- Combine train and val sets
- Create new split: 80% train / 20% validation
- Keep original test set separate and untouched
Final Distribution
The test set remains completely independent to provide unbiased performance evaluation.
Citation and Attribution
Dataset Citation
Scientific Publication
When using this dataset in your work, please provide proper attribution to respect the creators’ contribution to open medical AI research.
Common Issues and Solutions
Download fails or times out
Download fails or times out
Solution: Try the manual download option instead of the API. Large files can be interrupted on unstable connections.
Wrong number of images after extraction
Wrong number of images after extraction
Solution: Re-download the dataset. The zip file may have been corrupted during download.
Permission denied when accessing kaggle.json
Permission denied when accessing kaggle.json
Solution: On Linux/Mac, run
chmod 600 ~/.kaggle/kaggle.json to set proper permissions.Out of memory during training
Out of memory during training
Solution: Reduce batch size in training configuration. The dataset is large, and 224×224 RGB images require significant memory.
Next Steps
With the dataset downloaded and understood, you’re ready to:- Verify the data with the verification script
- Explore the images to understand the visual patterns
- Train the model using the prepared data pipeline
- Evaluate results on the test set