Skip to main content
This page provides a detailed technical breakdown of the CNN architecture used for pneumonia detection in chest X-ray images.

Architecture Overview

The model uses a 3-layer convolutional neural network (CNN) with approximately 13 million trainable parameters.
INPUT (224×224×3) → CONV → POOL → CONV → POOL → CONV → POOL → FC → OUTPUT (2)
This architecture is specifically designed to balance model capacity with the limited dataset size (~6K images) while avoiding overfitting.

Layer-by-Layer Breakdown

Input Layer

input_shape
tuple
default:"(224, 224, 3)"
Input images are RGB format with dimensions 224×224 pixels. The 3 channels represent RGB values, even though X-rays are grayscale.

Convolutional Block 1

Conv2D
layer
Filters: 32
Kernel size: 3×3
Activation: ReLU
Output shape: (222, 222, 32)
Why 32 filters? This provides a balance between model capacity and computational efficiency. Fewer filters would limit the model’s ability to detect diverse features, while more would increase training time unnecessarily. Why 3×3 kernels? Small kernels capture local patterns such as:
  • Edges and lines
  • Basic textures
  • Intensity gradients
MaxPooling2D
layer
Pool size: 2×2
Output shape: (111, 111, 32)
Purpose of MaxPooling:
  • Reduces spatial dimensions by 50%
  • Provides translation invariance (small position shifts don’t affect output)
  • Reduces computational load for subsequent layers
  • Helps prevent overfitting by abstracting features

Convolutional Block 2

Conv2D
layer
Filters: 64
Kernel size: 3×3
Activation: ReLU
Output shape: (109, 109, 64)
Deeper filters (64 vs 32) capture more complex patterns:
  • Texture combinations
  • Opacity patterns
  • Regional abnormalities
MaxPooling2D
layer
Pool size: 2×2
Output shape: (54, 54, 64)

Convolutional Block 3

Conv2D
layer
Filters: 128
Kernel size: 3×3
Activation: ReLU
Output shape: (52, 52, 128)
The highest-level features detect:
  • Consolidations (pneumonia indicators)
  • Infiltrates
  • Specific disease patterns
MaxPooling2D
layer
Pool size: 2×2
Output shape: (26, 26, 128)

Fully Connected Layers

Flatten
layer
Converts 3D feature maps (26×26×128) into 1D vector of 86,528 elements.
Dense
layer
Units: 128
Activation: ReLU
Parameters: ~11.1M
This layer combines extracted features to make classification decisions.
Dropout
regularization
Rate: 0.5 (50%)
Purpose: Prevents overfitting by randomly deactivating neurons during training
Dense (Output)
layer
Units: 2
Activation: Softmax
Classes: [NORMAL, PNEUMONIA]

Activation Functions

ReLU (Hidden Layers)

Formula:
f(x) = max(0, x)
  • No vanishing gradient: Gradients don’t diminish during backpropagation
  • Computational efficiency: Simple max operation
  • Sparsity: Outputs exactly 0 for negative inputs, creating sparse representations
  • Industry standard: Proven performance in modern CNNs

Softmax (Output Layer)

Formula:
softmax(x_i) = exp(x_i) / Σ exp(x_j)
  • Probabilistic interpretation: Outputs sum to 1.0
  • Class probabilities: Each output represents P(class | image)
  • Differentiable: Enables gradient-based optimization
  • Standard for classification: Ideal for multi-class problems

Parameter Count

Total trainable parameters: ~13,000,000
Conv1:    32 × (3×3×3 + 1) = 896 parameters
Conv2:    64 × (3×3×32 + 1) = 18,496 parameters
Conv3:    128 × (3×3×64 + 1) = 73,856 parameters
Dense1:   128 × (128×26×26) + 128 ≈ 11,075,712 parameters
Output:   2 × (128 + 1) = 258 parameters

Total: ~11,169,218 parameters
The majority of parameters (>98%) are in the first fully connected layer, which combines spatial features for classification.

Why 3 Convolutional Layers?

Too Shallow (1-2 layers):
  • Insufficient capacity to capture complex pneumonia patterns
  • Lower accuracy on validation data
  • Cannot learn hierarchical features
Optimal (3 layers):
  • Captures low, mid, and high-level features
  • Sufficient for dataset size (~6K images)
  • Balances accuracy and training time
  • Prevents overfitting with limited data
Too Deep (5+ layers):
  • Risk of overfitting with small dataset
  • Longer training time
  • Diminishing returns on accuracy

Comparison to Alternatives

Why Not Transfer Learning?

Advantages:
  • Pre-trained on ImageNet (1.4M images)
  • Better feature extraction
  • Higher potential accuracy
Why not used:
  • Much larger (~138M parameters)
  • Harder to explain and interpret
  • Overkill for educational project
  • Longer inference time

Why Not Simpler Models?

Multi-Layer Perceptron (MLP):
  • Would require flattening 224×224×3 = 150,528 input features
  • Loses spatial structure of images
  • Prohibitively large number of parameters
  • Cannot detect translation-invariant features
Recurrent Neural Networks (RNN/LSTM):
  • Designed for sequential data (text, time series)
  • Don’t leverage 2D spatial structure
  • Much slower to train
  • Unnecessarily complex for image classification
CNNs are the industry standard for medical image analysis because they preserve spatial relationships and learn hierarchical features automatically.

Architecture Diagram

┌─────────────────────┐
│  Input: 224×224×3   │  RGB chest X-ray image
└──────────┬──────────┘

┌──────────▼──────────┐
│  Conv2D (32, 3×3)   │  Detect basic features
│  ReLU               │
└──────────┬──────────┘

┌──────────▼──────────┐
│  MaxPool2D (2×2)    │  Downsample to 111×111
└──────────┬──────────┘

┌──────────▼──────────┐
│  Conv2D (64, 3×3)   │  Detect patterns
│  ReLU               │
└──────────┬──────────┘

┌──────────▼──────────┐
│  MaxPool2D (2×2)    │  Downsample to 54×54
└──────────┬──────────┘

┌──────────▼──────────┐
│  Conv2D (128, 3×3)  │  Detect disease markers
│  ReLU               │
└──────────┬──────────┘

┌──────────▼──────────┐
│  MaxPool2D (2×2)    │  Downsample to 26×26
└──────────┬──────────┘

┌──────────▼──────────┐
│  Flatten            │  Convert to 1D (86,528)
└──────────┬──────────┘

┌──────────▼──────────┐
│  Dense (128)        │  Combine features
│  ReLU               │
└──────────┬──────────┘

┌──────────▼──────────┐
│  Dropout (0.5)      │  Prevent overfitting
└──────────┬──────────┘

┌──────────▼──────────┐
│  Dense (2)          │  Classification
│  Softmax            │
└──────────┬──────────┘


    [Normal, Pneumonia]

Technical References

  • He et al. (2016): Deep Residual Learning for Image Recognition
  • Simonyan & Zisserman (2014): Very Deep Convolutional Networks (VGG)
  • Rajpurkar et al. (2017): CheXNet - Radiologist-Level Pneumonia Detection on Chest X-Rays

Build docs developers (and LLMs) love