Skip to main content
This page explains how to interpret the various metrics used to evaluate the pneumonia classification model, with special emphasis on their medical significance.

Core Metrics Overview

The model is evaluated using five primary metrics:
Accuracy
percentage
Overall percentage of correct predictions (both normal and pneumonia cases)
Precision
percentage
Of all cases predicted as pneumonia, what percentage actually have pneumonia?
Recall
percentage
Of all actual pneumonia cases, what percentage did the model detect?
F1-Score
percentage
Harmonic mean of precision and recall - balances both metrics
AUC-ROC
float (0-1)
Area Under the ROC Curve - model’s ability to distinguish between classes

Accuracy

Definition

Formula:
Accuracy = (True Positives + True Negatives) / Total Predictions
What it measures: Percentage of all predictions that are correct.

Interpretation

Out of 624 test images:
  • 550 correctly classified
  • 74 incorrectly classified
Accuracy = 550 / 624 = 88.1%
Accuracy is a good overall metric but can be misleading with imbalanced datasets (our dataset has 74% pneumonia cases).

Precision

Definition

Formula:
Precision = True Positives / (True Positives + False Positives)
What it measures: When the model predicts pneumonia, how often is it correct?

Medical Significance

Model predicts pneumonia in 420 cases:
  • 380 actually have pneumonia (True Positives)
  • 40 are actually normal (False Positives)
Precision = 380 / 420 = 90.5%
Target for this model: >80% precision

Recall (Sensitivity)

Definition

Formula:
Recall = True Positives / (True Positives + False Negatives)
What it measures: Of all actual pneumonia cases, what percentage did we catch?

Why 96% Recall is Critical

In medical screening, false negatives are more dangerous than false positives.
Out of 390 actual pneumonia cases:
  • 375 correctly detected (True Positives)
  • 15 missed (False Negatives)
Recall = 375 / 390 = 96.2%
High recall is prioritized in medical screening systems. It’s acceptable to have more false positives (lower precision) if it means catching more true cases of disease.

F1-Score

Definition

Formula:
F1 = 2 × (Precision × Recall) / (Precision + Recall)
What it measures: Harmonic mean of precision and recall.

Why Harmonic Mean?

Arithmetic mean: (Precision + Recall) / 2Example:
  • Precision: 100%
  • Recall: 10%
  • Arithmetic mean: 55% (misleading!)
This suggests decent performance when recall is terrible.
Target for this model: >85% F1-Score

AUC-ROC (Area Under Curve)

Definition

ROC Curve
plot
Plots True Positive Rate (Recall) vs False Positive Rate at various classification thresholds
AUC
float (0-1)
Area under the ROC curve - single number summarizing performance across all thresholds

Understanding ROC Curves

The model outputs probabilities:
Image A: 0.92 (92% confident = pneumonia)
Image B: 0.45 (45% confident = ???)
Image C: 0.12 (12% confident = normal)
We need a threshold to decide:
  • Probability ≥ threshold → Predict pneumonia
  • Probability < threshold → Predict normal
Different thresholds give different results:Threshold = 0.5 (standard):
  • Image A: Pneumonia ✓
  • Image B: Normal
  • Image C: Normal ✓
Threshold = 0.3 (more sensitive):
  • Image A: Pneumonia ✓
  • Image B: Pneumonia (maybe wrong)
  • Image C: Normal ✓
Target for this model: AUC >0.90

Why AUC Matters

AUC is threshold-independent - it measures the model’s ability to rank pneumonia cases higher than normal cases, regardless of where you set the decision boundary.
Advantage: Useful when you might adjust the threshold based on clinical context:
  • Emergency room: Lower threshold (0.3) → Catch more cases, tolerate false alarms
  • Routine screening: Standard threshold (0.5) → Balanced approach
  • Resource-limited: Higher threshold (0.7) → Only flag highly suspicious cases

Reading the Confusion Matrix

The confusion matrix shows all four possible outcomes:
                    Predicted
                 Normal  Pneumonia
Actual  Normal     TN       FP
        Pneumonia  FN       TP

Example Confusion Matrix

                    Predicted
                 Normal  Pneumonia
Actual  Normal     215      19        = 234 total
        Pneumonia   15     375        = 390 total
              
                   230     394        = 624 total
Value: 215Meaning: Correctly identified as normalPercentage: 215/234 = 91.9% of normal casesClinical impact: Healthy patients correctly cleared

Deriving Metrics from Confusion Matrix

Accuracy  = (TP + TN) / Total
          = (375 + 215) / 624 = 94.6%

Precision = TP / (TP + FP)
          = 375 / (375 + 19) = 95.2%

Recall    = TP / (TP + FN)
          = 375 / (375 + 15) = 96.2%

F1-Score  = 2 × (P × R) / (P + R)
          = 2 × (0.952 × 0.962) / (0.952 + 0.962)
          = 95.7%

What “Good” Metrics Mean for This Application

Screening Tool Perspective

Minimum acceptable metrics:
  • Accuracy: >85%
  • Precision: >80%
  • Recall: >90% (critical)
  • F1-Score: >85%
  • AUC: >0.88
Interpretation: Model is useful as a screening tool, but must be reviewed by radiologist.

Comparison to Baselines

Baseline: Flip a coin for each imageExpected metrics:
  • Accuracy: ~50%
  • Precision: ~50%
  • Recall: ~50%
  • AUC: 0.5
Our model: >40% better than random

Interpreting Model Confidence

The model outputs probabilities, not just binary predictions:
Prediction: PNEUMONIA (98% confident)
Actual: PNEUMONIA ✓
Interpretation: Clear pneumonia markers detected (consolidations, infiltrates)
In clinical deployment, low-confidence predictions (<60% or >40% and <60%) should be flagged for mandatory human review.

Summary: Metrics Priority for Medical AI

Order of importance for pneumonia screening:
  1. Recall (Sensitivity) - Must catch pneumonia cases
  2. AUC-ROC - Overall discriminative ability
  3. F1-Score - Balance of precision and recall
  4. Precision - Avoid too many false alarms
  5. Accuracy - Overall correctness
Never evaluate a medical AI model on accuracy alone. Always examine the full confusion matrix and prioritize metrics relevant to clinical consequences.

Build docs developers (and LLMs) love