This page explains how to interpret the various metrics used to evaluate the pneumonia classification model, with special emphasis on their medical significance.
Core Metrics Overview
The model is evaluated using five primary metrics:
Overall percentage of correct predictions (both normal and pneumonia cases)
Of all cases predicted as pneumonia, what percentage actually have pneumonia?
Of all actual pneumonia cases, what percentage did the model detect?
Harmonic mean of precision and recall - balances both metrics
Area Under the ROC Curve - model’s ability to distinguish between classes
Accuracy
Definition
Formula:
Accuracy = (True Positives + True Negatives) / Total Predictions
What it measures: Percentage of all predictions that are correct.
Interpretation
Example
Target
Limitations
Out of 624 test images:
- 550 correctly classified
- 74 incorrectly classified
Accuracy = 550 / 624 = 88.1% Goal for this model: >85%Typical results: 87-92%Clinical benchmark: Human radiologists achieve ~85-95% on similar tasks
Why accuracy alone isn’t enough:Imagine a dataset with:
- 90% pneumonia cases
- 10% normal cases
A naive model that always predicts pneumonia would achieve:
- Accuracy: 90% ✓
- But it would miss 0% of pneumonia (good)
- And incorrectly flag 100% of normal patients (terrible)
This is why we need precision and recall.
Accuracy is a good overall metric but can be misleading with imbalanced datasets (our dataset has 74% pneumonia cases).
Precision
Definition
Formula:
Precision = True Positives / (True Positives + False Positives)
What it measures: When the model predicts pneumonia, how often is it correct?
Medical Significance
Example
Clinical Impact
Trade-offs
Model predicts pneumonia in 420 cases:
- 380 actually have pneumonia (True Positives)
- 40 are actually normal (False Positives)
Precision = 380 / 420 = 90.5% High precision means:
- Fewer false alarms
- Less unnecessary follow-up testing
- Lower patient anxiety
- Reduced healthcare costs
Low precision means:
- Many false positives
- Patients undergo unnecessary treatment
- Loss of trust in the AI system
Precision vs Recall:You can achieve 100% precision by being very conservative:
- Only predict pneumonia when absolutely certain
- But you’ll miss many real cases (low recall)
The model must balance both metrics.
Target for this model: >80% precision
Recall (Sensitivity)
Definition
Formula:
Recall = True Positives / (True Positives + False Negatives)
What it measures: Of all actual pneumonia cases, what percentage did we catch?
Why 96% Recall is Critical
In medical screening, false negatives are more dangerous than false positives.
Example
False Negative Risk
False Positive Risk
Why 96%?
Out of 390 actual pneumonia cases:
- 375 correctly detected (True Positives)
- 15 missed (False Negatives)
Recall = 375 / 390 = 96.2% What happens when we miss pneumonia:
- Patient doesn’t receive treatment
- Pneumonia can worsen rapidly (especially in children)
- Can lead to:
- Hospitalization
- Sepsis
- Respiratory failure
- Death (15% mortality in children <5 years)
Cost: Potentially life-threatening What happens with a false alarm:
- Doctor orders additional tests (chest CT, blood work)
- May prescribe antibiotics preventatively
- Patient monitored more closely
Cost: Money, time, minor inconvenienceBut: Patient’s life is not at risk Our target: >90% recall, ideally >95%Rationale:
- Missing <5% of cases is acceptable for screening tool
- Radiologists still review all X-rays
- AI serves as “second opinion” or triage system
- 96% recall means we catch 374 out of 390 cases
In practice: The 4% we miss would likely be:
- Very early-stage pneumonia
- Atypical presentations
- Poor image quality
High recall is prioritized in medical screening systems. It’s acceptable to have more false positives (lower precision) if it means catching more true cases of disease.
F1-Score
Definition
Formula:
F1 = 2 × (Precision × Recall) / (Precision + Recall)
What it measures: Harmonic mean of precision and recall.
Why Harmonic Mean?
Arithmetic Mean Problem
Harmonic Mean Advantage
Example Calculation
Arithmetic mean: (Precision + Recall) / 2Example:
- Precision: 100%
- Recall: 10%
- Arithmetic mean: 55% (misleading!)
This suggests decent performance when recall is terrible. Harmonic mean: Punishes extreme imbalancesSame example:
- Precision: 100%
- Recall: 10%
- F1-Score: 18% (accurately reflects poor performance)
Property: F1 is only high when BOTH precision and recall are high. Model performance:
- Precision: 87%
- Recall: 96%
F1 = 2 × (0.87 × 0.96) / (0.87 + 0.96)
F1 = 2 × 0.8352 / 1.83
F1 = 0.913 = 91.3%F1 is slightly below recall because precision is lower, showing the model favors sensitivity.
Target for this model: >85% F1-Score
AUC-ROC (Area Under Curve)
Definition
Plots True Positive Rate (Recall) vs False Positive Rate at various classification thresholds
Area under the ROC curve - single number summarizing performance across all thresholds
Understanding ROC Curves
How It Works
ROC Curve Plot
AUC Values
The model outputs probabilities:Image A: 0.92 (92% confident = pneumonia)
Image B: 0.45 (45% confident = ???)
Image C: 0.12 (12% confident = normal)
We need a threshold to decide:
- Probability ≥ threshold → Predict pneumonia
- Probability < threshold → Predict normal
Different thresholds give different results:Threshold = 0.5 (standard):
- Image A: Pneumonia ✓
- Image B: Normal
- Image C: Normal ✓
Threshold = 0.3 (more sensitive):
- Image A: Pneumonia ✓
- Image B: Pneumonia (maybe wrong)
- Image C: Normal ✓
Axes:
- X-axis: False Positive Rate (FPR) = FP / (FP + TN)
- Y-axis: True Positive Rate (TPR) = Recall = TP / (TP + FN)
Each point represents model performance at a different threshold.Curve interpretation:
- Top-left corner (0, 1) = Perfect classifier
- Diagonal line = Random guessing
- Area under curve = Overall performance
AUC = 1.0: Perfect classifier
- 100% recall, 0% false positives at optimal threshold
AUC = 0.9-1.0: Excellent
- Our model typically achieves ~0.93-0.96
AUC = 0.8-0.9: GoodAUC = 0.7-0.8: FairAUC = 0.5: Random guessing
- Model has no predictive power
AUC < 0.5: Worse than random
- Model is systematically wrong (predictions inverted)
Target for this model: AUC >0.90
Why AUC Matters
AUC is threshold-independent - it measures the model’s ability to rank pneumonia cases higher than normal cases, regardless of where you set the decision boundary.
Advantage: Useful when you might adjust the threshold based on clinical context:
- Emergency room: Lower threshold (0.3) → Catch more cases, tolerate false alarms
- Routine screening: Standard threshold (0.5) → Balanced approach
- Resource-limited: Higher threshold (0.7) → Only flag highly suspicious cases
Reading the Confusion Matrix
The confusion matrix shows all four possible outcomes:
Predicted
Normal Pneumonia
Actual Normal TN FP
Pneumonia FN TP
Example Confusion Matrix
Predicted
Normal Pneumonia
Actual Normal 215 19 = 234 total
Pneumonia 15 375 = 390 total
230 394 = 624 total
True Negatives (TN)
False Positives (FP)
False Negatives (FN)
True Positives (TP)
Value: 215Meaning: Correctly identified as normalPercentage: 215/234 = 91.9% of normal casesClinical impact: Healthy patients correctly cleared
Value: 19Meaning: Normal, but predicted pneumoniaPercentage: 19/234 = 8.1% of normal casesClinical impact:
- Unnecessary follow-up tests
- Patient anxiety
- But not dangerous
Value: 15Meaning: Has pneumonia, but predicted normalPercentage: 15/390 = 3.8% of pneumonia casesClinical impact:
- Most dangerous error type
- Patient might not get treatment
- We minimize this by targeting high recall
Value: 375Meaning: Correctly identified as pneumoniaPercentage: 375/390 = 96.2% of pneumonia casesClinical impact: Patients get appropriate treatment
Deriving Metrics from Confusion Matrix
Accuracy = (TP + TN) / Total
= (375 + 215) / 624 = 94.6%
Precision = TP / (TP + FP)
= 375 / (375 + 19) = 95.2%
Recall = TP / (TP + FN)
= 375 / (375 + 15) = 96.2%
F1-Score = 2 × (P × R) / (P + R)
= 2 × (0.952 × 0.962) / (0.952 + 0.962)
= 95.7%
What “Good” Metrics Mean for This Application
Comparison to Baselines
Random Guessing
Naive Classifier
Human Radiologists
Baseline: Flip a coin for each imageExpected metrics:
- Accuracy: ~50%
- Precision: ~50%
- Recall: ~50%
- AUC: 0.5
Our model: >40% better than random Baseline: Always predict “pneumonia” (since 62.5% of test set has pneumonia)Metrics:
- Accuracy: 62.5%
- Precision: 62.5%
- Recall: 100% (catches all cases)
- But: 100% false positive rate on normal cases
Our model: Much more balanced Published benchmarks (varies by study):
- Accuracy: 85-95%
- Recall: 88-97%
- Inter-rater agreement: ~80%
Our model: Within range of human performance on this specific datasetImportant: Radiologists have contextual information (patient history, symptoms) that our model doesn’t use.
Interpreting Model Confidence
The model outputs probabilities, not just binary predictions:
Prediction: PNEUMONIA (98% confident)
Actual: PNEUMONIA ✓
Interpretation: Clear pneumonia markers detected (consolidations, infiltrates)Prediction: PNEUMONIA (52% confident)
Actual: PNEUMONIA ✓
Interpretation:
- Early-stage pneumonia
- Ambiguous features
- Should be reviewed carefully by radiologist
Prediction: PNEUMONIA (94% confident)
Actual: NORMAL ✗
Interpretation:
- May have pneumonia-like artifacts
- Could be other lung condition
- Model’s most problematic errors
Prediction: NORMAL (51% confident)
Actual: PNEUMONIA ✗
Interpretation:
- Very subtle pneumonia
- Model is uncertain (close to 50% threshold)
- Borderline case even for humans
In clinical deployment, low-confidence predictions (<60% or >40% and <60%) should be flagged for mandatory human review.
Summary: Metrics Priority for Medical AI
Order of importance for pneumonia screening:
- Recall (Sensitivity) - Must catch pneumonia cases
- AUC-ROC - Overall discriminative ability
- F1-Score - Balance of precision and recall
- Precision - Avoid too many false alarms
- Accuracy - Overall correctness
Never evaluate a medical AI model on accuracy alone. Always examine the full confusion matrix and prioritize metrics relevant to clinical consequences.