Metrics Interpretation

This page explains how to interpret the various metrics used to evaluate the pneumonia classification model, with special emphasis on their medical significance.

Core Metrics Overview

The model is evaluated using five primary metrics:

Accuracy

percentage

Overall percentage of correct predictions (both normal and pneumonia cases)

Precision

percentage

Of all cases predicted as pneumonia, what percentage actually have pneumonia?

Recall

percentage

Of all actual pneumonia cases, what percentage did the model detect?

F1-Score

percentage

Harmonic mean of precision and recall - balances both metrics

AUC-ROC

float (0-1)

Area Under the ROC Curve - model’s ability to distinguish between classes

Accuracy

Definition

Formula:

Accuracy = (True Positives + True Negatives) / Total Predictions

What it measures: Percentage of all predictions that are correct.

Interpretation

Example
Target
Limitations

Out of 624 test images:

550 correctly classified
74 incorrectly classified

Accuracy = 550 / 624 = 88.1%

Accuracy is a good overall metric but can be misleading with imbalanced datasets (our dataset has 74% pneumonia cases).

Precision

Definition

Formula:

Precision = True Positives / (True Positives + False Positives)

What it measures: When the model predicts pneumonia, how often is it correct?

Medical Significance

Example
Clinical Impact
Trade-offs

Model predicts pneumonia in 420 cases:

380 actually have pneumonia (True Positives)
40 are actually normal (False Positives)

Precision = 380 / 420 = 90.5%

Target for this model: >80% precision

Recall (Sensitivity)

Definition

Formula:

Recall = True Positives / (True Positives + False Negatives)

What it measures: Of all actual pneumonia cases, what percentage did we catch?

Why 96% Recall is Critical

In medical screening, false negatives are more dangerous than false positives.

Example
False Negative Risk
False Positive Risk
Why 96%?

Out of 390 actual pneumonia cases:

375 correctly detected (True Positives)
15 missed (False Negatives)

Recall = 375 / 390 = 96.2%

High recall is prioritized in medical screening systems. It’s acceptable to have more false positives (lower precision) if it means catching more true cases of disease.

F1-Score

Definition

Formula:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

What it measures: Harmonic mean of precision and recall.

Why Harmonic Mean?

Arithmetic Mean Problem
Harmonic Mean Advantage
Example Calculation

Arithmetic mean: (Precision + Recall) / 2Example:

Precision: 100%
Recall: 10%
Arithmetic mean: 55% (misleading!)

This suggests decent performance when recall is terrible.

Target for this model: >85% F1-Score

AUC-ROC (Area Under Curve)

Definition

ROC Curve

plot

Plots True Positive Rate (Recall) vs False Positive Rate at various classification thresholds

AUC

float (0-1)

Area under the ROC curve - single number summarizing performance across all thresholds

Understanding ROC Curves

How It Works
ROC Curve Plot
AUC Values

The model outputs probabilities:

Image A: 0.92 (92% confident = pneumonia)
Image B: 0.45 (45% confident = ???)
Image C: 0.12 (12% confident = normal)

We need a threshold to decide:

Probability ≥ threshold → Predict pneumonia
Probability < threshold → Predict normal

Different thresholds give different results:Threshold = 0.5 (standard):

Image A: Pneumonia ✓
Image B: Normal
Image C: Normal ✓

Threshold = 0.3 (more sensitive):

Image A: Pneumonia ✓
Image B: Pneumonia (maybe wrong)
Image C: Normal ✓

Target for this model: AUC >0.90

Why AUC Matters

AUC is threshold-independent - it measures the model’s ability to rank pneumonia cases higher than normal cases, regardless of where you set the decision boundary.

Advantage: Useful when you might adjust the threshold based on clinical context:

Emergency room: Lower threshold (0.3) → Catch more cases, tolerate false alarms
Routine screening: Standard threshold (0.5) → Balanced approach
Resource-limited: Higher threshold (0.7) → Only flag highly suspicious cases

Reading the Confusion Matrix

The confusion matrix shows all four possible outcomes:

                    Predicted
                 Normal  Pneumonia
Actual  Normal     TN       FP
        Pneumonia  FN       TP

Example Confusion Matrix

                    Predicted
                 Normal  Pneumonia
Actual  Normal     215      19        = 234 total
        Pneumonia   15     375        = 390 total
              
                   230     394        = 624 total

True Negatives (TN)
False Positives (FP)
False Negatives (FN)
True Positives (TP)

Value: 215Meaning: Correctly identified as normalPercentage: 215/234 = 91.9% of normal casesClinical impact: Healthy patients correctly cleared

Deriving Metrics from Confusion Matrix

Accuracy  = (TP + TN) / Total
          = (375 + 215) / 624 = 94.6%

Precision = TP / (TP + FP)
          = 375 / (375 + 19) = 95.2%

Recall    = TP / (TP + FN)
          = 375 / (375 + 15) = 96.2%

F1-Score  = 2 × (P × R) / (P + R)
          = 2 × (0.952 × 0.962) / (0.952 + 0.962)
          = 95.7%

What “Good” Metrics Mean for This Application

Screening Tool Perspective

Acceptable Performance
Excellent Performance
Clinical Deployment

Minimum acceptable metrics:

Accuracy: >85%
Precision: >80%
Recall: >90% (critical)
F1-Score: >85%
AUC: >0.88

Interpretation: Model is useful as a screening tool, but must be reviewed by radiologist.

Comparison to Baselines

Random Guessing
Naive Classifier
Human Radiologists

Baseline: Flip a coin for each imageExpected metrics:

Accuracy: ~50%
Precision: ~50%
Recall: ~50%
AUC: 0.5

Our model: >40% better than random

Interpreting Model Confidence

The model outputs probabilities, not just binary predictions:

High Confidence Correct
Low Confidence Correct
High Confidence Incorrect
Low Confidence Incorrect

Prediction: PNEUMONIA (98% confident)
Actual: PNEUMONIA ✓

Interpretation: Clear pneumonia markers detected (consolidations, infiltrates)

Prediction: PNEUMONIA (52% confident)
Actual: PNEUMONIA ✓

Interpretation:

Early-stage pneumonia
Ambiguous features
Should be reviewed carefully by radiologist

Prediction: PNEUMONIA (94% confident)
Actual: NORMAL ✗

Interpretation:

May have pneumonia-like artifacts
Could be other lung condition
Model’s most problematic errors

Prediction: NORMAL (51% confident)
Actual: PNEUMONIA ✗

Interpretation:

Very subtle pneumonia
Model is uncertain (close to 50% threshold)
Borderline case even for humans

In clinical deployment, low-confidence predictions (<60% or >40% and <60%) should be flagged for mandatory human review.

Summary: Metrics Priority for Medical AI

Order of importance for pneumonia screening:

Recall (Sensitivity) - Must catch pneumonia cases
AUC-ROC - Overall discriminative ability
F1-Score - Balance of precision and recall
Precision - Avoid too many false alarms
Accuracy - Overall correctness

Never evaluate a medical AI model on accuracy alone. Always examine the full confusion matrix and prioritize metrics relevant to clinical consequences.

Introducción

Fundamentos del Proyecto

Guías de Implementación

Presentación y Exposición

Recursos Técnicos

Core Metrics Overview

Accuracy

Definition

Interpretation

Precision

Definition

Medical Significance

Recall (Sensitivity)

Definition

Why 96% Recall is Critical

F1-Score

Definition

Why Harmonic Mean?

AUC-ROC (Area Under Curve)

Definition

Understanding ROC Curves

Why AUC Matters

Reading the Confusion Matrix

Example Confusion Matrix

Deriving Metrics from Confusion Matrix

What “Good” Metrics Mean for This Application

Screening Tool Perspective

Comparison to Baselines

Interpreting Model Confidence

Summary: Metrics Priority for Medical AI

Build docs developers (and LLMs) love

Introducción

Fundamentos del Proyecto

Guías de Implementación

Presentación y Exposición

Recursos Técnicos

​Core Metrics Overview

​Accuracy

​Definition

​Interpretation

​Precision

​Definition

​Medical Significance

​Recall (Sensitivity)

​Definition

​Why 96% Recall is Critical

​F1-Score

​Definition

​Why Harmonic Mean?

​AUC-ROC (Area Under Curve)

​Definition

​Understanding ROC Curves

​Why AUC Matters

​Reading the Confusion Matrix

​Example Confusion Matrix

​Deriving Metrics from Confusion Matrix

​What “Good” Metrics Mean for This Application

​Screening Tool Perspective

​Comparison to Baselines

​Interpreting Model Confidence

​Summary: Metrics Priority for Medical AI

Build docs developers (and LLMs) love

Core Metrics Overview

Accuracy

Definition

Interpretation

Precision

Definition

Medical Significance

Recall (Sensitivity)

Definition

Why 96% Recall is Critical

F1-Score

Definition

Why Harmonic Mean?

AUC-ROC (Area Under Curve)

Definition

Understanding ROC Curves

Why AUC Matters

Reading the Confusion Matrix

Example Confusion Matrix

Deriving Metrics from Confusion Matrix

What “Good” Metrics Mean for This Application

Screening Tool Perspective

Comparison to Baselines

Interpreting Model Confidence

Summary: Metrics Priority for Medical AI