Overview
PatchCore evaluation computes both image-level and pixel-level anomaly detection performance on test datasets. The evaluation script loads pretrained models and generates comprehensive metrics including AUROC and PRO scores.
Evaluation Script
Use bin/load_and_evaluate_patchcore.py to evaluate saved models:
python bin/load_and_evaluate_patchcore.py \
--gpu 0 \
--seed 0 \
evaluated_results/IM224_WR50 \
patch_core_loader \
-p results/models/mvtec_bottle \
-p results/models/mvtec_cable \
--faiss_on_gpu \
dataset \
--resize 256 \
--imagesize 224 \
-d bottle \
-d cable \
mvtec /path/to/mvtec/data
Specify output directory
First positional argument: evaluated_results/IM224_WR50 - where results will be saved
Configure model loader
Use patch_core_loader with -p flags to specify model paths for each subdataset
Configure dataset
Use dataset command with -d flags to specify which subdatasets to evaluate
Complete Evaluation Example
Here’s a complete evaluation script for all MVTec AD categories (from sample_evaluation.sh:1):
#!/bin/bash
datapath = /path/to/data/from/mvtec
loadpath = /path/to/pretrained/patchcore/model
modelfolder = IM320_Ensemble_L2-3_P001_D1024-384_PS-3_AN-1
savefolder = evaluated_results'/' $modelfolder
datasets = ( 'bottle' 'cable' 'capsule' 'carpet' 'grid' 'hazelnut' \
'leather' 'metal_nut' 'pill' 'screw' 'tile' 'toothbrush' \
'transistor' 'wood' 'zipper' )
model_flags = ($( for dataset in "${ datasets [ @ ]}" ; do \
echo '-p ' $loadpath '/' $modelfolder '/models/mvtec_' $dataset ; done))
dataset_flags = ($( for dataset in "${ datasets [ @ ]}" ; do \
echo '-d ' $dataset ; done))
python bin/load_and_evaluate_patchcore.py --gpu 0 --seed 0 $savefolder \
patch_core_loader "${ model_flags [ @ ]}" --faiss_on_gpu \
dataset --resize 366 --imagesize 320 "${ dataset_flags [ @ ]}" mvtec $datapath
Evaluation Metrics
The evaluation computes three key metrics for each dataset:
Image-Level AUROC
# Compute image-level anomaly detection (source:load_and_evaluate_patchcore.py:148)
auroc = patchcore.metrics.compute_imagewise_retrieval_metrics(
scores, anomaly_labels
)[ "auroc" ]
Measures the ability to classify entire images as normal or anomalous. Uses the maximum anomaly score across all patches in an image.
Area Under the Receiver Operating Characteristic Curve
Ranges from 0 to 1 (1.0 = perfect)
Measures discrimination ability across all thresholds
Robust to class imbalance
Standard metric for anomaly detection
Pixel-Level AUROC (Full)
Computed on all test images including normal images:
# Full pixel-level AUROC (source:load_and_evaluate_patchcore.py:154)
pixel_scores = patchcore.metrics.compute_pixelwise_retrieval_metrics(
segmentations, masks_gt
)
full_pixel_auroc = pixel_scores[ "auroc" ]
Measures pixel-level localization accuracy across the entire test set.
Pixel-Level AUROC (Anomaly Only)
Computed only on images containing anomalies:
# Select only images with anomalies (source:load_and_evaluate_patchcore.py:160)
sel_idxs = []
for i in range ( len (masks_gt)):
if np.sum(masks_gt[i]) > 0 :
sel_idxs.append(i)
pixel_scores = patchcore.metrics.compute_pixelwise_retrieval_metrics(
[segmentations[i] for i in sel_idxs],
[masks_gt[i] for i in sel_idxs]
)
anomaly_pixel_auroc = pixel_scores[ "auroc" ]
Provides a more focused measure of localization performance on defective regions.
Example Output
During evaluation, you’ll see progress and results like this:
Evaluating dataset [mvtec_bottle] (1/15)...
Embedding test data with models (1/1)
Inferring...: 100%|████████████| 83/83 [00:24<00:00, 3.38it/s]
Computing evaluation metrics.
instance_auroc: 1.000
full_pixel_auroc: 0.988
anomaly_pixel_auroc: 0.984
-----
Evaluating dataset [mvtec_cable] (2/15)...
...
The Inferring... progress bar shows prediction progress. Evaluation metrics are computed after all predictions are complete.
Understanding the Evaluation Process
The evaluation pipeline (source:load_and_evaluate_patchcore.py:58):
Load models and datasets
Iterate through dataset/model pairs, loading one PatchCore model per subdataset
Generate predictions
scores, segmentations, labels_gt, masks_gt = PatchCore.predict(
dataloaders[ "testing" ]
)
Run inference on all test images to get anomaly scores and segmentation maps
Normalize scores
min_scores = scores.min( axis =- 1 ).reshape( - 1 , 1 )
max_scores = scores.max( axis =- 1 ).reshape( - 1 , 1 )
scores = (scores - min_scores) / (max_scores - min_scores)
Normalize to [0, 1] range for consistent thresholding
Compute metrics
Calculate AUROC for image-level detection and pixel-level localization
Save results
Store metrics in CSV format with per-dataset breakdown
Ensemble Evaluation
For ensemble models, predictions from multiple models are aggregated:
# Aggregate predictions from ensemble members (source:load_and_evaluate_patchcore.py:75)
aggregator = { "scores" : [], "segmentations" : []}
for i, PatchCore in enumerate (PatchCore_list):
scores, segmentations, labels_gt, masks_gt = PatchCore.predict(
dataloaders[ "testing" ]
)
aggregator[ "scores" ].append(scores)
aggregator[ "segmentations" ].append(segmentations)
# Average ensemble predictions after normalization
scores = np.mean(scores, axis = 0 )
segmentations = np.mean(segmentations, axis = 0 )
Each ensemble member is normalized independently before averaging. This prevents models with different score ranges from dominating the ensemble.
Results Storage
Evaluation results are saved in the specified output directory:
evaluated_results/IM224_WR50/
├── results.csv # Aggregate metrics for all datasets
└── segmentation_images/ # (optional) Visual results
├── mvtec_bottle/
├── mvtec_cable/
└── ...
The results.csv file contains:
dataset_name, instance_auroc, full_pixel_auroc, anomaly_pixel_auroc
mvtec_bottle, 1.000, 0.988, 0.984
mvtec_cable, 0.993, 0.982, 0.978
...
mean, 0.996, 0.984, 0.980
Saving Segmentation Images
To visualize predictions, use the --save_segmentation_images flag:
python bin/load_and_evaluate_patchcore.py \
--gpu 0 \
--save_segmentation_images \
...
This generates overlay images showing:
Original input image
Ground truth mask (if available)
Predicted anomaly heatmap
Anomaly score
State-of-the-art pretrained models achieve:
Model Image AUROC Pixel AUROC PRO Score WR50-baseline 99.2% 98.1% 94.4% Ensemble 99.6% 98.2% 94.9%
Results may vary slightly due to hardware differences and software versions, but should be within 0.1-0.3% of reported values.
Next Steps
Load Models Learn how to load and initialize saved models
Batch Prediction Run inference on custom image datasets