Overview
This page provides comprehensive benchmark results for PatchCore models on the MVTec AD industrial anomaly detection dataset.
Mean performance across all 15 MVTec AD categories:
| Model | Image AUROC | Pixel AUROC | PRO Score |
|---|
| WR50 Baseline | 99.2% | 98.1% | 94.4% |
| Ensemble | 99.6% | 98.2% | 94.9% |
The ensemble model combines Wide ResNet-101, ResNeXt-101, and DenseNet-201 backbones for superior performance.
Model Configurations
WideResNet50 Baseline
Configuration:
- Backbone: Wide ResNet-50
- Layers: layer2, layer3
- Image size: 224×224
- Coreset: 10%
- Embeddings: 1024 → 1024
- Patch size: 3
- Neighbors: 1
Model ID: IM224_WR50_L2-3_P01_D1024-1024_PS-3_AN-1
Ensemble Model
Configuration:
- Backbones: Wide ResNet-101, ResNeXt-101, DenseNet-201
- Layers: layer2+layer3 (ResNets), denseblock2+denseblock3 (DenseNet)
- Image size: 224×224
- Coreset: 1%
- Embeddings: 1024 → 384
- Patch size: 3
- Neighbors: 1
Model ID: IM224_Ensemble_L2-3_P001_D1024-384_PS-3_AN-1
Object Categories
Bottle
| Model | Image AUROC | Pixel AUROC | PRO Score |
|---|
| WR50 Baseline | 100.0% | 98.5% | 73.7% |
| Ensemble (Run 1) | 100.0% | 98.5% | 73.7% |
| Ensemble (Run 2) | 100.0% | 98.7% | 73.2% |
Cable
| Model | Image AUROC | Pixel AUROC | PRO Score |
|---|
| WR50 Baseline | 99.9% | 98.5% | 57.6% |
| Ensemble (Run 1) | 99.7% | 98.4% | 57.5% |
| Ensemble (Run 2) | 99.8% | 98.1% | 57.2% |
Capsule
| Model | Image AUROC | Pixel AUROC | PRO Score |
|---|
| WR50 Baseline | 98.3% | 99.1% | 80.4% |
| Ensemble (Run 1) | 97.9% | 98.9% | 80.2% |
| Ensemble (Run 2) | 98.7% | 99.2% | 79.9% |
Hazelnut
| Model | Image AUROC | Pixel AUROC | PRO Score |
|---|
| WR50 Baseline | 100.0% | 98.7% | 58.6% |
| Ensemble (Run 1) | 100.0% | 98.7% | 59.1% |
| Ensemble (Run 2) | 100.0% | 98.9% | 57.7% |
| Model | Image AUROC | Pixel AUROC | PRO Score |
|---|
| WR50 Baseline | 100.0% | 98.6% | 75.9% |
| Ensemble (Run 1) | 99.9% | 98.3% | 75.1% |
| Ensemble (Run 2) | 100.0% | 98.8% | 77.4% |
Pill
| Model | Image AUROC | Pixel AUROC | PRO Score |
|---|
| WR50 Baseline | 97.7% | 97.8% | 79.3% |
| Ensemble (Run 1) | 96.7% | 97.8% | 79.7% |
| Ensemble (Run 2) | 98.3% | 97.7% | 80.7% |
Screw
| Model | Image AUROC | Pixel AUROC | PRO Score |
|---|
| WR50 Baseline | 98.9% | 99.4% | 72.8% |
| Ensemble (Run 1) | 98.8% | 99.5% | 73.3% |
| Ensemble (Run 2) | 99.2% | 99.6% | 73.5% |
Toothbrush
| Model | Image AUROC | Pixel AUROC | PRO Score |
|---|
| WR50 Baseline | 100.0% | 98.7% | 67.5% |
| Ensemble (Run 1) | 100.0% | 98.6% | 67.7% |
| Ensemble (Run 2) | 100.0% | 98.9% | 68.5% |
Transistor
| Model | Image AUROC | Pixel AUROC | PRO Score |
|---|
| WR50 Baseline | 100.0% | 96.7% | 34.1% |
| Ensemble (Run 1) | 99.9% | 96.1% | 33.3% |
| Ensemble (Run 2) | 99.9% | 94.1% | 32.8% |
Transistor is the most challenging category due to complex, small-scale defects.
Zipper
| Model | Image AUROC | Pixel AUROC | PRO Score |
|---|
| WR50 Baseline | 99.8% | 98.9% | 76.9% |
| Ensemble (Run 1) | 99.5% | 98.9% | 77.1% |
| Ensemble (Run 2) | 99.7% | 99.2% | 77.6% |
Texture Categories
Carpet
| Model | Image AUROC | Pixel AUROC | PRO Score |
|---|
| WR50 Baseline | 98.9% | 98.9% | 74.0% |
| Ensemble (Run 1) | 98.6% | 99.1% | 73.7% |
| Ensemble (Run 2) | 99.6% | 99.1% | 74.7% |
Grid
| Model | Image AUROC | Pixel AUROC | PRO Score |
|---|
| WR50 Baseline | 98.2% | 98.5% | 69.4% |
| Ensemble (Run 1) | 97.9% | 98.8% | 70.0% |
| Ensemble (Run 2) | 99.5% | 99.1% | 70.2% |
Leather
| Model | Image AUROC | Pixel AUROC | PRO Score |
|---|
| WR50 Baseline | 100.0% | 99.2% | 73.5% |
| Ensemble (Run 1) | 100.0% | 99.3% | 73.6% |
| Ensemble (Run 2) | 100.0% | 99.4% | 73.7% |
Tile
| Model | Image AUROC | Pixel AUROC | PRO Score |
|---|
| WR50 Baseline | 98.9% | 95.7% | 64.2% |
| Ensemble (Run 1) | 99.5% | 95.7% | 64.5% |
| Ensemble (Run 2) | 98.8% | 96.8% | 65.7% |
Wood
| Model | Image AUROC | Pixel AUROC | PRO Score |
|---|
| WR50 Baseline | 99.5% | 94.8% | 67.7% |
| Ensemble (Run 1) | 99.1% | 95.1% | 68.5% |
| Ensemble (Run 2) | 99.7% | 96.0% | 70.6% |
Metric Definitions
Image-Level AUROC
Area Under Receiver Operating Characteristic curve for image-level anomaly classification.
- Task: Binary classification (normal vs. anomalous)
- Range: 0% to 100% (higher is better)
- Interpretation: Probability that a randomly chosen anomalous image scores higher than a randomly chosen normal image
Pixel-Level AUROC
Area Under ROC curve for pixel-wise anomaly localization.
- Task: Pixel-level anomaly segmentation
- Range: 0% to 100% (higher is better)
- Evaluation: Per-pixel classification accuracy
PRO Score (Per-Region Overlap)
Measures the overlap between predicted and ground truth anomalous regions at various thresholds.
- Task: Anomaly region localization quality
- Range: 0% to 100% (higher is better)
- Calculation: Integrated precision over multiple overlap thresholds
- Focus: Connected component-level accuracy
PRO score is more sensitive to localization accuracy than pixel-level AUROC, especially for small defects.
Evaluation Metrics
Each model reports five key metrics:
- instance_auroc - Image-level anomaly detection AUROC
- full_pixel_auroc - Pixel-level AUROC across all test images
- full_pro - PRO score across all test images
- anomaly_pixel_auroc - Pixel-level AUROC on anomalous images only
- anomaly_pro - PRO score on anomalous images only
Results Variability
Performance may vary slightly due to:
- Random seed - Affects coreset sampling
- Hardware differences - GPU/CPU implementations
- FAISS version - Nearest neighbor search variations
- Software versions - PyTorch, timm, etc.
Typical variance: ±0.1-0.3% AUROC across different runs
Reproducing Results
WideResNet50 Baseline
datapath=/path/to/mvtec
datasets=('bottle' 'cable' 'capsule' 'carpet' 'grid' 'hazelnut' \
'leather' 'metal_nut' 'pill' 'screw' 'tile' 'toothbrush' \
'transistor' 'wood' 'zipper')
dataset_flags=($(for dataset in "${datasets[@]}"; do echo '-d '$dataset; done))
python bin/run_patchcore.py --gpu 0 --seed 0 --save_patchcore_model \
--log_group IM224_WR50_L2-3_P01_D1024-1024_PS-3_AN-1_S0 \
--log_project MVTecAD_Results results \
patch_core -b wideresnet50 -le layer2 -le layer3 --faiss_on_gpu \
--pretrain_embed_dimension 1024 --target_embed_dimension 1024 \
--anomaly_scorer_num_nn 1 --patchsize 3 \
sampler -p 0.1 approx_greedy_coreset \
dataset --resize 256 --imagesize 224 "${dataset_flags[@]}" mvtec $datapath
Ensemble Model
python bin/run_patchcore.py --gpu 0 --seed 0 --save_patchcore_model \
--log_group IM224_Ensemble_L2-3_P001_D1024-384_PS-3_AN-1 \
--log_project MVTecAD_Results results \
patch_core -b wideresnet101 -b resnext101 -b densenet201 \
-le 0.layer2 -le 0.layer3 -le 1.layer2 -le 1.layer3 \
-le 2.features.denseblock2 -le 2.features.denseblock3 --faiss_on_gpu \
--pretrain_embed_dimension 1024 --target_embed_dimension 384 \
--anomaly_scorer_num_nn 1 --patchsize 3 \
sampler -p 0.01 approx_greedy_coreset \
dataset --resize 256 --imagesize 224 "${dataset_flags[@]}" mvtec $datapath
Higher Resolution Results
Models trained on 320×320 images:
IM320 WideResNet50
Mean Performance:
- Image AUROC: 99.3%
- Pixel AUROC: 97.8%
- PRO Score: 94.3%
Configuration: IM320_WR50_L2-3_P001_D1024-1024_PS-3_AN-1
IM320 Ensemble
Mean Performance:
- Image AUROC: 99.6%
- Pixel AUROC: 98.2%
- PRO Score: 94.9%
Configuration: IM320_Ensemble_L2-3_P001_D1024-384_PS-3_AN-1
Higher resolution models (320×320) provide better pixel-level localization for larger images but require more memory.
State-of-the-Art Comparison
PatchCore achieves competitive or superior results compared to other methods on MVTec AD:
| Method | Image AUROC | Pixel AUROC | Year |
|---|
| PatchCore (Ensemble) | 99.6% | 98.2% | 2021 |
| PatchCore (WR50) | 99.2% | 98.1% | 2021 |
| PaDiM | 95.3% | 96.7% | 2020 |
| SPADE | 85.5% | 95.5% | 2021 |
| CFlow-AD | 98.7% | 98.6% | 2021 |
| FastFlow | 99.4% | 98.5% | 2021 |
Training Time
Approximate training times on RTX 3090 GPU:
| Model | Per Category | All 15 Categories |
|---|
| WR50 Baseline | ~5-10 min | ~90 min |
| Ensemble | ~15-20 min | ~5 hours |
Note: “Training” refers to coreset extraction and memory bank construction (no gradient updates).
Inference Time
Approximate inference times on RTX 3090 GPU (per image):
| Model | 224×224 | 320×320 |
|---|
| WR50 Baseline | ~20ms | ~35ms |
| Ensemble | ~50ms | ~80ms |
Memory Requirements
Training
| Model | GPU Memory | Disk Space (per category) |
|---|
| WR50 Baseline | ~8GB | ~10-50MB |
| Ensemble | ~11GB | ~30-150MB |
Inference
| Model | GPU Memory | RAM |
|---|
| WR50 Baseline | ~6GB | ~4GB |
| Ensemble | ~9GB | ~8GB |
Best Practices
- For production: Use WR50 baseline (best speed/accuracy trade-off)
- For highest accuracy: Use ensemble model
- For larger images: Train at 320×320 resolution
- For limited memory: Reduce coreset percentage or use smaller backbone
- For fastest inference: Use ResNet-50 instead of Wide ResNet-50
Citation
These results are based on:
@article{roth2021total,
title={Towards Total Recall in Industrial Anomaly Detection},
author={Roth, Karsten and Pemula, Latha and Zepeda, Joaquin and Sch{\"o}lkopf, Bernhard and Brox, Thomas and Gehler, Peter},
journal={arXiv preprint arXiv:2106.08265},
year={2021}
}