What are Ensemble Models?
PatchCore ensembles combine predictions from multiple backbone networks to achieve state-of-the-art performance. The original paper achieved 99.6% instance AUROC using an ensemble of WideResNet101, ResNext101, and DenseNet201.Performance Gains:
- Single WideResNet50: 99.2% AUROC
- Ensemble (3 backbones): 99.6% AUROC
- Trade-off: 3x training time, 3x model size
How Ensembles Work
Train Multiple Models
Each backbone is trained independently:
- WideResNet101 → Feature bank 1
- ResNext101 → Feature bank 2
- DenseNet201 → Feature bank 3
Extract Features
During inference, each model extracts features from the test image using its own layers.
Compute Anomaly Scores
Each model computes its own anomaly score and segmentation map using its nearest-neighbor search.
Training an Ensemble
The key difference from single-model training is specifying multiple backbones and their corresponding layers.Basic Ensemble Syntax
Ensemble Command Structure
Recommended Ensemble (224x224)
This configuration from the paper achieves 99.3% instance AUROC:Ensemble - 224x224 Images
Expected Performance
Expected Performance
Metrics (from sample_training.sh:12-13)
Best Ensemble (320x320)
For maximum performance, use higher resolution images (99.6% instance AUROC):Ensemble - 320x320 Images (Best Performance)
Expected Performance
Expected Performance
Metrics (from sample_training.sh:24-25)
Understanding Layer Selection
Different architectures have different layer naming conventions:- WideResNet / ResNet
- DenseNet
- ResNext
Available Layers:
layer1- Early features (low-level)layer2- Mid-level features (recommended)layer3- High-level features (recommended)layer4- Very high-level features
layer2 and layer3Example
Ensemble Configuration Details
Why Different Embedding Dimensions?
Notice the ensemble uses--target_embed_dimension 384 instead of 1024:
Dimension Trade-offs
Dimension Trade-offs
Single Model (target_embed_dimension: 1024):
- Higher capacity for one backbone
- Larger memory footprint per model
- Better individual model performance
- Lower dimension per backbone (3 × 384 = 1152 total)
- Reduces memory usage
- Diversity across backbones compensates for lower individual capacity
- Total ensemble capacity is still higher than single model
- Single model @ 1024: ~500 MB per category
- Ensemble @ 384: ~300 MB × 3 = ~900 MB per category (vs ~1500 MB @ 1024)
Backbone Combinations
You can mix and match different backbones. Here are proven combinations:Output Structure
Ensemble models save multiple model files per category:Ensemble Output
Each category has 3 pairs of files (one per backbone). The
Ensemble-{i}-{total}_ prefix indicates:i: Model index (1, 2, or 3)total: Total number of models in ensemble (3)
Training Progress
Ensemble training processes models sequentially:Training Output
Performance Comparison
Here’s how ensembles compare to single models:| Configuration | AUROC | Training Time | Model Size | GPU Memory |
|---|---|---|---|---|
| WR50 @ 224 (10%) | 99.2% | 1-2 hours | 1-2 GB | 8-10 GB |
| WR50 @ 224 (1%) | 99.2% | 1-2 hours | 0.5-1 GB | 8-10 GB |
| Ensemble @ 224 (1%) | 99.3% | 3-5 hours | 5-7 GB | 10-12 GB |
| WR50 @ 320 (1%) | 99.3% | 2-3 hours | 1-2 GB | 11-13 GB |
| Ensemble @ 320 (1%) | 99.6% | 6-10 hours | 8-12 GB | 14-16 GB |
Diminishing Returns: The ensemble provides a 0.4% AUROC improvement over the single WR50 model, but requires 3x the resources. Consider your performance vs. efficiency requirements.
When to Use Ensembles
Use Ensemble When
- Maximum accuracy is critical
- You have 16GB+ GPU available
- Inference speed is not a constraint
- Storage space is not limited
- Targeting 99.5%+ AUROC
Use Single Model When
- Fast inference is required
- Limited GPU memory (under 12 GB)
- Storage is constrained
- 99%+ AUROC is sufficient
- Deploying to edge devices
Advanced: Custom Ensembles
You can create custom ensembles with different architectures:Troubleshooting
CUDA out of memory with ensemble
CUDA out of memory with ensemble
Problem: Ensemble requires more GPU memory than single model.Solutions:
- Lower target embedding dimension:
- Use smaller backbones:
- Reduce batch size:
- Use 224x224 images instead of 320x320
- Train models separately (see below)
Training models separately then combining
Training models separately then combining
You can train each backbone individually and combine them later:Then load all three during inference for ensemble predictions.
Train Model 1
Train Model 2
Train Model 3
Layer names not found for DenseNet
Layer names not found for DenseNet
Problem: DenseNet uses different layer naming.Solution: Use
features.denseblock{N} format:Ensemble slower than expected
Ensemble slower than expected
Problem: Each model processes images sequentially.This is normal: Ensemble training takes approximately N× single model time.Speed tips:
- Use
--faiss_on_gpufor all models - Increase coreset sampling:
-p 0.01(not-p 0.1) - Use faster backbones (ResNet50 instead of ResNet101)
Next Steps
Configuration Reference
Detailed explanation of all parameters
Model Evaluation
Load and evaluate your trained ensemble
