Evaluation Overview
After training completes, the YOLOv11 model is automatically evaluated on the validation set. This section covers how to interpret results, perform additional evaluations, and fine-tune your model for better performance.
Evaluation metrics are automatically generated during training and saved to runs/yolov11/train/results.csv.
Evaluation Metrics
YOLOv11 uses several standard metrics to assess model performance:
Primary Metrics
Metric Description Target Value mAP50 Mean Average Precision at 50% IoU > 0.80 (good) mAP50-95 Mean Average Precision averaged over IoU 50-95% > 0.60 (good) Precision Ratio of true positives to all positive predictions > 0.85 (good) Recall Ratio of true positives to all actual positives > 0.80 (good) F1 Score Harmonic mean of precision and recall > 0.82 (good)
Segmentation-Specific Metrics
Metric Description mask mAP50 Segmentation mask mAP at 50% IoU mask mAP50-95 Segmentation mask mAP averaged over IoU thresholds IoU Intersection over Union for segmentation masks
Segmentation metrics are typically lower than detection metrics because they require pixel-perfect predictions, not just bounding boxes.
Understanding mAP (Mean Average Precision)
What is mAP?
mAP is the most important metric for object detection and segmentation:
Average Precision (AP) : Area under the precision-recall curve for one class
Mean Average Precision (mAP) : Average of AP across all classes
mAP Thresholds
# IoU threshold = 50%
# Prediction counts as correct if overlap with ground truth >= 50%
# More lenient metric
mAP50 = 0.85
Per-Class mAP
The model reports mAP for each trash category:
Class Images Instances P R mAP50 mAP50-95
cardboard paper 156 423 0.87 0.82 0.86 0.71
metal 156 198 0.91 0.88 0.91 0.76
plastic 156 512 0.83 0.79 0.84 0.68
Classes with more training instances typically achieve higher mAP scores. Consider balancing your dataset if one class significantly underperforms.
Running Manual Evaluation
To evaluate a trained model on the test set:
from ultralytics import YOLO
# Load trained model
model = YOLO( 'training/runs/yolov11/train/weights/best.pt' )
# Evaluate on validation set
metrics = model.val( data = 'training/data/data.yaml' , split = 'val' )
print ( f "mAP50: { metrics.box.map50 } " )
print ( f "mAP50-95: { metrics.box.map } " )
print ( f "Precision: { metrics.box.p } " )
print ( f "Recall: { metrics.box.r } " )
# Evaluate on test set
test_metrics = model.val( data = 'training/data/data.yaml' , split = 'test' )
Load the best model
Always use best.pt for evaluation, not last.pt.
Run validation
Use the val() method with your dataset configuration.
Analyze metrics
Review per-class and overall performance metrics.
Test set evaluation
Run final evaluation on the held-out test set.
Analyzing Training Results
Results CSV
The results.csv file contains epoch-by-epoch metrics:
epoch, train/box_loss, train/seg_loss, val/box_loss, val/seg_loss, metrics/precision, metrics/recall, metrics/mAP50, metrics/mAP50-95
1, 1.234, 0.876, 1.456, 0.923, 0.654, 0.612, 0.623, 0.445
2, 1.156, 0.812, 1.389, 0.887, 0.701, 0.658, 0.678, 0.489
3, 1.089, 0.763, 1.334, 0.856, 0.745, 0.698, 0.723, 0.531
...
Visualizing Training Progress
import pandas as pd
import matplotlib.pyplot as plt
# Load results
df = pd.read_csv( 'training/runs/yolov11/train/results.csv' )
# Plot mAP over epochs
plt.figure( figsize = ( 10 , 6 ))
plt.plot(df[ 'epoch' ], df[ 'metrics/mAP50' ], label = 'mAP50' )
plt.plot(df[ 'epoch' ], df[ 'metrics/mAP50-95' ], label = 'mAP50-95' )
plt.xlabel( 'Epoch' )
plt.ylabel( 'mAP' )
plt.legend()
plt.title( 'Model Performance Over Training' )
plt.savefig( 'map_progression.png' )
Validation Procedures
Cross-Validation
For robust evaluation with limited data:
from sklearn.model_selection import KFold
import numpy as np
# 5-fold cross-validation approach
kf = KFold( n_splits = 5 , shuffle = True , random_state = 42 )
map_scores = []
for fold, (train_idx, val_idx) in enumerate (kf.split(dataset)):
# Create fold-specific data.yaml
# Train model
model = YOLO( 'yolo11n-seg.pt' )
model.train( data = f 'data_fold { fold } .yaml' , epochs = 300 )
# Evaluate
metrics = model.val()
map_scores.append(metrics.box.map50)
print ( f "Average mAP50: { np.mean(map_scores) :.3f} ± { np.std(map_scores) :.3f} " )
Confusion Matrix Analysis
The confusion matrix shows classification performance:
Predicted
CB MT PL BG
Actual CB [423 12 18 8]
MT [ 15 198 9 6]
PL [ 23 11 512 14]
BG [ 8 5 12 -]
CB : Cardboard paper
MT : Metal
PL : Plastic
BG : Background (false positives)
Ideal confusion matrix has high values on the diagonal and low values elsewhere. Off-diagonal values indicate misclassifications.
Good Model Indicators
✅ High Performance :
mAP50 > 0.85
mAP50-95 > 0.65
Training and validation losses converge
Consistent performance across all classes
Low false positive rate
Warning Signs
⚠️ Overfitting :
Training loss much lower than validation loss
Training mAP significantly higher than validation mAP
Performance degrades after certain epoch
Solution : Reduce epochs, increase dropout, add more augmentation
⚠️ Underfitting :
Both training and validation losses are high
mAP50 < 0.70
Losses still decreasing at end of training
Solution : Train longer, increase model size, reduce augmentation
⚠️ Class Imbalance :
One class has significantly lower mAP
High confusion between specific classes
Solution : Collect more data for underperforming classes, adjust class weights
Model Testing Checklist
Before deploying your model:
Fine-Tuning Recommendations
When to Fine-Tune
Baseline Model
Train initial model with default hyperparameters.
Identify Issues
Analyze metrics to identify specific problems (low recall, poor segmentation, etc.).
Targeted Adjustments
Make specific hyperparameter changes to address issues.
Iterative Improvement
Repeat training with adjusted parameters and compare results.
Common Fine-Tuning Strategies
Improve Recall (Detect More Objects)
model.train(
conf = 0.001 , # Lower confidence threshold
iou = 0.5 , # Lower IoU for NMS
augment = True , # Enable test-time augmentation
)
Improve Precision (Reduce False Positives)
model.train(
conf = 0.35 , # Higher confidence threshold
iou = 0.7 , # Higher IoU for NMS
cls = 0.5 , # Increase classification weight
)
Better Segmentation Masks
model.train(
epochs = 400 , # Train longer
imgsz = 1280 , # Higher resolution
mask_ratio = 4 , # Higher mask resolution
)
Faster Convergence
model.train(
lr0 = 0.01 , # Higher initial learning rate
warmup_epochs = 5 , # Gradual warmup
optimizer = 'Adam' , # Adam instead of SGD
)
Only change one or two hyperparameters at a time. Changing too many makes it difficult to understand what improved (or hurt) performance.
Inference Speed Benchmarking
Evaluate model speed on your deployment hardware:
from ultralytics import YOLO
import time
import torch
model = YOLO( 'training/runs/yolov11/train/weights/best.pt' )
# Warmup
for _ in range ( 10 ):
model.predict( 'test_image.jpg' , verbose = False )
# Benchmark
start = time.time()
for _ in range ( 100 ):
results = model.predict( 'test_image.jpg' , verbose = False )
end = time.time()
avg_time = (end - start) / 100
fps = 1 / avg_time
print ( f "Average inference time: { avg_time * 1000 :.2f} ms" )
print ( f "FPS: { fps :.2f} " )
Device Target FPS Acceptable Range NVIDIA RTX 3080 60+ 40-100 Apple M1/M2 30+ 20-50 Raspberry Pi 4 5+ 3-10 CPU (modern) 10+ 5-15
YOLOv11n (nano) is optimized for speed. If you need higher accuracy and can sacrifice speed, consider YOLOv11s (small) or YOLOv11m (medium) variants.
Next Steps
Once your model achieves satisfactory performance:
Export Model Export your model for deployment to various platforms
API Integration Integrate the trained model into the classification API
Validation Report Template
# Model Validation Report
## Model Details
- Model: YOLOv11n-seg
- Training Date: YYYY-MM-DD
- Dataset Size: XXX images
- Training Duration: XX hours
## Performance Metrics
### Overall Performance
- mAP50: 0.XXX
- mAP50-95: 0.XXX
- Precision: 0.XXX
- Recall: 0.XXX
### Per-Class Performance
| Class | AP50 | AP50-95 | Precision | Recall |
|-------|------|---------|-----------|--------|
| Cardboard | 0.XX | 0.XX | 0.XX | 0.XX |
| Metal | 0.XX | 0.XX | 0.XX | 0.XX |
| Plastic | 0.XX | 0.XX | 0.XX | 0.XX |
### Inference Speed
- Device: XXX
- Average Time: XX ms
- FPS: XX
## Validation Results
- Test Set Size: XXX images
- Test mAP50: 0.XXX
- False Positive Rate: X.X%
- False Negative Rate: X.X%
## Recommendations
- [ ] Model approved for deployment
- [ ] Requires additional training
- [ ] Needs more data for class XXX
## Edge Cases Tested
- [ ] Occluded objects
- [ ] Poor lighting conditions
- [ ] Multiple overlapping objects
- [ ] Unusual angles/perspectives