Skip to main content

Evaluation Overview

After training completes, the YOLOv11 model is automatically evaluated on the validation set. This section covers how to interpret results, perform additional evaluations, and fine-tune your model for better performance.
Evaluation metrics are automatically generated during training and saved to runs/yolov11/train/results.csv.

Evaluation Metrics

YOLOv11 uses several standard metrics to assess model performance:

Primary Metrics

MetricDescriptionTarget Value
mAP50Mean Average Precision at 50% IoU> 0.80 (good)
mAP50-95Mean Average Precision averaged over IoU 50-95%> 0.60 (good)
PrecisionRatio of true positives to all positive predictions> 0.85 (good)
RecallRatio of true positives to all actual positives> 0.80 (good)
F1 ScoreHarmonic mean of precision and recall> 0.82 (good)

Segmentation-Specific Metrics

MetricDescription
mask mAP50Segmentation mask mAP at 50% IoU
mask mAP50-95Segmentation mask mAP averaged over IoU thresholds
IoUIntersection over Union for segmentation masks
Segmentation metrics are typically lower than detection metrics because they require pixel-perfect predictions, not just bounding boxes.

Understanding mAP (Mean Average Precision)

What is mAP?

mAP is the most important metric for object detection and segmentation:
  • Average Precision (AP): Area under the precision-recall curve for one class
  • Mean Average Precision (mAP): Average of AP across all classes

mAP Thresholds

# IoU threshold = 50%
# Prediction counts as correct if overlap with ground truth >= 50%
# More lenient metric
mAP50 = 0.85

Per-Class mAP

The model reports mAP for each trash category:
Class           Images  Instances    P      R    mAP50  mAP50-95
cardboard paper    156        423  0.87   0.82   0.86      0.71
metal              156        198  0.91   0.88   0.91      0.76
plastic            156        512  0.83   0.79   0.84      0.68
Classes with more training instances typically achieve higher mAP scores. Consider balancing your dataset if one class significantly underperforms.

Running Manual Evaluation

To evaluate a trained model on the test set:
evaluate_model.py
from ultralytics import YOLO

# Load trained model
model = YOLO('training/runs/yolov11/train/weights/best.pt')

# Evaluate on validation set
metrics = model.val(data='training/data/data.yaml', split='val')

print(f"mAP50: {metrics.box.map50}")
print(f"mAP50-95: {metrics.box.map}")
print(f"Precision: {metrics.box.p}")
print(f"Recall: {metrics.box.r}")

# Evaluate on test set
test_metrics = model.val(data='training/data/data.yaml', split='test')
1

Load the best model

Always use best.pt for evaluation, not last.pt.
2

Run validation

Use the val() method with your dataset configuration.
3

Analyze metrics

Review per-class and overall performance metrics.
4

Test set evaluation

Run final evaluation on the held-out test set.

Analyzing Training Results

Results CSV

The results.csv file contains epoch-by-epoch metrics:
epoch,train/box_loss,train/seg_loss,val/box_loss,val/seg_loss,metrics/precision,metrics/recall,metrics/mAP50,metrics/mAP50-95
1,1.234,0.876,1.456,0.923,0.654,0.612,0.623,0.445
2,1.156,0.812,1.389,0.887,0.701,0.658,0.678,0.489
3,1.089,0.763,1.334,0.856,0.745,0.698,0.723,0.531
...

Visualizing Training Progress

plot_results.py
import pandas as pd
import matplotlib.pyplot as plt

# Load results
df = pd.read_csv('training/runs/yolov11/train/results.csv')

# Plot mAP over epochs
plt.figure(figsize=(10, 6))
plt.plot(df['epoch'], df['metrics/mAP50'], label='mAP50')
plt.plot(df['epoch'], df['metrics/mAP50-95'], label='mAP50-95')
plt.xlabel('Epoch')
plt.ylabel('mAP')
plt.legend()
plt.title('Model Performance Over Training')
plt.savefig('map_progression.png')

Validation Procedures

Cross-Validation

For robust evaluation with limited data:
from sklearn.model_selection import KFold
import numpy as np

# 5-fold cross-validation approach
kf = KFold(n_splits=5, shuffle=True, random_state=42)
map_scores = []

for fold, (train_idx, val_idx) in enumerate(kf.split(dataset)):
    # Create fold-specific data.yaml
    # Train model
    model = YOLO('yolo11n-seg.pt')
    model.train(data=f'data_fold{fold}.yaml', epochs=300)
    
    # Evaluate
    metrics = model.val()
    map_scores.append(metrics.box.map50)
    
print(f"Average mAP50: {np.mean(map_scores):.3f} ± {np.std(map_scores):.3f}")

Confusion Matrix Analysis

The confusion matrix shows classification performance:
                Predicted
              CB   MT   PL   BG
Actual   CB  [423   12   18   8]
         MT  [ 15  198   9    6]
         PL  [ 23   11  512  14]
         BG  [  8    5   12   -]
  • CB: Cardboard paper
  • MT: Metal
  • PL: Plastic
  • BG: Background (false positives)
Ideal confusion matrix has high values on the diagonal and low values elsewhere. Off-diagonal values indicate misclassifications.

Performance Assessment

Good Model Indicators

High Performance:
  • mAP50 > 0.85
  • mAP50-95 > 0.65
  • Training and validation losses converge
  • Consistent performance across all classes
  • Low false positive rate

Warning Signs

⚠️ Overfitting:
  • Training loss much lower than validation loss
  • Training mAP significantly higher than validation mAP
  • Performance degrades after certain epoch
Solution: Reduce epochs, increase dropout, add more augmentation ⚠️ Underfitting:
  • Both training and validation losses are high
  • mAP50 < 0.70
  • Losses still decreasing at end of training
Solution: Train longer, increase model size, reduce augmentation ⚠️ Class Imbalance:
  • One class has significantly lower mAP
  • High confusion between specific classes
Solution: Collect more data for underperforming classes, adjust class weights

Model Testing Checklist

Before deploying your model:
  • Evaluate on held-out test set
  • Verify mAP50 > 0.80 on validation set
  • Check per-class performance is balanced
  • Test on edge cases (occlusions, poor lighting)
  • Measure inference speed on target hardware
  • Validate segmentation mask quality
  • Test with real-world images not in training set
  • Verify no data leakage between train/val/test

Fine-Tuning Recommendations

When to Fine-Tune

1

Baseline Model

Train initial model with default hyperparameters.
2

Identify Issues

Analyze metrics to identify specific problems (low recall, poor segmentation, etc.).
3

Targeted Adjustments

Make specific hyperparameter changes to address issues.
4

Iterative Improvement

Repeat training with adjusted parameters and compare results.

Common Fine-Tuning Strategies

Improve Recall (Detect More Objects)

model.train(
    conf=0.001,      # Lower confidence threshold
    iou=0.5,         # Lower IoU for NMS
    augment=True,    # Enable test-time augmentation
)

Improve Precision (Reduce False Positives)

model.train(
    conf=0.35,       # Higher confidence threshold
    iou=0.7,         # Higher IoU for NMS
    cls=0.5,         # Increase classification weight
)

Better Segmentation Masks

model.train(
    epochs=400,      # Train longer
    imgsz=1280,      # Higher resolution
    mask_ratio=4,    # Higher mask resolution
)

Faster Convergence

model.train(
    lr0=0.01,        # Higher initial learning rate
    warmup_epochs=5, # Gradual warmup
    optimizer='Adam', # Adam instead of SGD
)
Only change one or two hyperparameters at a time. Changing too many makes it difficult to understand what improved (or hurt) performance.

Inference Speed Benchmarking

Evaluate model speed on your deployment hardware:
benchmark.py
from ultralytics import YOLO
import time
import torch

model = YOLO('training/runs/yolov11/train/weights/best.pt')

# Warmup
for _ in range(10):
    model.predict('test_image.jpg', verbose=False)

# Benchmark
start = time.time()
for _ in range(100):
    results = model.predict('test_image.jpg', verbose=False)
end = time.time()

avg_time = (end - start) / 100
fps = 1 / avg_time

print(f"Average inference time: {avg_time*1000:.2f} ms")
print(f"FPS: {fps:.2f}")

Target Performance

DeviceTarget FPSAcceptable Range
NVIDIA RTX 308060+40-100
Apple M1/M230+20-50
Raspberry Pi 45+3-10
CPU (modern)10+5-15
YOLOv11n (nano) is optimized for speed. If you need higher accuracy and can sacrifice speed, consider YOLOv11s (small) or YOLOv11m (medium) variants.

Next Steps

Once your model achieves satisfactory performance:

Export Model

Export your model for deployment to various platforms

API Integration

Integrate the trained model into the classification API

Validation Report Template

# Model Validation Report

## Model Details
- Model: YOLOv11n-seg
- Training Date: YYYY-MM-DD
- Dataset Size: XXX images
- Training Duration: XX hours

## Performance Metrics

### Overall Performance
- mAP50: 0.XXX
- mAP50-95: 0.XXX
- Precision: 0.XXX
- Recall: 0.XXX

### Per-Class Performance
| Class | AP50 | AP50-95 | Precision | Recall |
|-------|------|---------|-----------|--------|
| Cardboard | 0.XX | 0.XX | 0.XX | 0.XX |
| Metal | 0.XX | 0.XX | 0.XX | 0.XX |
| Plastic | 0.XX | 0.XX | 0.XX | 0.XX |

### Inference Speed
- Device: XXX
- Average Time: XX ms
- FPS: XX

## Validation Results
- Test Set Size: XXX images
- Test mAP50: 0.XXX
- False Positive Rate: X.X%
- False Negative Rate: X.X%

## Recommendations
- [ ] Model approved for deployment
- [ ] Requires additional training
- [ ] Needs more data for class XXX

## Edge Cases Tested
- [ ] Occluded objects
- [ ] Poor lighting conditions
- [ ] Multiple overlapping objects
- [ ] Unusual angles/perspectives

Build docs developers (and LLMs) love