Model Evaluation

Evaluation Overview

After training completes, the YOLOv11 model is automatically evaluated on the validation set. This section covers how to interpret results, perform additional evaluations, and fine-tune your model for better performance.

Evaluation metrics are automatically generated during training and saved to runs/yolov11/train/results.csv.

Evaluation Metrics

YOLOv11 uses several standard metrics to assess model performance:

Primary Metrics

Metric	Description	Target Value
mAP50	Mean Average Precision at 50% IoU	> 0.80 (good)
mAP50-95	Mean Average Precision averaged over IoU 50-95%	> 0.60 (good)
Precision	Ratio of true positives to all positive predictions	> 0.85 (good)
Recall	Ratio of true positives to all actual positives	> 0.80 (good)
F1 Score	Harmonic mean of precision and recall	> 0.82 (good)

Segmentation-Specific Metrics

Metric	Description
mask mAP50	Segmentation mask mAP at 50% IoU
mask mAP50-95	Segmentation mask mAP averaged over IoU thresholds
IoU	Intersection over Union for segmentation masks

Segmentation metrics are typically lower than detection metrics because they require pixel-perfect predictions, not just bounding boxes.

Understanding mAP (Mean Average Precision)

What is mAP?

mAP is the most important metric for object detection and segmentation:

Average Precision (AP): Area under the precision-recall curve for one class
Mean Average Precision (mAP): Average of AP across all classes

mAP Thresholds

# IoU threshold = 50%
# Prediction counts as correct if overlap with ground truth >= 50%
# More lenient metric
mAP50 = 0.85

Per-Class mAP

The model reports mAP for each trash category:

Class           Images  Instances    P      R    mAP50  mAP50-95
cardboard paper    156        423  0.87   0.82   0.86      0.71
metal              156        198  0.91   0.88   0.91      0.76
plastic            156        512  0.83   0.79   0.84      0.68

Classes with more training instances typically achieve higher mAP scores. Consider balancing your dataset if one class significantly underperforms.

Running Manual Evaluation

To evaluate a trained model on the test set:

evaluate_model.py

from ultralytics import YOLO

# Load trained model
model = YOLO('training/runs/yolov11/train/weights/best.pt')

# Evaluate on validation set
metrics = model.val(data='training/data/data.yaml', split='val')

print(f"mAP50: {metrics.box.map50}")
print(f"mAP50-95: {metrics.box.map}")
print(f"Precision: {metrics.box.p}")
print(f"Recall: {metrics.box.r}")

# Evaluate on test set
test_metrics = model.val(data='training/data/data.yaml', split='test')

Load the best model

Always use best.pt for evaluation, not last.pt.

Run validation

Use the val() method with your dataset configuration.

Analyze metrics

Review per-class and overall performance metrics.

Test set evaluation

Run final evaluation on the held-out test set.

Analyzing Training Results

Results CSV

The results.csv file contains epoch-by-epoch metrics:

epoch,train/box_loss,train/seg_loss,val/box_loss,val/seg_loss,metrics/precision,metrics/recall,metrics/mAP50,metrics/mAP50-95
1,1.234,0.876,1.456,0.923,0.654,0.612,0.623,0.445
2,1.156,0.812,1.389,0.887,0.701,0.658,0.678,0.489
3,1.089,0.763,1.334,0.856,0.745,0.698,0.723,0.531
...

Visualizing Training Progress

plot_results.py

import pandas as pd
import matplotlib.pyplot as plt

# Load results
df = pd.read_csv('training/runs/yolov11/train/results.csv')

# Plot mAP over epochs
plt.figure(figsize=(10, 6))
plt.plot(df['epoch'], df['metrics/mAP50'], label='mAP50')
plt.plot(df['epoch'], df['metrics/mAP50-95'], label='mAP50-95')
plt.xlabel('Epoch')
plt.ylabel('mAP')
plt.legend()
plt.title('Model Performance Over Training')
plt.savefig('map_progression.png')

Validation Procedures

Cross-Validation

For robust evaluation with limited data:

from sklearn.model_selection import KFold
import numpy as np

# 5-fold cross-validation approach
kf = KFold(n_splits=5, shuffle=True, random_state=42)
map_scores = []

for fold, (train_idx, val_idx) in enumerate(kf.split(dataset)):
    # Create fold-specific data.yaml
    # Train model
    model = YOLO('yolo11n-seg.pt')
    model.train(data=f'data_fold{fold}.yaml', epochs=300)
    
    # Evaluate
    metrics = model.val()
    map_scores.append(metrics.box.map50)
    
print(f"Average mAP50: {np.mean(map_scores):.3f} ± {np.std(map_scores):.3f}")

Confusion Matrix Analysis

The confusion matrix shows classification performance:

                Predicted
              CB   MT   PL   BG
Actual   CB  [423   12   18   8]
         MT  [ 15  198   9    6]
         PL  [ 23   11  512  14]
         BG  [  8    5   12   -]

CB: Cardboard paper
MT: Metal
PL: Plastic
BG: Background (false positives)

Ideal confusion matrix has high values on the diagonal and low values elsewhere. Off-diagonal values indicate misclassifications.

Performance Assessment

Good Model Indicators

✅ High Performance:

mAP50 > 0.85
mAP50-95 > 0.65
Training and validation losses converge
Consistent performance across all classes
Low false positive rate

Warning Signs

⚠️ Overfitting:

Training loss much lower than validation loss
Training mAP significantly higher than validation mAP
Performance degrades after certain epoch

Solution: Reduce epochs, increase dropout, add more augmentation ⚠️ Underfitting:

Both training and validation losses are high
mAP50 < 0.70
Losses still decreasing at end of training

Solution: Train longer, increase model size, reduce augmentation ⚠️ Class Imbalance:

One class has significantly lower mAP
High confusion between specific classes

Solution: Collect more data for underperforming classes, adjust class weights

Model Testing Checklist

Before deploying your model:

Evaluate on held-out test set
Verify mAP50 > 0.80 on validation set
Check per-class performance is balanced
Test on edge cases (occlusions, poor lighting)
Measure inference speed on target hardware
Validate segmentation mask quality
Test with real-world images not in training set
Verify no data leakage between train/val/test

Fine-Tuning Recommendations

When to Fine-Tune

Baseline Model

Train initial model with default hyperparameters.

Identify Issues

Analyze metrics to identify specific problems (low recall, poor segmentation, etc.).

Targeted Adjustments

Make specific hyperparameter changes to address issues.

Iterative Improvement

Repeat training with adjusted parameters and compare results.

Common Fine-Tuning Strategies

Improve Recall (Detect More Objects)

model.train(
    conf=0.001,      # Lower confidence threshold
    iou=0.5,         # Lower IoU for NMS
    augment=True,    # Enable test-time augmentation
)

Improve Precision (Reduce False Positives)

model.train(
    conf=0.35,       # Higher confidence threshold
    iou=0.7,         # Higher IoU for NMS
    cls=0.5,         # Increase classification weight
)

Better Segmentation Masks

model.train(
    epochs=400,      # Train longer
    imgsz=1280,      # Higher resolution
    mask_ratio=4,    # Higher mask resolution
)

Faster Convergence

model.train(
    lr0=0.01,        # Higher initial learning rate
    warmup_epochs=5, # Gradual warmup
    optimizer='Adam', # Adam instead of SGD
)

Only change one or two hyperparameters at a time. Changing too many makes it difficult to understand what improved (or hurt) performance.

Inference Speed Benchmarking

Evaluate model speed on your deployment hardware:

benchmark.py

from ultralytics import YOLO
import time
import torch

model = YOLO('training/runs/yolov11/train/weights/best.pt')

# Warmup
for _ in range(10):
    model.predict('test_image.jpg', verbose=False)

# Benchmark
start = time.time()
for _ in range(100):
    results = model.predict('test_image.jpg', verbose=False)
end = time.time()

avg_time = (end - start) / 100
fps = 1 / avg_time

print(f"Average inference time: {avg_time*1000:.2f} ms")
print(f"FPS: {fps:.2f}")

Target Performance

Device	Target FPS	Acceptable Range
NVIDIA RTX 3080	60+	40-100
Apple M1/M2	30+	20-50
Raspberry Pi 4	5+	3-10
CPU (modern)	10+	5-15

YOLOv11n (nano) is optimized for speed. If you need higher accuracy and can sacrifice speed, consider YOLOv11s (small) or YOLOv11m (medium) variants.

Next Steps

Once your model achieves satisfactory performance:

Export Model

Export your model for deployment to various platforms

API Integration

Integrate the trained model into the classification API

Validation Report Template

# Model Validation Report

## Model Details
- Model: YOLOv11n-seg
- Training Date: YYYY-MM-DD
- Dataset Size: XXX images
- Training Duration: XX hours

## Performance Metrics

### Overall Performance
- mAP50: 0.XXX
- mAP50-95: 0.XXX
- Precision: 0.XXX
- Recall: 0.XXX

### Per-Class Performance
| Class | AP50 | AP50-95 | Precision | Recall |
|-------|------|---------|-----------|--------|
| Cardboard | 0.XX | 0.XX | 0.XX | 0.XX |
| Metal | 0.XX | 0.XX | 0.XX | 0.XX |
| Plastic | 0.XX | 0.XX | 0.XX | 0.XX |

### Inference Speed
- Device: XXX
- Average Time: XX ms
- FPS: XX

## Validation Results
- Test Set Size: XXX images
- Test mAP50: 0.XXX
- False Positive Rate: X.X%
- False Negative Rate: X.X%

## Recommendations
- [ ] Model approved for deployment
- [ ] Requires additional training
- [ ] Needs more data for class XXX

## Edge Cases Tested
- [ ] Occluded objects
- [ ] Poor lighting conditions
- [ ] Multiple overlapping objects
- [ ] Unusual angles/perspectives

Getting Started

Core Concepts

Training

Inference

Robotics Integration

Evaluation Overview

Evaluation Metrics

Primary Metrics

Segmentation-Specific Metrics

Understanding mAP (Mean Average Precision)

What is mAP?

mAP Thresholds

Per-Class mAP

Running Manual Evaluation

Analyzing Training Results

Results CSV

Visualizing Training Progress

Validation Procedures

Cross-Validation

Confusion Matrix Analysis

Performance Assessment

Good Model Indicators

Warning Signs

Model Testing Checklist

Fine-Tuning Recommendations

When to Fine-Tune

Common Fine-Tuning Strategies

Improve Recall (Detect More Objects)

Improve Precision (Reduce False Positives)

Better Segmentation Masks

Faster Convergence

Inference Speed Benchmarking

Target Performance

Next Steps

Export Model

API Integration

Validation Report Template

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Training

Inference

Robotics Integration

​Evaluation Overview

​Evaluation Metrics

​Primary Metrics

​Segmentation-Specific Metrics

​Understanding mAP (Mean Average Precision)

​What is mAP?

​mAP Thresholds

​Per-Class mAP

​Running Manual Evaluation

​Analyzing Training Results

​Results CSV

​Visualizing Training Progress

​Validation Procedures

​Cross-Validation

​Confusion Matrix Analysis

​Performance Assessment

​Good Model Indicators

​Warning Signs

​Model Testing Checklist

​Fine-Tuning Recommendations

​When to Fine-Tune

​Common Fine-Tuning Strategies

​Improve Recall (Detect More Objects)

​Improve Precision (Reduce False Positives)

​Better Segmentation Masks

​Faster Convergence

​Inference Speed Benchmarking

​Target Performance

​Next Steps

Export Model

API Integration

​Validation Report Template

Build docs developers (and LLMs) love

Evaluation Overview

Evaluation Metrics

Primary Metrics

Segmentation-Specific Metrics

Understanding mAP (Mean Average Precision)

What is mAP?

mAP Thresholds

Per-Class mAP

Running Manual Evaluation

Analyzing Training Results

Results CSV

Visualizing Training Progress

Validation Procedures

Cross-Validation

Confusion Matrix Analysis

Performance Assessment

Good Model Indicators

Warning Signs

Model Testing Checklist

Fine-Tuning Recommendations

When to Fine-Tune

Common Fine-Tuning Strategies

Improve Recall (Detect More Objects)

Improve Precision (Reduce False Positives)

Better Segmentation Masks

Faster Convergence

Inference Speed Benchmarking

Target Performance

Next Steps

Validation Report Template