YOLO Model Training

Computer vision is the eyes of your robotic system. In this lesson, you’ll learn how to train YOLO (You Only Look Once) models to detect specific objects for pick-and-place tasks.

Learning Objectives

By the end of this lesson, you will be able to:

Understand YOLO architecture and how it works
Prepare datasets for object detection training
Train custom YOLO models using Ultralytics
Configure training parameters for optimal results
Evaluate model performance
Select appropriate model sizes for your hardware

This course uses YOLO11 (also called YOLOv11), the latest version from Ultralytics with improved speed and accuracy.

Why YOLO for Robotics?

Traditional vs. YOLO Detection

Traditional Two-Stage Detectors (R-CNN, Faster R-CNN):

Propose regions of interest
Classify each region
Slow: 5-10 FPS

YOLO Single-Stage Detector:

Process entire image in one pass
Predict bounding boxes and classes simultaneously
Fast: 30-100+ FPS

For real-time robot control, you need low latency. YOLO’s speed makes it ideal for robotics where decisions must be made in milliseconds.

YOLO Architecture Overview

Components:

Backbone: Extracts features (edges, shapes, textures)
- Convolutional layers with residual connections
- Progressively reduces spatial dimensions
- Increases feature channels
Neck: Fuses features from different scales
- Path Aggregation Network (PANet)
- Detects both small and large objects
Head: Predicts detections
- Bounding box coordinates (x, y, width, height)
- Class probabilities
- Confidence scores

Model Sizes

YOLO11 comes in multiple sizes trading off speed vs. accuracy:

Model	Parameters	Speed (ms)	mAP	Use Case
YOLO11n	2.6M	5	39.5	Ultra-fast, low accuracy
YOLO11s	9.4M	10	47.0	Best for robotics
YOLO11m	20.1M	18	51.5	Balanced
YOLO11l	25.3M	25	53.4	High accuracy
YOLO11x	56.9M	40	54.7	Maximum accuracy

For Raspberry Pi: Use YOLO11s (small) or YOLO11n (nano). These provide good accuracy while running at acceptable framerates on edge devices.

Setting Up Training Environment

Installation

# Install Ultralytics YOLO
pip install ultralytics

# Verify installation
yolo version

# Install training dependencies
pip install torch torchvision opencv-python pillow pyyaml

GPU Setup (Recommended)

Training on GPU is 10-50x faster than CPU:

# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"

# Should print: True

If you don’t have a GPU, you can:

Use Google Colab (free GPU)
Use cloud platforms (AWS, Azure, Google Cloud)
Train on CPU (slower but works for small datasets)

Dataset Preparation

Dataset Structure

YOLO expects a specific directory structure:

dataset/
├── data.yaml          # Dataset configuration
├── train/
│   ├── images/        # Training images
│   │   ├── img001.jpg
│   │   ├── img002.jpg
│   │   └── ...
│   └── labels/        # Annotations
│       ├── img001.txt
│       ├── img002.txt
│       └── ...
└── val/
    ├── images/        # Validation images
    └── labels/        # Validation annotations

Annotation Format

Each image has a corresponding .txt file with one line per object:

<class_id> <x_center> <y_center> <width> <height>

All values are normalized (0-1):

class_id: Integer class index (0, 1, 2…)
x_center: Center X / image width
y_center: Center Y / image height
width: Bounding box width / image width
height: Bounding box height / image height

Example (img001.txt):

0 0.5 0.4 0.15 0.2
1 0.3 0.6 0.12 0.18

This represents:

Object of class 0 at center (50%, 40%) with size 15% × 20%
Object of class 1 at center (30%, 60%) with size 12% × 18%

Data Configuration File

data.yaml:

# Dataset paths
path: /path/to/dataset  # Root directory
train: train/images     # Training images (relative to path)
val: val/images         # Validation images (relative to path)

# Classes
names:
  0: apple
  1: orange
  2: bottle

For this course project, we detect three objects:

apple: Red fruits for pick-and-place
orange: Orange fruits
bottle: Cylindrical objects

See vision_class/process/image_processing.py:50-51

Annotation Tools

Recommended Tools:

LabelImg (free, easy)
```
pip install labelImg
labelimg
```
- Draw bounding boxes
- Exports YOLO format directly
Roboflow (online, free tier)
- Web-based annotation
- Auto-generates train/val split
- Augmentation options
- Export to YOLO format
CVAT (advanced, free)
- Team collaboration
- Video annotation
- Multiple export formats

Dataset Size Guidelines

Task Complexity	Images per Class	Total Images
Simple objects	50-100	150-300
Medium	200-500	600-1500
Complex/varied	1000+	3000+

Quality > QuantityBetter to have 100 well-annotated, diverse images than 1000 similar ones. Include:

Different lighting conditions
Various angles and distances
Different backgrounds
Occlusions and multiple objects

Training a YOLO Model

Basic Training Script

From export_model.py:1-5 (training foundation):

from ultralytics import YOLO

# Load pretrained model (transfer learning)
model = YOLO('yolo11s.pt')  # Small model

# Train on custom dataset
results = model.train(
    data='dataset/data.yaml',  # Dataset config
    epochs=100,                # Training iterations
    imgsz=640,                 # Input image size
    batch=16,                  # Batch size
    name='apple_orange_bottle' # Experiment name
)

Training Parameters

Essential Parameters:

epochs: Number of complete passes through dataset
- Small dataset: 50-100
- Large dataset: 100-300
- Stop when validation loss plateaus
imgsz: Input image size (square)
- Standard: 640 (good balance)
- Faster: 320 (lower accuracy)
- Better: 1280 (slower, better for small objects)
batch: Images per training step
- Depends on GPU memory
- YOLO11s: 16-32 (typical)
- If OOM error: reduce batch size
lr0: Initial learning rate
- Default: 0.01 (usually good)
- Fine-tuning: 0.001 (lower)

Transfer LearningLoading yolo11s.pt starts with weights pretrained on COCO dataset (80 classes, 118k images). This dramatically improves results compared to training from scratch!

Advanced Training Configuration

from ultralytics import YOLO

model = YOLO('yolo11s.pt')

results = model.train(
    # Dataset
    data='dataset/data.yaml',
    
    # Training duration
    epochs=100,
    patience=50,  # Early stopping: stop if no improvement for 50 epochs
    
    # Image settings
    imgsz=640,
    batch=16,
    
    # Optimization
    optimizer='AdamW',  # or 'SGD', 'Adam'
    lr0=0.01,           # Initial learning rate
    lrf=0.01,           # Final learning rate (lr0 * lrf)
    momentum=0.937,     # SGD momentum
    weight_decay=0.0005,# Regularization
    
    # Augmentation
    hsv_h=0.015,        # Hue augmentation
    hsv_s=0.7,          # Saturation
    hsv_v=0.4,          # Value/brightness
    degrees=0.0,        # Rotation (+/- degrees)
    translate=0.1,      # Translation (fraction of image)
    scale=0.5,          # Scale +/- 
    shear=0.0,          # Shear angle
    perspective=0.0,    # Perspective distortion
    flipud=0.0,         # Vertical flip probability
    fliplr=0.5,         # Horizontal flip probability
    mosaic=1.0,         # Mosaic augmentation
    mixup=0.0,          # Mixup augmentation
    
    # Hardware
    device=0,           # GPU device (0, 1, etc.) or 'cpu'
    workers=8,          # Data loading threads
    
    # Output
    project='runs/train',
    name='experiment1',
    exist_ok=False,     # Overwrite existing
    pretrained=True,    # Use pretrained weights
    verbose=True,       # Print progress
    
    # Validation
    val=True,           # Validate during training
    save=True,          # Save checkpoints
    save_period=10,     # Save every N epochs
)

Data AugmentationAugmentation creates variations of training images (flips, rotations, color changes) to:

Increase effective dataset size
Improve model generalization
Reduce overfitting

YOLO applies augmentations automatically during training!

Understanding Training Output

Training Logs

During training, you’ll see output like:

Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  1/100      4.12G      1.234      2.456      1.123         89        640
  2/100      4.12G      1.156      2.234      1.087         89        640
  3/100      4.12G      1.089      2.012      1.034         89        640
...

Metrics Explained:

box_loss: Bounding box localization error
- How far predicted boxes are from ground truth
- Lower is better
cls_loss: Classification error
- How well model predicts correct class
- Lower is better
dfl_loss: Distribution Focal Loss (advanced)
- Improves box prediction quality
- Lower is better

Healthy Training:

Losses steadily decrease
Validation metrics improve
No massive jumps or instability

Validation Metrics

Class     Images  Instances      P      R   mAP50  mAP50-95
apple         50         75   0.89   0.92    0.91      0.67
orange        50         68   0.87   0.89    0.88      0.64
bottle        50         82   0.92   0.94    0.93      0.71

Metrics:

P (Precision): Of all predicted apples, how many were correct?
- Precision = True Positives / (True Positives + False Positives)
- High precision = few false alarms
R (Recall): Of all actual apples, how many were detected?
- Recall = True Positives / (True Positives + False Negatives)
- High recall = few missed objects
mAP50: Mean Average Precision at IoU=0.5
- Overall detection quality
- Standard metric for comparison
- 0.5-0.6: Decent, 0.7-0.8: Good, 0.8+: Excellent
mAP50-95: Average mAP from IoU 0.5 to 0.95
- More strict metric
- Tests box localization accuracy

What’s IoU?Intersection over Union measures box overlap:

IoU = Area of Overlap / Area of Union
IoU > 0.5: Detection is considered correct
IoU > 0.95: Nearly perfect box alignment

Model Evaluation

Validation During Training

YOLO automatically validates every epoch:

# Enable validation (default)
model.train(
    data='data.yaml',
    epochs=100,
    val=True  # Run validation each epoch
)

Post-Training Evaluation

from ultralytics import YOLO

# Load trained model
model = YOLO('runs/train/experiment1/weights/best.pt')

# Evaluate on validation set
metrics = model.val()

print(f'mAP50: {metrics.box.map50}')
print(f'mAP50-95: {metrics.box.map}')
print(f'Precision: {metrics.box.mp}')
print(f'Recall: {metrics.box.mr}')

Testing on Individual Images

# Run inference
results = model.predict(
    source='test_images/',
    conf=0.5,      # Confidence threshold
    save=True,     # Save annotated images
    show=True      # Display results
)

# Process results
for result in results:
    boxes = result.boxes
    for box in boxes:
        cls = int(box.cls[0])
        conf = float(box.conf[0])
        xyxy = box.xyxy[0].cpu().numpy()
        print(f'Class: {model.names[cls]}, Conf: {conf:.2f}, Box: {xyxy}')

Training Tips and Troubleshooting

Common Issues

Problem: Loss not decreasing Solutions:

Reduce learning rate (lr0=0.001)
Check annotations (incorrect labels?)
Increase batch size
More epochs needed

Problem: Overfitting (train good, val poor) Solutions:

Add more training data
Increase augmentation
Reduce model size (yolo11s → yolo11n)
Add regularization (weight_decay=0.001)

Problem: Out of memory (OOM) Solutions:

Reduce batch size (batch=8 or batch=4)
Reduce image size (imgsz=320)
Use smaller model (yolo11n)
Close other programs

Problem: Training too slow Solutions:

Use GPU (check device=0)
Reduce image size
Reduce workers if I/O bottleneck
Use mixed precision: amp=True

GPU Memory RequirementsYOLO11s with batch=16, imgsz=640:

Approximately 4-6 GB VRAM
RTX 3060 (12GB): batch=32
RTX 3050 (8GB): batch=16
GTX 1650 (4GB): batch=8, imgsz=416

Best Practices

Start Small: Train for 10-20 epochs first, verify setup works
Monitor Validation: Watch mAP, not just loss
Use Callbacks: Save best model based on mAP, not last epoch
Experiment Tracking: Keep notes on what works
Test Incrementally: Validate on real robot images, not just dataset

Course Project Dataset

For the robotic arm project, we detect: From image_processing.py:50-51:

if detected_class in ['apple', 'orange', 'bottle']:
    clss_object = detected_class

Dataset Recommendations:

Apple: 50-100 images of red apples from various angles
Orange: 50-100 images of oranges
Bottle: 50-100 images of common bottles (water, soda)

Collection Tips:

Use actual objects robot will interact with
Match lighting conditions of robot workspace
Include images with multiple objects
Vary distances (close, medium, far)

You can start with a pretrained COCO model! YOLO11 already knows ‘apple’, ‘orange’, and ‘bottle’ (classes 47, 49, 39). Fine-tuning on your specific environment improves accuracy.

Practice Exercise

Train a Custom Detector

Task: Train YOLO11s to detect apples and oranges Steps:

Collect 100 images (50 apples, 50 oranges)
Annotate using LabelImg
Split: 80 train, 20 validation
Create data.yaml
Train for 50 epochs
Evaluate: Achieve mAP50 > 0.7

Success Criteria:

Training completes without errors
Validation mAP50 > 0.7
Model detects objects in new test images
Inference runs at >10 FPS

Extension: Experiment with Parameters

Try different configurations:

Model sizes: yolo11n vs yolo11s vs yolo11m
Image sizes: 320 vs 640 vs 1280
Augmentation: heavy vs light

Compare results and inference speed.

Summary

You’ve learned:

✓ YOLO architecture and why it’s ideal for robotics
✓ Dataset preparation and annotation formats
✓ Training configuration and parameters
✓ Transfer learning from pretrained models
✓ Evaluation metrics (Precision, Recall, mAP)
✓ Troubleshooting training issues
✓ Best practices for custom object detection

Next Steps

With a trained model, you need to deploy it to Raspberry Pi! The next lesson covers converting YOLO models to optimized formats for edge devices.

Model Conversion

Export models to ONNX, MNN, and NCNN formats for edge deployment

Reference Code: course/vision_class/

process/detection/main.py:36: Inference parameters (conf, imgsz, half precision)
process/image_processing.py:50-51: Target classes for course project

Learning Path

Communication Class

Vision Class

​YOLO Model Training

​Learning Objectives

​Why YOLO for Robotics?

​Traditional vs. YOLO Detection

​YOLO Architecture Overview

​Model Sizes

​Setting Up Training Environment

​Installation

​GPU Setup (Recommended)

​Dataset Preparation

​Dataset Structure

​Annotation Format

​Data Configuration File

​Annotation Tools

​Dataset Size Guidelines

​Training a YOLO Model

​Basic Training Script

​Training Parameters

​Advanced Training Configuration

​Understanding Training Output

​Training Logs

​Validation Metrics

​Model Evaluation

​Validation During Training

​Post-Training Evaluation

​Testing on Individual Images

​Training Tips and Troubleshooting

​Common Issues

​Best Practices

​Course Project Dataset

​Practice Exercise

​Train a Custom Detector

​Extension: Experiment with Parameters

​Summary

​Next Steps

Model Conversion

Build docs developers (and LLMs) love

YOLO Model Training

Learning Objectives

Why YOLO for Robotics?

Traditional vs. YOLO Detection

YOLO Architecture Overview

Model Sizes

Setting Up Training Environment

Installation

GPU Setup (Recommended)

Dataset Preparation

Dataset Structure

Annotation Format

Data Configuration File

Annotation Tools

Dataset Size Guidelines

Training a YOLO Model

Basic Training Script

Training Parameters

Advanced Training Configuration

Understanding Training Output

Training Logs

Validation Metrics

Model Evaluation

Validation During Training

Post-Training Evaluation

Testing on Individual Images

Training Tips and Troubleshooting

Common Issues

Best Practices

Course Project Dataset

Practice Exercise

Train a Custom Detector

Extension: Experiment with Parameters

Summary

Next Steps