Skip to main content

YOLO Model Training

Computer vision is the eyes of your robotic system. In this lesson, you’ll learn how to train YOLO (You Only Look Once) models to detect specific objects for pick-and-place tasks.

Learning Objectives

By the end of this lesson, you will be able to:
  • Understand YOLO architecture and how it works
  • Prepare datasets for object detection training
  • Train custom YOLO models using Ultralytics
  • Configure training parameters for optimal results
  • Evaluate model performance
  • Select appropriate model sizes for your hardware
This course uses YOLO11 (also called YOLOv11), the latest version from Ultralytics with improved speed and accuracy.

Why YOLO for Robotics?

Traditional vs. YOLO Detection

Traditional Two-Stage Detectors (R-CNN, Faster R-CNN):
  1. Propose regions of interest
  2. Classify each region
  3. Slow: 5-10 FPS
YOLO Single-Stage Detector:
  1. Process entire image in one pass
  2. Predict bounding boxes and classes simultaneously
  3. Fast: 30-100+ FPS
For real-time robot control, you need low latency. YOLO’s speed makes it ideal for robotics where decisions must be made in milliseconds.

YOLO Architecture Overview

Components:
  1. Backbone: Extracts features (edges, shapes, textures)
    • Convolutional layers with residual connections
    • Progressively reduces spatial dimensions
    • Increases feature channels
  2. Neck: Fuses features from different scales
    • Path Aggregation Network (PANet)
    • Detects both small and large objects
  3. Head: Predicts detections
    • Bounding box coordinates (x, y, width, height)
    • Class probabilities
    • Confidence scores

Model Sizes

YOLO11 comes in multiple sizes trading off speed vs. accuracy:
ModelParametersSpeed (ms)mAPUse Case
YOLO11n2.6M539.5Ultra-fast, low accuracy
YOLO11s9.4M1047.0Best for robotics
YOLO11m20.1M1851.5Balanced
YOLO11l25.3M2553.4High accuracy
YOLO11x56.9M4054.7Maximum accuracy
For Raspberry Pi: Use YOLO11s (small) or YOLO11n (nano). These provide good accuracy while running at acceptable framerates on edge devices.

Setting Up Training Environment

Installation

# Install Ultralytics YOLO
pip install ultralytics

# Verify installation
yolo version

# Install training dependencies
pip install torch torchvision opencv-python pillow pyyaml
Training on GPU is 10-50x faster than CPU:
# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"

# Should print: True
If you don’t have a GPU, you can:
  • Use Google Colab (free GPU)
  • Use cloud platforms (AWS, Azure, Google Cloud)
  • Train on CPU (slower but works for small datasets)

Dataset Preparation

Dataset Structure

YOLO expects a specific directory structure:
dataset/
├── data.yaml          # Dataset configuration
├── train/
│   ├── images/        # Training images
│   │   ├── img001.jpg
│   │   ├── img002.jpg
│   │   └── ...
│   └── labels/        # Annotations
│       ├── img001.txt
│       ├── img002.txt
│       └── ...
└── val/
    ├── images/        # Validation images
    └── labels/        # Validation annotations

Annotation Format

Each image has a corresponding .txt file with one line per object:
<class_id> <x_center> <y_center> <width> <height>
All values are normalized (0-1):
  • class_id: Integer class index (0, 1, 2…)
  • x_center: Center X / image width
  • y_center: Center Y / image height
  • width: Bounding box width / image width
  • height: Bounding box height / image height
Example (img001.txt):
0 0.5 0.4 0.15 0.2
1 0.3 0.6 0.12 0.18
This represents:
  • Object of class 0 at center (50%, 40%) with size 15% × 20%
  • Object of class 1 at center (30%, 60%) with size 12% × 18%

Data Configuration File

data.yaml:
# Dataset paths
path: /path/to/dataset  # Root directory
train: train/images     # Training images (relative to path)
val: val/images         # Validation images (relative to path)

# Classes
names:
  0: apple
  1: orange
  2: bottle
For this course project, we detect three objects:
  • apple: Red fruits for pick-and-place
  • orange: Orange fruits
  • bottle: Cylindrical objects
See vision_class/process/image_processing.py:50-51

Annotation Tools

Recommended Tools:
  1. LabelImg (free, easy)
    pip install labelImg
    labelimg
    
    • Draw bounding boxes
    • Exports YOLO format directly
  2. Roboflow (online, free tier)
    • Web-based annotation
    • Auto-generates train/val split
    • Augmentation options
    • Export to YOLO format
  3. CVAT (advanced, free)
    • Team collaboration
    • Video annotation
    • Multiple export formats

Dataset Size Guidelines

Task ComplexityImages per ClassTotal Images
Simple objects50-100150-300
Medium200-500600-1500
Complex/varied1000+3000+
Quality > QuantityBetter to have 100 well-annotated, diverse images than 1000 similar ones. Include:
  • Different lighting conditions
  • Various angles and distances
  • Different backgrounds
  • Occlusions and multiple objects

Training a YOLO Model

Basic Training Script

From export_model.py:1-5 (training foundation):
from ultralytics import YOLO

# Load pretrained model (transfer learning)
model = YOLO('yolo11s.pt')  # Small model

# Train on custom dataset
results = model.train(
    data='dataset/data.yaml',  # Dataset config
    epochs=100,                # Training iterations
    imgsz=640,                 # Input image size
    batch=16,                  # Batch size
    name='apple_orange_bottle' # Experiment name
)

Training Parameters

Essential Parameters:
  • epochs: Number of complete passes through dataset
    • Small dataset: 50-100
    • Large dataset: 100-300
    • Stop when validation loss plateaus
  • imgsz: Input image size (square)
    • Standard: 640 (good balance)
    • Faster: 320 (lower accuracy)
    • Better: 1280 (slower, better for small objects)
  • batch: Images per training step
    • Depends on GPU memory
    • YOLO11s: 16-32 (typical)
    • If OOM error: reduce batch size
  • lr0: Initial learning rate
    • Default: 0.01 (usually good)
    • Fine-tuning: 0.001 (lower)
Transfer LearningLoading yolo11s.pt starts with weights pretrained on COCO dataset (80 classes, 118k images). This dramatically improves results compared to training from scratch!

Advanced Training Configuration

from ultralytics import YOLO

model = YOLO('yolo11s.pt')

results = model.train(
    # Dataset
    data='dataset/data.yaml',
    
    # Training duration
    epochs=100,
    patience=50,  # Early stopping: stop if no improvement for 50 epochs
    
    # Image settings
    imgsz=640,
    batch=16,
    
    # Optimization
    optimizer='AdamW',  # or 'SGD', 'Adam'
    lr0=0.01,           # Initial learning rate
    lrf=0.01,           # Final learning rate (lr0 * lrf)
    momentum=0.937,     # SGD momentum
    weight_decay=0.0005,# Regularization
    
    # Augmentation
    hsv_h=0.015,        # Hue augmentation
    hsv_s=0.7,          # Saturation
    hsv_v=0.4,          # Value/brightness
    degrees=0.0,        # Rotation (+/- degrees)
    translate=0.1,      # Translation (fraction of image)
    scale=0.5,          # Scale +/- 
    shear=0.0,          # Shear angle
    perspective=0.0,    # Perspective distortion
    flipud=0.0,         # Vertical flip probability
    fliplr=0.5,         # Horizontal flip probability
    mosaic=1.0,         # Mosaic augmentation
    mixup=0.0,          # Mixup augmentation
    
    # Hardware
    device=0,           # GPU device (0, 1, etc.) or 'cpu'
    workers=8,          # Data loading threads
    
    # Output
    project='runs/train',
    name='experiment1',
    exist_ok=False,     # Overwrite existing
    pretrained=True,    # Use pretrained weights
    verbose=True,       # Print progress
    
    # Validation
    val=True,           # Validate during training
    save=True,          # Save checkpoints
    save_period=10,     # Save every N epochs
)
Data AugmentationAugmentation creates variations of training images (flips, rotations, color changes) to:
  • Increase effective dataset size
  • Improve model generalization
  • Reduce overfitting
YOLO applies augmentations automatically during training!

Understanding Training Output

Training Logs

During training, you’ll see output like:
Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  1/100      4.12G      1.234      2.456      1.123         89        640
  2/100      4.12G      1.156      2.234      1.087         89        640
  3/100      4.12G      1.089      2.012      1.034         89        640
...
Metrics Explained:
  • box_loss: Bounding box localization error
    • How far predicted boxes are from ground truth
    • Lower is better
  • cls_loss: Classification error
    • How well model predicts correct class
    • Lower is better
  • dfl_loss: Distribution Focal Loss (advanced)
    • Improves box prediction quality
    • Lower is better
Healthy Training:
  • Losses steadily decrease
  • Validation metrics improve
  • No massive jumps or instability

Validation Metrics

Class     Images  Instances      P      R   mAP50  mAP50-95
apple         50         75   0.89   0.92    0.91      0.67
orange        50         68   0.87   0.89    0.88      0.64
bottle        50         82   0.92   0.94    0.93      0.71
Metrics:
  • P (Precision): Of all predicted apples, how many were correct?
    • Precision = True Positives / (True Positives + False Positives)
    • High precision = few false alarms
  • R (Recall): Of all actual apples, how many were detected?
    • Recall = True Positives / (True Positives + False Negatives)
    • High recall = few missed objects
  • mAP50: Mean Average Precision at IoU=0.5
    • Overall detection quality
    • Standard metric for comparison
    • 0.5-0.6: Decent, 0.7-0.8: Good, 0.8+: Excellent
  • mAP50-95: Average mAP from IoU 0.5 to 0.95
    • More strict metric
    • Tests box localization accuracy
What’s IoU?Intersection over Union measures box overlap:
  • IoU = Area of Overlap / Area of Union
  • IoU > 0.5: Detection is considered correct
  • IoU > 0.95: Nearly perfect box alignment

Model Evaluation

Validation During Training

YOLO automatically validates every epoch:
# Enable validation (default)
model.train(
    data='data.yaml',
    epochs=100,
    val=True  # Run validation each epoch
)

Post-Training Evaluation

from ultralytics import YOLO

# Load trained model
model = YOLO('runs/train/experiment1/weights/best.pt')

# Evaluate on validation set
metrics = model.val()

print(f'mAP50: {metrics.box.map50}')
print(f'mAP50-95: {metrics.box.map}')
print(f'Precision: {metrics.box.mp}')
print(f'Recall: {metrics.box.mr}')

Testing on Individual Images

# Run inference
results = model.predict(
    source='test_images/',
    conf=0.5,      # Confidence threshold
    save=True,     # Save annotated images
    show=True      # Display results
)

# Process results
for result in results:
    boxes = result.boxes
    for box in boxes:
        cls = int(box.cls[0])
        conf = float(box.conf[0])
        xyxy = box.xyxy[0].cpu().numpy()
        print(f'Class: {model.names[cls]}, Conf: {conf:.2f}, Box: {xyxy}')

Training Tips and Troubleshooting

Common Issues

Problem: Loss not decreasing Solutions:
  • Reduce learning rate (lr0=0.001)
  • Check annotations (incorrect labels?)
  • Increase batch size
  • More epochs needed
Problem: Overfitting (train good, val poor) Solutions:
  • Add more training data
  • Increase augmentation
  • Reduce model size (yolo11s → yolo11n)
  • Add regularization (weight_decay=0.001)
Problem: Out of memory (OOM) Solutions:
  • Reduce batch size (batch=8 or batch=4)
  • Reduce image size (imgsz=320)
  • Use smaller model (yolo11n)
  • Close other programs
Problem: Training too slow Solutions:
  • Use GPU (check device=0)
  • Reduce image size
  • Reduce workers if I/O bottleneck
  • Use mixed precision: amp=True
GPU Memory RequirementsYOLO11s with batch=16, imgsz=640:
  • Approximately 4-6 GB VRAM
  • RTX 3060 (12GB): batch=32
  • RTX 3050 (8GB): batch=16
  • GTX 1650 (4GB): batch=8, imgsz=416

Best Practices

  1. Start Small: Train for 10-20 epochs first, verify setup works
  2. Monitor Validation: Watch mAP, not just loss
  3. Use Callbacks: Save best model based on mAP, not last epoch
  4. Experiment Tracking: Keep notes on what works
  5. Test Incrementally: Validate on real robot images, not just dataset

Course Project Dataset

For the robotic arm project, we detect: From image_processing.py:50-51:
if detected_class in ['apple', 'orange', 'bottle']:
    clss_object = detected_class
Dataset Recommendations:
  • Apple: 50-100 images of red apples from various angles
  • Orange: 50-100 images of oranges
  • Bottle: 50-100 images of common bottles (water, soda)
Collection Tips:
  • Use actual objects robot will interact with
  • Match lighting conditions of robot workspace
  • Include images with multiple objects
  • Vary distances (close, medium, far)
You can start with a pretrained COCO model! YOLO11 already knows ‘apple’, ‘orange’, and ‘bottle’ (classes 47, 49, 39). Fine-tuning on your specific environment improves accuracy.

Practice Exercise

Train a Custom Detector

Task: Train YOLO11s to detect apples and oranges Steps:
  1. Collect 100 images (50 apples, 50 oranges)
  2. Annotate using LabelImg
  3. Split: 80 train, 20 validation
  4. Create data.yaml
  5. Train for 50 epochs
  6. Evaluate: Achieve mAP50 > 0.7
Success Criteria:
  • Training completes without errors
  • Validation mAP50 > 0.7
  • Model detects objects in new test images
  • Inference runs at >10 FPS

Extension: Experiment with Parameters

Try different configurations:
  • Model sizes: yolo11n vs yolo11s vs yolo11m
  • Image sizes: 320 vs 640 vs 1280
  • Augmentation: heavy vs light
Compare results and inference speed.

Summary

You’ve learned:
  • ✓ YOLO architecture and why it’s ideal for robotics
  • ✓ Dataset preparation and annotation formats
  • ✓ Training configuration and parameters
  • ✓ Transfer learning from pretrained models
  • ✓ Evaluation metrics (Precision, Recall, mAP)
  • ✓ Troubleshooting training issues
  • ✓ Best practices for custom object detection

Next Steps

With a trained model, you need to deploy it to Raspberry Pi! The next lesson covers converting YOLO models to optimized formats for edge devices.

Model Conversion

Export models to ONNX, MNN, and NCNN formats for edge deployment
Reference Code: course/vision_class/
  • process/detection/main.py:36: Inference parameters (conf, imgsz, half precision)
  • process/image_processing.py:50-51: Target classes for course project

Build docs developers (and LLMs) love