YOLO Model Training
Computer vision is the eyes of your robotic system. In this lesson, you’ll learn how to train YOLO (You Only Look Once) models to detect specific objects for pick-and-place tasks.Learning Objectives
By the end of this lesson, you will be able to:- Understand YOLO architecture and how it works
- Prepare datasets for object detection training
- Train custom YOLO models using Ultralytics
- Configure training parameters for optimal results
- Evaluate model performance
- Select appropriate model sizes for your hardware
This course uses YOLO11 (also called YOLOv11), the latest version from Ultralytics with improved speed and accuracy.
Why YOLO for Robotics?
Traditional vs. YOLO Detection
Traditional Two-Stage Detectors (R-CNN, Faster R-CNN):- Propose regions of interest
- Classify each region
- Slow: 5-10 FPS
- Process entire image in one pass
- Predict bounding boxes and classes simultaneously
- Fast: 30-100+ FPS
For real-time robot control, you need low latency. YOLO’s speed makes it ideal for robotics where decisions must be made in milliseconds.
YOLO Architecture Overview
Components:-
Backbone: Extracts features (edges, shapes, textures)
- Convolutional layers with residual connections
- Progressively reduces spatial dimensions
- Increases feature channels
-
Neck: Fuses features from different scales
- Path Aggregation Network (PANet)
- Detects both small and large objects
-
Head: Predicts detections
- Bounding box coordinates (x, y, width, height)
- Class probabilities
- Confidence scores
Model Sizes
YOLO11 comes in multiple sizes trading off speed vs. accuracy:| Model | Parameters | Speed (ms) | mAP | Use Case |
|---|---|---|---|---|
| YOLO11n | 2.6M | 5 | 39.5 | Ultra-fast, low accuracy |
| YOLO11s | 9.4M | 10 | 47.0 | Best for robotics |
| YOLO11m | 20.1M | 18 | 51.5 | Balanced |
| YOLO11l | 25.3M | 25 | 53.4 | High accuracy |
| YOLO11x | 56.9M | 40 | 54.7 | Maximum accuracy |
Setting Up Training Environment
Installation
GPU Setup (Recommended)
Training on GPU is 10-50x faster than CPU:If you don’t have a GPU, you can:
- Use Google Colab (free GPU)
- Use cloud platforms (AWS, Azure, Google Cloud)
- Train on CPU (slower but works for small datasets)
Dataset Preparation
Dataset Structure
YOLO expects a specific directory structure:Annotation Format
Each image has a corresponding.txt file with one line per object:
class_id: Integer class index (0, 1, 2…)x_center: Center X / image widthy_center: Center Y / image heightwidth: Bounding box width / image widthheight: Bounding box height / image height
img001.txt):
- Object of class 0 at center (50%, 40%) with size 15% × 20%
- Object of class 1 at center (30%, 60%) with size 12% × 18%
Data Configuration File
data.yaml:For this course project, we detect three objects:
- apple: Red fruits for pick-and-place
- orange: Orange fruits
- bottle: Cylindrical objects
vision_class/process/image_processing.py:50-51Annotation Tools
Recommended Tools:-
LabelImg (free, easy)
- Draw bounding boxes
- Exports YOLO format directly
-
Roboflow (online, free tier)
- Web-based annotation
- Auto-generates train/val split
- Augmentation options
- Export to YOLO format
-
CVAT (advanced, free)
- Team collaboration
- Video annotation
- Multiple export formats
Dataset Size Guidelines
| Task Complexity | Images per Class | Total Images |
|---|---|---|
| Simple objects | 50-100 | 150-300 |
| Medium | 200-500 | 600-1500 |
| Complex/varied | 1000+ | 3000+ |
Training a YOLO Model
Basic Training Script
Fromexport_model.py:1-5 (training foundation):
Training Parameters
Essential Parameters:-
epochs: Number of complete passes through dataset
- Small dataset: 50-100
- Large dataset: 100-300
- Stop when validation loss plateaus
-
imgsz: Input image size (square)
- Standard: 640 (good balance)
- Faster: 320 (lower accuracy)
- Better: 1280 (slower, better for small objects)
-
batch: Images per training step
- Depends on GPU memory
- YOLO11s: 16-32 (typical)
- If OOM error: reduce batch size
-
lr0: Initial learning rate
- Default: 0.01 (usually good)
- Fine-tuning: 0.001 (lower)
Transfer LearningLoading
yolo11s.pt starts with weights pretrained on COCO dataset (80 classes, 118k images). This dramatically improves results compared to training from scratch!Advanced Training Configuration
Data AugmentationAugmentation creates variations of training images (flips, rotations, color changes) to:
- Increase effective dataset size
- Improve model generalization
- Reduce overfitting
Understanding Training Output
Training Logs
During training, you’ll see output like:-
box_loss: Bounding box localization error
- How far predicted boxes are from ground truth
- Lower is better
-
cls_loss: Classification error
- How well model predicts correct class
- Lower is better
-
dfl_loss: Distribution Focal Loss (advanced)
- Improves box prediction quality
- Lower is better
- Losses steadily decrease
- Validation metrics improve
- No massive jumps or instability
Validation Metrics
-
P (Precision): Of all predicted apples, how many were correct?
- Precision = True Positives / (True Positives + False Positives)
- High precision = few false alarms
-
R (Recall): Of all actual apples, how many were detected?
- Recall = True Positives / (True Positives + False Negatives)
- High recall = few missed objects
-
mAP50: Mean Average Precision at IoU=0.5
- Overall detection quality
- Standard metric for comparison
- 0.5-0.6: Decent, 0.7-0.8: Good, 0.8+: Excellent
-
mAP50-95: Average mAP from IoU 0.5 to 0.95
- More strict metric
- Tests box localization accuracy
Model Evaluation
Validation During Training
YOLO automatically validates every epoch:Post-Training Evaluation
Testing on Individual Images
Training Tips and Troubleshooting
Common Issues
Problem: Loss not decreasing Solutions:- Reduce learning rate (
lr0=0.001) - Check annotations (incorrect labels?)
- Increase batch size
- More epochs needed
- Add more training data
- Increase augmentation
- Reduce model size (yolo11s → yolo11n)
- Add regularization (
weight_decay=0.001)
- Reduce batch size (
batch=8orbatch=4) - Reduce image size (
imgsz=320) - Use smaller model (yolo11n)
- Close other programs
- Use GPU (check
device=0) - Reduce image size
- Reduce workers if I/O bottleneck
- Use mixed precision:
amp=True
GPU Memory RequirementsYOLO11s with batch=16, imgsz=640:
- Approximately 4-6 GB VRAM
- RTX 3060 (12GB): batch=32
- RTX 3050 (8GB): batch=16
- GTX 1650 (4GB): batch=8, imgsz=416
Best Practices
- Start Small: Train for 10-20 epochs first, verify setup works
- Monitor Validation: Watch mAP, not just loss
- Use Callbacks: Save best model based on mAP, not last epoch
- Experiment Tracking: Keep notes on what works
- Test Incrementally: Validate on real robot images, not just dataset
Course Project Dataset
For the robotic arm project, we detect: Fromimage_processing.py:50-51:
- Apple: 50-100 images of red apples from various angles
- Orange: 50-100 images of oranges
- Bottle: 50-100 images of common bottles (water, soda)
- Use actual objects robot will interact with
- Match lighting conditions of robot workspace
- Include images with multiple objects
- Vary distances (close, medium, far)
Practice Exercise
Train a Custom Detector
Task: Train YOLO11s to detect apples and oranges Steps:- Collect 100 images (50 apples, 50 oranges)
- Annotate using LabelImg
- Split: 80 train, 20 validation
- Create data.yaml
- Train for 50 epochs
- Evaluate: Achieve mAP50 > 0.7
- Training completes without errors
- Validation mAP50 > 0.7
- Model detects objects in new test images
- Inference runs at >10 FPS
Extension: Experiment with Parameters
Try different configurations:- Model sizes: yolo11n vs yolo11s vs yolo11m
- Image sizes: 320 vs 640 vs 1280
- Augmentation: heavy vs light
Summary
You’ve learned:- ✓ YOLO architecture and why it’s ideal for robotics
- ✓ Dataset preparation and annotation formats
- ✓ Training configuration and parameters
- ✓ Transfer learning from pretrained models
- ✓ Evaluation metrics (Precision, Recall, mAP)
- ✓ Troubleshooting training issues
- ✓ Best practices for custom object detection
Next Steps
With a trained model, you need to deploy it to Raspberry Pi! The next lesson covers converting YOLO models to optimized formats for edge devices.Model Conversion
Export models to ONNX, MNN, and NCNN formats for edge deployment
Reference Code:
course/vision_class/process/detection/main.py:36: Inference parameters (conf, imgsz, half precision)process/image_processing.py:50-51: Target classes for course project