Model Training

Training Script

The complete training script is located at training/model_train.py. Here’s the full implementation:

training/model_train.py

from ultralytics import YOLO
import torch
import os

current_path = os.path.dirname(os.path.abspath(__file__))
model_path = os.path.join(current_path, "models/pretrained_models/yolo11n-seg.pt")
data_path = os.path.join(current_path, "data/data.yaml")
save_project = os.path.join(current_path, "runs/yolov11/")
device = None

# device
if torch.backends.mps.is_available():
    device = torch.device("mps")

elif torch.cuda.is_available():
    device = torch.device("cuda")

else:
    device = torch.device("cpu")

model = YOLO(model_path).to(device)


def main():
    model.train(data=data_path, epochs=300, batch=-1, imgsz=640, patience=10, task='segment', device=0,
                project=save_project, verbose=True, plots=True)


if __name__ == '__main__':
    main()

Running the Training

Navigate to training directory

cd training

Ensure dependencies are installed

pip install ultralytics torch

Run the training script

python model_train.py

The script automatically detects the best available device (MPS, CUDA, or CPU) and starts training with the configured hyperparameters.

Hyperparameters Explained

The training script uses several key hyperparameters that control the learning process:

Core Training Parameters

epochs=300

Parameter Details

Parameter	Value	Description
`data`	`data.yaml`	Path to dataset configuration file
`epochs`	`300`	Maximum number of training iterations through dataset
`batch`	`-1`	Automatic batch size based on available GPU memory
`imgsz`	`640`	Input image dimensions (640x640 pixels)
`patience`	`10`	Early stopping: halt if no improvement for 10 epochs
`task`	`'segment'`	Instance segmentation task (not just detection)
`device`	`0`	GPU device index (0 for first GPU)
`project`	`runs/yolov11/`	Directory for saving training outputs
`verbose`	`True`	Print detailed training logs
`plots`	`True`	Generate training visualization plots

Epochs (300)

What it does: Controls how many times the model sees the entire training dataset.

Higher values: More training, but risk of overfitting
Lower values: Faster training, but may underfit
Recommended: 200-400 epochs with early stopping

With patience=10, training may stop before reaching 300 epochs if validation metrics don’t improve. This prevents overfitting.

Batch Size (-1)

What it does: Number of images processed simultaneously. -1 enables automatic batch sizing.

Auto mode (-1): Automatically determines optimal batch size for your GPU memory
Manual values: Can be set to specific numbers (8, 16, 32, etc.)
Larger batches: More stable gradients, but require more memory
Smaller batches: Less memory usage, but noisier training

# Alternative manual configurations
batch=16  # For 6-8GB GPU
batch=32  # For 10-12GB GPU
batch=64  # For 16GB+ GPU

Image Size (640)

What it does: Input resolution for training and inference.

640x640: Balanced speed and accuracy (recommended)
Higher (1280): Better accuracy for small objects, slower inference
Lower (416): Faster inference, reduced accuracy

All images are automatically resized to this dimension during training. Aspect ratios are preserved using letterboxing.

Patience (10)

What it does: Early stopping mechanism to prevent overfitting.

Monitors validation mAP (mean Average Precision)
Stops training if no improvement for 10 consecutive epochs
Saves best model weights automatically

# Adjust patience based on dataset size
patience=5   # Small datasets (< 500 images)
patience=10  # Medium datasets (500-2000 images)
patience=20  # Large datasets (> 2000 images)

Device Selection

The script implements intelligent device detection:

if torch.backends.mps.is_available():
    device = torch.device("mps")

Device Priority

MPS (Metal Performance Shaders): Apple M1/M2/M3 GPUs
CUDA: NVIDIA GPUs
CPU: Fallback if no GPU is available

While the device detection logic selects the appropriate device, the actual training uses device=0 parameter, which specifically targets the first GPU. Ensure your GPU is properly configured.

Multi-GPU Training

For systems with multiple GPUs:

# Single GPU (default)
device=0

# Multi-GPU training
device=[0, 1]  # Use first two GPUs

# All available GPUs
device=[0, 1, 2, 3]

Training Monitoring

Console Output

With verbose=True, the training script displays real-time metrics:

Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
  1/300     3.45G      1.234      0.876      0.543      1.012         52        640
  2/300     3.45G      1.156      0.812      0.498      0.965         48        640
  3/300     3.45G      1.089      0.763      0.456      0.923         51        640
...

Key Metrics

Metric	Description
`GPU_mem`	Current GPU memory usage
`box_loss`	Bounding box regression loss
`seg_loss`	Segmentation mask loss
`cls_loss`	Classification loss
`dfl_loss`	Distribution focal loss
`Instances`	Number of objects in batch

Lower loss values indicate better model performance. Losses should decrease over time.

Training Logs

All training logs are saved to:

training/runs/yolov11/train/
├── weights/
│   ├── best.pt          # Best model weights
│   └── last.pt          # Final epoch weights
├── results.csv          # Training metrics per epoch
├── confusion_matrix.png # Class confusion matrix
├── results.png          # Loss and metric plots
├── PR_curve.png         # Precision-Recall curve
├── F1_curve.png         # F1 score curve
└── args.yaml            # Training arguments

Training Visualization

With plots=True, the script generates several visualization plots:

Loss Curves

Tracks training and validation losses over epochs:

Box loss (bounding box accuracy)
Segmentation loss (mask quality)
Classification loss (class prediction)

Performance Metrics

Precision: Percentage of correct positive predictions
Recall: Percentage of actual positives detected
mAP50: Mean Average Precision at 50% IoU threshold
mAP50-95: Mean Average Precision averaged over IoU thresholds

Confusion Matrix

Shows classification performance across the three classes:

Cardboard paper
Metal
Plastic

All plots are automatically saved to the training output directory and updated after each epoch.

Model Output Location

Trained models are saved to the project directory:

save_project = os.path.join(current_path, "runs/yolov11/")

Output Structure

training/runs/yolov11/
└── train/              # First training run
    ├── weights/
    │   ├── best.pt     # Best performing model (use for deployment)
    │   └── last.pt     # Most recent epoch
    └── ...

Subsequent Training Runs

Each new training run creates a new directory:

runs/yolov11/
├── train/      # First run
├── train2/     # Second run
├── train3/     # Third run
└── ...

Always use best.pt for deployment, not last.pt. The best model is selected based on validation performance, not training duration.

Resuming Training

To resume interrupted training:

model = YOLO('runs/yolov11/train/weights/last.pt')
model.train(resume=True)

Advanced Configuration

Custom Hyperparameters

model.train(
    data=data_path,
    epochs=300,
    batch=16,
    imgsz=640,
    patience=10,
    
    # Learning rate
    lr0=0.01,          # Initial learning rate
    lrf=0.01,          # Final learning rate factor
    
    # Augmentation
    hsv_h=0.015,       # HSV hue augmentation
    hsv_s=0.7,         # HSV saturation augmentation
    hsv_v=0.4,         # HSV value augmentation
    degrees=0.0,       # Rotation degrees
    translate=0.1,     # Translation
    scale=0.5,         # Scale augmentation
    
    # Optimization
    optimizer='SGD',   # SGD, Adam, AdamW
    momentum=0.937,    # SGD momentum
    weight_decay=0.0005,
    
    # Advanced
    mosaic=1.0,        # Mosaic augmentation probability
    mixup=0.0,         # Mixup augmentation probability
    copy_paste=0.0,    # Copy-paste augmentation
)

Troubleshooting

Out of Memory Error

# Reduce batch size
batch=8  # or batch=4 for very limited memory

# Reduce image size
imgsz=416  # instead of 640

Training Too Slow

Verify GPU is being used (check GPU_mem in logs)
Increase batch size if memory allows
Use mixed precision training (automatically enabled)

Poor Convergence

Increase dataset size
Adjust learning rate
Increase patience for early stopping
Check annotation quality

Evaluate Model

Learn how to evaluate your trained model’s performance

Getting Started

Core Concepts

Training

Inference

Robotics Integration

Training Script

Running the Training

Hyperparameters Explained

Core Training Parameters

Parameter Details

Epochs (300)

Batch Size (-1)

Image Size (640)

Patience (10)

Device Selection

Device Priority

Multi-GPU Training

Training Monitoring

Console Output

Key Metrics

Training Logs

Training Visualization

Loss Curves

Performance Metrics

Confusion Matrix

Model Output Location

Output Structure

Subsequent Training Runs

Resuming Training

Advanced Configuration

Custom Hyperparameters

Troubleshooting

Out of Memory Error

Training Too Slow

Poor Convergence

Evaluate Model

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Training

Inference

Robotics Integration

​Training Script

​Running the Training

​Hyperparameters Explained

​Core Training Parameters

​Parameter Details

​Epochs (300)

​Batch Size (-1)

​Image Size (640)

​Patience (10)

​Device Selection

​Device Priority

​Multi-GPU Training

​Training Monitoring

​Console Output

​Key Metrics

​Training Logs

​Training Visualization

​Loss Curves

​Performance Metrics

​Confusion Matrix

​Model Output Location

​Output Structure

​Subsequent Training Runs

​Resuming Training

​Advanced Configuration

​Custom Hyperparameters

​Troubleshooting

​Out of Memory Error

​Training Too Slow

​Poor Convergence

Evaluate Model

Build docs developers (and LLMs) love

Training Script

Running the Training

Hyperparameters Explained

Core Training Parameters

Parameter Details

Epochs (300)

Batch Size (-1)

Image Size (640)

Patience (10)

Device Selection

Device Priority

Multi-GPU Training

Training Monitoring

Console Output

Key Metrics

Training Logs

Training Visualization

Loss Curves

Performance Metrics

Confusion Matrix

Model Output Location

Output Structure

Subsequent Training Runs

Resuming Training

Advanced Configuration

Custom Hyperparameters

Troubleshooting

Out of Memory Error

Training Too Slow

Poor Convergence