Skip to main content

Training Script

The complete training script is located at training/model_train.py. Here’s the full implementation:
training/model_train.py
from ultralytics import YOLO
import torch
import os

current_path = os.path.dirname(os.path.abspath(__file__))
model_path = os.path.join(current_path, "models/pretrained_models/yolo11n-seg.pt")
data_path = os.path.join(current_path, "data/data.yaml")
save_project = os.path.join(current_path, "runs/yolov11/")
device = None

# device
if torch.backends.mps.is_available():
    device = torch.device("mps")

elif torch.cuda.is_available():
    device = torch.device("cuda")

else:
    device = torch.device("cpu")

model = YOLO(model_path).to(device)


def main():
    model.train(data=data_path, epochs=300, batch=-1, imgsz=640, patience=10, task='segment', device=0,
                project=save_project, verbose=True, plots=True)


if __name__ == '__main__':
    main()

Running the Training

1

Navigate to training directory

cd training
2

Ensure dependencies are installed

pip install ultralytics torch
3

Run the training script

python model_train.py
The script automatically detects the best available device (MPS, CUDA, or CPU) and starts training with the configured hyperparameters.

Hyperparameters Explained

The training script uses several key hyperparameters that control the learning process:

Core Training Parameters

epochs=300

Parameter Details

ParameterValueDescription
datadata.yamlPath to dataset configuration file
epochs300Maximum number of training iterations through dataset
batch-1Automatic batch size based on available GPU memory
imgsz640Input image dimensions (640x640 pixels)
patience10Early stopping: halt if no improvement for 10 epochs
task'segment'Instance segmentation task (not just detection)
device0GPU device index (0 for first GPU)
projectruns/yolov11/Directory for saving training outputs
verboseTruePrint detailed training logs
plotsTrueGenerate training visualization plots

Epochs (300)

What it does: Controls how many times the model sees the entire training dataset.
  • Higher values: More training, but risk of overfitting
  • Lower values: Faster training, but may underfit
  • Recommended: 200-400 epochs with early stopping
With patience=10, training may stop before reaching 300 epochs if validation metrics don’t improve. This prevents overfitting.

Batch Size (-1)

What it does: Number of images processed simultaneously. -1 enables automatic batch sizing.
  • Auto mode (-1): Automatically determines optimal batch size for your GPU memory
  • Manual values: Can be set to specific numbers (8, 16, 32, etc.)
  • Larger batches: More stable gradients, but require more memory
  • Smaller batches: Less memory usage, but noisier training
# Alternative manual configurations
batch=16  # For 6-8GB GPU
batch=32  # For 10-12GB GPU
batch=64  # For 16GB+ GPU

Image Size (640)

What it does: Input resolution for training and inference.
  • 640x640: Balanced speed and accuracy (recommended)
  • Higher (1280): Better accuracy for small objects, slower inference
  • Lower (416): Faster inference, reduced accuracy
All images are automatically resized to this dimension during training. Aspect ratios are preserved using letterboxing.

Patience (10)

What it does: Early stopping mechanism to prevent overfitting.
  • Monitors validation mAP (mean Average Precision)
  • Stops training if no improvement for 10 consecutive epochs
  • Saves best model weights automatically
# Adjust patience based on dataset size
patience=5   # Small datasets (< 500 images)
patience=10  # Medium datasets (500-2000 images)
patience=20  # Large datasets (> 2000 images)

Device Selection

The script implements intelligent device detection:
if torch.backends.mps.is_available():
    device = torch.device("mps")

Device Priority

  1. MPS (Metal Performance Shaders): Apple M1/M2/M3 GPUs
  2. CUDA: NVIDIA GPUs
  3. CPU: Fallback if no GPU is available
While the device detection logic selects the appropriate device, the actual training uses device=0 parameter, which specifically targets the first GPU. Ensure your GPU is properly configured.

Multi-GPU Training

For systems with multiple GPUs:
# Single GPU (default)
device=0

# Multi-GPU training
device=[0, 1]  # Use first two GPUs

# All available GPUs
device=[0, 1, 2, 3]

Training Monitoring

Console Output

With verbose=True, the training script displays real-time metrics:
Epoch    GPU_mem   box_loss   seg_loss   cls_loss   dfl_loss  Instances       Size
  1/300     3.45G      1.234      0.876      0.543      1.012         52        640
  2/300     3.45G      1.156      0.812      0.498      0.965         48        640
  3/300     3.45G      1.089      0.763      0.456      0.923         51        640
...

Key Metrics

MetricDescription
GPU_memCurrent GPU memory usage
box_lossBounding box regression loss
seg_lossSegmentation mask loss
cls_lossClassification loss
dfl_lossDistribution focal loss
InstancesNumber of objects in batch
Lower loss values indicate better model performance. Losses should decrease over time.

Training Logs

All training logs are saved to:
training/runs/yolov11/train/
├── weights/
│   ├── best.pt          # Best model weights
│   └── last.pt          # Final epoch weights
├── results.csv          # Training metrics per epoch
├── confusion_matrix.png # Class confusion matrix
├── results.png          # Loss and metric plots
├── PR_curve.png         # Precision-Recall curve
├── F1_curve.png         # F1 score curve
└── args.yaml            # Training arguments

Training Visualization

With plots=True, the script generates several visualization plots:

Loss Curves

Tracks training and validation losses over epochs:
  • Box loss (bounding box accuracy)
  • Segmentation loss (mask quality)
  • Classification loss (class prediction)

Performance Metrics

  • Precision: Percentage of correct positive predictions
  • Recall: Percentage of actual positives detected
  • mAP50: Mean Average Precision at 50% IoU threshold
  • mAP50-95: Mean Average Precision averaged over IoU thresholds

Confusion Matrix

Shows classification performance across the three classes:
  • Cardboard paper
  • Metal
  • Plastic
All plots are automatically saved to the training output directory and updated after each epoch.

Model Output Location

Trained models are saved to the project directory:
save_project = os.path.join(current_path, "runs/yolov11/")

Output Structure

training/runs/yolov11/
└── train/              # First training run
    ├── weights/
    │   ├── best.pt     # Best performing model (use for deployment)
    │   └── last.pt     # Most recent epoch
    └── ...

Subsequent Training Runs

Each new training run creates a new directory:
runs/yolov11/
├── train/      # First run
├── train2/     # Second run
├── train3/     # Third run
└── ...
Always use best.pt for deployment, not last.pt. The best model is selected based on validation performance, not training duration.

Resuming Training

To resume interrupted training:
model = YOLO('runs/yolov11/train/weights/last.pt')
model.train(resume=True)

Advanced Configuration

Custom Hyperparameters

model.train(
    data=data_path,
    epochs=300,
    batch=16,
    imgsz=640,
    patience=10,
    
    # Learning rate
    lr0=0.01,          # Initial learning rate
    lrf=0.01,          # Final learning rate factor
    
    # Augmentation
    hsv_h=0.015,       # HSV hue augmentation
    hsv_s=0.7,         # HSV saturation augmentation
    hsv_v=0.4,         # HSV value augmentation
    degrees=0.0,       # Rotation degrees
    translate=0.1,     # Translation
    scale=0.5,         # Scale augmentation
    
    # Optimization
    optimizer='SGD',   # SGD, Adam, AdamW
    momentum=0.937,    # SGD momentum
    weight_decay=0.0005,
    
    # Advanced
    mosaic=1.0,        # Mosaic augmentation probability
    mixup=0.0,         # Mixup augmentation probability
    copy_paste=0.0,    # Copy-paste augmentation
)

Troubleshooting

Out of Memory Error

# Reduce batch size
batch=8  # or batch=4 for very limited memory

# Reduce image size
imgsz=416  # instead of 640

Training Too Slow

  • Verify GPU is being used (check GPU_mem in logs)
  • Increase batch size if memory allows
  • Use mixed precision training (automatically enabled)

Poor Convergence

  • Increase dataset size
  • Adjust learning rate
  • Increase patience for early stopping
  • Check annotation quality

Evaluate Model

Learn how to evaluate your trained model’s performance

Build docs developers (and LLMs) love