Training Script
The complete training script is located attraining/model_train.py. Here’s the full implementation:
training/model_train.py
Running the Training
The script automatically detects the best available device (MPS, CUDA, or CPU) and starts training with the configured hyperparameters.
Hyperparameters Explained
The training script uses several key hyperparameters that control the learning process:Core Training Parameters
Parameter Details
| Parameter | Value | Description |
|---|---|---|
data | data.yaml | Path to dataset configuration file |
epochs | 300 | Maximum number of training iterations through dataset |
batch | -1 | Automatic batch size based on available GPU memory |
imgsz | 640 | Input image dimensions (640x640 pixels) |
patience | 10 | Early stopping: halt if no improvement for 10 epochs |
task | 'segment' | Instance segmentation task (not just detection) |
device | 0 | GPU device index (0 for first GPU) |
project | runs/yolov11/ | Directory for saving training outputs |
verbose | True | Print detailed training logs |
plots | True | Generate training visualization plots |
Epochs (300)
What it does: Controls how many times the model sees the entire training dataset.- Higher values: More training, but risk of overfitting
- Lower values: Faster training, but may underfit
- Recommended: 200-400 epochs with early stopping
Batch Size (-1)
What it does: Number of images processed simultaneously.-1 enables automatic batch sizing.
- Auto mode (-1): Automatically determines optimal batch size for your GPU memory
- Manual values: Can be set to specific numbers (8, 16, 32, etc.)
- Larger batches: More stable gradients, but require more memory
- Smaller batches: Less memory usage, but noisier training
Image Size (640)
What it does: Input resolution for training and inference.- 640x640: Balanced speed and accuracy (recommended)
- Higher (1280): Better accuracy for small objects, slower inference
- Lower (416): Faster inference, reduced accuracy
All images are automatically resized to this dimension during training. Aspect ratios are preserved using letterboxing.
Patience (10)
What it does: Early stopping mechanism to prevent overfitting.- Monitors validation mAP (mean Average Precision)
- Stops training if no improvement for 10 consecutive epochs
- Saves best model weights automatically
Device Selection
The script implements intelligent device detection:Device Priority
- MPS (Metal Performance Shaders): Apple M1/M2/M3 GPUs
- CUDA: NVIDIA GPUs
- CPU: Fallback if no GPU is available
Multi-GPU Training
For systems with multiple GPUs:Training Monitoring
Console Output
Withverbose=True, the training script displays real-time metrics:
Key Metrics
| Metric | Description |
|---|---|
GPU_mem | Current GPU memory usage |
box_loss | Bounding box regression loss |
seg_loss | Segmentation mask loss |
cls_loss | Classification loss |
dfl_loss | Distribution focal loss |
Instances | Number of objects in batch |
Lower loss values indicate better model performance. Losses should decrease over time.
Training Logs
All training logs are saved to:Training Visualization
Withplots=True, the script generates several visualization plots:
Loss Curves
Tracks training and validation losses over epochs:- Box loss (bounding box accuracy)
- Segmentation loss (mask quality)
- Classification loss (class prediction)
Performance Metrics
- Precision: Percentage of correct positive predictions
- Recall: Percentage of actual positives detected
- mAP50: Mean Average Precision at 50% IoU threshold
- mAP50-95: Mean Average Precision averaged over IoU thresholds
Confusion Matrix
Shows classification performance across the three classes:- Cardboard paper
- Metal
- Plastic
All plots are automatically saved to the training output directory and updated after each epoch.
Model Output Location
Trained models are saved to the project directory:Output Structure
Subsequent Training Runs
Each new training run creates a new directory:Resuming Training
To resume interrupted training:Advanced Configuration
Custom Hyperparameters
Troubleshooting
Out of Memory Error
Training Too Slow
- Verify GPU is being used (check
GPU_memin logs) - Increase batch size if memory allows
- Use mixed precision training (automatically enabled)
Poor Convergence
- Increase dataset size
- Adjust learning rate
- Increase patience for early stopping
- Check annotation quality
Evaluate Model
Learn how to evaluate your trained model’s performance