Skip to main content

Overview

The Training Monitor page (/monitor) combines experiment composition and live training monitoring. Create experiments by pairing models with training configurations, then start training and watch real-time progress.
Experiments are the fundamental training unit. Each experiment combines one model + one training config + dataset config.

Page Structure

The page consists of:
  • Header with ”+ New Experiment” button
  • Experiment List showing all experiments (newest first)
  • Live Auto-Refresh when any experiment is training

Creating Experiments

Prerequisites

Before creating experiments, you need: Dataset Configuration: Completed in Dataset page ✅ Saved Models: At least one model in Model Library ✅ Saved Training Configs: At least one config in Training Library
If models or training configs are missing, warning messages appear and experiment creation is disabled.

New Experiment Button

+ New Experiment (top-right, primary button) On Click:
  • Creates new experiment with auto-generated name (“Experiment 1”, “Experiment 2”, …)
  • Defaults to first model in library
  • Defaults to first training config in library
  • Status: “not_started”
  • Adds to experiment list
Create multiple experiments upfront to queue different model-training combinations, then run them sequentially.

Experiment Card Interface

Each experiment displays as an expandable card.

Card Header (Collapsed View)

Shows:
  • Experiment Name: Editable via text input
  • Status Badge: Color-coded status indicator
    • 🔵 Not Started: Ready to configure
    • 🟢 Training: Currently running
    • 🟡 Paused: Training paused
    • Completed: Training finished successfully
    • 🔴 Stopped: Manually stopped
  • Progress Indicator: Current epoch / Total epochs (e.g., “Epoch 23/100”)
  • Duration: Elapsed training time (e.g., “0:15:32”)

Card Content (Expanded View)

Expanded cards show full experiment configuration and controls.

Experiment Configuration

Name Input

Editable text field at top of card:
  • Default: “Experiment 1”, “Experiment 2”, etc.
  • Change to descriptive names: “ResNet50_Baseline”, “CNN_Heavy_Aug”
  • Updates automatically on change

Model Selection

Dropdown: Select from Model Library
  • Shows model name and type
  • Example: “ResNet50_v1 (Transfer Learning)”
  • Cannot change once training starts
Model selection pulls from the Model Library. If you don’t see your model, create it in the Model Builder page first.

Training Config Selection

Dropdown: Select from Training Library
  • Shows training config name
  • Example: “Adam_Default”, “SGD_LR0.01”
  • Cannot change once training starts
Configuration selections are locked once training begins to prevent mid-training changes that would invalidate results.

Training Controls

Action Buttons

Buttons appear based on experiment status:
▶️ Start Training (primary button)On Click:
  • Validates model and training config are selected
  • Launches background training thread
  • Changes status to “Training”
  • Shows toast: “Training started! Check terminal for progress.”
Requirements:
  • Model selected
  • Training config selected
  • Dataset configured

Live Training Progress

Real-Time Metrics

While training, the card displays live-updating metrics: Progress Bar:
  • Visual progress through epochs
  • Updates every 3 seconds
Current Epoch:
  • “Epoch 45/100”
  • Shows current vs total epochs
Time Elapsed:
  • “0:23:15” (hours:minutes:seconds)
  • Updates in real-time
Latest Metrics Row:
  • Train Loss: Current training loss
  • Train Acc: Current training accuracy
  • Val Loss: Validation loss (updated at epoch end)
  • Val Acc: Validation accuracy (updated at epoch end)
  • Learning Rate: Current LR (changes with schedule)
Metrics update every 3 seconds via Streamlit’s @st.fragment decorator with run_every parameter. This provides smooth updates without jarring full-page refreshes.

Training History Chart

Mini training curves appear below metrics:
  • Loss Curve: Train vs Val loss over epochs
  • Accuracy Curve: Train vs Val accuracy over epochs
Charts update live during training.

Auto-Refresh Mechanism

Fragment-Based Live Updates

The dashboard uses Streamlit fragments for efficient real-time updates:
@st.fragment(run_every=timedelta(seconds=3))
def _render_experiments_live(models, trainings):
    # Only this fragment updates, not full page
    experiments = _get_experiments_with_live_updates()
    # ...
Benefits:
  • Smooth updates: No full-page reload
  • Better UX: No flashing or jumping
  • Efficient: Only experiment section re-renders
How it works:
  1. Training Active: Fragment runs every 3 seconds
  2. Reads from File: Background thread writes metrics to JSON
  3. Updates Display: Fragment updates cards with latest data
  4. Training Complete: Full page rerun to exit fragment mode
The background training thread writes progress to session JSON files, not st.session_state. The fragment reads fresh data from disk every 3 seconds.

Experiment Status Flow


Background Training Process

Training Worker Thread

When you start training, a background thread launches:
1

Model Instantiation

Creates PyTorch model from saved config
  • Builds architecture (CNN/Transformer/Transfer Learning)
  • Initializes weights (random or pretrained)
  • Moves to GPU if available
2

Dataset Loading

Loads and preprocesses dataset
  • Applies train/val/test split from dataset config
  • Creates PyTorch DataLoaders with specified batch size
  • Applies augmentation transforms to training set
3

Training Loop

Runs training epochs
  • Forward pass → Compute loss → Backward pass → Update weights
  • Logs metrics every batch
  • Validation at end of each epoch
  • Applies learning rate schedule
  • Checks early stopping condition
4

Progress Logging

Writes metrics to session file
  • Updates experiment history
  • Saves checkpoints when metric improves
  • Dashboard fragment reads updates for display
5

Completion

Training finishes
  • Saves final model weights
  • Updates experiment status to “Completed”
  • Logs final metrics
  • Dashboard exits fragment mode
Training runs in a separate thread to prevent blocking the Streamlit UI. Check the terminal for detailed training logs.

Terminal Output

While training, detailed logs appear in the terminal:
[Experiment abc123] Starting training...
[Epoch 1/100] Train Loss: 2.3456, Train Acc: 0.2345
[Epoch 1/100] Val Loss: 2.1234, Val Acc: 0.3456
[Epoch 2/100] Train Loss: 1.9876, Train Acc: 0.3890
...
[Early Stopping] No improvement for 10 epochs. Stopping.
[Experiment abc123] Training completed!
Keep the terminal visible during training to monitor detailed progress and catch any errors.

Managing Multiple Experiments

Sequential Training

Best Practice: Train experiments one at a time
  • Prevents GPU memory conflicts
  • Clearer progress monitoring
  • Easier debugging
Workflow:
  1. Create multiple experiments upfront
  2. Configure each with different models/configs
  3. Start first experiment
  4. Wait for completion (or stop)
  5. Start next experiment

Experiment Comparison

Use naming conventions for easy comparison:
  • “ResNet50_Baseline”
  • “ResNet50_Heavy_Aug”
  • “CNN_Deep_NoReg”
  • “ViT_Base_CosineLR”
After completion, compare results in the Results page.

Experiment Management Actions

Deleting Experiments

🗑️ Delete button appears for completed/stopped experiments. On Click:
  • Removes experiment from list
  • Deletes associated checkpoint files
  • Frees disk space
  • Cannot be undone
Delete removes all experiment data including checkpoints. Download model weights before deleting if you want to preserve them.

Editing Configurations

Configuration is locked once training starts. To change:
  1. Stop current experiment (preserves partial results)
  2. Create new experiment
  3. Select different model/config
  4. Start training with new settings

GPU Monitoring

The dashboard header shows GPU status (if available):
  • GPU Name: “NVIDIA RTX 3090”
  • Memory Usage: “4.2 GB / 24.0 GB”
  • Utilization: “85%”
Monitors:
  • GPU availability
  • Memory consumption during training
  • Utilization percentage
If GPU memory is low, reduce batch size or use a smaller model architecture.

Troubleshooting

Training Won’t Start

Solution: Select both from dropdowns before clicking Start.
Solution: Complete Dataset Configuration page and save.
Error: “CUDA out of memory”Solutions:
  • Reduce batch size (32 → 16)
  • Use smaller model architecture
  • Close other GPU applications
Error: Dataset images not foundSolution: Verify dataset path in Dataset config matches actual location.

Training Stalls

Possible causes:
  • Learning rate too low (increase LR)
  • Model too small (use deeper architecture)
  • Data quality issues
Solution: Stop training, adjust config, restart.
Possible causes:
  • Learning rate too high
  • Gradient explosion
  • Numerical instability
Solution: Reduce learning rate (e.g., 0.001 → 0.0001), enable gradient clipping.

Tips & Best Practices

Name Experiments Descriptively: Use names that indicate model and key hyperparameters for easy comparison.
Monitor Terminal: Keep terminal visible to catch detailed errors and progress logs.
Start Simple: Begin with lightweight models and short training runs to validate pipeline before scaling up.
Don’t refresh the browser during active training. This terminates the background thread. If refreshed, restart training.
Use Early Stopping: Let early stopping handle when to stop. Don’t manually stop unless necessary.

Next Steps

After training completes:

Results & Evaluation

Analyze training curves, confusion matrices, and per-class metrics

Build docs developers (and LLMs) love