Training Monitor - UC Intel Final

Overview

The Training Monitor page (/monitor) combines experiment composition and live training monitoring. Create experiments by pairing models with training configurations, then start training and watch real-time progress.

Experiments are the fundamental training unit. Each experiment combines one model + one training config + dataset config.

Page Structure

The page consists of:

Header with ”+ New Experiment” button
Experiment List showing all experiments (newest first)
Live Auto-Refresh when any experiment is training

Creating Experiments

Prerequisites

Before creating experiments, you need: ✅ Dataset Configuration: Completed in Dataset page ✅ Saved Models: At least one model in Model Library ✅ Saved Training Configs: At least one config in Training Library

If models or training configs are missing, warning messages appear and experiment creation is disabled.

New Experiment Button

+ New Experiment (top-right, primary button) On Click:

Creates new experiment with auto-generated name (“Experiment 1”, “Experiment 2”, …)
Defaults to first model in library
Defaults to first training config in library
Status: “not_started”
Adds to experiment list

Create multiple experiments upfront to queue different model-training combinations, then run them sequentially.

Experiment Card Interface

Each experiment displays as an expandable card.

Card Header (Collapsed View)

Shows:

Experiment Name: Editable via text input
Status Badge: Color-coded status indicator
- 🔵 Not Started: Ready to configure
- 🟢 Training: Currently running
- 🟡 Paused: Training paused
- ✅ Completed: Training finished successfully
- 🔴 Stopped: Manually stopped
Progress Indicator: Current epoch / Total epochs (e.g., “Epoch 23/100”)
Duration: Elapsed training time (e.g., “0:15:32”)

Card Content (Expanded View)

Expanded cards show full experiment configuration and controls.

Experiment Configuration

Name Input

Editable text field at top of card:

Default: “Experiment 1”, “Experiment 2”, etc.
Change to descriptive names: “ResNet50_Baseline”, “CNN_Heavy_Aug”
Updates automatically on change

Model Selection

Dropdown: Select from Model Library

Shows model name and type
Example: “ResNet50_v1 (Transfer Learning)”
Cannot change once training starts

Model selection pulls from the Model Library. If you don’t see your model, create it in the Model Builder page first.

Training Config Selection

Dropdown: Select from Training Library

Shows training config name
Example: “Adam_Default”, “SGD_LR0.01”
Cannot change once training starts

Configuration selections are locked once training begins to prevent mid-training changes that would invalidate results.

Training Controls

Action Buttons

Buttons appear based on experiment status:

Not Started
Training
Paused
Completed
Stopped

▶️ Start Training (primary button)On Click:

Validates model and training config are selected
Launches background training thread
Changes status to “Training”
Shows toast: “Training started! Check terminal for progress.”

Requirements:

Model selected
Training config selected
Dataset configured

Live Training Progress

Real-Time Metrics

While training, the card displays live-updating metrics: Progress Bar:

Visual progress through epochs
Updates every 3 seconds

Current Epoch:

“Epoch 45/100”
Shows current vs total epochs

Time Elapsed:

“0:23:15” (hours:minutes:seconds)
Updates in real-time

Latest Metrics Row:

Train Loss: Current training loss
Train Acc: Current training accuracy
Val Loss: Validation loss (updated at epoch end)
Val Acc: Validation accuracy (updated at epoch end)
Learning Rate: Current LR (changes with schedule)

Metrics update every 3 seconds via Streamlit’s @st.fragment decorator with run_every parameter. This provides smooth updates without jarring full-page refreshes.

Training History Chart

Mini training curves appear below metrics:

Loss Curve: Train vs Val loss over epochs
Accuracy Curve: Train vs Val accuracy over epochs

Charts update live during training.

Auto-Refresh Mechanism

Fragment-Based Live Updates

The dashboard uses Streamlit fragments for efficient real-time updates:

@st.fragment(run_every=timedelta(seconds=3))
def _render_experiments_live(models, trainings):
    # Only this fragment updates, not full page
    experiments = _get_experiments_with_live_updates()
    # ...

Benefits:

Smooth updates: No full-page reload
Better UX: No flashing or jumping
Efficient: Only experiment section re-renders

How it works:

Training Active: Fragment runs every 3 seconds
Reads from File: Background thread writes metrics to JSON
Updates Display: Fragment updates cards with latest data
Training Complete: Full page rerun to exit fragment mode

The background training thread writes progress to session JSON files, not st.session_state. The fragment reads fresh data from disk every 3 seconds.

Experiment Status Flow

Background Training Process

Training Worker Thread

When you start training, a background thread launches:

Model Instantiation

Creates PyTorch model from saved config

Builds architecture (CNN/Transformer/Transfer Learning)
Initializes weights (random or pretrained)
Moves to GPU if available

Dataset Loading

Loads and preprocesses dataset

Applies train/val/test split from dataset config
Creates PyTorch DataLoaders with specified batch size
Applies augmentation transforms to training set

Training Loop

Runs training epochs

Forward pass → Compute loss → Backward pass → Update weights
Logs metrics every batch
Validation at end of each epoch
Applies learning rate schedule
Checks early stopping condition

Progress Logging

Writes metrics to session file

Updates experiment history
Saves checkpoints when metric improves
Dashboard fragment reads updates for display

Completion

Training finishes

Saves final model weights
Updates experiment status to “Completed”
Logs final metrics
Dashboard exits fragment mode

Training runs in a separate thread to prevent blocking the Streamlit UI. Check the terminal for detailed training logs.

Terminal Output

While training, detailed logs appear in the terminal:

[Experiment abc123] Starting training...
[Epoch 1/100] Train Loss: 2.3456, Train Acc: 0.2345
[Epoch 1/100] Val Loss: 2.1234, Val Acc: 0.3456
[Epoch 2/100] Train Loss: 1.9876, Train Acc: 0.3890
...
[Early Stopping] No improvement for 10 epochs. Stopping.
[Experiment abc123] Training completed!

Keep the terminal visible during training to monitor detailed progress and catch any errors.

Managing Multiple Experiments

Sequential Training

Best Practice: Train experiments one at a time

Prevents GPU memory conflicts
Clearer progress monitoring
Easier debugging

Workflow:

Create multiple experiments upfront
Configure each with different models/configs
Start first experiment
Wait for completion (or stop)
Start next experiment

Experiment Comparison

Use naming conventions for easy comparison:

“ResNet50_Baseline”
“ResNet50_Heavy_Aug”
“CNN_Deep_NoReg”
“ViT_Base_CosineLR”

After completion, compare results in the Results page.

Experiment Management Actions

Deleting Experiments

🗑️ Delete button appears for completed/stopped experiments. On Click:

Removes experiment from list
Deletes associated checkpoint files
Frees disk space
Cannot be undone

Delete removes all experiment data including checkpoints. Download model weights before deleting if you want to preserve them.

Editing Configurations

Configuration is locked once training starts. To change:

Stop current experiment (preserves partial results)
Create new experiment
Select different model/config
Start training with new settings

GPU Monitoring

The dashboard header shows GPU status (if available):

GPU Name: “NVIDIA RTX 3090”
Memory Usage: “4.2 GB / 24.0 GB”
Utilization: “85%”

Monitors:

GPU availability
Memory consumption during training
Utilization percentage

If GPU memory is low, reduce batch size or use a smaller model architecture.

Troubleshooting

Training Won’t Start

Model or Training Config Not Selected

Solution: Select both from dropdowns before clicking Start.

Dataset Not Configured

Solution: Complete Dataset Configuration page and save.

GPU Out of Memory

Error: “CUDA out of memory”Solutions:

Reduce batch size (32 → 16)
Use smaller model architecture
Close other GPU applications

FileNotFoundError

Error: Dataset images not foundSolution: Verify dataset path in Dataset config matches actual location.

Training Stalls

Loss Not Decreasing

Possible causes:

Learning rate too low (increase LR)
Model too small (use deeper architecture)
Data quality issues

Solution: Stop training, adjust config, restart.

Loss Diverging (NaN)

Possible causes:

Learning rate too high
Gradient explosion
Numerical instability

Solution: Reduce learning rate (e.g., 0.001 → 0.0001), enable gradient clipping.

Tips & Best Practices

Name Experiments Descriptively: Use names that indicate model and key hyperparameters for easy comparison.

Monitor Terminal: Keep terminal visible to catch detailed errors and progress logs.

Start Simple: Begin with lightweight models and short training runs to validate pipeline before scaling up.

Don’t refresh the browser during active training. This terminates the background thread. If refreshed, restart training.

Use Early Stopping: Let early stopping handle when to stop. Don’t manually stop unless necessary.

Next Steps

After training completes:

Results & Evaluation

Analyze training curves, confusion matrices, and per-class metrics

Get Started

Core Concepts

Dashboard Guide

Training

Model Interpretability

​Overview

​Page Structure

​Creating Experiments

​Prerequisites

​New Experiment Button

​Experiment Card Interface

​Card Header (Collapsed View)

​Card Content (Expanded View)

​Experiment Configuration

​Name Input

​Model Selection

​Training Config Selection

​Training Controls

​Action Buttons

​Live Training Progress

​Real-Time Metrics

​Training History Chart

​Auto-Refresh Mechanism

​Fragment-Based Live Updates

​Experiment Status Flow

​Background Training Process

​Training Worker Thread

​Terminal Output

​Managing Multiple Experiments

​Sequential Training

​Experiment Comparison

​Experiment Management Actions

​Deleting Experiments

​Editing Configurations

​GPU Monitoring

​Troubleshooting

​Training Won’t Start

​Training Stalls

​Tips & Best Practices

​Next Steps

Results & Evaluation

Build docs developers (and LLMs) love

Overview

Page Structure

Creating Experiments

Prerequisites

New Experiment Button

Experiment Card Interface

Card Header (Collapsed View)

Card Content (Expanded View)

Experiment Configuration

Name Input

Model Selection

Training Config Selection

Training Controls

Action Buttons

Live Training Progress

Real-Time Metrics

Training History Chart

Auto-Refresh Mechanism

Fragment-Based Live Updates

Experiment Status Flow

Background Training Process

Training Worker Thread

Terminal Output

Managing Multiple Experiments

Sequential Training

Experiment Comparison

Experiment Management Actions

Deleting Experiments

Editing Configurations

GPU Monitoring

Troubleshooting

Training Won’t Start

Training Stalls

Tips & Best Practices

Next Steps