Overview
The Training Monitor page (/monitor) combines experiment composition and live training monitoring. Create experiments by pairing models with training configurations, then start training and watch real-time progress.
Experiments are the fundamental training unit. Each experiment combines one model + one training config + dataset config.
Page Structure
The page consists of:- Header with ”+ New Experiment” button
- Experiment List showing all experiments (newest first)
- Live Auto-Refresh when any experiment is training
Creating Experiments
Prerequisites
Before creating experiments, you need: ✅ Dataset Configuration: Completed in Dataset page ✅ Saved Models: At least one model in Model Library ✅ Saved Training Configs: At least one config in Training LibraryNew Experiment Button
+ New Experiment (top-right, primary button) On Click:- Creates new experiment with auto-generated name (“Experiment 1”, “Experiment 2”, …)
- Defaults to first model in library
- Defaults to first training config in library
- Status: “not_started”
- Adds to experiment list
Experiment Card Interface
Each experiment displays as an expandable card.Card Header (Collapsed View)
Shows:- Experiment Name: Editable via text input
- Status Badge: Color-coded status indicator
- 🔵 Not Started: Ready to configure
- 🟢 Training: Currently running
- 🟡 Paused: Training paused
- ✅ Completed: Training finished successfully
- 🔴 Stopped: Manually stopped
- Progress Indicator: Current epoch / Total epochs (e.g., “Epoch 23/100”)
- Duration: Elapsed training time (e.g., “0:15:32”)
Card Content (Expanded View)
Expanded cards show full experiment configuration and controls.Experiment Configuration
Name Input
Editable text field at top of card:- Default: “Experiment 1”, “Experiment 2”, etc.
- Change to descriptive names: “ResNet50_Baseline”, “CNN_Heavy_Aug”
- Updates automatically on change
Model Selection
Dropdown: Select from Model Library- Shows model name and type
- Example: “ResNet50_v1 (Transfer Learning)”
- Cannot change once training starts
Model selection pulls from the Model Library. If you don’t see your model, create it in the Model Builder page first.
Training Config Selection
Dropdown: Select from Training Library- Shows training config name
- Example: “Adam_Default”, “SGD_LR0.01”
- Cannot change once training starts
Configuration selections are locked once training begins to prevent mid-training changes that would invalidate results.
Training Controls
Action Buttons
Buttons appear based on experiment status:- Not Started
- Training
- Paused
- Completed
- Stopped
▶️ Start Training (primary button)On Click:
- Validates model and training config are selected
- Launches background training thread
- Changes status to “Training”
- Shows toast: “Training started! Check terminal for progress.”
- Model selected
- Training config selected
- Dataset configured
Live Training Progress
Real-Time Metrics
While training, the card displays live-updating metrics: Progress Bar:- Visual progress through epochs
- Updates every 3 seconds
- “Epoch 45/100”
- Shows current vs total epochs
- “0:23:15” (hours:minutes:seconds)
- Updates in real-time
- Train Loss: Current training loss
- Train Acc: Current training accuracy
- Val Loss: Validation loss (updated at epoch end)
- Val Acc: Validation accuracy (updated at epoch end)
- Learning Rate: Current LR (changes with schedule)
Metrics update every 3 seconds via Streamlit’s
@st.fragment decorator with run_every parameter. This provides smooth updates without jarring full-page refreshes.Training History Chart
Mini training curves appear below metrics:- Loss Curve: Train vs Val loss over epochs
- Accuracy Curve: Train vs Val accuracy over epochs
Auto-Refresh Mechanism
Fragment-Based Live Updates
The dashboard uses Streamlit fragments for efficient real-time updates:- Smooth updates: No full-page reload
- Better UX: No flashing or jumping
- Efficient: Only experiment section re-renders
- Training Active: Fragment runs every 3 seconds
- Reads from File: Background thread writes metrics to JSON
- Updates Display: Fragment updates cards with latest data
- Training Complete: Full page rerun to exit fragment mode
The background training thread writes progress to session JSON files, not
st.session_state. The fragment reads fresh data from disk every 3 seconds.Experiment Status Flow
Background Training Process
Training Worker Thread
When you start training, a background thread launches:Model Instantiation
Creates PyTorch model from saved config
- Builds architecture (CNN/Transformer/Transfer Learning)
- Initializes weights (random or pretrained)
- Moves to GPU if available
Dataset Loading
Loads and preprocesses dataset
- Applies train/val/test split from dataset config
- Creates PyTorch DataLoaders with specified batch size
- Applies augmentation transforms to training set
Training Loop
Runs training epochs
- Forward pass → Compute loss → Backward pass → Update weights
- Logs metrics every batch
- Validation at end of each epoch
- Applies learning rate schedule
- Checks early stopping condition
Progress Logging
Writes metrics to session file
- Updates experiment history
- Saves checkpoints when metric improves
- Dashboard fragment reads updates for display
Training runs in a separate thread to prevent blocking the Streamlit UI. Check the terminal for detailed training logs.
Terminal Output
While training, detailed logs appear in the terminal:Managing Multiple Experiments
Sequential Training
Best Practice: Train experiments one at a time- Prevents GPU memory conflicts
- Clearer progress monitoring
- Easier debugging
- Create multiple experiments upfront
- Configure each with different models/configs
- Start first experiment
- Wait for completion (or stop)
- Start next experiment
Experiment Comparison
Use naming conventions for easy comparison:- “ResNet50_Baseline”
- “ResNet50_Heavy_Aug”
- “CNN_Deep_NoReg”
- “ViT_Base_CosineLR”
Experiment Management Actions
Deleting Experiments
🗑️ Delete button appears for completed/stopped experiments. On Click:- Removes experiment from list
- Deletes associated checkpoint files
- Frees disk space
- Cannot be undone
Editing Configurations
Configuration is locked once training starts. To change:- Stop current experiment (preserves partial results)
- Create new experiment
- Select different model/config
- Start training with new settings
GPU Monitoring
The dashboard header shows GPU status (if available):- GPU Name: “NVIDIA RTX 3090”
- Memory Usage: “4.2 GB / 24.0 GB”
- Utilization: “85%”
- GPU availability
- Memory consumption during training
- Utilization percentage
If GPU memory is low, reduce batch size or use a smaller model architecture.
Troubleshooting
Training Won’t Start
Model or Training Config Not Selected
Model or Training Config Not Selected
Solution: Select both from dropdowns before clicking Start.
Dataset Not Configured
Dataset Not Configured
Solution: Complete Dataset Configuration page and save.
GPU Out of Memory
GPU Out of Memory
Error: “CUDA out of memory”Solutions:
- Reduce batch size (32 → 16)
- Use smaller model architecture
- Close other GPU applications
FileNotFoundError
FileNotFoundError
Error: Dataset images not foundSolution: Verify dataset path in Dataset config matches actual location.
Training Stalls
Loss Not Decreasing
Loss Not Decreasing
Possible causes:
- Learning rate too low (increase LR)
- Model too small (use deeper architecture)
- Data quality issues
Loss Diverging (NaN)
Loss Diverging (NaN)
Possible causes:
- Learning rate too high
- Gradient explosion
- Numerical instability
Tips & Best Practices
Next Steps
After training completes:Results & Evaluation
Analyze training curves, confusion matrices, and per-class metrics