What is MLflow?
MLflow is an open-source platform for managing the machine learning lifecycle, including:- Experiment Tracking: Log parameters, metrics, and artifacts
- Model Registry: Version and manage trained models
- Visualization: Compare runs and analyze performance
- Reproducibility: Record everything needed to recreate results
Starting MLflow UI
Launch the MLflow tracking server to access the web interface:http://127.0.0.1:5000
MLflow Storage
All experiment data is stored locally in themlruns/ directory:
The
mlruns/ directory is automatically created when you run your first training experiment.MLflow Integration in Training
The training script automatically logs everything to MLflow:Experiment Setup
training/training.py:65-70
What gets logged?
What gets logged?
Parameters:
hidden_layers: Network architectureactivation_functions: Activation types per layerdropout_rate: Regularization ratelearning_rate: Optimizer learning rateepochs: Number of training epochsbatch_size: Batch size for trainingconfig_file: Configuration filename
train_loss: Training losstrain_accuracy: Training accuracy
test_accuracy: Test set accuracytest_roc_auc: Area under ROC curvetest_precision: Precision scoretest_recall: Recall scoretest_f1_score: F1 score
model_weights_*.pth: Trained model weightsconfusion_matrix.png: Confusion matrix visualizationroc_curve.png: ROC curve plotprecision_recall_curve.png: Precision-recall curveclassification_report.txt: Detailed classification metrics
Logging Metrics During Training
Metrics are logged at each epoch:training/training.py:151-158
Logging Test Metrics
Final evaluation metrics are logged after training:training/training.py:179-183
Logging Artifacts
Visualizations and model files are logged as artifacts:training/training.py:188-236
Using the MLflow UI
Experiments View
When you openhttp://127.0.0.1:5000, you’ll see:
Select Experiment
Click on “Credit Score Training” from the experiments list to view all related training runs.
View Runs Table
The main table shows all training runs with columns for:
- Run Name: Configuration name (e.g.,
model_config_001) - Created: Timestamp of training start
- Duration: How long training took
- Metrics: Quick view of key metrics
- Parameters: Hyperparameter values
Run Details View
Click on any run to see detailed information:- Overview
- Parameters
- Metrics
- Artifacts
- System Info
- Run ID: Unique identifier
- Status: Finished, Running, or Failed
- Duration: Total training time
- User: Who ran the experiment
- Source: Script path and Git commit (if available)
Comparing Experiments
Side-by-Side Comparison
Select Multiple Runs
In the experiments table, check the boxes next to 2 or more runs you want to compare.
Analyze Differences
The comparison view shows:Parameter Differences:
- Highlights parameters that differ between runs
- Shows which values led to better performance
- Side-by-side metrics table
- Visual charts overlaying multiple runs
- Statistical differences highlighted
- Compare confusion matrices
- Overlay ROC curves
Example Comparison Workflow
Comparing two configurations:- Deeper network (Run 2) achieved +2.2% accuracy
- Lower learning rate was more stable
- Training time increased by 30%
Advanced MLflow Features
Searching Experiments
Use the search bar with MLflow query syntax:Downloading Artifacts
- Via UI
- Via CLI
- Via Python
- Navigate to a run’s Artifacts tab
- Click on any artifact (e.g.,
model_weights_001.pth) - Click the Download button
Tagging Runs
Add custom tags to organize experiments:Visualizing Training Progress
Training Loss Curve
The loss curve shows model improvement over epochs:Interpreting Loss Curves
Interpreting Loss Curves
Healthy Training:
- Smooth, steady decrease
- Converges to a low value
- No erratic jumps
- Training loss keeps decreasing
- Validation loss starts increasing
- Gap between train and test widens
- Loss plateaus at a high value
- Little improvement over epochs
- Both train and test loss are high
- Loss oscillates wildly
- Large spikes or drops
- May indicate learning rate is too high
ROC Curve Analysis
The ROC curve (roc_curve.png) shows classifier performance:
- AUC = 1.0: Perfect classifier
- AUC = 0.9-0.95: Excellent performance
- AUC = 0.8-0.9: Good performance
- AUC = 0.7-0.8: Fair performance
- AUC = 0.5: Random guessing
training/training.py:199-210
Confusion Matrix
The confusion matrix shows prediction distribution:- High values on diagonal (TN, TP)
- Low values off diagonal (FP, FN)
training/training.py:188-197
Model Registry (Advanced)
Registering Models
Promote trained models to the registry:Model Versioning
The registry tracks versions:- Version 1: Initial baseline model
- Version 2: Improved architecture
- Version 3: Fine-tuned hyperparameters
Transitioning Model Stages
- None: Experimental, not deployed
- Staging: Testing in staging environment
- Production: Deployed to production
- Archived: Deprecated, no longer used
Best Practices
Use Descriptive Run Names
Use Descriptive Run Names
Name runs based on configuration or experiment purpose:This makes it easier to identify runs later.
Log Early, Log Often
Log Early, Log Often
Don’t wait until the end to log metrics:Real-time logging helps you:
- Stop training early if not converging
- Monitor progress remotely
- Debug issues as they occur
Version Everything
Version Everything
Log dataset versions, code versions, and environment info:
Clean Up Failed Runs
Clean Up Failed Runs
Delete incomplete or failed runs to keep things organized:
Backup MLflow Data
Backup MLflow Data
Regularly backup the
mlruns/ directory:Troubleshooting
MLflow UI shows no experiments
MLflow UI shows no experiments
Solution:
- Ensure you’re in the correct directory (where
mlruns/exists) - Check that training script ran successfully
- Verify
mlflow.set_experiment()is called in your code - Try:
mlflow ui --backend-store-uri file:///path/to/mlruns
Metrics not updating in real-time
Metrics not updating in real-time
Solution:
- Refresh the browser page
- Check that
mlflow.log_metric()is called withstepparameter - Ensure training script hasn’t crashed
- Restart MLflow UI
Cannot access MLflow on port 5000
Cannot access MLflow on port 5000
Solution:
- Check if port is already in use:
lsof -i :5000 - Use a different port:
mlflow ui --port 5001 - Check firewall settings
Artifacts not appearing
Artifacts not appearing
Solution:
- Verify
mlflow.log_artifact()is called - Check file paths are correct
- Ensure files exist before logging
- Look for errors in training logs
Remote MLflow Tracking
For team collaboration, set up a remote tracking server:Server Setup
Client Configuration
Next Steps
Training Models
Run experiments and track them in MLflow
Model Configuration
Tune hyperparameters and compare results
Running Inference
Deploy your best model from MLflow
Deployment
Deploy with Docker for production
