Training Models

This guide walks you through the complete process of training a credit score prediction model using PyTorch and MLflow for experiment tracking.

Prerequisites

Before training models, ensure you have:

Python 3.10+ installed
UV package manager (recommended) or pip
MLflow running for experiment tracking
Access to the training dataset

Setup Environment

Install Dependencies

Install all required packages using UV or pip:

uv sync

UV provides faster, deterministic installations compared to traditional pip.

Start MLflow UI

Launch the MLflow tracking server to monitor training in real-time:

uv run mlflow ui

Access the dashboard at http://127.0.0.1:5000 to view:

Training metrics (loss, accuracy)
Model parameters and configurations
Saved artifacts and visualizations

Keep the MLflow UI running in a separate terminal window during training.

Training Process

Basic Training Command

The training script uses YAML configuration files to define model architecture and hyperparameters:

uv run training/training.py --config config/models-configs/model_config_001.yaml

See the Model Configuration guide for details on creating custom configurations.

Training Script Workflow

The training process (training/training.py:51-237) follows these steps:

Load Configuration

The script reads hyperparameters from the specified YAML file:

config = load_config(config_path)
config_name = os.path.splitext(os.path.basename(config_path))[0]

Reference: training/training.py:40-48

Initialize MLflow Experiment

All training runs are tracked under the “Credit Score Training” experiment:

mlflow.set_experiment("Credit Score Training")

with mlflow.start_run(run_name=config_name):
    mlflow.log_params(config)
    mlflow.log_param("config_file", config_name)

Reference: training/training.py:65-70

Load and Preprocess Data

The training data is loaded and preprocessed automatically:

df = load_data(dataset_path)
X_train, X_test, y_train, y_test = preprocess_data(
    df, save_path=preprocessor_path
)

The preprocessor is saved to processing/preprocessor.joblib for use during inference.Reference: training/training.py:85-93

Create PyTorch DataLoaders

Data is converted to PyTorch tensors and loaded in batches:

X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).unsqueeze(1)

train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

Reference: training/training.py:96-104

Initialize Model

The neural network is built based on configuration parameters:

model_config = ModelConfig(
    input_size=input_size,
    output_size=1,
    hidden_layers=config["hidden_layers"],
    activation_functions=config["activation_functions"],
    dropout_rate=config["dropout_rate"],
    learning_rate=config["learning_rate"],
    epochs=config["epochs"],
    batch_size=batch_size,
)

model = CreditScoreModel(model_config)

Reference: training/training.py:107-119

Training Loop

The model is trained using Binary Cross-Entropy Loss and AdamW optimizer:

criterion = nn.BCEWithLogitsLoss()
optimizer = optim.AdamW(model.parameters(), lr=config["learning_rate"])

for epoch in range(epochs):
    model.train()
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

Metrics are logged to MLflow at each epoch:

mlflow.log_metric("train_loss", epoch_loss, step=epoch)
mlflow.log_metric("train_accuracy", epoch_acc, step=epoch)

Reference: training/training.py:122-158

Evaluation

The model is evaluated on the test set:

model.eval()
with torch.no_grad():
    outputs = model(X_test_tensor)
    probs = torch.sigmoid(outputs).numpy()
    preds = (probs > 0.5).astype(int)

Multiple metrics are computed and logged:

Accuracy: Overall classification accuracy
ROC AUC: Area under the ROC curve
Precision: Positive predictive value
Recall: Sensitivity
F1 Score: Harmonic mean of precision and recall

Reference: training/training.py:161-183

Save Artifacts

The trained model weights are saved:

model_save_path = os.path.join(save_dir, f"{weights_name}.pth")
torch.save(model.state_dict(), model_save_path)
mlflow.log_artifact(model_save_path)

Visualizations are also generated and logged:

Confusion Matrix
ROC Curve
Precision-Recall Curve
Classification Report

Reference: training/training.py:228-236

Training Multiple Configurations

You can train multiple models with different hyperparameters in parallel or sequentially:

# Train with configuration 001
uv run training/training.py --config config/models-configs/model_config_001.yaml

# Train with configuration 002
uv run training/training.py --config config/models-configs/model_config_002.yaml

Parallel training requires sufficient RAM and GPU memory. Monitor system resources carefully.

Monitoring Training Progress

Once training starts, you can monitor it through MLflow:

Navigate to http://127.0.0.1:5000
Select the “Credit Score Training” experiment
View real-time metrics:
- Training loss curve
- Training accuracy progression
- Test metrics upon completion
Compare different runs side-by-side
Download artifacts (model weights, visualizations)

Understanding Training Output

During training, you’ll see console output like:

Epoch [1/150], Loss: 0.6234, Accuracy: 0.7123
Epoch [2/150], Loss: 0.5891, Accuracy: 0.7345
Epoch [3/150], Loss: 0.5567, Accuracy: 0.7501
...
Test Accuracy: 0.8234
Test ROC AUC: 0.8756
Model weights saved to model/model_weights_001.pth

Troubleshooting

Training loss is not decreasing

Check learning rate: Try reducing it (e.g., from 0.001 to 0.0001)
Verify data preprocessing: Ensure features are properly normalized
Increase model capacity: Add more hidden layers or neurons
Check for NaN values: Look at MLflow metrics for anomalies

Model overfits training data

Increase dropout rate: Try 0.4 or 0.5 instead of 0.3
Add regularization: Consider L2 regularization in optimizer
Reduce model complexity: Use fewer layers or neurons
Get more training data: If possible, expand the dataset

MLflow connection errors

Verify MLflow is running: Check http://127.0.0.1:5000
Check port availability: Ensure port 5000 is not in use
Restart MLflow: Stop and restart the MLflow UI

Next Steps

Model Configuration

Learn how to customize model architecture and hyperparameters

Running Inference

Use your trained model to make predictions

MLflow Tracking

Deep dive into experiment tracking and visualization

Deployment

Deploy your model to production with Docker

Get Started

Core Concepts

Guides

Use Cases

Prerequisites

Setup Environment

Training Process

Basic Training Command

Training Script Workflow

Training Multiple Configurations

Monitoring Training Progress

Understanding Training Output

Troubleshooting

Next Steps

Model Configuration

Running Inference

MLflow Tracking

Deployment

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Use Cases

​Prerequisites

​Setup Environment

​Training Process

​Basic Training Command

​Training Script Workflow

​Training Multiple Configurations

​Monitoring Training Progress

​Understanding Training Output

​Troubleshooting

​Next Steps

Model Configuration

Running Inference

MLflow Tracking

Deployment

Build docs developers (and LLMs) love

Prerequisites

Setup Environment

Training Process

Basic Training Command

Training Script Workflow

Training Multiple Configurations

Monitoring Training Progress

Understanding Training Output

Troubleshooting

Next Steps