Training Configuration

Overview

Training configurations are defined in YAML files that specify model architecture, hyperparameters, and training settings. These configurations enable reproducible experiments and easy hyperparameter tuning.

Configuration Structure

Complete Example

hidden_layers:
  - 128
  - 64
  - 32
activation_functions:
  - relu
  - relu
  - relu
dropout_rate: 0.3
learning_rate: 0.0005
epochs: 150
batch_size: 64

Configuration Parameters

Network Architecture

hidden_layers

list[int]

required

List of integers defining the number of neurons in each hidden layer.Constraints:

Length must match activation_functions length
Each value must be positive
Typically decreasing sequence (e.g., [128, 64, 32])

Example:

hidden_layers:
  - 128  # First hidden layer
  - 64   # Second hidden layer
  - 32   # Third hidden layer

activation_functions

list[str]

required

List of activation function names for each hidden layer.Allowed Values:

relu - Rectified Linear Unit (recommended for most cases)
leaky_relu - Leaky ReLU with negative_slope=0.1
gelu - Gaussian Error Linear Unit
sigmoid - Sigmoid function
tanh - Hyperbolic tangent
softmax - Softmax (use for multi-class intermediate layers)

Constraints:

Length must match hidden_layers length
One function per hidden layer

Example:

activation_functions:
  - relu
  - relu
  - relu

The number of hidden layers and activation functions must be equal. A validation error will be raised if they don’t match.

Regularization

dropout_rate

float

required

Dropout probability applied after each hidden layer to prevent overfitting.Range: 0.0 to 1.0Recommendations:

Small models: 0.2 - 0.3
Medium models: 0.3 - 0.5
Large models: 0.5 - 0.7

Example:

dropout_rate: 0.3  # 30% of neurons randomly dropped

Dropout is only active during training. It’s automatically disabled during evaluation and inference.

Optimization

learning_rate

float

required

Learning rate for the AdamW optimizer.Typical Range: 0.0001 to 0.01Recommendations:

Start with: 0.001 (default)
Large datasets: 0.0005 - 0.001
Small datasets: 0.001 - 0.005

Example:

learning_rate: 0.0005  # Conservative learning rate

The training pipeline uses the AdamW optimizer, which includes weight decay regularization for better generalization.

Training Duration

epochs

int

required

Number of complete passes through the training dataset.Typical Range: 50 to 500Recommendations:

Quick experiments: 50-100
Standard training: 100-200
Full training: 150-300

Example:

epochs: 150  # Train for 150 complete iterations

Batch Processing

batch_size

int

required

Number of samples processed before updating model weights.Typical Range: 16 to 512Recommendations:

Limited memory: 16-32
Standard: 32-64
Large memory: 64-128
Very large datasets: 128-512

Trade-offs:

Smaller batches: More updates, noisier gradients, better generalization
Larger batches: Fewer updates, smoother gradients, faster training

Example:

batch_size: 64  # Process 64 samples per batch

ModelConfig Dataclass

The YAML configuration is parsed and mapped to the ModelConfig dataclass:

from dataclasses import dataclass
from typing import Literal

@dataclass
class ModelConfig:
    """Configuration for credit score model architecture"""
    
    input_size: int
    hidden_layers: list[int]
    activation_functions: list[
        Literal["relu", "leaky_relu", "gelu", "sigmoid", "softmax", "tanh"]
    ]
    output_size: int = 1
    dropout_rate: float = 0.2
    learning_rate: float = 0.001
    epochs: int = 100
    batch_size: int = 32
    checkpoint_path: str = "./model/checkpoint"

Location: model/model.py:20

Auto-Computed Fields

input_size

int

Number of input features - automatically determined from preprocessed data.Computed as: X_train.shape[1]

output_size

int

default:"1"

Number of output neurons - fixed at 1 for binary classification.

checkpoint_path

str

default:"./model/checkpoint"

Directory for saving model checkpoints (not currently used in training loop).

Configuration Examples

Small Model (Fast Training)

hidden_layers:
  - 64
  - 32
activation_functions:
  - relu
  - relu
dropout_rate: 0.2
learning_rate: 0.001
epochs: 100
batch_size: 32

Use Case: Quick experiments, limited compute resources

Medium Model (Balanced)

hidden_layers:
  - 128
  - 64
  - 32
activation_functions:
  - relu
  - relu
  - relu
dropout_rate: 0.3
learning_rate: 0.0005
epochs: 150
batch_size: 64

Use Case: Standard production training, good accuracy-speed trade-off

Large Model (Maximum Capacity)

hidden_layers:
  - 256
  - 128
  - 64
  - 32
activation_functions:
  - relu
  - relu
  - relu
  - relu
dropout_rate: 0.4
learning_rate: 0.0003
epochs: 200
batch_size: 128

Use Case: Maximum model capacity, large datasets, extended training time

Alternative Activation Functions

hidden_layers:
  - 128
  - 64
  - 32
activation_functions:
  - gelu
  - gelu
  - gelu
dropout_rate: 0.3
learning_rate: 0.0005
epochs: 150
batch_size: 64

Use Case: Experimenting with GELU activation (often used in transformer architectures)

Usage in Training

Configurations are loaded and applied during training initialization:

# Load configuration
config = load_config('config/models-configs/model_config_001.yaml')

# Create ModelConfig instance
model_config = ModelConfig(
    input_size=X_train.shape[1],
    output_size=1,
    hidden_layers=config['hidden_layers'],
    activation_functions=config['activation_functions'],
    dropout_rate=config['dropout_rate'],
    learning_rate=config['learning_rate'],
    epochs=config['epochs'],
    batch_size=config['batch_size']
)

# Initialize model
model = CreditScoreModel(model_config)

Location: training/training.py:108

Best Practices

Naming Convention

model_config_001.yaml  # Baseline
model_config_002.yaml  # Increased depth
model_config_003.yaml  # Different activation
model_config_004.yaml  # Tuned learning rate

Benefits:

Easy version tracking
Sequential experiment numbering
Automatic weight file naming (e.g., model_weights_001.pth)

Hyperparameter Tuning Strategy

Start with baseline configuration

hidden_layers: [128, 64, 32]
activation_functions: [relu, relu, relu]
dropout_rate: 0.3
learning_rate: 0.001
epochs: 100
batch_size: 32

Tune learning rate first
- Try: [0.0001, 0.0005, 0.001, 0.005]
- Monitor training loss convergence
Adjust network depth
- Add/remove layers
- Ensure gradual size reduction
Optimize regularization
- Increase dropout if overfitting
- Decrease dropout if underfitting
Fine-tune batch size
- Balance speed vs. stability
- Consider GPU memory constraints

Validation

The training pipeline validates configuration parameters:

if len(config.hidden_layers) != len(config.activation_functions):
    raise ValueError(
        "The length of hidden_layers must equal the length of activation_functions"
    )

Location: model/model.py:66

Invalid configurations will raise errors during model initialization, before training begins.

MLflow Parameter Logging

All configuration parameters are automatically logged to MLflow:

with mlflow.start_run(run_name=config_name):
    mlflow.log_params(config)
    mlflow.log_param("config_file", config_name)

This enables:

Experiment comparison
Hyperparameter analysis
Reproducible training runs
Configuration versioning

Inference API

Model Architecture

Data Processing

Training

Overview

Configuration Structure

Complete Example

Configuration Parameters

Network Architecture

Regularization

Optimization

Training Duration

Batch Processing

ModelConfig Dataclass

Auto-Computed Fields

Configuration Examples

Small Model (Fast Training)

Medium Model (Balanced)

Large Model (Maximum Capacity)

Alternative Activation Functions

Usage in Training

Best Practices

Naming Convention

Hyperparameter Tuning Strategy

Validation

MLflow Parameter Logging

Build docs developers (and LLMs) love

Inference API

Model Architecture

Data Processing

Training

​Overview

​Configuration Structure

​Complete Example

​Configuration Parameters

​Network Architecture

​Regularization

​Optimization

​Training Duration

​Batch Processing

​ModelConfig Dataclass

​Auto-Computed Fields

​Configuration Examples

​Small Model (Fast Training)

​Medium Model (Balanced)

​Large Model (Maximum Capacity)

​Alternative Activation Functions

​Usage in Training

​Best Practices

​Naming Convention

​Hyperparameter Tuning Strategy

​Validation

​MLflow Parameter Logging

Build docs developers (and LLMs) love

Overview

Configuration Structure

Complete Example

Configuration Parameters

Network Architecture

Regularization

Optimization

Training Duration

Batch Processing

ModelConfig Dataclass

Auto-Computed Fields

Configuration Examples

Small Model (Fast Training)

Medium Model (Balanced)

Large Model (Maximum Capacity)

Alternative Activation Functions

Usage in Training

Best Practices

Naming Convention

Hyperparameter Tuning Strategy

Validation

MLflow Parameter Logging