mlpp::losses namespace.
Regression losses
Regression losses measure the discrepancy between continuous predictions and ground truth values.Mean squared error (MSE)
The most common regression loss, MSE is the arithmetic mean of squared prediction errors:- Differentiable everywhere
- Heavily penalizes large errors (quadratic penalty)
- Sensitive to outliers
- Optimal for Gaussian noise
MSE is the maximum likelihood estimator when errors follow a normal distribution.
Mean absolute error (MAE)
MAE uses absolute differences, providing a more robust alternative to MSE:- More robust to outliers than MSE
- Non-differentiable at zero (subgradient exists)
- Linear penalty for errors
- Optimal for Laplacian noise
- Data contains outliers or heavy-tailed distributions
- You want to minimize median error rather than mean error
- Interpretability is important (same units as target)
Huber loss
The Huber loss combines the best properties of MSE and MAE by being quadratic for small errors and linear for large errors:delta(δ): Threshold where loss transitions from quadratic to linear- Small δ: More robust, approaches MAE
- Large δ: Less robust, approaches MSE
- Default: δ = 1.0
- Differentiable everywhere
- Robust to outliers (like MAE)
- Smooth gradients near zero (like MSE)
- Parameterized robustness
Classification losses
Classification losses measure the quality of discrete predictions.Binary cross entropy
Used for binary classification with probabilistic outputs in [0, 1]:- Outputs should be probabilities (sigmoid or softmax)
- Convex in log-odds space
- Heavily penalizes confident misclassifications
- Maximum likelihood for Bernoulli distributions
Predictions are automatically clamped to [ε, 1-ε] where ε = 1e-12 to prevent numerical instability from log(0).
Multiclass cross entropy
Generalization of binary cross entropy for K > 2 classes:y_true: One-hot encoded labels, shape (n_samples, n_classes)y_pred: Predicted probabilities (typically from softmax), shape (n_samples, n_classes)- Each prediction vector should sum to 1.0
Hinge loss
The standard SVM loss for binary classification with labels in :- Designed for large-margin classification
- Zero loss for correctly classified points beyond the margin
- Linear penalty for violations
- Non-differentiable at margin boundary
- Loss = 0: Correct classification with margin ≥ 1
- Loss > 0: Either misclassified or margin < 1
Squared hinge loss
A differentiable variant of hinge loss with quadratic penalty:- Differentiable everywhere (enables gradient-based optimization)
- Stronger penalty for large margin violations
- Smoother loss surface
- More sensitive to outliers than standard hinge
- May require smaller learning rates
Regularization terms
MLPP provides regularization penalties to prevent overfitting.L1 penalty (Lasso)
Encourages sparse solutions:- Drives small weights to exactly zero
- Performs automatic feature selection
- Non-differentiable at zero (use proximal methods)
L2 penalty (Ridge)
Encourages small but non-zero weights:- Shrinks all weights toward zero
- Differentiable everywhere
- Improves numerical stability
Elastic net penalty
Combines L1 and L2 regularization:alpha(α): Overall regularization strengthl1_ratio(ρ): Mixing parameter ∈ [0, 1]- ρ = 0: Pure L2 (ridge)
- ρ = 1: Pure L1 (lasso)
- 0 < ρ < 1: Combination
- Correlated features (L2 helps where L1 struggles)
- Need feature selection but want to keep correlated groups
- More stable than pure L1 when features >> samples