Optimization strategies

MLPP employs problem-specific optimization strategies tailored to each algorithm’s mathematical structure. This guide covers the automatic solver selection in linear regression and the Sequential Minimal Optimization (SMO) algorithm for SVM training.

Linear regression solvers

The LinearRegression class automatically selects an optimal solver based on problem geometry and numerical conditioning. The regularized least squares problem is:

min_w  (1/2n) ‖Xw - y‖² + (λ/2)‖w‖²

Solver selection strategy

MLPP provides four solver methods:

enum class SolveMethod {
    Auto,       // Automatic selection (recommended)
    Cholesky,   // Normal equations via LDLT
    SVD,        // Thin BDCSVD decomposition
    JacobiSVD   // Full-pivoting JacobiSVD
};

Automatic selection (recommended)

When method = SolveMethod::Auto, the solver is chosen based on:

Problem dimensions
- n >> d → Cholesky (normal equations)
- d >> n → SVD (avoid forming X^T X)
Numerical conditioning
- Well-conditioned → Cholesky or SVD
- Ill-conditioned → JacobiSVD fallback

#include <MLPP/Supervised Learning/Regression/linear_regression.hpp>

using namespace mlpp::regression;

LinearRegression<double> model(
    true,                      // fit_intercept
    0.1,                       // regularization lambda
    SolveMethod::Auto          // automatic solver selection
);

model.fit(X_train, y_train);

Cholesky solver

Solves the normal equations using LDLT decomposition:

(X^T X + λI) w = X^T y

Complexity:

Matrix formation: O(nd²)
Cholesky decomposition: O(d³)
Total: O(nd² + d³)

When to use:

n >> d (many samples, few features)
Problem is well-conditioned (condition number < 10^6)
Need maximum speed for small d

Advantages:

Fastest method when applicable
Direct closed-form solution

Disadvantages:

Requires forming X^T X (can lose precision)
Fails for ill-conditioned problems
Memory: O(d²)

LinearRegression<double> model(
    true,                      // fit_intercept
    1e-3,                      // regularization
    SolveMethod::Cholesky      // explicit Cholesky
);

Regularization (λ > 0) improves conditioning by adding λI to the diagonal of X^T X.

SVD solver

Uses Bidiagonal Divide-and-Conquer SVD (BDCSVD) for the thin decomposition:

X = U Σ V^T
w = V diag(σᵢ/(σᵢ² + λ)) U^T y

Complexity:

O(nd²) when n > d
O(n²d) when d > n

When to use:

d >> n (more features than samples)
Moderate ill-conditioning
Need to inspect singular values

Advantages:

Never forms X^T X (numerically stable)
Handles rank-deficient problems
Provides condition number diagnostics

Disadvantages:

Slower than Cholesky for n >> d
Higher memory for intermediate results

LinearRegression<double> model(
    true,
    0.01,
    SolveMethod::SVD
);

model.fit(X, y);

// Check numerical conditioning
double cond = model.condition_number();
if (cond > 1e8) {
    std::cerr << "Warning: Problem is ill-conditioned\n";
}

JacobiSVD solver

Full-pivoting Jacobi SVD provides maximum numerical stability: Complexity:

O(n²d) or O(nd²), depending on shape
Significantly slower than BDCSVD

When to use:

Severely ill-conditioned problems (condition number > 10^10)
Require maximum precision
Small to moderate problem sizes

Advantages:

Most numerically stable method
Guaranteed convergence

Disadvantages:

Slowest solver
Only necessary for pathological cases

LinearRegression<double> model(
    true,
    0.0,                       // No regularization
    SolveMethod::JacobiSVD     // Maximum stability
);

JacobiSVD should only be used when other methods fail due to numerical issues. For most problems, Auto selection is sufficient.

Data preprocessing

MLPP automatically standardizes features before solving:

Feature standardization

X_scaled = (X - μ) / σ

Where:

μ: Feature-wise mean
σ: Feature-wise standard deviation

Effects:

Improves numerical conditioning
Makes regularization fair across features
Accelerates convergence

Centering for intercept

When fit_intercept = true:

Center X and y by subtracting means
Solve for weights without bias term
Compute intercept from means

Advantages:

Intercept is never regularized
Reduces dimensionality of optimization
Improves numerical stability

Coefficient transformation

Solution is transformed back to original feature space:

w_original = w_scaled / σ
b_original = y_mean - w_original^T μ

This ensures predictions work on raw, unscaled data:

model.fit(X_train, y_train);  // Internal scaling
auto y_pred = model.predict(X_test);  // No scaling needed

SVM optimization

MLPP uses Sequential Minimal Optimization (SMO) for training kernel SVMs. SMO decomposes the dual quadratic program into a series of smallest possible subproblems.

Dual formulation

The SVM dual problem is:

maximize   W(α) = Σᵢ αᵢ - (1/2) Σᵢ Σⱼ αᵢ αⱼ yᵢ yⱼ K(xᵢ, xⱼ)

subject to:
    0 ≤ αᵢ ≤ C        ∀i
    Σᵢ αᵢ yᵢ = 0

Where:

α: Dual variables (Lagrange multipliers)
C: Soft margin penalty parameter
K: Kernel function
y: Binary labels in

SMO algorithm

SMO iteratively optimizes pairs of dual variables:

Select working set: Choose two indices (i, j) that violate KKT conditions
Solve 2D subproblem: Analytically optimize αᵢ and αⱼ
Update bias: Recompute threshold b
Cache errors: Update error cache for next iteration
Repeat: Until KKT conditions satisfied within tolerance

using namespace mlpp::classifiers::kernel;

SVM svm(
    X_train,
    y_train,
    KernelFunction(std::make_unique<RBFKernel>(0.5)),
    1.0  // C parameter
);

svm.fit();  // Runs SMO algorithm

KKT conditions

Optimality is characterized by Karush-Kuhn-Tucker conditions:

For each sample i:
    αᵢ = 0      ⟹  yᵢ f(xᵢ) ≥ 1
    0 < αᵢ < C  ⟹  yᵢ f(xᵢ) = 1    (support vector)
    αᵢ = C      ⟹  yᵢ f(xᵢ) ≤ 1

SMO selects pairs that maximally violate these conditions.

Working set selection

MLPP uses a heuristic for choosing (i, j):

First variable (i): Loop over all training examples violating KKT within tolerance
Second variable (j): Choose j that maximizes |Eᵢ - Eⱼ| where E is the error cache

const bool violates_kkt =
    (yᵢ * Eᵢ < -tol && αᵢ < C) ||  // Below margin, can increase αᵢ
    (yᵢ * Eᵢ >  tol && αᵢ > 0);    // Above margin, can decrease αᵢ

This heuristic maximizes progress per iteration.

Analytical optimization

For the selected pair (i, j), SMO solves:

η = 2K(xᵢ, xⱼ) - K(xᵢ, xᵢ) - K(xⱼ, xⱼ)

αⱼ_new = αⱼ_old - (yⱼ(Eᵢ - Eⱼ)) / η

// Clip to feasible region [L, H]
αⱼ_new = clip(αⱼ_new, L, H)

// Update αᵢ to maintain constraint Σ αᵢ yᵢ = 0
αᵢ_new = αᵢ_old + yᵢ yⱼ (αⱼ_old - αⱼ_new)

The feasible region [L, H] depends on label agreement:

if (yᵢ ≠ yⱼ) {
    L = max(0, αⱼ - αᵢ)
    H = min(C, C + αⱼ - αᵢ)
} else {
    L = max(0, αᵢ + αⱼ - C)
    H = min(C, αᵢ + αⱼ)
}

Bias computation

After updating α, the bias term is recomputed using support vectors:

b = (1/|S|) Σᵢ∈S (yᵢ - Σⱼ αⱼ yⱼ K(xⱼ, xᵢ))

Where S = {i : 0 < αᵢ < C} is the set of support vectors on the margin.

Error cache

MLPP maintains a cached error vector to avoid recomputing decision values:

Eᵢ = f(xᵢ) - yᵢ
   = Σⱼ αⱼ yⱼ K(xⱼ, xᵢ) + b - yᵢ

After updating αᵢ and αⱼ, errors are efficiently updated:

for k in 1..n:
    Eₖ += (αᵢ_new - αᵢ_old) yᵢ K(xᵢ, xₖ) +
          (αⱼ_new - αⱼ_old) yⱼ K(xⱼ, xₖ) +
          (b_new - b_old)

This incremental update is O(n) per SMO iteration.

Convergence criteria

SMO terminates when no changes occur for multiple consecutive passes:

constexpr double tol = 1e-3;           // KKT tolerance
constexpr std::size_t max_passes = 10; // Stopping criterion

Termination:

If no αᵢ changes during a full pass, increment pass counter
If pass counter reaches max_passes, terminate
If any αᵢ changes, reset pass counter to zero

Increasing max_passes improves solution quality but increases training time. The default (10 passes) provides a good balance.

Kernel caching

SMO heavily reuses kernel evaluations. The KernelCache precomputes the Gram matrix:

svm.fit() {
    kernel_cache_.precompute();  // Compute all K(xᵢ, xⱼ)
    // ... SMO iterations
}

This reduces kernel evaluations from O(n²·iterations) to O(n²), trading memory for speed. Memory usage:

Gram matrix: O(n²) doubles
Typically feasible for n < 10,000

Hyperparameter tuning

Linear regression

Regularization strength (λ):

for (double lambda : {0.001, 0.01, 0.1, 1.0, 10.0}) {
    LinearRegression<double> model(true, lambda);
    model.fit(X_train, y_train);
    double score = model.score(X_val, y_val);
    // Select lambda with best validation score
}

Rule of thumb:

Start with λ = 0 (pure OLS)
Increase if overfitting (train >> test performance)
Decrease if underfitting (both train and test are poor)

SVM

C parameter: Controls the trade-off between margin width and classification errors:

Small C: Wide margin, more training errors (high bias)
Large C: Narrow margin, fewer training errors (high variance)

Kernel parameters:

RBF gamma: Use grid search over log-scale:
Polynomial degree: Try small values first:

std::vector<double> C_values = {0.1, 1.0, 10.0, 100.0};
std::vector<double> gamma_values = {0.001, 0.01, 0.1, 1.0};

double best_score = 0.0;
for (double C : C_values) {
    for (double gamma : gamma_values) {
        SVM svm(
            X_train, y_train,
            KernelFunction(std::make_unique<RBFKernel>(gamma)),
            C
        );
        svm.fit();
        double score = evaluate(svm, X_val, y_val);
        if (score > best_score) {
            best_score = score;
            // Save best parameters
        }
    }
}

Always use held-out validation data or cross-validation for hyperparameter selection. Evaluating on training data leads to overfitting.

Getting Started

Core Concepts

Supervised Learning

Unsupervised Learning

Model Validation

Advanced Topics

Optimization strategies

Linear regression solvers

Solver selection strategy

Automatic selection (recommended)

Cholesky solver

SVD solver

JacobiSVD solver

Data preprocessing

Feature standardization

Centering for intercept

Coefficient transformation

SVM optimization

Dual formulation

SMO algorithm

KKT conditions

Working set selection

Analytical optimization

Bias computation

Error cache

Convergence criteria

Kernel caching

Hyperparameter tuning

Linear regression

SVM

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Supervised Learning

Unsupervised Learning

Model Validation

Advanced Topics

​Linear regression solvers

​Solver selection strategy

​Automatic selection (recommended)

​Cholesky solver

​SVD solver

​JacobiSVD solver

​Data preprocessing

​Feature standardization

​Centering for intercept

​Coefficient transformation

​SVM optimization

​Dual formulation

​SMO algorithm

​KKT conditions

​Working set selection

​Analytical optimization

​Bias computation

​Error cache

​Convergence criteria

​Kernel caching

​Hyperparameter tuning

​Linear regression

​SVM

Build docs developers (and LLMs) love

Linear regression solvers

Solver selection strategy

Automatic selection (recommended)

Cholesky solver

SVD solver

JacobiSVD solver

Data preprocessing

Feature standardization

Centering for intercept

Coefficient transformation

SVM optimization

Dual formulation

SMO algorithm

KKT conditions

Working set selection

Analytical optimization

Bias computation

Error cache

Convergence criteria

Kernel caching

Hyperparameter tuning

Linear regression

SVM