Skip to main content
MLPP employs problem-specific optimization strategies tailored to each algorithm’s mathematical structure. This guide covers the automatic solver selection in linear regression and the Sequential Minimal Optimization (SMO) algorithm for SVM training.

Linear regression solvers

The LinearRegression class automatically selects an optimal solver based on problem geometry and numerical conditioning. The regularized least squares problem is:
min_w  (1/2n) ‖Xw - y‖² +/2)‖w‖²

Solver selection strategy

MLPP provides four solver methods:
enum class SolveMethod {
    Auto,       // Automatic selection (recommended)
    Cholesky,   // Normal equations via LDLT
    SVD,        // Thin BDCSVD decomposition
    JacobiSVD   // Full-pivoting JacobiSVD
};
When method = SolveMethod::Auto, the solver is chosen based on:
  1. Problem dimensions
    • n >> d → Cholesky (normal equations)
    • d >> n → SVD (avoid forming X^T X)
  2. Numerical conditioning
    • Well-conditioned → Cholesky or SVD
    • Ill-conditioned → JacobiSVD fallback
#include <MLPP/Supervised Learning/Regression/linear_regression.hpp>

using namespace mlpp::regression;

LinearRegression<double> model(
    true,                      // fit_intercept
    0.1,                       // regularization lambda
    SolveMethod::Auto          // automatic solver selection
);

model.fit(X_train, y_train);

Cholesky solver

Solves the normal equations using LDLT decomposition:
(X^T X + λI) w = X^T y
Complexity:
  • Matrix formation: O(nd²)
  • Cholesky decomposition: O(d³)
  • Total: O(nd² + d³)
When to use:
  • n >> d (many samples, few features)
  • Problem is well-conditioned (condition number < 10^6)
  • Need maximum speed for small d
Advantages:
  • Fastest method when applicable
  • Direct closed-form solution
Disadvantages:
  • Requires forming X^T X (can lose precision)
  • Fails for ill-conditioned problems
  • Memory: O(d²)
LinearRegression<double> model(
    true,                      // fit_intercept
    1e-3,                      // regularization
    SolveMethod::Cholesky      // explicit Cholesky
);
Regularization (λ > 0) improves conditioning by adding λI to the diagonal of X^T X.

SVD solver

Uses Bidiagonal Divide-and-Conquer SVD (BDCSVD) for the thin decomposition:
X = U Σ V^T
w = V diag(σᵢ/(σᵢ² + λ)) U^T y
Complexity:
  • O(nd²) when n > d
  • O(n²d) when d > n
When to use:
  • d >> n (more features than samples)
  • Moderate ill-conditioning
  • Need to inspect singular values
Advantages:
  • Never forms X^T X (numerically stable)
  • Handles rank-deficient problems
  • Provides condition number diagnostics
Disadvantages:
  • Slower than Cholesky for n >> d
  • Higher memory for intermediate results
LinearRegression<double> model(
    true,
    0.01,
    SolveMethod::SVD
);

model.fit(X, y);

// Check numerical conditioning
double cond = model.condition_number();
if (cond > 1e8) {
    std::cerr << "Warning: Problem is ill-conditioned\n";
}

JacobiSVD solver

Full-pivoting Jacobi SVD provides maximum numerical stability: Complexity:
  • O(n²d) or O(nd²), depending on shape
  • Significantly slower than BDCSVD
When to use:
  • Severely ill-conditioned problems (condition number > 10^10)
  • Require maximum precision
  • Small to moderate problem sizes
Advantages:
  • Most numerically stable method
  • Guaranteed convergence
Disadvantages:
  • Slowest solver
  • Only necessary for pathological cases
LinearRegression<double> model(
    true,
    0.0,                       // No regularization
    SolveMethod::JacobiSVD     // Maximum stability
);
JacobiSVD should only be used when other methods fail due to numerical issues. For most problems, Auto selection is sufficient.

Data preprocessing

MLPP automatically standardizes features before solving:

Feature standardization

X_scaled = (X - μ) / σ
Where:
  • μ: Feature-wise mean
  • σ: Feature-wise standard deviation
Effects:
  • Improves numerical conditioning
  • Makes regularization fair across features
  • Accelerates convergence

Centering for intercept

When fit_intercept = true:
  1. Center X and y by subtracting means
  2. Solve for weights without bias term
  3. Compute intercept from means
Advantages:
  • Intercept is never regularized
  • Reduces dimensionality of optimization
  • Improves numerical stability

Coefficient transformation

Solution is transformed back to original feature space:
w_original = w_scaled / σ
b_original = y_mean - w_original^T μ
This ensures predictions work on raw, unscaled data:
model.fit(X_train, y_train);  // Internal scaling
auto y_pred = model.predict(X_test);  // No scaling needed

SVM optimization

MLPP uses Sequential Minimal Optimization (SMO) for training kernel SVMs. SMO decomposes the dual quadratic program into a series of smallest possible subproblems.

Dual formulation

The SVM dual problem is:
maximize   W(α) = Σᵢ αᵢ - (1/2) Σᵢ Σⱼ αᵢ αⱼ yᵢ yⱼ K(xᵢ, xⱼ)

subject to:
    0 ≤ αᵢ ≤ C        ∀i
    Σᵢ αᵢ yᵢ = 0
Where:
  • α: Dual variables (Lagrange multipliers)
  • C: Soft margin penalty parameter
  • K: Kernel function
  • y: Binary labels in

SMO algorithm

SMO iteratively optimizes pairs of dual variables:
  1. Select working set: Choose two indices (i, j) that violate KKT conditions
  2. Solve 2D subproblem: Analytically optimize αᵢ and αⱼ
  3. Update bias: Recompute threshold b
  4. Cache errors: Update error cache for next iteration
  5. Repeat: Until KKT conditions satisfied within tolerance
using namespace mlpp::classifiers::kernel;

SVM svm(
    X_train,
    y_train,
    KernelFunction(std::make_unique<RBFKernel>(0.5)),
    1.0  // C parameter
);

svm.fit();  // Runs SMO algorithm

KKT conditions

Optimality is characterized by Karush-Kuhn-Tucker conditions:
For each sample i:
    αᵢ = 0      ⟹  yᵢ f(xᵢ) ≥ 1
    0 < αᵢ < C  ⟹  yᵢ f(xᵢ) = 1    (support vector)
    αᵢ = C      ⟹  yᵢ f(xᵢ) ≤ 1
SMO selects pairs that maximally violate these conditions.

Working set selection

MLPP uses a heuristic for choosing (i, j):
  1. First variable (i): Loop over all training examples violating KKT within tolerance
  2. Second variable (j): Choose j that maximizes |Eᵢ - Eⱼ| where E is the error cache
const bool violates_kkt =
    (yᵢ * Eᵢ < -tol && αᵢ < C) ||  // Below margin, can increase αᵢ
    (yᵢ * Eᵢ >  tol && αᵢ > 0);    // Above margin, can decrease αᵢ
This heuristic maximizes progress per iteration.

Analytical optimization

For the selected pair (i, j), SMO solves:
η = 2K(xᵢ, xⱼ) - K(xᵢ, xᵢ) - K(xⱼ, xⱼ)

αⱼ_new = αⱼ_old - (yⱼ(Eᵢ - Eⱼ)) / η

// Clip to feasible region [L, H]
αⱼ_new = clip(αⱼ_new, L, H)

// Update αᵢ to maintain constraint Σ αᵢ yᵢ = 0
αᵢ_new = αᵢ_old + yᵢ yⱼ (αⱼ_old - αⱼ_new)
The feasible region [L, H] depends on label agreement:
if (yᵢ ≠ yⱼ) {
    L = max(0, αⱼ - αᵢ)
    H = min(C, C + αⱼ - αᵢ)
} else {
    L = max(0, αᵢ + αⱼ - C)
    H = min(C, αᵢ + αⱼ)
}

Bias computation

After updating α, the bias term is recomputed using support vectors:
b = (1/|S|) Σᵢ∈S (yᵢ - Σⱼ αⱼ yⱼ K(xⱼ, xᵢ))
Where S = {i : 0 < αᵢ < C} is the set of support vectors on the margin.

Error cache

MLPP maintains a cached error vector to avoid recomputing decision values:
Eᵢ = f(xᵢ) - yᵢ
   = Σⱼ αⱼ yⱼ K(xⱼ, xᵢ) + b - yᵢ
After updating αᵢ and αⱼ, errors are efficiently updated:
for k in 1..n:
    Eₖ += (αᵢ_new - αᵢ_old) yᵢ K(xᵢ, xₖ) +
          (αⱼ_new - αⱼ_old) yⱼ K(xⱼ, xₖ) +
          (b_new - b_old)
This incremental update is O(n) per SMO iteration.

Convergence criteria

SMO terminates when no changes occur for multiple consecutive passes:
constexpr double tol = 1e-3;           // KKT tolerance
constexpr std::size_t max_passes = 10; // Stopping criterion
Termination:
  • If no αᵢ changes during a full pass, increment pass counter
  • If pass counter reaches max_passes, terminate
  • If any αᵢ changes, reset pass counter to zero
Increasing max_passes improves solution quality but increases training time. The default (10 passes) provides a good balance.

Kernel caching

SMO heavily reuses kernel evaluations. The KernelCache precomputes the Gram matrix:
svm.fit() {
    kernel_cache_.precompute();  // Compute all K(xᵢ, xⱼ)
    // ... SMO iterations
}
This reduces kernel evaluations from O(n²·iterations) to O(n²), trading memory for speed. Memory usage:
  • Gram matrix: O(n²) doubles
  • Typically feasible for n < 10,000

Hyperparameter tuning

Linear regression

Regularization strength (λ):
for (double lambda : {0.001, 0.01, 0.1, 1.0, 10.0}) {
    LinearRegression<double> model(true, lambda);
    model.fit(X_train, y_train);
    double score = model.score(X_val, y_val);
    // Select lambda with best validation score
}
Rule of thumb:
  • Start with λ = 0 (pure OLS)
  • Increase if overfitting (train >> test performance)
  • Decrease if underfitting (both train and test are poor)

SVM

C parameter: Controls the trade-off between margin width and classification errors:
  • Small C: Wide margin, more training errors (high bias)
  • Large C: Narrow margin, fewer training errors (high variance)
Kernel parameters:
  • RBF gamma: Use grid search over log-scale:
  • Polynomial degree: Try small values first:
std::vector<double> C_values = {0.1, 1.0, 10.0, 100.0};
std::vector<double> gamma_values = {0.001, 0.01, 0.1, 1.0};

double best_score = 0.0;
for (double C : C_values) {
    for (double gamma : gamma_values) {
        SVM svm(
            X_train, y_train,
            KernelFunction(std::make_unique<RBFKernel>(gamma)),
            C
        );
        svm.fit();
        double score = evaluate(svm, X_val, y_val);
        if (score > best_score) {
            best_score = score;
            // Save best parameters
        }
    }
}
Always use held-out validation data or cross-validation for hyperparameter selection. Evaluating on training data leads to overfitting.

Build docs developers (and LLMs) love