Linear regression solvers
TheLinearRegression class automatically selects an optimal solver based on problem geometry and numerical conditioning. The regularized least squares problem is:
Solver selection strategy
MLPP provides four solver methods:Automatic selection (recommended)
Whenmethod = SolveMethod::Auto, the solver is chosen based on:
-
Problem dimensions
- n >> d → Cholesky (normal equations)
- d >> n → SVD (avoid forming X^T X)
-
Numerical conditioning
- Well-conditioned → Cholesky or SVD
- Ill-conditioned → JacobiSVD fallback
Cholesky solver
Solves the normal equations using LDLT decomposition:- Matrix formation: O(nd²)
- Cholesky decomposition: O(d³)
- Total: O(nd² + d³)
- n >> d (many samples, few features)
- Problem is well-conditioned (condition number < 10^6)
- Need maximum speed for small d
- Fastest method when applicable
- Direct closed-form solution
- Requires forming X^T X (can lose precision)
- Fails for ill-conditioned problems
- Memory: O(d²)
Regularization (λ > 0) improves conditioning by adding λI to the diagonal of X^T X.
SVD solver
Uses Bidiagonal Divide-and-Conquer SVD (BDCSVD) for the thin decomposition:- O(nd²) when n > d
- O(n²d) when d > n
- d >> n (more features than samples)
- Moderate ill-conditioning
- Need to inspect singular values
- Never forms X^T X (numerically stable)
- Handles rank-deficient problems
- Provides condition number diagnostics
- Slower than Cholesky for n >> d
- Higher memory for intermediate results
JacobiSVD solver
Full-pivoting Jacobi SVD provides maximum numerical stability: Complexity:- O(n²d) or O(nd²), depending on shape
- Significantly slower than BDCSVD
- Severely ill-conditioned problems (condition number > 10^10)
- Require maximum precision
- Small to moderate problem sizes
- Most numerically stable method
- Guaranteed convergence
- Slowest solver
- Only necessary for pathological cases
Data preprocessing
MLPP automatically standardizes features before solving:Feature standardization
- μ: Feature-wise mean
- σ: Feature-wise standard deviation
- Improves numerical conditioning
- Makes regularization fair across features
- Accelerates convergence
Centering for intercept
Whenfit_intercept = true:
- Center X and y by subtracting means
- Solve for weights without bias term
- Compute intercept from means
- Intercept is never regularized
- Reduces dimensionality of optimization
- Improves numerical stability
Coefficient transformation
Solution is transformed back to original feature space:SVM optimization
MLPP uses Sequential Minimal Optimization (SMO) for training kernel SVMs. SMO decomposes the dual quadratic program into a series of smallest possible subproblems.Dual formulation
The SVM dual problem is:- α: Dual variables (Lagrange multipliers)
- C: Soft margin penalty parameter
- K: Kernel function
- y: Binary labels in
SMO algorithm
SMO iteratively optimizes pairs of dual variables:- Select working set: Choose two indices (i, j) that violate KKT conditions
- Solve 2D subproblem: Analytically optimize αᵢ and αⱼ
- Update bias: Recompute threshold b
- Cache errors: Update error cache for next iteration
- Repeat: Until KKT conditions satisfied within tolerance
KKT conditions
Optimality is characterized by Karush-Kuhn-Tucker conditions:Working set selection
MLPP uses a heuristic for choosing (i, j):- First variable (i): Loop over all training examples violating KKT within tolerance
- Second variable (j): Choose j that maximizes |Eᵢ - Eⱼ| where E is the error cache
Analytical optimization
For the selected pair (i, j), SMO solves:Bias computation
After updating α, the bias term is recomputed using support vectors:Error cache
MLPP maintains a cached error vector to avoid recomputing decision values:Convergence criteria
SMO terminates when no changes occur for multiple consecutive passes:- If no αᵢ changes during a full pass, increment pass counter
- If pass counter reaches
max_passes, terminate - If any αᵢ changes, reset pass counter to zero
Increasing max_passes improves solution quality but increases training time. The default (10 passes) provides a good balance.
Kernel caching
SMO heavily reuses kernel evaluations. TheKernelCache precomputes the Gram matrix:
- Gram matrix: O(n²) doubles
- Typically feasible for n < 10,000
Hyperparameter tuning
Linear regression
Regularization strength (λ):- Start with λ = 0 (pure OLS)
- Increase if overfitting (train >> test performance)
- Decrease if underfitting (both train and test are poor)
SVM
C parameter: Controls the trade-off between margin width and classification errors:- Small C: Wide margin, more training errors (high bias)
- Large C: Narrow margin, fewer training errors (high variance)
- RBF gamma: Use grid search over log-scale:
- Polynomial degree: Try small values first: