Data structures

MLPP uses Eigen as its primary linear algebra backend, providing efficient matrix operations and numerical stability. The library defines consistent type aliases and data structures across all algorithms.

Eigen integration

When Eigen 3.4+ is detected, MLPP automatically integrates it for linear algebra operations:

if(MLPP_USE_EIGEN)
    find_package(Eigen3 3.4 QUIET NO_MODULE)
    if(Eigen3_FOUND)
        target_link_libraries(mlpp INTERFACE Eigen3::Eigen)
        target_compile_definitions(mlpp INTERFACE MLPP_HAS_EIGEN)
    endif()
endif()

The MLPP_HAS_EIGEN definition enables Eigen-based implementations throughout the codebase.

Matrix and vector types

MLPP algorithms use Eigen’s templated matrix and vector types with consistent aliases:

Learning/Regression/linear_regression.hpp

template <typename Scalar = double>
class LinearRegression {
public:
    using Matrix = Eigen::Matrix<Scalar, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>;
    using Vector = Eigen::Matrix<Scalar, Eigen::Dynamic, 1>;
    using Index  = Eigen::Index;
    
    // ...
};

Type characteristics

Dynamic sizing

Matrices and vectors use Eigen::Dynamic for runtime-sized containers, enabling flexible data dimensions.

Row-major storage

Matrices use row-major layout by default, optimizing for row-wise iteration patterns common in ML algorithms.

Template parameters

Scalar type is templated (typically double), allowing users to choose precision vs. performance tradeoffs.

Eigen interop

Types are direct Eigen instantiations, ensuring zero-cost interoperability with Eigen-based code.

Common type patterns

MLPP establishes consistent naming conventions for data structures:

Feature matrices

Feature matrices are typically named X with shape (n_samples, n_features) in row-major format:

void fit(const Matrix& X, const Vector& y);

Each row represents a single sample, and each column represents a feature dimension.

Target vectors

Target values are column vectors named y with length n_samples:

Vector predict(const Matrix& X) const;

Coefficient vectors

Learned parameters are stored as column vectors with length matching feature dimensions:

Vector coef_;        // Coefficient vector in original feature space
Scalar intercept_;   // Bias term

Dataset abstractions

For structured learning tasks, MLPP provides high-level dataset abstractions:

Learning/Clustering/clustering_dataset.hpp

namespace mlpp::unsupervised::clustering {

template<Numeric T>
class Record {
public:
    explicit Record(std::shared_ptr<Schema<T>> schema_ptr);
    
    AttrValue<T>& labelValue();
    const AttrValue<T>& labelValue() const;
    
    std::size_t get_id() const;
    std::size_t get_label() const;
    
private:
    std::shared_ptr<Schema<T>> schema_;
    AttrValue<T> label_;
    AttrValue<T> id_;
    std::vector<AttrValue<T>> features_;
};

template<Numeric T>
class Dataset {
public:
    explicit Dataset(std::shared_ptr<Schema<T>> schema_ptr);
    
    std::size_t num_attr() const;
    const std::shared_ptr<Schema<T>>& schema() const;
    
    AttrValue<T>& operator()(std::size_t i, std::size_t j);
    const AttrValue<T>& operator()(std::size_t i, std::size_t j) const;
    
    bool is_numeric() const;
    bool is_categorical() const;
    
private:
    std::shared_ptr<Schema<T>> schema_;
    std::vector<std::shared_ptr<Record<T>>> records_;
};

}

Schema-based organization

The Dataset class provides:

Type safety: Schema defines attribute types and constraints
Labeled data: Built-in support for supervised learning with labels
Record abstraction: Individual samples with typed feature access
Flexibility: Supports both numeric and categorical attributes

The schema-based approach is particularly useful for clustering and classification tasks where feature types and metadata matter.

Numeric concepts

MLPP uses C++20 concepts to constrain template parameters:

template<Numeric T>
class Record { /* ... */ };

The Numeric concept ensures type parameters support arithmetic operations required for machine learning computations.

Memory management

MLPP follows modern C++ memory management practices:

Value semantics: Algorithms store data members by value when possible
Smart pointers: Shared ownership uses std::shared_ptr
Const correctness: Read-only operations marked const throughout
Move semantics: Large objects support efficient moves

Example: Linear regression storage

Learning/Regression/linear_regression.hpp

private:
    // Hyper-parameters
    bool        fit_intercept_;
    Scalar      lambda_;
    SolveMethod method_;
    
    // Learned parameters (original feature space)
    Vector  coef_;           // Coefficient vector
    Scalar  intercept_{};    // Bias term
    
    Vector  feature_mean_;   // For standardization
    Vector  feature_std_;
    Scalar  target_mean_{};
    
    bool   fitted_ = false;
    Scalar cond_number_ = Scalar(-1);

Performance considerations

Row-major matrices optimize cache locality for row-wise iteration, but column operations may be slower. Choose storage order based on your dominant access pattern.

Eigen provides:

Vectorization: SIMD optimizations for supported architectures
Lazy evaluation: Expression templates minimize temporary allocations
Block operations: Efficient sub-matrix views without copying

Integration with external data

MLPP’s Eigen-based types integrate seamlessly with:

NumPy arrays (via Python bindings)
OpenCV matrices (when MLPP_HAS_OPENCV is defined)
Raw C++ arrays (via Eigen::Map)
Standard library containers

Example mapping from raw pointer:

double* raw_data = /* ... */;
Eigen::Map<Matrix> X(raw_data, n_samples, n_features);

Getting Started

Core Concepts

Supervised Learning

Unsupervised Learning

Model Validation

Advanced Topics

Data structures

Eigen integration

Matrix and vector types

Type characteristics

Dynamic sizing

Row-major storage

Template parameters

Eigen interop

Common type patterns

Feature matrices

Target vectors

Coefficient vectors

Dataset abstractions

Schema-based organization

Numeric concepts

Memory management

Example: Linear regression storage

Performance considerations

Integration with external data

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Supervised Learning

Unsupervised Learning

Model Validation

Advanced Topics

​Eigen integration

​Matrix and vector types

​Type characteristics

Dynamic sizing

Row-major storage

Template parameters

Eigen interop

​Common type patterns

​Feature matrices

​Target vectors

​Coefficient vectors

​Dataset abstractions

​Schema-based organization

​Numeric concepts

​Memory management

​Example: Linear regression storage

​Performance considerations

​Integration with external data

Build docs developers (and LLMs) love

Eigen integration

Matrix and vector types

Type characteristics

Common type patterns

Feature matrices

Target vectors

Coefficient vectors

Dataset abstractions

Schema-based organization

Numeric concepts

Memory management

Example: Linear regression storage

Performance considerations

Integration with external data