Skip to main content
Clustering algorithms partition data into groups where similar items belong to the same cluster. MLPP provides template-based cluster classes and efficient spatial indexing with KD-Trees.

Cluster classes

MLPP provides a hierarchy of cluster classes in the mlpp::unsupervised::clustering namespace for representing grouped data points.

Cluster

Base cluster class for grouping data records.
#include "Unsupervised Learning/Clustering/cluster.hpp"

template<typename DatasetT>
class Cluster {
public:
    using dataset_type = DatasetT;
    using record_type = typename DatasetT::record_type;

    void set_id(std::size_t id) noexcept;
    std::size_t id() const noexcept;

protected:
    std::size_t id_ = 0;
    std::vector<std::shared_ptr<record_type>> elements_;
};
Template parameters:
  • DatasetT - Dataset type containing records to cluster
Methods:
  • set_id(id) - Assign cluster identifier
  • id() - Get cluster identifier

CenterCluster

Cluster with a centroid representing the cluster center.
template<typename DatasetT>
class CenterCluster : public Cluster<DatasetT> {
public:
    using record_type = typename DatasetT::record_type;

    explicit CenterCluster(const std::shared_ptr<record_type>& center);

    const std::shared_ptr<record_type>& center() const noexcept;
    std::shared_ptr<record_type>& center() noexcept;

protected:
    std::shared_ptr<record_type> center_;
};
Constructor:
  • CenterCluster(center) - Initialize with cluster centroid
Methods:
  • center() - Access cluster centroid
Use cases:
  • K-means clustering
  • K-medoids clustering
  • Any centroid-based partitioning

SubspaceCluster

Cluster with weighted subspace projection for high-dimensional data.
template<typename DatasetT>
class SubspaceCluster : public CenterCluster<DatasetT> {
public:
    using value_type = typename DatasetT::value_type;

    explicit SubspaceCluster(const std::shared_ptr<record_type>& center);

    std::vector<value_type>& w() noexcept;
    const std::vector<value_type>& w() const noexcept;

    value_type& w(std::size_t i) noexcept;
    const value_type& w(std::size_t i) const noexcept;

protected:
    std::vector<value_type> w_;
};
Constructor:
  • SubspaceCluster(center) - Initialize with cluster centroid
Methods:
  • w() - Access weight vector for subspace projection
  • w(i) - Access individual weight at dimension i
Use cases:
  • Subspace clustering in high dimensions
  • Feature-weighted clustering
  • Locally weighted projections

KD-Tree

KD-Tree provides efficient spatial indexing for nearest neighbor and radius searches in multi-dimensional space.
#include "Unsupervised Learning/Clustering/KD-Tree/kd_tree.h"

template<typename PointT, std::size_t Dim>
class kd_tree {
public:
    kd_tree(const std::vector<PointT>* pts);
    std::vector<std::size_t> radius_search(const PointT& q, double eps) const;
};
Template parameters:
  • PointT - Point type supporting operator[] for coordinate access
  • Dim - Number of dimensions (compile-time constant)

Constructor

kd_tree(const std::vector<PointT>* pts)
Builds KD-Tree from point collection. Parameters:
  • pts - Pointer to vector of points (must remain valid during tree lifetime)
std::vector<std::size_t> radius_search(const PointT& q, double eps) const
Finds all points within radius eps of query point q. Parameters:
  • q - Query point
  • eps - Search radius (Euclidean distance)
Returns: Vector of indices into original point collection

Example usage

#include "Unsupervised Learning/Clustering/KD-Tree/kd_tree.h"
#include <vector>
#include <array>

// Define 3D points
using Point3D = std::array<double, 3>;
std::vector<Point3D> points = {
    {1.0, 2.0, 3.0},
    {4.0, 5.0, 6.0},
    {1.1, 2.1, 3.1},
    {10.0, 10.0, 10.0}
};

// Build KD-Tree
kd_tree<Point3D, 3> tree(&points);

// Find neighbors within radius 0.5
Point3D query = {1.0, 2.0, 3.0};
auto neighbors = tree.radius_search(query, 0.5);

// neighbors contains indices: {0, 2}
KD-Trees provide O(log n) average-case search performance, making them ideal for DBSCAN clustering, local density estimation, and nearest neighbor queries.

Performance considerations

  • KD-Tree construction: O(n log n) time complexity
  • Radius search: O(log n) average case, O(n) worst case
  • Template instantiation: Dimension must be compile-time constant
  • Memory overhead: Additional nodes stored internally
  • Use dimensionality reduction before clustering high-dimensional data
  • Combine KD-Tree with DBSCAN for density-based clustering
  • Apply PCA before K-means to reduce noise and improve convergence

Build docs developers (and LLMs) love