Clustering algorithms partition data into groups where similar items belong to the same cluster. MLPP provides template-based cluster classes and efficient spatial indexing with KD-Trees.
Cluster classes
MLPP provides a hierarchy of cluster classes in the mlpp::unsupervised::clustering namespace for representing grouped data points.
Cluster
Base cluster class for grouping data records.
#include "Unsupervised Learning/Clustering/cluster.hpp"
template<typename DatasetT>
class Cluster {
public:
using dataset_type = DatasetT;
using record_type = typename DatasetT::record_type;
void set_id(std::size_t id) noexcept;
std::size_t id() const noexcept;
protected:
std::size_t id_ = 0;
std::vector<std::shared_ptr<record_type>> elements_;
};
Template parameters:
DatasetT - Dataset type containing records to cluster
Methods:
set_id(id) - Assign cluster identifier
id() - Get cluster identifier
CenterCluster
Cluster with a centroid representing the cluster center.
template<typename DatasetT>
class CenterCluster : public Cluster<DatasetT> {
public:
using record_type = typename DatasetT::record_type;
explicit CenterCluster(const std::shared_ptr<record_type>& center);
const std::shared_ptr<record_type>& center() const noexcept;
std::shared_ptr<record_type>& center() noexcept;
protected:
std::shared_ptr<record_type> center_;
};
Constructor:
CenterCluster(center) - Initialize with cluster centroid
Methods:
center() - Access cluster centroid
Use cases:
- K-means clustering
- K-medoids clustering
- Any centroid-based partitioning
SubspaceCluster
Cluster with weighted subspace projection for high-dimensional data.
template<typename DatasetT>
class SubspaceCluster : public CenterCluster<DatasetT> {
public:
using value_type = typename DatasetT::value_type;
explicit SubspaceCluster(const std::shared_ptr<record_type>& center);
std::vector<value_type>& w() noexcept;
const std::vector<value_type>& w() const noexcept;
value_type& w(std::size_t i) noexcept;
const value_type& w(std::size_t i) const noexcept;
protected:
std::vector<value_type> w_;
};
Constructor:
SubspaceCluster(center) - Initialize with cluster centroid
Methods:
w() - Access weight vector for subspace projection
w(i) - Access individual weight at dimension i
Use cases:
- Subspace clustering in high dimensions
- Feature-weighted clustering
- Locally weighted projections
KD-Tree
KD-Tree provides efficient spatial indexing for nearest neighbor and radius searches in multi-dimensional space.
#include "Unsupervised Learning/Clustering/KD-Tree/kd_tree.h"
template<typename PointT, std::size_t Dim>
class kd_tree {
public:
kd_tree(const std::vector<PointT>* pts);
std::vector<std::size_t> radius_search(const PointT& q, double eps) const;
};
Template parameters:
PointT - Point type supporting operator[] for coordinate access
Dim - Number of dimensions (compile-time constant)
Constructor
kd_tree(const std::vector<PointT>* pts)
Builds KD-Tree from point collection.
Parameters:
pts - Pointer to vector of points (must remain valid during tree lifetime)
Radius search
std::vector<std::size_t> radius_search(const PointT& q, double eps) const
Finds all points within radius eps of query point q.
Parameters:
q - Query point
eps - Search radius (Euclidean distance)
Returns:
Vector of indices into original point collection
Example usage
#include "Unsupervised Learning/Clustering/KD-Tree/kd_tree.h"
#include <vector>
#include <array>
// Define 3D points
using Point3D = std::array<double, 3>;
std::vector<Point3D> points = {
{1.0, 2.0, 3.0},
{4.0, 5.0, 6.0},
{1.1, 2.1, 3.1},
{10.0, 10.0, 10.0}
};
// Build KD-Tree
kd_tree<Point3D, 3> tree(&points);
// Find neighbors within radius 0.5
Point3D query = {1.0, 2.0, 3.0};
auto neighbors = tree.radius_search(query, 0.5);
// neighbors contains indices: {0, 2}
KD-Trees provide O(log n) average-case search performance, making them ideal for DBSCAN clustering, local density estimation, and nearest neighbor queries.
- KD-Tree construction: O(n log n) time complexity
- Radius search: O(log n) average case, O(n) worst case
- Template instantiation: Dimension must be compile-time constant
- Memory overhead: Additional nodes stored internally
- Use dimensionality reduction before clustering high-dimensional data
- Combine KD-Tree with DBSCAN for density-based clustering
- Apply PCA before K-means to reduce noise and improve convergence