Skip to main content

Dataset

Template class for managing collections of records with a common schema in clustering algorithms.

Template parameters

T
Numeric
Numeric type constrained by the Numeric concept (requires std::is_arithmetic_v<T>)

Constructors

Dataset(schema_ptr)
constructor
Constructs a dataset with specified schemaParameters:
  • schema_ptr (std::shared_ptr<Schema<T>>): Shared pointer to the schema
Dataset(other)
constructor
Copy constructorParameters:
  • other (const Dataset<T>&): Dataset to copy from

Methods

num_attr
std::size_t
Returns the number of attributes in the datasetReturns: Number of attributes defined in the schema
schema
const std::shared_ptr<Schema<T>>&
Returns the dataset schemaReturns: Const reference to shared pointer to the schema
operator()(i, j)
AttrValue<T>&
Accesses attribute value at record i, attribute jParameters:
  • i (std::size_t): Record index
  • j (std::size_t): Attribute index
Returns: Mutable reference to the attribute value
operator()(i, j) const
const AttrValue<T>&
Accesses attribute value at record i, attribute j (const version)Parameters:
  • i (std::size_t): Record index
  • j (std::size_t): Attribute index
Returns: Const reference to the attribute value
is_numeric
bool
Checks if all attributes are continuous (numeric)Returns: True if dataset contains only continuous attributes
is_categorical
bool
Checks if all attributes are discrete (categorical)Returns: True if dataset contains only discrete attributes
save
void
Saves the dataset to a fileParameters:
  • filename (const std::string&): Path to output file
get_CM
mlpp::model_validation::ConfusionMatrix<>
Generates a confusion matrix from labelled and clustered dataReturns: Confusion matrix comparing true labels to cluster assignmentsRequires the dataset to have labels. Used for evaluating clustering quality.
operator=
Dataset<T>&
Assignment operatorParameters:
  • other (const Dataset<T>&): Dataset to assign from
Returns: Reference to this dataset

Example

#include "Clustering/clustering_dataset.hpp"

using namespace mlpp::unsupervised::clustering;

// Create schema
auto schema = std::make_shared<Schema<double>>();

// Create dataset
Dataset<double> dataset(schema);

// Check properties
bool numeric = dataset.is_numeric();
std::size_t n_attrs = dataset.num_attr();

// Save to file
dataset.save("clustered_data.csv");

// Get confusion matrix (if labelled)
auto cm = dataset.get_CM();

Schema

Template class defining the structure and metadata for dataset attributes.

Template parameters

T
Numeric
Numeric type constrained by the Numeric concept

Methods

clone
Schema<T>*
Creates a deep copy of the schemaReturns: Pointer to cloned schema
labelInfo()
std::shared_ptr<DAttrInfo<T>>&
Returns mutable reference to label attribute metadataReturns: Shared pointer to discrete attribute info for labels
labelInfo() const
const std::shared_ptr<DAttrInfo<T>>&
Returns const reference to label attribute metadataReturns: Const shared pointer to discrete attribute info for labels
idInfo()
std::shared_ptr<DAttrInfo<T>>&
Returns mutable reference to ID attribute metadataReturns: Shared pointer to discrete attribute info for record IDs
idInfo() const
const std::shared_ptr<DAttrInfo<T>>&
Returns const reference to ID attribute metadataReturns: Const shared pointer to discrete attribute info for record IDs
set_label
void
Sets the label value for a recordParameters:
  • r (std::shared_ptr<Record<T>>&): Record to modify
  • val (const std::string&): String value to set as label
set_id
void
Sets the ID value for a recordParameters:
  • r (std::shared_ptr<Record<T>>&): Record to modify
  • val (const std::string&): String value to set as ID
is_labelled
bool
Checks if the schema includes label informationReturns: True if schema has label attribute defined
equal
bool
Checks if two schemas are equal (including labels)Parameters:
  • o (const Schema<T>&): Other schema to compare
Returns: True if schemas are identical
equal_no_label
bool
Checks if two schemas are equal excluding labelsParameters:
  • o (const Schema<T>&): Other schema to compare
Returns: True if schemas are identical (ignoring labels)
is_member
bool
Checks if an attribute is part of this schemaParameters:
  • info (const AttrInfo<T>&): Attribute to check
Returns: True if attribute exists in schema

Example

#include "Clustering/clustering_dataset.hpp"

using namespace mlpp::unsupervised::clustering;

auto schema = std::make_shared<Schema<double>>();

// Check if labelled
if (schema->is_labelled()) {
    auto label_info = schema->labelInfo();
}

// Clone schema
Schema<double>* schema_copy = schema->clone();

Record

Template class representing a single data record in a dataset.

Template parameters

T
Numeric
Numeric type constrained by the Numeric concept

Constructor

Record(schema_ptr)
constructor
Constructs a record with specified schemaParameters:
  • schema_ptr (std::shared_ptr<Schema<T>>): Shared pointer to the schema

Methods

schema
const std::shared_ptr<Schema<T>>&
Returns the record’s schemaReturns: Const reference to shared pointer to the schema
labelValue()
AttrValue<T>&
Returns mutable reference to the label valueReturns: Mutable reference to label attribute value
labelValue() const
const AttrValue<T>&
Returns const reference to the label valueReturns: Const reference to label attribute value
idValue()
AttrValue<T>&
Returns mutable reference to the ID valueReturns: Mutable reference to ID attribute value
idValue() const
const AttrValue<T>&
Returns const reference to the ID valueReturns: Const reference to ID attribute value
get_id
std::size_t
Gets the record’s ID as an integerReturns: Record ID
get_label
std::size_t
Gets the record’s label as an integerReturns: Record label (class/category)

Private members

schema_
std::shared_ptr<Schema<T>>
Shared pointer to the record’s schema
label_
AttrValue<T>
Label attribute value (for supervised evaluation)
id_
AttrValue<T>
ID attribute value (cluster assignment)
features_
std::vector<AttrValue<T>>
Vector of feature attribute values

Example

#include "Clustering/clustering_dataset.hpp"

using namespace mlpp::unsupervised::clustering;

auto schema = std::make_shared<Schema<double>>();
auto record = std::make_shared<Record<double>>(schema);

// Get cluster assignment
std::size_t cluster_id = record->get_id();

// Get true label (if available)
std::size_t true_label = record->get_label();

Namespace

All dataset-related classes are defined in the mlpp::unsupervised::clustering namespace.

Type constraints

The Numeric concept requires:
template<typename T>
concept Numeric = std::is_arithmetic_v<T>;
This ensures that template parameter T is an arithmetic type (int, float, double, etc.).

Build docs developers (and LLMs) love