Skip to main content

What is Clustering?

Clustering is an unsupervised machine learning technique that groups similar data points together based on their characteristics. The goal is to partition a dataset into distinct groups (clusters) where points within the same cluster are more similar to each other than to points in other clusters.
Unlike supervised learning, clustering doesn’t require labeled training data. The algorithm discovers patterns and structures in the data automatically.

Why Use Clustering?

Clustering algorithms are valuable for:

Data Exploration

Discover natural groupings and patterns in unlabeled datasets

Customer Segmentation

Group customers based on behavior, preferences, or demographics

Image Segmentation

Partition images into meaningful regions for computer vision tasks

Anomaly Detection

Identify outliers that don’t fit well into any cluster

The C-Means Algorithm Family

The C-Means family of algorithms partitions data points into C clusters by iteratively:
1

Initialize Centroids

Start with C initial centroid positions, either randomly or using domain knowledge
2

Assign Memberships

Calculate how each point relates to each centroid (either crisp or fuzzy assignment)
3

Update Centroids

Recompute centroid positions based on the membership assignments
4

Check Convergence

Repeat steps 2-3 until centroids stabilize or a cost function threshold is met

Distance Calculation

Both algorithms use Euclidean distance to measure similarity between points and centroids:
/**
 * Calculates the Euclidean distance between two points.
 */
export const euclidianDistance = (pointA: Point, pointB: Point) => {
    const distance = Math.sqrt(
        Math.pow((pointA.x - pointB.x), 2) + 
        Math.pow((pointA.y - pointB.y), 2)
    );
    return distance;
}
The distance matrix stores all pairwise distances:
/**
 * Calculates the distance matrix between each point and each centroid.
 * 
 * @returns A matrix where element [i][j] is the distance 
 *          between the i-th centroid and the j-th point.
 */
export const getDistanceMatrix = (points: Point[], centroids: Point[]) => {
    const distanceMatrix = [];
    for (let i = 0; i < centroids.length; i++) {
        const distancesRow = [];
        for (let j = 0; j < points.length; j++) {
            const newDistance = euclidianDistance(centroids[i], points[j]);
            distancesRow.push(newDistance);
        }
        distanceMatrix.push(distancesRow);
    }
    return distanceMatrix;
}

Crisp vs. Fuzzy Clustering

The C-Means family includes two main variants that differ in how they assign points to clusters:

Crisp Clustering (Hard Assignment)

In crisp C-Means, each point belongs to exactly one cluster. The membership is binary:
  • Membership value = 1 if the point belongs to the cluster
  • Membership value = 0 if the point doesn’t belong to the cluster
Crisp clustering creates clear, non-overlapping boundaries between clusters. Each point is assigned to its nearest centroid.
Example membership matrix (3 clusters, 5 points):
        Point1  Point2  Point3  Point4  Point5
Cluster1   1      0       0       1       0
Cluster2   0      1       0       0       0  
Cluster3   0      0       1       0       1

Fuzzy Clustering (Soft Assignment)

In fuzzy C-Means, each point has a degree of membership to all clusters. Membership values are between 0 and 1, and sum to 1 for each point:
  • Membership values represent probability or degree of belonging
  • Points can partially belong to multiple clusters
  • A fuzzification parameter (m) controls how fuzzy the boundaries are
Fuzzy clustering better represents real-world scenarios where boundaries between groups are gradual rather than sharp.
Example membership matrix (3 clusters, 5 points):
        Point1  Point2  Point3  Point4  Point5
Cluster1  0.70    0.15    0.10    0.65    0.20
Cluster2  0.20    0.75    0.15    0.25    0.15
Cluster3  0.10    0.10    0.75    0.10    0.65

Cost Function

Both algorithms optimize an objective function that measures clustering quality:
/**
 * Calculates the total cost as the sum of all individual cluster costs.
 */
export const getCostFunction = (costValues: number[]) => 
    costValues.length != 0 
        ? costValues.reduce((sum, costValue) => (sum + costValue)) 
        : 0;
The cost function decreases with each iteration as centroids move toward optimal positions. Convergence occurs when the change in cost falls below a threshold.

Choosing Between Crisp and Fuzzy

Use Crisp C-Means when:
  • You need clear, distinct cluster assignments
  • Your data has well-separated natural groups
  • You want faster computation and simpler interpretation
Use Fuzzy C-Means when:
  • Your data has overlapping or gradual boundaries
  • You need to model uncertainty in cluster membership
  • You want more nuanced cluster analysis

Next Steps

Crisp C-Means

Learn about hard cluster assignments and binary membership

Fuzzy C-Means

Explore soft memberships and the fuzzification parameter

Build docs developers (and LLMs) love