What is Clustering
A clustering algorithm looks at a number of data points and automatically finds data points that are related or similar to each other. Unlike supervised learning where we have labeled training data, clustering works with unlabeled data to discover natural groupings.Clustering is an unsupervised learning algorithm that takes data without labels and automatically groups them into clusters based on similarity.
Clustering vs Classification
Let’s contrast clustering with supervised learning:- Supervised Learning (Classification): Given a dataset with features x₁ and x₂, we have both inputs x and labels y. We can fit logistic regression or neural networks to learn a decision boundary.
- Unsupervised Learning (Clustering): We’re given a dataset with just x, but not the target labels y. We ask the algorithm to find interesting structure - to discover if the data can be grouped into clusters.
K-Means Algorithm
The K-means algorithm is the most widely used clustering technique. It identifies k distinct groups in your data by iteratively refining cluster assignments.How K-Means Works
Initialize Cluster Centroids
Randomly pick k training examples from your dataset. Set μ₁, μ₂, …, μₖ (the cluster centroids) equal to these k examples.
Assign Points to Nearest Centroid
For each training example xⁱ, calculate its distance to each centroid. Assign the example to the cluster of the nearest centroid.Set cⁱ = index of cluster centroid closest to xⁱ
The distance metric used is the L2 norm (Euclidean distance): ||xⁱ - μₖ||
Move Centroids
For each cluster k, compute the mean of all points assigned to that cluster. Move the centroid μₖ to this mean location.
Complete K-Means Implementation
Here’s a full implementation showing the iterative process:Optimization Objective
K-means optimizes a specific cost function called the distortion function:Cost Function J:J(c¹,…,cᵐ, μ₁,…,μₖ) = (1/m) Σᵢ₌₁ᵐ ||xⁱ - μ_cⁱ||²This is the average squared distance between each training example and the centroid of the cluster to which it’s assigned.
Why the Algorithm Works
Step 1: Assigning Points Minimizes Distance
Step 1: Assigning Points Minimizes Distance
When assigning points to clusters, we choose the closest centroid for each point. This minimizes the squared distance ||xⁱ - μ_cⁱ||² for each example, thus reducing the overall cost function J.
Step 2: Moving Centroids Reduces Average Distance
Step 2: Moving Centroids Reduces Average Distance
Moving the centroid to the mean position of all assigned points minimizes the average squared distance to those points. This is a mathematical property of the mean - it’s the point that minimizes squared distances.For example, if a cluster has points at positions 1 and 11, placing the centroid at their average (6) gives squared distances of 25 + 25 = 50. Any other position would give a larger sum.
Convergence Guarantee
K-means is guaranteed to converge because each iteration either reduces the cost function J or keeps it the same. The algorithm stops when no further improvement is possible.
Random Initialization
The choice of initial centroids significantly impacts the final result. K-means can get stuck in local optima depending on initialization.Multiple Random Initializations
To find better clusters, run K-means multiple times with different initializations:Choosing the Number of Clusters
How do you decide the value of k - the number of clusters?The Ambiguity Challenge
For many clustering problems, the right value of k is genuinely ambiguous. Different people might see different numbers of natural groupings in the same data.
The Elbow Method
One approach is to plot the cost function J versus the number of clusters:Limitation: In practice, many datasets don’t have a clear elbow, making this method less useful than it might seem.
Evaluation Based on Later Purpose
A better approach is choosing k based on what you need clusters for:Example: T-shirt Sizing
Example: T-shirt Sizing
If manufacturing t-shirts, you might evaluate:
- k=3 (S, M, L): Simpler inventory, but some customers get poor fits
- k=5 (XS, S, M, L, XL): Better fit for more customers, but more complex inventory
Example: Image Compression
Example: Image Compression
For compressing images by clustering pixel colors:
- Lower k: Smaller file size, lower quality
- Higher k: Better image quality, larger file size
Practical Applications
Customer Segmentation
Group customers by behavior patterns to personalize marketing and services
Document Organization
Automatically categorize documents or news articles by topic
Image Compression
Reduce colors in images by clustering similar pixel values
Genomic Analysis
Identify subtypes in biological data based on gene expression
Key Takeaways
Next Steps
Anomaly Detection
Learn how to detect unusual events and outliers in your data
