5.2 The Power of Clustering: Discovering Hidden Groups

Introduction

The K-Means algorithm is a powerful unsupervised learning tool whose mission is to explore a dataset and discover hidden "groups" or "clusters" without requiring prior labels. It works by grouping data points that are most similar to each other, revealing the inherent structure of the information.

🏢

Activity

K-Means Cluster Explorer

Level Beginner

K-Means is the central algorithm of this chapter for discovering hidden segments without prior labels. Minermont's service center needs to identify distinct user profiles from behavioural signals, without knowing in advance how many groups exist.

What to watch for: K-Means uncovers groups with similar characteristics, helping the team design tailored processes for each segment and detect unusual clusters early.

Interactive Demonstration

K-Means Algorithm Settings

Number of Clusters (K): 3

Use K-Means++

Elbow Method

Use the Elbow Method to estimate the best number of clusters for your data.

Core Concepts

The Elbow Method: Finding the Optimal K

What is the Elbow Method?

The Elbow Method is a fundamental heuristic technique to determine the optimal number of clusters (K) in a dataset. It runs K-Means for different values of K and computes the sum of squared errors (SSE) or "inertia" for each value.

Process:

Run K-Means for a range of K values (e.g., 1 to 10).
Compute Inertia (SSE) for each K: SSE = Σ(distance² between each point and its centroid).
Plot the Curve (K on the X axis, inertia on the Y axis).
Find the "Elbow" where the rate of decrease slows down.

Interpretation and Application

Few clusters (small K): High inertia, points far from centroids.
Many clusters (large K): Low inertia, but risk of overfitting.
Optimal K: Balance between compactness and model simplicity.

The optimal point is where the curve forms an "elbow", indicating that adding more clusters does not significantly improve the grouping.

K-Means Algorithm Step by Step

Iterative Process

K-Means uses an iterative refinement process:

Initialization: Place K centroids randomly.
Assignment: Assign each point to the nearest centroid.
Update: Recompute centroid positions as the mean of their assigned points.
Repeat until convergence (centroids stop moving significantly).