π K-Means Clustering Pseudocode
K-Means is one of the most popular unsupervised learning algorithms for clustering. It partitions data into K clusters by iteratively assigning points to the nearest centroid and updating centroids based on assigned points.
The Algorithm
K-Means Clustering (Lloyd's Algorithm)
ALGORITHM: K-Means Clustering
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
X: Data matrix of shape (n_samples, n_features)
K: Number of clusters
max_iterations: Maximum number of iterations (e.g., 300)
tolerance: Convergence threshold (e.g., 1e-4)
init: Initialization method ('random', 'k-means++')
OUTPUT:
centroids: Final cluster centers of shape (K, n_features)
labels: Cluster assignment for each sample of shape (n_samples,)
inertia: Sum of squared distances to nearest centroid
PROCEDURE:
1. // Initialize centroids
IF init = 'random':
centroids β random_sample(X, K) // Select K random points
ELSE IF init = 'k-means++':
centroids β kmeans_plus_plus_init(X, K)
2. // Main loop
FOR iteration = 1 TO max_iterations:
// Step 1: Assignment - assign each point to nearest centroid
labels β empty array of size n_samples
FOR i = 1 TO n_samples:
distances β [||X[i] - centroids[k]||Β² for k = 1 to K]
labels[i] β argmin(distances)
// Step 2: Update - recompute centroids
new_centroids β empty array of shape (K, n_features)
FOR k = 1 TO K:
points_in_cluster β X[labels == k]
IF |points_in_cluster| > 0:
new_centroids[k] β mean(points_in_cluster, axis=0)
ELSE:
// Handle empty cluster: keep old centroid or reinitialize
new_centroids[k] β centroids[k]
// Check for convergence
centroid_shift β max(||new_centroids - centroids||)
IF centroid_shift < tolerance:
PRINT "Converged at iteration", iteration
BREAK
centroids β new_centroids
3. // Compute final inertia (within-cluster sum of squares)
inertia β 0
FOR i = 1 TO n_samples:
inertia β inertia + ||X[i] - centroids[labels[i]]||Β²
4. RETURN centroids, labels, inertiaK-Means++ Initialization
This improved initialization leads to faster convergence and better results:
ALGORITHM: K-Means++ Initialization
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
X: Data matrix of shape (n_samples, n_features)
K: Number of clusters
OUTPUT:
centroids: Initial centroid positions of shape (K, n_features)
PROCEDURE:
1. // Choose first centroid uniformly at random
centroids β [X[random_int(0, n_samples-1)]]
2. // Choose remaining centroids with probability proportional to DΒ²
FOR k = 2 TO K:
// Compute distance from each point to nearest existing centroid
DΒ² β zeros(n_samples)
FOR i = 1 TO n_samples:
min_dist β β
FOR c in centroids:
dist β ||X[i] - c||Β²
min_dist β min(min_dist, dist)
DΒ²[i] β min_dist
// Sample next centroid with probability β DΒ²
probabilities β DΒ² / sum(DΒ²)
next_idx β random_choice(range(n_samples), p=probabilities)
APPEND X[next_idx] to centroids
3. RETURN array(centroids)The Elbow Method for Choosing K
ALGORITHM: Elbow Method
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
X: Data matrix
K_range: Range of K values to try (e.g., 1 to 10)
OUTPUT:
inertias: List of inertia values for each K
suggested_K: Suggested number of clusters (elbow point)
PROCEDURE:
inertias β empty list
FOR K in K_range:
_, _, inertia β kmeans(X, K)
APPEND inertia to inertias
// Find elbow point (maximum curvature)
// Simple heuristic: largest decrease in inertia rate
decreases β [inertias[i-1] - inertias[i] for i = 1 to len(inertias)-1]
second_derivatives β [decreases[i-1] - decreases[i] for i = 1 to len(decreases)-1]
suggested_K β argmax(second_derivatives) + 2 // +2 adjusts for indexing
RETURN inertias, suggested_K
// Plot inertias vs K to visualize the "elbow"Silhouette Score
ALGORITHM: Silhouette Coefficient
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
// Measures how similar a point is to its own cluster vs other clusters
INPUT:
X: Data matrix
labels: Cluster assignments
OUTPUT:
silhouette_scores: Score for each sample
mean_silhouette: Average silhouette score
PROCEDURE:
FOR each sample i:
cluster_i β labels[i]
// a(i): Mean distance to other points in same cluster
same_cluster β X[labels == cluster_i AND index β i]
a[i] β mean(||X[i] - p|| for p in same_cluster)
// b(i): Mean distance to points in nearest other cluster
b[i] β β
FOR each cluster c β cluster_i:
other_cluster β X[labels == c]
mean_dist β mean(||X[i] - p|| for p in other_cluster)
b[i] β min(b[i], mean_dist)
// Silhouette coefficient for point i
s[i] β (b[i] - a[i]) / max(a[i], b[i])
mean_silhouette β mean(s)
RETURN s, mean_silhouette
// Interpretation:
// s β +1: Point is well-clustered
// s β 0: Point is on cluster boundary
// s β -1: Point might be in wrong clusterMathematical Foundation
Objective Function
K-Means minimizes the Within-Cluster Sum of Squares (WCSS):
$$J = \sum_{k=1}^{K} \sum_{i \in C_k} ||x_i - \mu_k||^2$$Where:
- $C_k$ is the set of points in cluster $k$
- $\mu_k$ is the centroid of cluster $k$
Convergence Proof
Each iteration decreases (or maintains) $J$:
Assignment step: For fixed centroids, assigning each point to its nearest centroid minimizes $J$.
Update step: For fixed assignments, setting $\mu_k = \frac{1}{|C_k|}\sum_{i \in C_k} x_i$ minimizes $J$.
Since $J$ is bounded below by 0 and decreases monotonically, the algorithm converges.
Connection to EM Algorithm
K-Means can be viewed as a special case of the Expectation-Maximization algorithm for Gaussian Mixture Models with:
- Equal, fixed, spherical covariances
- Hard (0/1) cluster assignments instead of soft probabilities
Complexity Analysis
| Operation | Time Complexity | Space Complexity |
|---|---|---|
| One iteration | $O(n \cdot K \cdot m)$ | $O(K \cdot m)$ |
| Full algorithm | $O(t \cdot n \cdot K \cdot m)$ | $O(K \cdot m + n)$ |
| K-Means++ init | $O(n \cdot K \cdot m)$ | $O(K \cdot m)$ |
Where $n$ = samples, $m$ = features, $K$ = clusters, $t$ = iterations.
Variants and Extensions
MINI-BATCH K-MEANS:
// Uses random subsets for faster updates on large datasets
FOR each iteration:
batch β random_sample(X, batch_size)
// Assign batch points to clusters
FOR x in batch:
label β nearest_centroid(x)
// Incremental centroid update
FOR each cluster k:
count[k] β count[k] + batch_count[k]
centroids[k] β centroids[k] +
(batch_mean[k] - centroids[k]) Γ (batch_count[k] / count[k])
K-MEDOIDS (PAM):
// Uses actual data points as cluster centers
// More robust to outliers
Objective: minimize Ξ£ Ξ£ ||x - medoid_k||
Constraint: medoid_k β X (must be a data point)Key Insights
Local Optima: K-Means finds local minima; run multiple times with different initializations.
K-Means++ Matters: Proper initialization dramatically improves results.
Feature Scaling: Essential! Use standardization before clustering.
Spherical Clusters: Assumes clusters are roughly spherical and similar in size.
Choosing K: Use elbow method, silhouette scores, or domain knowledge.
Outliers: Sensitive to outliers; consider K-Medoids or robust alternatives.
Dimensionality: Performance degrades in high dimensions (curse of dimensionality).
References
- Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory.
- Arthur, D., & Vassilvitskii, S. (2007). k-means++: The Advantages of Careful Seeding. SODA 2007. http://ilpubs.stanford.edu:8090/778/
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, Chapter 14. Springer. https://hastie.su.domains/ElemStatLearn/
- scikit-learn Documentation: KMeans. https://scikit-learn.org/stable/modules/clustering.html#k-means
- Sculley, D. (2010). Web-Scale K-Means Clustering. WWW 2010. (Mini-batch K-Means)