Algorithm
K-Means Clustering
Description
An iterative unsupervised learning algorithm that partitions a dataset into K distinct, non-overlapping subgroups (clusters). Each data point belongs to the cluster with the nearest mean. The algorithm minimizes the within-cluster sum of squares (WCSS). It employs the classic Lloyd's algorithm with efficient C++ loops for distance calculation and centroid updates.
$$ \text{arg} \min_S \sum_{i=1}^k \sum_{\mathbf{x} \in S_i} ||\mathbf{x} - \mathbf{\mu}_i||^2 $$
Algorithm Workflow
START
Specify $K$ clusters.
INIT
Randomly select $K$ data points as initial centroids
$\mu^{(0)}$.
ASSIGN
For every point $x_i$, find the nearest centroid $c_j$ minimizing
$||x_i - \mu_j||^2$.
UPDATE
Recalculate each centroid $\mu_j$ as the mean of all points assigned
to it.
CHECK
If centroids have not changed (or change < epsilon), STOP.
LOOP
Otherwise, repeat the Assign and Update steps.
END
Return cluster labels and final centroid locations.
Implementation Details
Implemented in `KMeans.cpp`.
while(changed && iter < max_iter) {
// Assignment
int c = nearest_centroid(row, centroids);
// Update
new_centroids[c] += row;
counts[c]++;
}
Complexity & Optimization
Time Complexity
O(I * N * K * P).
Space Complexity
O(K * P).
Optimizations
In-place updates.
Limitations
Local optima.
Use Cases
Clustering.