ML0060 K Selection in K-Means

How to select K in K-Means?

Answer

To select the optimal number of clusters  K in K-Means, use the visual plot like the elbow method, quantitative metrics like the silhouette score, or statistical methods like the gap statistic. These help balance model fit and generalization without overfitting.

Elbow Method:
(1) Plot the within-cluster sum of squares (WCSS) vs.  K .
(2) Choose the “elbow” point where the rate of improvement slows.
WCSS can be calculated using the following equation:
 \text{WCSS}(K) = \sum_{k=1}^{K} \sum_{x_i \in C_k} |x_i - \mu_k|^2
Where:
 C_k is cluster  k ,
 \mu_k is its centroid.

Here is one plot example to demonstrate the location of the elbow point.

Silhouette Score:
The silhouette score measures how well each point lies within its cluster. It ranges from -1 (wrong clustering) to 1 (well-clustered).
(1) Calculate the average silhouette score for different  K values.
(2) Choose the  K that yields the highest average silhouette score.
Silhouette coefficient for point  i can be calculated by the following equation.
 s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}
Where:
 a(i) = intra-cluster distance,
 b(i) = nearest-cluster distance.

Gap Statistic:
(1) Compares clustering against a random reference distribution.
(2) Choose  K that maximizes the gap between observed and expected WCSS.


Login to view more content


Did you solve the problem?

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *