Tag: Kmeans

  • ML0061 KNN and K-means

    What are the key differences between KNN and K-means?

    Answer

    KNN(K-Nearest Neighbors) is a supervised algorithm that classifies data by considering the labels of its nearest neighbors, emphasizing prediction based on historical data. In contrast, K-Means is an unsupervised clustering technique that groups data together based solely on their similarity, without using any labels.

    Here are the key differences between KNN and K-Means:
    (1) Learning Type
    KNN: Supervised learning algorithm (used for classification/regression).
    K-Means: Unsupervised learning algorithm (used for clustering).
    (2) Objective
    KNN: Predict the label of a new sample based on the majority vote (or average) of its K nearest neighbors.
    K-Means: Partition the dataset into K clusters by minimizing intra-cluster distance.
    (3) Training
    KNN: No explicit training; It simply stores the entire training dataset.
    K-Means: Involves an iterative training process to learn cluster centroids.
    (4) Prediction
    KNN: Computationally expensive, computes the distance from the test point to every training point. Sorts the distances and selects the top  K nearest neighbors. The majority votes for classification. Average of values for regression.
    K-Means: Fast and simple for inference, compute the distance of any new data point to each of the  K centroids. Assign it to the nearest centroid (i.e., predicted cluster).
    (5) Distance Metric Use
    KNN: Used to find neighbors.
    K-Means: Used to assign points to the nearest cluster center.
    (6) Output
    KNN: Outputs a label (classification) or value (regression).
    K-Means: Outputs cluster assignments and centroids.

    The table below summarizes the comparison between KNN and K-Means.


    Login to view more content
  • ML0060 K Selection in K-Means

    How to select K in K-Means?

    Answer

    To select the optimal number of clusters  K in K-Means, use the visual plot like the elbow method, quantitative metrics like the silhouette score, or statistical methods like the gap statistic. These help balance model fit and generalization without overfitting.

    Elbow Method:
    (1) Plot the within-cluster sum of squares (WCSS) vs.  K .
    (2) Choose the “elbow” point where the rate of improvement slows.
    WCSS can be calculated using the following equation:
     \text{WCSS}(K) = \sum_{k=1}^{K} \sum_{x_i \in C_k} |x_i - \mu_k|^2
    Where:
     C_k is cluster  k ,
     \mu_k is its centroid.

    Here is one plot example to demonstrate the location of the elbow point.

    Silhouette Score:
    The silhouette score measures how well each point lies within its cluster. It ranges from -1 (wrong clustering) to 1 (well-clustered).
    (1) Calculate the average silhouette score for different  K values.
    (2) Choose the  K that yields the highest average silhouette score.
    Silhouette coefficient for point  i can be calculated by the following equation.
     s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}
    Where:
     a(i) = intra-cluster distance,
     b(i) = nearest-cluster distance.

    Gap Statistic:
    (1) Compares clustering against a random reference distribution.
    (2) Choose  K that maximizes the gap between observed and expected WCSS.


    Login to view more content

  • ML0059 K-means II

    K-Means is widely used for clustering. Can you discuss its main benefits as well as its disadvantages?

    Answer

    K-Means clustering is a widely used unsupervised learning algorithm that partitions data points into K clusters, where each point belongs to the cluster with the nearest mean. While it is computationally efficient and easy to implement, it relies on prior specification of the number of clusters, assumes spherical clusters, and is sensitive to initialization and outliers.

    Objective Function of K-means:
    J = \sum_{i=1}^{K} \sum_{x_j \in C_i} |x_j - \mu_i|^2
    Where:
     K is the number of clusters.
     C_i is the set of points in cluster  i .
     \mu_i is the centroid of cluster  i .
     x_j is a data point assigned to cluster  i .

    Main Benefits of K-means:
    (1) Simple & Efficient: Fast to compute, easy to implement.
    (2) Scalable: Handles large datasets well.
    (3) Unsupervised Learning: Requires no labeled data.
    (4) Interpretable: Cluster centroids are intuitive and interpretable.
    (5) Works Well on Spherical Clusters: Performs best when clusters are compact and well-separated.

    Main Disadvantages of K-means:
    (1) Must Specify  K : The number of clusters must be known in advance.
    (2) Sensitive to Initialization: Poor starting points may lead to suboptimal clustering.
    (3) Assumes Spherical Clusters: Fails on clusters with irregular shapes or varying densities.
    (4) Affected by Outliers: Outliers can skew centroids and degrade performance.
    (5) Only Uses Euclidean Distance: Not suitable for non-numeric or categorical features.

    Examples below demonstrate K-Means performance on spherical clusters and Irregular shape clusters.


    Login to view more content
  • ML0058 K-means++

    Please explain how K-means++ works.

    Answer

    K-means++ is an improved way to initialize centroids in K-means. K-means++ selects initial centroids one by one using a weighted probability based on squared distances from already chosen centroids. This spreads out the centroids more effectively, reducing the chances of poor clustering and helping the algorithm converge faster and more reliably.

    K-means++ Steps:
    (1) Choose the first centroid  \mu_1 uniformly at random from the dataset.
    (2) For each point  x_i , compute its squared distance to the nearest chosen centroid:
     D(x_i)^2 = \min_{1 \le j \le m} |x_i - \mu_j|^2
    Where:
     \mu_j is one of the already chosen centroids.
    (3) Choose the next centroid  \mu_{m+1} with probability:
     P(x_i) = \frac{D(x_i)^2}{\sum_j D(x_j)^2}
    Where:
     D(x_i)^2 is the squared distance from point  x_i to its nearest already chosen centroid.
     \sum_j D(x_j)^2 is the sum of minimum squared distances from all data points to their nearest chosen centroid.
    (4) Repeat until  K centroids are chosen.
    (5) Then proceed with standard K-means clustering.

    Below shows an example for K-means++ clustering.


    Login to view more content
  • ML0057 K-means

    Please explain how K-means works.

    Answer

    K-means is an iterative unsupervised algorithm that groups data into  K clusters by minimizing intra-cluster distances. It alternates between assigning points to the nearest centroid and updating centroids until convergence. It is fast and easy to implement, but sensitive to initialization and non-convex cluster shapes.

    Goal of K-means
    : Partition data into  K clusters by minimizing within-cluster variance.

    K-means Steps:
    (1) Initialization: Randomly choose  K centroids.
    (2) Assignment step: Assign each point to the closest centroid using Euclidean distance, given by:
     d(x, c_k) = \sqrt{\sum_{i=1}^{n} (x_i - c_{k,i})^2}
    Where:
     x is the data point.
     c_k is the cluster center.
    (3) Update Step: Compute new cluster centers as the mean of all points assigned to that cluster by:
     c_k = \frac{1}{C_k} \sum_{x \in C_k} x
    Where:
     C_k represents the set of points assigned to cluster  k .
    (4) Convergence: Repeat assignment and update steps until cluster centers stabilize or a stopping criterion is met.

    Below shows an example for K-means clustering.


    Login to view more content