Category: Easy

  • ML0065 Random Forest III

    How to choose the number of features in a random forest?

    Answer

    Select the number of features (m) using rules of thumb (default heuristics), then tune via cross-validation or out-of-bag (OOB) error to find the best value for your specific dataset.

    Default Heuristics:
    Classification: m = \sqrt{p}
    Regression: m = \frac{p}{3}
    Where:
    p = total number of features,
    m = number of features considered at each split.

    Bias-Variance Trade-off:
    (1) Smaller max_features will increase randomness, leading to less correlated trees (reducing variance) but potentially higher bias.
    (2) Larger max_features will decrease randomness, leading to more correlated trees (increasing variance) but potentially lower bias.

    Grid Search/Randomized Search:
    This is the most robust method. Define a range of possible max_features values and use cross-validation to evaluate the model’s performance for each value.

    Out-of-Bag (OOB) Error:
    Random Forests can estimate the generalization error internally using OOB samples. You can monitor the OOB error as you vary max_features to find the optimal value.

    The figure below shows the cross-validation accuracy curve when using different numbers of features.


    Login to view more content
  • ML0063 Random Forest

    How does the random forest algorithm operate? Please outline its key steps.

    Answer

    Random Forest builds an ensemble of decision trees using bootstrapped samples and random feature subsets at each split. This combination reduces variance, combats overfitting, and improves predictive accuracy. The final output aggregates the predictions of all trees (majority vote for classification, averaging for regression).
    (1) Bootstrap Sampling: Create multiple subsets of the original training data by sampling with replacement (bootstrap samples).
    (2) Grow Decision Trees: For each bootstrap sample, train an unpruned decision tree.
    (3) Random Feature Selection: At each split in a tree, randomly select a subset of features. The split is chosen only among this random subset (increases diversity).
    (4) Aggregate Results with Voting or Averaging:
    Classification: Each tree votes for a class label. The majority vote is used.
    \hat{y} = \mathrm{mode}\, { T_b(x) },\quad b=1,\ldots,B
    Where:
     T_b(x) = prediction of the b-th tree.
     B = total number of trees.

    Regression: Each tree predicts a numeric value. The average is used.
    \hat{y} = \frac{1}{B}\sum_{b=1}^{B} T_b(x)
    Where:
     T_b(x) = prediction of the b-th tree.
     B = total number of trees.

    The example below shows the decision boundary differences between three decision trees and their random forest ensemble.


    Login to view more content
  • ML0062 Decision Tree

    Please explain how a decision tree works.

    Answer

    A decision tree partitions the input space into regions by recursively splitting on features that best separate the target variable. Each split aims to improve the “purity” of the resulting subsets, as measured by criteria such as Gini impurity or Entropy. Predictions are made by following the sequence of splits down to a leaf node and returning the most common class (classification) or average target (regression).

    Structure: A tree of nodes where each internal node tests a feature, branches represent feature outcomes, and leaves give predictions.

    Splitting Criterion: Chooses the best feature (and threshold) by maximizing purity—e.g., Information Gain, Gini Impurity, or Variance Reduction.

    Recursive Growth: Starting at the root, data is split, then the process recurses on each subset until stopping criteria (max depth, min samples, or pure leaves) are met.

    Prediction: A new sample “travels” from root to leaf by following feature-test branches; the leaf’s label or value is returned.

    The example below demonstrates using a Decision Tree on a 2-feature dataset for classification.


    Login to view more content

  • ML0061 KNN and K-means

    What are the key differences between KNN and K-means?

    Answer

    KNN(K-Nearest Neighbors) is a supervised algorithm that classifies data by considering the labels of its nearest neighbors, emphasizing prediction based on historical data. In contrast, K-Means is an unsupervised clustering technique that groups data together based solely on their similarity, without using any labels.

    Here are the key differences between KNN and K-Means:
    (1) Learning Type
    KNN: Supervised learning algorithm (used for classification/regression).
    K-Means: Unsupervised learning algorithm (used for clustering).
    (2) Objective
    KNN: Predict the label of a new sample based on the majority vote (or average) of its K nearest neighbors.
    K-Means: Partition the dataset into K clusters by minimizing intra-cluster distance.
    (3) Training
    KNN: No explicit training; It simply stores the entire training dataset.
    K-Means: Involves an iterative training process to learn cluster centroids.
    (4) Prediction
    KNN: Computationally expensive, computes the distance from the test point to every training point. Sorts the distances and selects the top  K nearest neighbors. The majority votes for classification. Average of values for regression.
    K-Means: Fast and simple for inference, compute the distance of any new data point to each of the  K centroids. Assign it to the nearest centroid (i.e., predicted cluster).
    (5) Distance Metric Use
    KNN: Used to find neighbors.
    K-Means: Used to assign points to the nearest cluster center.
    (6) Output
    KNN: Outputs a label (classification) or value (regression).
    K-Means: Outputs cluster assignments and centroids.

    The table below summarizes the comparison between KNN and K-Means.


    Login to view more content
  • ML0059 K-means II

    K-Means is widely used for clustering. Can you discuss its main benefits as well as its disadvantages?

    Answer

    K-Means clustering is a widely used unsupervised learning algorithm that partitions data points into K clusters, where each point belongs to the cluster with the nearest mean. While it is computationally efficient and easy to implement, it relies on prior specification of the number of clusters, assumes spherical clusters, and is sensitive to initialization and outliers.

    Objective Function of K-means:
    J = \sum_{i=1}^{K} \sum_{x_j \in C_i} |x_j - \mu_i|^2
    Where:
     K is the number of clusters.
     C_i is the set of points in cluster  i .
     \mu_i is the centroid of cluster  i .
     x_j is a data point assigned to cluster  i .

    Main Benefits of K-means:
    (1) Simple & Efficient: Fast to compute, easy to implement.
    (2) Scalable: Handles large datasets well.
    (3) Unsupervised Learning: Requires no labeled data.
    (4) Interpretable: Cluster centroids are intuitive and interpretable.
    (5) Works Well on Spherical Clusters: Performs best when clusters are compact and well-separated.

    Main Disadvantages of K-means:
    (1) Must Specify  K : The number of clusters must be known in advance.
    (2) Sensitive to Initialization: Poor starting points may lead to suboptimal clustering.
    (3) Assumes Spherical Clusters: Fails on clusters with irregular shapes or varying densities.
    (4) Affected by Outliers: Outliers can skew centroids and degrade performance.
    (5) Only Uses Euclidean Distance: Not suitable for non-numeric or categorical features.

    Examples below demonstrate K-Means performance on spherical clusters and Irregular shape clusters.


    Login to view more content
  • ML0057 K-means

    Please explain how K-means works.

    Answer

    K-means is an iterative unsupervised algorithm that groups data into  K clusters by minimizing intra-cluster distances. It alternates between assigning points to the nearest centroid and updating centroids until convergence. It is fast and easy to implement, but sensitive to initialization and non-convex cluster shapes.

    Goal of K-means
    : Partition data into  K clusters by minimizing within-cluster variance.

    K-means Steps:
    (1) Initialization: Randomly choose  K centroids.
    (2) Assignment step: Assign each point to the closest centroid using Euclidean distance, given by:
     d(x, c_k) = \sqrt{\sum_{i=1}^{n} (x_i - c_{k,i})^2}
    Where:
     x is the data point.
     c_k is the cluster center.
    (3) Update Step: Compute new cluster centers as the mean of all points assigned to that cluster by:
     c_k = \frac{1}{C_k} \sum_{x \in C_k} x
    Where:
     C_k represents the set of points assigned to cluster  k .
    (4) Convergence: Repeat assignment and update steps until cluster centers stabilize or a stopping criterion is met.

    Below shows an example for K-means clustering.


    Login to view more content

  • ML0055 KNN Regression

    Please explain how KNN Regression works.

    Answer

    K-Nearest Neighbors (KNN) Regression is a simple, instance-based learning algorithm that predicts a continuous output by finding the K nearest data points in the training set and averaging their output values. Its simplicity makes it easy to understand and implement, but its performance can be sensitive to the choice of K and the distance metric. Additionally, it can be computationally expensive during the prediction phase, especially for large datasets, since it requires comparing the query point to all training samples. KNN is often referred to as a “lazy learner” because it doesn’t build an explicit model during training—it simply memorizes the training data.
    (1) Instance-based method: KNN doesn’t learn an explicit model; it stores the training data and makes predictions based on similarity.
    (2) Distance-based: It finds the K nearest neighbors to a query point using a distance metric (commonly Euclidean distance).
    (3) Averaging Neighbors: The predicted value is the average of the target values of these K nearest neighbors.
    (4) Sensitive to K and distance metric: The performance depends on the choice of  K and how distance is measured (Euclidean, Manhattan, etc.).
    (5) No training phase: All computation happens during prediction (also called lazy learning).

    Below is the equation for the Prediction Calculation in KNN Regression:
    \hat{y} = \frac{1}{K}\sum_{i=1}^{K} y_i
    Where:
    \hat{y} is the predicted value for the query point.
     y_i represents the target value of the i‑th nearest neighbor.
     K denotes the number of neighbors considered.

    The example below shows KNN for regression.


    Login to view more content
  • ML0054 KNN Classification

    Please explain how KNN classification works.

    Answer

    K-Nearest Neighbors (KNN) is a simple, non-parametric algorithm that predicts a label by majority vote among the  K nearest neighbors of a test point, using a chosen distance metric. It is intuitive and effective for small datasets, though less efficient on large-scale data.
    (1) Instance-based method: KNN doesn’t learn an explicit model; it stores the training data and makes predictions based on similarity.
    (2) Distance-based classification: For a test point  \mathbf{x} , it computes the distance to every training point (e.g., Euclidean distance).
    (3) Majority vote: It selects the  K closest neighbors and assigns the label that appears most frequently among them.
    (4) Sensitive to K and distance metric: The performance depends on the choice of  K and how distance is measured (Euclidean, Manhattan, etc.).
    (5) No training phase: All computation happens during prediction (also called lazy learning).

    Below is the equation for Euclidean Distance calculation:
    \mbox{distance}(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}
    Where:
     x_i and  y_i are the ith features of the new and training points, respectively.
     n is the number of features

    Below is the equation for the Voting Rule in KNN classification:
    \hat{y} = \arg\max_{c \in \mathcal{C}} \sum_{i=1}^{K} \mathbb{1}(y_i = c)
    Where:
     \hat{y} is the predicted class label for the query point.
     \mathcal{C} represents the set of all possible classes.
     K is the number of nearest neighbors considered.
     y_i is the class label of the i-th neighbor.
     \mathbb{1}(y_i = c) is an indicator function, returning 1 if the neighbor’s class is  c , and 0 otherwise.

    The example below shows KNN for classification.


    Login to view more content
  • ML0053 Hinge Loss for SVM

    Explain the Hinge Loss function used in SVM.

    Answer

    The Hinge Loss function is a key element in Support Vector Machines that penalizes both misclassified points and correctly classified points that lie within the decision margin. It assigns zero loss to points that are correctly classified and lie outside or exactly on the margin, and applies a linearly increasing loss as points move closer to or across the decision boundary. This loss structure encourages the SVM to maximize the margin between classes, promoting robust and generalizable decision boundaries.

    The Hinge Loss is defined as follows.
     \text{Hinge Loss} = \max(0,\ 1 - y \cdot f(\mathbf{x}))
    Where:
     y \in {-1, +1} is the true label,
     f(\mathbf{x}) is the raw model output.

    Hinge Loss is plotted in the figure below.

    Zero Loss: When  y \cdot f(\mathbf{x}) \ge 1 , meaning the point is correctly classified with margin.
    Positive Loss: When  y \cdot f(\mathbf{x}) < 1 , the point is either inside the margin or misclassified.


    Login to view more content
  • ML0051 Linear SVM

    Can you explain the key concepts behind a Linear Support Vector Machine?

    Answer

    A Linear Support Vector Machine (Linear SVM) is a classifier that finds the optimal straight-line (or hyperplane) separating two classes by maximizing the margin between them. It relies on a few critical points (support vectors) and offers strong generalization, especially for linearly separable data.

    Key Concepts of a Linear Support Vector Machine:
    (1) Hyperplane: A decision boundary that separates data points of different classes.
    (2) Margin: The distance between the hyperplane and the nearest data points from each class.
    (3) Support Vectors: Data points that lie closest to the hyperplane and define the margin.
    (4) Objective: Maximize the margin while minimizing classification errors.

    Here is the Linear SVM Decision Function:
     f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b
    Where:
     \mathbf{x} is the input feature vector.
     \mathbf{w} is the weight vector.
     b is the bias term.

    Here is the Linear SVM Classification Rule:
     \hat{y} = \mbox{sign}(\mathbf{w}^\top \mathbf{x} + b) = \mbox{sign}(f(\mathbf{x}))
    Where:
     \hat{y} is the predicted class label.
     \mbox{sign}(\cdot) returns +1 if the argument is ≥ 0, and −1 otherwise.

    For Hard Margin SVM, here is the Optimization Objective:
     \min_{\mathbf{w}, b} \quad \frac{1}{2} \|\mathbf{w}\|^2
    Subject to:
     y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 \quad \text{for all } i
    Where:
     y_i \in {-1, 1} is the class label for the i-th data point.
     \mathbf{x}_i is the i-th feature vector.

    The example below shows Hard Margin SVM for solving a classification task.


    Login to view more content