Category: Easy

  • ML0039 Distributed Training

    What are the two main distributed training approaches for machine learning?

    Answer

    The two main distributed training approaches for machine learning are Data Parallelism and Model Parallelism.

    Data Parallelism: In this approach, the training dataset is divided and distributed among multiple computing devices, with each device holding a complete copy of the machine learning model. Each device then trains its model on its assigned subset of the data in parallel. After each training step, the updates (gradients or new parameters) from all devices are aggregated and synchronized to maintain a consistent model across the system. This method is highly effective for scaling training when the dataset is large and the model can fit within a single device’s memory.

    Model Parallelism: This approach is used when the machine learning model itself is too large to fit into the memory of a single computing device. In model parallelism, different parts of the model (e.g., specific layers of a neural network) are distributed across multiple devices. Data typically flows sequentially through these distributed model parts. This necessitates more complex communication between devices as intermediate computations and activations must be passed along. Model parallelism is crucial for training extremely large models that would otherwise be computationally intractable on a single machine.


    Login to view more content
  • ML0038 Validation and Test


    What are the key purposes of using both a validation and a test set when building machine learning models?

    Answer

    Using a validation set separates model development from tuning, enabling informed hyperparameter decisions and overfitting control, while reserving a test set ensures a completely unbiased, final assessment of how the model will perform in real‑world, unseen scenarios.

    Validation Set:
    (1) Tune Hyperparameters: Optimize model settings without test set bias.
    (2) Select Best Model: Compare different models objectively during development.
    (3) Prevent Overfitting (During Training): Monitor performance on unseen data to stop training early if needed.

    Test Set:
    (1) Final, Unbiased Evaluation: Assess the truly generalized performance of the final model.
    (2)Simulate Real-World Performance: Estimate how the model will perform on completely new data.
    (3) Avoid Data Leakage: Ensure no information from the test set influences model building.


    Login to view more content
  • ML0037 Bias in NN

    Why is bias used in neural networks?

    Answer

    Bias in Neural Networks is used to introduce flexibility and adaptability in learning.
    (1) Shifts Activation Threshold: Allows a neuron’s activation function to move left or right, so it can fire even when inputs sum to zero.
    (2) Avoids Origin Constraint: Lets decision boundaries and fitted functions not be forced through the origin (0,0).
    (3) Increases Flexibility: Provides an extra learnable parameter for better approximation of complex functions.
    (4) Compensates for Imbalance: Helps adjust for biases in data or features.


    Login to view more content

  • ML0036 Confusion Matrix

    In which scenarios is a Confusion Matrix most useful for evaluating machine learning models, and why?

    Answer

    A Confusion Matrix is a table that visualizes the performance of a classification model by comparing the predicted and actual class labels. It displays the counts of True Positives (correctly predicted positives), True Negatives (correctly predicted negatives), False Positives (incorrectly predicted positives), and False Negatives (incorrectly predicted negatives). While its form is simple, it becomes indispensable whenever you need more insight than overall accuracy. Below are the key scenarios where a confusion matrix shines.

    (1) Imbalanced Datasets: Reveals if the minority class is being predicted well, unlike overall accuracy.
    (2) Understanding Error Types: Shows True Positives, True Negatives, False Positives, and False Negatives, which is crucial when different errors have different costs (e.g., medical tests, fraud detection).
    (3) Multi-Class Classification: Identifies which specific classes are being confused.
    (4) Comparing Models: A detailed comparison of model strengths and weaknesses beyond overall accuracy.

    Here is an example binary class Confusion Matrix.


    Login to view more content

  • ML0035 Model Comparison

    How to compare different machine learning models?

    Answer

    Compare machine learning models by defining clear objectives and metrics, using consistent data splits, training and tuning each model, and evaluating them through robust metrics and statistical tests. Finally, consider trade-offs like model complexity and interpretability to make an informed choice.
    (1) Choose Relevant Metrics: Select evaluation metrics that align with your task (e.g., accuracy or F1 for classification)
    Below shows an example using ROC curves for model comparison.

    (2) Use Consistent Data Splits: Evaluate all models on the same train/validation/test splits—or identical cross-validation folds—to ensure fairness
    (3) Apply Cross-Validation: Employ k-fold or nested cross-validation to reduce variance in performance estimates, especially with limited data
    (4) Control Randomness: Run each model multiple times with different random seeds (data shuffles, weight initializations) and average the results to gauge stability
    (5) Perform Statistical Tests: Use paired tests to determine if observed differences are statistically significant
    (6) Measure Efficiency: Record training time, inference latency, and resource usage (CPU/GPU and memory) to assess practical deployability
    (7) Evaluate Robustness & Interpretability: Test models under data perturbations or adversarial noise, and compare explainability


    Login to view more content
  • ML0034 Backpropagation

    What is backpropagation?

    Answer

    Backpropagation, backward propagation of errors, is the central algorithm by which multilayer neural networks learn. At its core, it efficiently computes how each weight and bias in the network contributes to the overall prediction error (loss). Then, it updates those parameters in the direction that reduces the error the most.
    By combining the chain rule from calculus with gradient‑based optimization (e.g., gradient descent), backpropagation makes training deep architectures tractable and underpins virtually all modern advances in deep learning.

    Steps to conduct Backpropagation:
    (1) Forward Pass: Inputs are propagated through the network to compute outputs. Intermediate activations are stored for later use.
    (2) Compute Loss: Use a loss function to compare the network’s output to the actual target values.
    (3) Backward Pass (Error Propagation): The error is computed at the output layer. The chain rule is applied to recursively calculate the gradients of the loss for each weight, starting from the output layer back to the input layer.
    (4) Gradient Calculation: For every neuron, determine how much its weights contributed to the error by computing partial derivatives.
    (5) Update Weights: Adjust the weights using an optimization algorithm (e.g., gradient descent), by subtracting a fraction (learning rate) of the computed gradients. This step is repeated iteratively to gradually minimize the loss.

    More details for step (3): Backward Pass (Error Propagation)
    At the Output Layer:
    Imagine a neuron with an output value a (its activation) and a weighted sum z computed as:
    \mbox z = w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b
    Suppose we use the mean squared error (MSE) as our loss function:
    \mbox L = \frac{1}{2} (T - a)^2
    Where T is the target value.
    The derivative of the loss to the activation is:
    \frac{dL}{da} = a - T
    To update weights, we need to know how the loss changes to z. Using the chain rule, we have:
    \frac{dL}{dz} = \frac{dL}{da} \cdot \frac{da}{dz}
    For example, if the activation function is sigmoid, then:
    \frac{da}{dz} = a (1 - a)

    For Hidden Layers:
    Consider a hidden neuron j that feeds into the output neurons. Its contribution to the loss is influenced by all neurons it connects to in the subsequent layer. The backpropagated error for neuron j is given by:
    \frac{dL}{dz_j} = \left( \sum_{k} \frac{dL}{dz_k} \cdot w_{jk} \right) \cdot f'(z_j)
    Here, f'(z_j) is the derivative of the activation function at neuron j.

    More details for step (4): Gradient Calculation
    For Each Weight:
    Once you have the error signal \frac{dL}{dz}​ for a neuron, the gradient with respect to a weight w_i connected to input x_i is:
    \frac{dL}{dw_i} = \frac{dL}{dz} \cdot x_i
    This shows that the gradient is directly proportional to the input, linking how much weight its contribution had on the final error.

    For the Bias:
    Since the bias b contributes to z with a derivative of 1, the gradient for the bias is simply:
    \frac{dL}{db} = \frac{dL}{dz}


    Login to view more content

  • ML0032 Non-Linear Activation

    Why use non-linear activation functions in neural networks in machine learning, and what limitations would a network face if only linear activation functions were used?

    Answer

    The benefits of using non-linear activation functions in neural networks are as follows:
    (1) Introduce Non-Linearity: Enable learning complex patterns in data.
    (2) Model Complexity: Allow approximation of any continuous function.
    (3) Enable Multiple Layers to Add Power: Enable multiple layers to build complex, abstract representations rather than simple linear mappings. Stacking multiple layers with only linear activations collapses into an equivalent single linear transformation; depth would confer no additional modeling capacity.

    The limitations of only linear activations are as follows:
    (1) No Depth Advantage: Any multilayer network collapses to a single-layer linear model, so adding layers does not increase modeling power. Acts as a single linear regression, regardless of depth.
    (2) Inability to Learn Non‑Linear Boundaries: Only learn linearly separable data. Tasks requiring non‑linear decision boundaries become impossible.

    The following example shows the limitation of using linear activations only in neural networks.


    Login to view more content
  • ML0031 Linear Regression

    What are the advantages and disadvantages of linear regression?

    Answer

    Linear regression aims to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.

    \displaystyle h_\theta(x) = \theta_0 + \sum_{j=1}^{p} \theta_j x_j
    Where
    h_\theta(x) represents the hypothesis (predicted value) for input feature vector x.
    \theta_0 is the bias (intercept) parameter, shifting the prediction up or down independent of features.
    \theta_j are the weight parameters multiplying each feature.
    x_j denotes the j‑th feature of the input vector x.
    p is the total number of features (excluding the bias) used in the model.

    Advantages:
    (1) Simplicity & Interpretability: Linear regression is easy to understand and implement. The coefficients of the model directly indicate the strength and direction of the relationship between the features and the target variable, making it highly interpretable.
    (2) Computational Efficiency: Its low computational cost makes linear regression fast to train, even on large datasets.
    (3) Effective for Linearly Separable Data: It performs well when the relationship between the independent and dependent variables is approximately linear.

    Disadvantages:
    (1) Assumes Linearity: The primary limitation is the assumption that the relationship between the variables is linear. It will perform poorly if the underlying relationship is nonlinear.
    (2) Sensitivity to Outliers: Extreme values can disproportionately affect the model, distorting the results.
    (3) Multicollinearity Issues: When predictors are highly correlated, it becomes difficult to isolate individual effects, leading to unreliable coefficient estimates
    (4) Potential for Underfitting: The simplicity of the model may fail to capture the nuances and complexities of more intricate datasets.


    Login to view more content

  • ML0030 Sigmoid

    What are the advantages and disadvantages of using a sigmoid activation function?

    Answer

    The sigmoid activation function transforms input values into a range between 0 and 1, making it useful in various applications like binary classification.
    \mbox{Sigmoid}(x) = \frac{1}{1 + e^{-x}}

    Advantages:
    (1) Smooth, Bounded Gradient: The sigmoid’s S‑shape yields a continuous derivative, preventing abrupt changes in backpropagation and aiding stable training on shallow networks.
    (2) Probability interpretation: Since the output is between 0 and 1, it can be useful for problems where predictions need to represent probabilities.

    Disadvantages:
    (1) Vanishing gradient problem: For very large or small inputs, the gradient becomes almost zero, slowing down training in deep networks.
    (2) Not zero-centered: The outputs are always positive, which can lead to inefficient weight updates and slower convergence.


    Login to view more content
  • ML0028 Softmax

    What is the Softmax activation function, and what is its purpose?

    Answer

    Softmax is an activation function typically used in the output layer of a neural network for multi-class classification problems. Its purpose is to convert a vector of raw scores (logits) into a probability distribution over the possible output classes. The output of Softmax is a vector where each element represents the probability of the input belonging to a specific class, and the sum of these probabilities is always 1.
    \mbox{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}
    Where:
     z_i represents the raw score (also known as a “logit”) for the i th class.
     K represents the total number of classes in the classification problem.

    The combination of the softmax function with the cross-entropy loss function is standard for multi-class classification problems. The softmax function provides a probability distribution over classes, and the cross-entropy loss measures how well this predicted distribution aligns with the true distribution (typically a one-hot encoded vector).


    Login to view more content