Interview for Machine Learning

Category: Easy

ML0026 ReLU
What are the benefits and limitations of the ReLU activation function?
Answer
ReLU offers substantial benefits in terms of computational efficiency, gradient propagation, and sparsity, which have made it a popular choice for activation functions in deep learning.
${\large \text{ReLU}(x) = \max(0, x)}$
Advantages of ReLU:
1. Mitigation of the Vanishing Gradient Problem: In the positive region (x>0), ReLU has a constant gradient of 1. This helps to alleviate the vanishing gradient problem that plagues sigmoid and tanh functions, especially in deep networks. A constant gradient allows for more effective backpropagation of the error signal to earlier layers.
2. Sparse Activation:
By outputting zero for all negative input values, ReLU naturally induces sparsity in the network. This means that, at any given time, only a subset of neurons are active. Sparse activations can lead to more efficient representations and can help the network learn more robust features.
3. Computational Efficiency:
ReLU is computationally simple, requiring only a threshold operation, which accelerates both training and inference processes compared to functions like sigmoid or tanh that involve more complex calculations.
Drawbacks of ReLU:
1. Dying ReLU Problem:
Neurons can become inactive if they consistently receive negative inputs, leading them to output zero and potentially never recover, thus reducing the model’s capacity.
2. Unbounded Output:
The unbounded nature of ReLU’s positive outputs can lead to large activation values, potentially causing issues like exploding gradients if not properly managed.
Login to view more content
April 12, 2025
ML0023 Gradient Descent
What is Gradient Descent in machine learning?
Answer
Gradient descent is an iterative optimization algorithm used to minimize a function, most commonly a cost or loss function in machine learning, by moving step-by-step in the direction of the steepest descent (i.e., opposite to the gradient).
In each iteration, the algorithm computes the gradient of the function with respect to its parameters, then updates the parameters by subtracting a fraction (the learning rate) of this gradient.
This process is repeated until the function converges to a minimum (which, for convex functions, is the global minimum) or until the updates become negligibly small.
The update rule for Gradient descent:
${\huge\theta = \theta - \alpha \nabla J(\theta)}$
where:
${\huge\theta}$ represents the parameters being optimized (for example, the weights in a model).
${\huge\alpha}$ represents the learning_rate.
${\huge\nabla J(\theta)}$ is the gradient of the cost function ${\huge J( \theta )}$ with respect to ${\huge\theta}$ .
Login to view more content
April 4, 2025
ML0022 Cross Entropy Loss
Explain how Cross Entropy Loss is used for a classification task.
Answer
Cross-entropy loss, also known as log loss or logistic loss, is a commonly used loss function in machine learning, particularly for classification tasks. It quantifies the difference between two probability distributions: the predicted probabilities generated by a model and the true probability distribution of the target variable. The goal of training a classification model is to minimize this loss.
For binary classification:
${\large \text{Binary Cross-Entropy Loss} = - \displaystyle\frac{1}{n}\sum_{i=1}^{n} [y_i \cdot \log(p_i) + (1-y_i) \cdot \log(1-p_i)]}$
where:
${\large n}$ is the number of total samples.
${\large y_i}$ is the true label (0 or 1) for the i-th data point.
${\large p_i}$ is the predicted probability of the positive class (class 1) for the i-th data point.
For multi-class classification:
${\large \text{Categorical Cross-Entropy Loss} = - \displaystyle\frac{1}{n}\sum_{i=1}^{n} \sum_{j=1}^{C} y_{ij} \cdot \log(p_{ij})}$
where:
${\large n}$ is the number of total samples.
${\large C}$ is the number of classes.
${\large y_{ij}}$ is a binary indicator (0 or 1) that is 1 if the true class for the i-th data point is j, and 0 otherwise (one-hot encoding).
${\large p_{ij}}$ is the predicted probability that the i-th data point belongs to class j.
The logarithm function in the formula penalizes incorrect predictions more severely when the model is more confident about that incorrect prediction.
For a true label of 1, the loss is higher when the predicted probability p is closer to 0, and lower when p is closer to 1.
For a true label of 0, the loss is higher when the predicted probability p is closer to 1, and lower when p is closer to 0.
The cross-entropy loss approaches 0 when the predicted probability distribution is close to the true distribution.
Key Properties:
Differentiable: The cross-entropy loss function is differentiable, which is essential for gradient-based optimization algorithms.
Sensitive to Confidence: It strongly penalizes confident but incorrect predictions.
Probabilistic Interpretation: It directly works with the predicted probabilities of the classes.
Login to view more content
March 31, 2025
ML0021 L1 Loss L2 Loss
What are the key differences between L1 loss and L2 loss?
Answer
L1 Loss (Mean Absolute Error – MAE)
L1 loss measures the average absolute difference between the actual and predicted values. It is expressed as:
${\large \text{L1 Loss} = \displaystyle\frac{1}{n}\sum_{i=1}^{n} | \hat{y}_i - y_i |}$
Where,
${\large y_i}$ represents the actual value for the i-th data point.
${\large\hat{y}_i}$ represents the predicted value for the i-th data point.
${\large n}$ is the total number of data points.
Sensitivity to Outliers: L1 loss is less sensitive to outliers because it treats all errors equally. A large error has a proportional impact on the total loss.
Gradient Behavior: The gradient of the L1 loss is constant (+1 or -1) for non-zero errors. At zero error, the gradient is undefined. This can lead to instability during optimization near the optimal solution.
Sparsity: L1 loss has a tendency to produce sparse models, meaning it can drive the weights of less important features to exactly zero. This is a desirable property for feature selection.
L2 Loss (Mean Squared Error – MSE)
L2 loss measures the average squared difference between the actual and predicted values. It is expressed as:
${\large \text{L2 Loss} = \displaystyle\frac{1}{n}\sum_{i=1}^{n} (\hat{y}_i - y_i)^2}$
Where,
${\large y_i}$ represents the actual value for the i-th data point.
${\large\hat{y}_i}$ represents the predicted value for the i-th data point.
${\large n}$ is the total number of data points.
Sensitivity to Outliers: L2 loss is more sensitive to outliers because it squares the errors. A large error has a disproportionately larger impact on the total loss, making the model more influenced by extreme values.
Gradient Behavior: The gradient of the L2 loss is proportional to the error (2( ${\large y_i}$ – ${\large\hat{y}_i}$ )). This means that larger errors have larger gradients, which can help the optimization process converge faster when the errors are large. As the error approaches zero, the gradient also approaches zero, leading to more stable convergence near the optimal solution.
Sparsity: L2 loss does not inherently lead to sparse models. It tends to shrink the weights of all features, but rarely drives them to exactly zero.
Login to view more content
March 29, 2025
ML0020 Data Split
How to split the dataset?
Answer
A good data split in machine learning ensures that the model is trained, validated, and tested effectively to generalize well on unseen data.
The typical approach involves dividing the dataset into three sets: Training Set, Validation Set, and Test Set.
Training Set: Used to train the machine learning model. The model learns patterns and relationships in the data from this set.
Validation Set: Used to tune hyperparameters of the model and evaluate its performance during training. This helps prevent overfitting to the training data and allows you to select the best model configuration.
Test Set: Used for a final, unbiased evaluation of the trained model’s performance on completely unseen data. This provides an estimate of how well the model will generalize to new, real-world data.
Stratification for Imbalanced Data: For imbalanced datasets, consider using stratified splits to maintain the same proportion of classes across the training and test sets.
Login to view more content
March 28, 2025
ML0018 Data Normalization
Why is data normalization used in Machine Learning?
Answer
Data normalization is the process of scaling data to fit within a specific range or distribution, often between 0 and 1 or with a mean of 0 and standard deviation of 1. It’s used in machine learning and statistical modeling to ensure that features contribute equally to the model’s learning process.
Login to view more content
March 28, 2025
ML0017 Data Augmentation
What are the common data augmentation techniques?
Answer
Data augmentation refers to techniques used to increase the diversity and size of a training dataset by creating modified versions of the existing data. It’s especially popular in applications like computer vision and natural language processing, where collecting large datasets can be expensive or time-consuming.
Common Techniques:
Computer Vision:
Geometric Transformations: Rotate, flip, crop, or scale images
Color Adjustments: Change brightness, contrast, saturation, or apply color jittering.
Noise Injection: Add random noise or blur to images.
Natural Language Processing:
Synonym Replacement: Replace words with their synonyms.
Back Translation: Translate text to another language and back.
Random Insertion/Deletion: Add/remove words randomly.

Tabular Data:
SMOTE (Synthetic Minority Oversampling Technique): Generate synthetic data points for minority classes.
Noise Injection: Add small random noise to numeric features.
Login to view more content
March 28, 2025
ML0016 AUC
What is AUC?
Answer
AUC (Area Under the Curve) is a measure of a model’s ability to distinguish between positive and negative classes, based on the ROC (Receiver Operating Characteristic) curve. It quantifies the area under the ROC curve, where the curve represents the trade-off between the True Positive Rate (TPR) and False Positive Rate (FPR) at various thresholds.
AUC Range:
1.0: Perfect classifier
0.5: Random guessing
Below 0.5: Worse than random guessing, which rarely happens
Login to view more content
March 27, 2025
ML0014 Confusion Matrix
What is the confusion matrix?
Answer
A confusion matrix is a table that summarizes the performance of a classification model by comparing its predicted labels against the actual labels. For binary classification, it is typically organized into a 2×2 table containing:
True Positives (TP): Cases where the model correctly predicts the positive class
False Positives (FP): Cases where the model incorrectly predicts the positive class.
False Negatives (FN): Cases where the model incorrectly predicts the negative class.
True Negatives (TN): Cases where the model correctly predicts the negative class.
It provides a detailed breakdown of the model’s predictions compared to the actual outcomes, which helps in understanding not only how many predictions were correct, but also the types of errors being made.
Login to view more content
March 22, 2025
ML0013 Accuracy
What is accuracy?
Answer
Accuracy in machine learning is a metric used to evaluate the performance of a model, particularly in classification tasks. It is the ratio of correct predictions to the total number of predictions made.
Mathematically, it’s defined as:
${\large \text{Accuracy} = \displaystyle\frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}}$
If a model correctly predicts the class for 99 out of 100 samples, its accuracy is 99%.
True Positives (TP): The model correctly predicts the positive class.
False Positives (FP): The model incorrectly predicts the positive class (it predicted positive, but it was negative)
True Negatives (TN): The model correctly predicts the negative class.
False Negatives (FN): The model incorrectly predicts the negative class (it predicted negative, but it was positive).
Login to view more content
March 22, 2025