Interview for Machine Learning

Author: admin

ML0025 Exploding Gradient
What are the typical reasons for exploding gradient?
Answer
Exploding gradients occur when the gradients during backpropagation become excessively large. This leads to huge updates in the model’s weights, making the training process unstable and potentially causing the model to diverge instead of converging.
Typical Reasons for Exploding Gradients:
1. Deep Architectures:
In very deep networks, repeatedly multiplying gradients (especially when derivatives are >1) can cause them to grow exponentially
2. Large Learning Rates:
When the learning rate is set too high, even moderately large gradients can result in weight updates that overshoot the optimum by a significant margin, compounding the instability.
3. Improper Weight Initialization:
If weights are initialized to values that are too high, activations and their corresponding derivatives can be disproportionately large. This imbalance not only disrupts the symmetry in learning but can also contribute to the accumulation of large gradient values.
4. Activation Functions with Derivatives Greater Than 1:
Some activation functions or their operating regimes can have derivatives greater than 1. Repeated multiplication of these large derivatives during backpropagation can lead to exponential growth of the gradients.
Scaled Exponential Linear Unit (SELU). For positive inputs, the SELU is defined as:
${\large \text{SELU}(x) = \lambda x,\quad \text{for } x>0}$
In the typical configuration for self-normalizing neural networks, the parameter ${\large\lambda}$ is set to approximately 1.0507 (greater than 1). This means in the positive regime, each layer effectively amplifies the gradient by a factor of ${\large\lambda}$ , which, when compounded over many layers, can contribute to exploding gradients if not properly managed.
Login to view more content
April 4, 2025
ML0024 Vanishing Gradient
What are the typical reasons for vanishing gradient?
Answer
The vanishing gradient problem occurs during the training of deep neural networks when gradients become exceedingly small as they are backpropagated through the network’s layers. This diminishes the effectiveness of weight updates, particularly in the earlier layers, hindering the network’s ability to learn and converge efficiently.
Typical Reasons for Vanishing Gradients:
1. Saturating Activation Functions:
Activation functions, such as the sigmoid or tanh, compress input values into a narrow range. For example, the sigmoid function is defined as:
${\large \sigma(z) = \displaystyle\frac{1}{1 + e^{-z}}}$
Its derivative is:
${\large \sigma'(z) = \sigma(z)(1 - \sigma(z))}$
Notice that when ${\large z}$ has very high or very low values, ${\large\sigma(z)}$ saturates close to 1 or 0, making ${\large\sigma'(z)}$ extremely small. When such small derivatives are multiplied across many layers (as dictated by the chain rule), they shrink toward zero, leading to vanishing gradients.
2. Deep Network Architectures:
In deep models, the gradient for a given layer involves a product of many small derivatives from subsequent layers. Mathematically, if you consider a simple scenario, the gradient with respect to an early layer might be expressed as:
${\large \frac{\partial L}{\partial x} = \prod_{i=1}^{n} \frac{\partial x_{i+1}}{\partial x_{i}}}$
If each term in the product is less than one in absolute value, the overall product becomes extremely small as n (the number of layers) increases.
3. Improper Weight Initialization:
The way weights are initialized can have a significant impact on the magnitude of the gradients. If the initial weights are set too small (or too large), they can push the activations into the non-linear saturation regions of functions like the sigmoid or tanh, causing their derivatives to be very small. This, in turn, contributes to vanishing gradients.
4. Recurrent Neural Networks (RNNs):
RNNs are particularly susceptible because the gradients must pass through many time steps (or iterations) when backpropagating through time. Similar to deep feedforward networks, if the gradient at each time step is less than one, the multiplicative effect causes the overall gradient to vanish over long sequences
Login to view more content
April 4, 2025
ML0023 Gradient Descent
What is Gradient Descent in machine learning?
Answer
Gradient descent is an iterative optimization algorithm used to minimize a function, most commonly a cost or loss function in machine learning, by moving step-by-step in the direction of the steepest descent (i.e., opposite to the gradient).
In each iteration, the algorithm computes the gradient of the function with respect to its parameters, then updates the parameters by subtracting a fraction (the learning rate) of this gradient.
This process is repeated until the function converges to a minimum (which, for convex functions, is the global minimum) or until the updates become negligibly small.
The update rule for Gradient descent:
${\huge\theta = \theta - \alpha \nabla J(\theta)}$
where:
${\huge\theta}$ represents the parameters being optimized (for example, the weights in a model).
${\huge\alpha}$ represents the learning_rate.
${\huge\nabla J(\theta)}$ is the gradient of the cost function ${\huge J( \theta )}$ with respect to ${\huge\theta}$ .
Login to view more content
April 4, 2025
ML0022 Cross Entropy Loss
Explain how Cross Entropy Loss is used for a classification task.
Answer
Cross-entropy loss, also known as log loss or logistic loss, is a commonly used loss function in machine learning, particularly for classification tasks. It quantifies the difference between two probability distributions: the predicted probabilities generated by a model and the true probability distribution of the target variable. The goal of training a classification model is to minimize this loss.
For binary classification:
${\large \text{Binary Cross-Entropy Loss} = - \displaystyle\frac{1}{n}\sum_{i=1}^{n} [y_i \cdot \log(p_i) + (1-y_i) \cdot \log(1-p_i)]}$
where:
${\large n}$ is the number of total samples.
${\large y_i}$ is the true label (0 or 1) for the i-th data point.
${\large p_i}$ is the predicted probability of the positive class (class 1) for the i-th data point.
For multi-class classification:
${\large \text{Categorical Cross-Entropy Loss} = - \displaystyle\frac{1}{n}\sum_{i=1}^{n} \sum_{j=1}^{C} y_{ij} \cdot \log(p_{ij})}$
where:
${\large n}$ is the number of total samples.
${\large C}$ is the number of classes.
${\large y_{ij}}$ is a binary indicator (0 or 1) that is 1 if the true class for the i-th data point is j, and 0 otherwise (one-hot encoding).
${\large p_{ij}}$ is the predicted probability that the i-th data point belongs to class j.
The logarithm function in the formula penalizes incorrect predictions more severely when the model is more confident about that incorrect prediction.
For a true label of 1, the loss is higher when the predicted probability p is closer to 0, and lower when p is closer to 1.
For a true label of 0, the loss is higher when the predicted probability p is closer to 1, and lower when p is closer to 0.
The cross-entropy loss approaches 0 when the predicted probability distribution is close to the true distribution.
Key Properties:
Differentiable: The cross-entropy loss function is differentiable, which is essential for gradient-based optimization algorithms.
Sensitive to Confidence: It strongly penalizes confident but incorrect predictions.
Probabilistic Interpretation: It directly works with the predicted probabilities of the classes.
Login to view more content
March 31, 2025
ML0021 L1 Loss L2 Loss
What are the key differences between L1 loss and L2 loss?
Answer
L1 Loss (Mean Absolute Error – MAE)
L1 loss measures the average absolute difference between the actual and predicted values. It is expressed as:
${\large \text{L1 Loss} = \displaystyle\frac{1}{n}\sum_{i=1}^{n} | \hat{y}_i - y_i |}$
Where,
${\large y_i}$ represents the actual value for the i-th data point.
${\large\hat{y}_i}$ represents the predicted value for the i-th data point.
${\large n}$ is the total number of data points.
Sensitivity to Outliers: L1 loss is less sensitive to outliers because it treats all errors equally. A large error has a proportional impact on the total loss.
Gradient Behavior: The gradient of the L1 loss is constant (+1 or -1) for non-zero errors. At zero error, the gradient is undefined. This can lead to instability during optimization near the optimal solution.
Sparsity: L1 loss has a tendency to produce sparse models, meaning it can drive the weights of less important features to exactly zero. This is a desirable property for feature selection.
L2 Loss (Mean Squared Error – MSE)
L2 loss measures the average squared difference between the actual and predicted values. It is expressed as:
${\large \text{L2 Loss} = \displaystyle\frac{1}{n}\sum_{i=1}^{n} (\hat{y}_i - y_i)^2}$
Where,
${\large y_i}$ represents the actual value for the i-th data point.
${\large\hat{y}_i}$ represents the predicted value for the i-th data point.
${\large n}$ is the total number of data points.
Sensitivity to Outliers: L2 loss is more sensitive to outliers because it squares the errors. A large error has a disproportionately larger impact on the total loss, making the model more influenced by extreme values.
Gradient Behavior: The gradient of the L2 loss is proportional to the error (2( ${\large y_i}$ – ${\large\hat{y}_i}$ )). This means that larger errors have larger gradients, which can help the optimization process converge faster when the errors are large. As the error approaches zero, the gradient also approaches zero, leading to more stable convergence near the optimal solution.
Sparsity: L2 loss does not inherently lead to sparse models. It tends to shrink the weights of all features, but rarely drives them to exactly zero.
Login to view more content
March 29, 2025
ML0020 Data Split
How to split the dataset?
Answer
A good data split in machine learning ensures that the model is trained, validated, and tested effectively to generalize well on unseen data.
The typical approach involves dividing the dataset into three sets: Training Set, Validation Set, and Test Set.
Training Set: Used to train the machine learning model. The model learns patterns and relationships in the data from this set.
Validation Set: Used to tune hyperparameters of the model and evaluate its performance during training. This helps prevent overfitting to the training data and allows you to select the best model configuration.
Test Set: Used for a final, unbiased evaluation of the trained model’s performance on completely unseen data. This provides an estimate of how well the model will generalize to new, real-world data.
Stratification for Imbalanced Data: For imbalanced datasets, consider using stratified splits to maintain the same proportion of classes across the training and test sets.
Login to view more content
March 28, 2025
ML0019 Imbalanced Data
How to handle imbalanced data in Machine Learning?
Answer
Handling imbalanced data in machine learning involves addressing scenarios where one class significantly outnumbers the other, which can skew model performance. Here are common techniques:
Dataset Resampling:
Oversampling: Increase the minority class samples (e.g., using SMOTE or ADASYN to generate synthetic data points).
Undersampling: Reduce the majority class samples to balance the dataset.
Data Augmentation:
Create synthetic data for the minority class with data augmentation techniques.
Class Weights Adjustment:
Assign higher weights to the minority class during training to penalize misclassifications more heavily.
Metrics Selection:
Use evaluation metrics like Precision, Recall, F1 Score, or AUC-ROC rather than accuracy.
Login to view more content
March 28, 2025
ML0018 Data Normalization
Why is data normalization used in Machine Learning?
Answer
Data normalization is the process of scaling data to fit within a specific range or distribution, often between 0 and 1 or with a mean of 0 and standard deviation of 1. It’s used in machine learning and statistical modeling to ensure that features contribute equally to the model’s learning process.
Login to view more content
March 28, 2025
ML0017 Data Augmentation
What are the common data augmentation techniques?
Answer
Data augmentation refers to techniques used to increase the diversity and size of a training dataset by creating modified versions of the existing data. It’s especially popular in applications like computer vision and natural language processing, where collecting large datasets can be expensive or time-consuming.
Common Techniques:
Computer Vision:
Geometric Transformations: Rotate, flip, crop, or scale images
Color Adjustments: Change brightness, contrast, saturation, or apply color jittering.
Noise Injection: Add random noise or blur to images.
Natural Language Processing:
Synonym Replacement: Replace words with their synonyms.
Back Translation: Translate text to another language and back.
Random Insertion/Deletion: Add/remove words randomly.

Tabular Data:
SMOTE (Synthetic Minority Oversampling Technique): Generate synthetic data points for minority classes.
Noise Injection: Add small random noise to numeric features.
Login to view more content
March 28, 2025
ML0016 AUC
What is AUC?
Answer
AUC (Area Under the Curve) is a measure of a model’s ability to distinguish between positive and negative classes, based on the ROC (Receiver Operating Characteristic) curve. It quantifies the area under the ROC curve, where the curve represents the trade-off between the True Positive Rate (TPR) and False Positive Rate (FPR) at various thresholds.
AUC Range:
1.0: Perfect classifier
0.5: Random guessing
Below 0.5: Worse than random guessing, which rarely happens
Login to view more content
March 27, 2025