Interview for Machine Learning

Category: Medium

ML0024 Vanishing Gradient
What are the typical reasons for vanishing gradient?
Answer
The vanishing gradient problem occurs during the training of deep neural networks when gradients become exceedingly small as they are backpropagated through the network’s layers. This diminishes the effectiveness of weight updates, particularly in the earlier layers, hindering the network’s ability to learn and converge efficiently.
Typical Reasons for Vanishing Gradients:
1. Saturating Activation Functions:
Activation functions, such as the sigmoid or tanh, compress input values into a narrow range. For example, the sigmoid function is defined as:
${\large \sigma(z) = \displaystyle\frac{1}{1 + e^{-z}}}$
Its derivative is:
${\large \sigma'(z) = \sigma(z)(1 - \sigma(z))}$
Notice that when ${\large z}$ has very high or very low values, ${\large\sigma(z)}$ saturates close to 1 or 0, making ${\large\sigma'(z)}$ extremely small. When such small derivatives are multiplied across many layers (as dictated by the chain rule), they shrink toward zero, leading to vanishing gradients.
2. Deep Network Architectures:
In deep models, the gradient for a given layer involves a product of many small derivatives from subsequent layers. Mathematically, if you consider a simple scenario, the gradient with respect to an early layer might be expressed as:
${\large \frac{\partial L}{\partial x} = \prod_{i=1}^{n} \frac{\partial x_{i+1}}{\partial x_{i}}}$
If each term in the product is less than one in absolute value, the overall product becomes extremely small as n (the number of layers) increases.
3. Improper Weight Initialization:
The way weights are initialized can have a significant impact on the magnitude of the gradients. If the initial weights are set too small (or too large), they can push the activations into the non-linear saturation regions of functions like the sigmoid or tanh, causing their derivatives to be very small. This, in turn, contributes to vanishing gradients.
4. Recurrent Neural Networks (RNNs):
RNNs are particularly susceptible because the gradients must pass through many time steps (or iterations) when backpropagating through time. Similar to deep feedforward networks, if the gradient at each time step is less than one, the multiplicative effect causes the overall gradient to vanish over long sequences
Login to view more content
April 4, 2025
ML0019 Imbalanced Data
How to handle imbalanced data in Machine Learning?
Answer
Handling imbalanced data in machine learning involves addressing scenarios where one class significantly outnumbers the other, which can skew model performance. Here are common techniques:
Dataset Resampling:
Oversampling: Increase the minority class samples (e.g., using SMOTE or ADASYN to generate synthetic data points).
Undersampling: Reduce the majority class samples to balance the dataset.
Data Augmentation:
Create synthetic data for the minority class with data augmentation techniques.
Class Weights Adjustment:
Assign higher weights to the minority class during training to penalize misclassifications more heavily.
Metrics Selection:
Use evaluation metrics like Precision, Recall, F1 Score, or AUC-ROC rather than accuracy.
Login to view more content
March 28, 2025
ML0015 ROC Curve
What is the ROC Curve, and how is it plotted?
Answer
The ROC (Receiver Operating Characteristic) curve is a graphical representation used to evaluate the performance of a binary classification model by comparing its True Positive Rate against its False Positive Rate at various threshold settings.
Key Concepts:
True Positive Rate (TPR): Also called sensitivity or recall, it measures the proportion of actual positives correctly identified.
${\large \text{TPR} = \displaystyle\frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}}$
False Positive Rate (FPR): The proportion of negatives incorrectly classified as positive.
${\large \text{FPR} = \displaystyle\frac{\text{False Positives}}{\text{False Positives} + \text{True Negatives}}}$
Thresholds: Classification models output scores (often probabilities). A threshold determines the cutoff for labeling a prediction as positive or negative. The ROC curve is built by varying this threshold.
Steps to Plot the ROC Curve:
Train a Model: Train the binary classification model on the labelled dataset.
Generate Probabilities: Instead of predicting class labels directly, generate probability scores for the positive class.
Calculate TPR and FPR: Calculate the TPR and FPR for various threshold values
Plot the Curve: Plot the TPR against the FPR for each threshold, creating the ROC curve.
In an ROC curve:
The x-axis shows FPR (1 – Specificity)
The y-axis shows TPR (Sensitivity or Recall).
Each point represents a TPR/FPR pair for a specific threshold.
Login to view more content
March 27, 2025
ML0007 Dropout
What is dropout in neural network training?
Answer
Dropout is a regularization technique used during neural network training to prevent overfitting.
During each training step, a fraction of the neurons (and their corresponding connections) are randomly “dropped out” (i.e., set their activations to zero). This forces the network to learn more robust features because it can’t rely on any single neuron; instead, it learns distributed representations by effectively training an ensemble of smaller sub-networks. This will improve the model’s ability to generalize to unseen data.
Login to view more content
March 16, 2025
ML0006 Cross-Validation
What are the common cross-validation techniques?
Answer
Cross-validation is a statistical method used to evaluate the performance and generalizability of a model.
Common cross-validation techniques are listed below:
1. k-Fold Cross-Validation: The data is divided into k equal parts (folds). The model is trained k times, each time using k – 1 folds for training and the remaining fold for validation. The final performance is the average over all k runs.
2. Leave-One-Out Cross-Validation (LOOCV): A special case of k-fold where k equals the number of data points. Each data point is used once as the validation set, while the rest serve as training data.
3. Stratified k-Fold: Similar to k-fold, but each fold preserves the class distribution, which is particularly useful for imbalanced datasets.
Login to view more content
February 27, 2025
ML0005 Discriminative and Generative
What are the differences between discriminative and generative models?
Answer
Discriminative Models:
Objective: Discriminative models are designed to draw a boundary between classes. They focus on modeling the conditional probability P(y∣x). They learn the mapping from features x to labels y without trying to model how the data was generated.
Examples: Logistic Regression, Support Vector Machines (SVMs), Neural Network Classifiers
Generative Models:
Generative Models: Estimate the joint probability P(x,y), or P(x∣y) and P(y), to understand how data is generated. Using Bayes’ theorem, they can then deduce the conditional probability P(y∣x) for classification tasks.
Examples: Naive Bayes, Hidden Markov Models (HMMs), Variational Autoencoders (VAEs), and Generative Adversarial Networks (GANs).
Login to view more content
February 26, 2025