Interview for Machine Learning

Tag: Basics

DL0002 All Ones Init
What are the potential consequences of initializing all weights to one in a deep learning model?
Answer
Below are the key consequences of initializing all weights in a deep-learning model to one (a constant non-zero value), illustrating why random, scaled initializations (e.g., Xavier/He) are essential.
(1) Symmetry Problem: Neurons receive identical gradients, causing them to learn the same features rather than developing distinct representations.
(2) Limited Representational Capacity: The network cannot capture complex, varied patterns because all neurons behave identically.
(3) Slow/No Convergence: The lack of Representational Capacity further makes it difficult for the model to update to the optimal weights. (The below image shows an example for training loss comparison for ones initialization vs random initialization)

(4) Activation Saturation: Can push neurons into saturated regions of activation functions (e.g., sigmoid, tanh), leading to vanishing gradients.
Login to view more content
April 28, 2025
DL0001 Residual Connection
Why are residual connections important in deep neural networks?
Answer
Residual connections, also known as skip connections, are vital in deep neural networks primarily because they tackle the infamous vanishing gradient problem and help with the related issue of network degradation as the network depth increases.
Residual connection is often expressed by the following equation:
$y = F(x) + x$
Where:
$F(x)$ represents the residual mapping that the network learns (i.e., what needs to be added to the input $x$ to achieve the desired output).
$x$ is the input to the residual block.
(1) Tackle vanishing gradient problem:
Residual connections create a direct shortcut for gradient flow by incorporating an identity mapping into the learned transformation. This ensures that even if the gradient through the learned component is small, a strong, direct gradient component persists, preventing vanishing gradients in deep networks. This improves gradient flow during backpropagation, reducing vanishing gradients and enabling the training of very deep networks.
(2) Address network degradation:
Residual connections mitigate the degradation problem often seen in deep networks. Without these connections, simply stacking more layers can result in higher training errors, as the network struggles to update its weights effectively. With residual connections, any layer that doesn’t contribute useful information can effectively learn to output zeros in the residual branch, letting the network default to an identity mapping.
Login to view more content
April 23, 2025
ML0031 Linear Regression
What are the advantages and disadvantages of linear regression?
Answer
Linear regression aims to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.
$\displaystyle h_\theta(x) = \theta_0 + \sum_{j=1}^{p} \theta_j x_j$
Where
$h_\theta(x)$ represents the hypothesis (predicted value) for input feature vector x.
$\theta_0$ is the bias (intercept) parameter, shifting the prediction up or down independent of features.
$\theta_j$ are the weight parameters multiplying each feature.
$x_j$ denotes the j‑th feature of the input vector x.
$p$ is the total number of features (excluding the bias) used in the model.
Advantages:
(1) Simplicity & Interpretability: Linear regression is easy to understand and implement. The coefficients of the model directly indicate the strength and direction of the relationship between the features and the target variable, making it highly interpretable.
(2) Computational Efficiency: Its low computational cost makes linear regression fast to train, even on large datasets.
(3) Effective for Linearly Separable Data: It performs well when the relationship between the independent and dependent variables is approximately linear.
Disadvantages:
(1) Assumes Linearity: The primary limitation is the assumption that the relationship between the variables is linear. It will perform poorly if the underlying relationship is nonlinear.
(2) Sensitivity to Outliers: Extreme values can disproportionately affect the model, distorting the results.
(3) Multicollinearity Issues: When predictors are highly correlated, it becomes difficult to isolate individual effects, leading to unreliable coefficient estimates
(4) Potential for Underfitting: The simplicity of the model may fail to capture the nuances and complexities of more intricate datasets.
Login to view more content
April 18, 2025
ML0030 Sigmoid
What are the advantages and disadvantages of using a sigmoid activation function?
Answer
The sigmoid activation function transforms input values into a range between 0 and 1, making it useful in various applications like binary classification.
$\mbox{Sigmoid}(x) = \frac{1}{1 + e^{-x}}$
Advantages:
(1) Smooth, Bounded Gradient: The sigmoid’s S‑shape yields a continuous derivative, preventing abrupt changes in backpropagation and aiding stable training on shallow networks.
(2) Probability interpretation: Since the output is between 0 and 1, it can be useful for problems where predictions need to represent probabilities.
Disadvantages:
(1) Vanishing gradient problem: For very large or small inputs, the gradient becomes almost zero, slowing down training in deep networks.
(2) Not zero-centered: The outputs are always positive, which can lead to inefficient weight updates and slower convergence.
Login to view more content
April 18, 2025
ML0029 Tanh
What are the advantages and disadvantages of using the tanh activation function?
Answer
In machine learning, the hyperbolic tangent (tanh) activation function is defined as
$\mbox{tanh}(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$
This function transforms input values into a range between -1 and 1, helping with faster convergence in neural networks.
Advantages:
(1) Zero-centered outputs: Unlike sigmoid, which outputs values between 0 and 1, tanh produces values between -1 and 1, making optimization easier and reducing bias in gradient updates.
(2) Smooth and Differentiable: The function is infinitely differentiable, supporting stable gradient‑based methods.
Disadvantages:
(1) Vanishing gradient problem: For very large or very small input values, the derivative of tanh approaches zero, leading to slow weight updates and potentially hindering deep network training.
(2) Computationally expensive: Compared to ReLU, tanh requires exponentially complex calculations, which may slow down model inference.
Login to view more content
April 17, 2025
ML0028 Softmax
What is the Softmax activation function, and what is its purpose?
Answer
Softmax is an activation function typically used in the output layer of a neural network for multi-class classification problems. Its purpose is to convert a vector of raw scores (logits) into a probability distribution over the possible output classes. The output of Softmax is a vector where each element represents the probability of the input belonging to a specific class, and the sum of these probabilities is always 1.
$\mbox{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$
Where:
$z_i$ represents the raw score (also known as a “logit”) for the i th class.
$K$ represents the total number of classes in the classification problem.
The combination of the softmax function with the cross-entropy loss function is standard for multi-class classification problems. The softmax function provides a probability distribution over classes, and the cross-entropy loss measures how well this predicted distribution aligns with the true distribution (typically a one-hot encoded vector).
Login to view more content
April 12, 2025
ML0027 Leaky ReLU
What are the benefits of the Leaky ReLU activation function?
Answer
Leaky ReLU modifies the standard ReLU by allowing a small, non-zero gradient for negative inputs. Its formula is typically written as:
${\large \text{Leaky ReLU}(x) = x \text{ if } x \ge 0,\quad \alpha x \text{ if } x < 0}$
Advantages of Leaky ReLU:
1. Addresses the dying ReLU problem: By having a small non-zero slope for negative inputs, Leaky ReLU allows a small gradient to flow even when the neuron is not active in the positive region. This prevents neurons from getting stuck in a permanently inactive state and potentially helps them recover during training.
2. Retains the benefits of ReLU for positive inputs: Maintains the linearity and non-saturation for positive values, contributing to efficient computation and gradient propagation.
Login to view more content
April 12, 2025
ML0026 ReLU
What are the benefits and limitations of the ReLU activation function?
Answer
ReLU offers substantial benefits in terms of computational efficiency, gradient propagation, and sparsity, which have made it a popular choice for activation functions in deep learning.
${\large \text{ReLU}(x) = \max(0, x)}$
Advantages of ReLU:
1. Mitigation of the Vanishing Gradient Problem: In the positive region (x>0), ReLU has a constant gradient of 1. This helps to alleviate the vanishing gradient problem that plagues sigmoid and tanh functions, especially in deep networks. A constant gradient allows for more effective backpropagation of the error signal to earlier layers.
2. Sparse Activation:
By outputting zero for all negative input values, ReLU naturally induces sparsity in the network. This means that, at any given time, only a subset of neurons are active. Sparse activations can lead to more efficient representations and can help the network learn more robust features.
3. Computational Efficiency:
ReLU is computationally simple, requiring only a threshold operation, which accelerates both training and inference processes compared to functions like sigmoid or tanh that involve more complex calculations.
Drawbacks of ReLU:
1. Dying ReLU Problem:
Neurons can become inactive if they consistently receive negative inputs, leading them to output zero and potentially never recover, thus reducing the model’s capacity.
2. Unbounded Output:
The unbounded nature of ReLU’s positive outputs can lead to large activation values, potentially causing issues like exploding gradients if not properly managed.
Login to view more content
April 12, 2025
ML0025 Exploding Gradient
What are the typical reasons for exploding gradient?
Answer
Exploding gradients occur when the gradients during backpropagation become excessively large. This leads to huge updates in the model’s weights, making the training process unstable and potentially causing the model to diverge instead of converging.
Typical Reasons for Exploding Gradients:
1. Deep Architectures:
In very deep networks, repeatedly multiplying gradients (especially when derivatives are >1) can cause them to grow exponentially
2. Large Learning Rates:
When the learning rate is set too high, even moderately large gradients can result in weight updates that overshoot the optimum by a significant margin, compounding the instability.
3. Improper Weight Initialization:
If weights are initialized to values that are too high, activations and their corresponding derivatives can be disproportionately large. This imbalance not only disrupts the symmetry in learning but can also contribute to the accumulation of large gradient values.
4. Activation Functions with Derivatives Greater Than 1:
Some activation functions or their operating regimes can have derivatives greater than 1. Repeated multiplication of these large derivatives during backpropagation can lead to exponential growth of the gradients.
Scaled Exponential Linear Unit (SELU). For positive inputs, the SELU is defined as:
${\large \text{SELU}(x) = \lambda x,\quad \text{for } x>0}$
In the typical configuration for self-normalizing neural networks, the parameter ${\large\lambda}$ is set to approximately 1.0507 (greater than 1). This means in the positive regime, each layer effectively amplifies the gradient by a factor of ${\large\lambda}$ , which, when compounded over many layers, can contribute to exploding gradients if not properly managed.
Login to view more content
April 4, 2025
ML0024 Vanishing Gradient
What are the typical reasons for vanishing gradient?
Answer
The vanishing gradient problem occurs during the training of deep neural networks when gradients become exceedingly small as they are backpropagated through the network’s layers. This diminishes the effectiveness of weight updates, particularly in the earlier layers, hindering the network’s ability to learn and converge efficiently.
Typical Reasons for Vanishing Gradients:
1. Saturating Activation Functions:
Activation functions, such as the sigmoid or tanh, compress input values into a narrow range. For example, the sigmoid function is defined as:
${\large \sigma(z) = \displaystyle\frac{1}{1 + e^{-z}}}$
Its derivative is:
${\large \sigma'(z) = \sigma(z)(1 - \sigma(z))}$
Notice that when ${\large z}$ has very high or very low values, ${\large\sigma(z)}$ saturates close to 1 or 0, making ${\large\sigma'(z)}$ extremely small. When such small derivatives are multiplied across many layers (as dictated by the chain rule), they shrink toward zero, leading to vanishing gradients.
2. Deep Network Architectures:
In deep models, the gradient for a given layer involves a product of many small derivatives from subsequent layers. Mathematically, if you consider a simple scenario, the gradient with respect to an early layer might be expressed as:
${\large \frac{\partial L}{\partial x} = \prod_{i=1}^{n} \frac{\partial x_{i+1}}{\partial x_{i}}}$
If each term in the product is less than one in absolute value, the overall product becomes extremely small as n (the number of layers) increases.
3. Improper Weight Initialization:
The way weights are initialized can have a significant impact on the magnitude of the gradients. If the initial weights are set too small (or too large), they can push the activations into the non-linear saturation regions of functions like the sigmoid or tanh, causing their derivatives to be very small. This, in turn, contributes to vanishing gradients.
4. Recurrent Neural Networks (RNNs):
RNNs are particularly susceptible because the gradients must pass through many time steps (or iterations) when backpropagating through time. Similar to deep feedforward networks, if the gradient at each time step is less than one, the multiplicative effect causes the overall gradient to vanish over long sequences
Login to view more content
April 4, 2025