Interview for Machine Learning

Author: admin

DL0004 Small Kernels
What are the key advantages of using small convolutional kernels, such as 3×3, over utilizing a few larger kernels in deep learning architectures?
Answer
Using small convolutional kernels instead of a few larger kernels offers several significant advantages in deep learning architectures:
(1) Deeper Networks & More Non-Linearity: Stacking multiple 3×3 layers (e.g., three 3×3 layers) allows for a deeper network with more non-linear activation functions compared to a single large kernel.
(2) Reduced Parameters: Multiple small kernels can achieve the same receptive field as a larger one, but with fewer parameters.
Example: Two stacked 3×3 layers ( $18 \cdot C_{in} \cdot C_{out}$ total parameters) have the same receptive field as a 5×5 layer ( $25 \cdot C_{in} \cdot C_{out}$ total parameters) but fewer parameters.

(3) Computational Efficiency: Fewer parameters in smaller kernels generally lead to lower computation costs during training and inference.
(4) Gradual Receptive Field Expansion: Successive 3×3 convolutions progressively build a larger receptive field while maintaining fine detail. (3×3 filters focus on local detail capture with pixel neighborhoods, ideal for textures or edges.)
Login to view more content
April 29, 2025
DL0003 1×1 Convolution
What are the benefits of using 1×1 convolutional layers in deep learning architectures?
Answer
A 1×1 convolution, also known as a pointwise convolution, is a convolutional operation where the kernel size is 1×1, which plays several crucial roles in deep learning architectures.
(1) Dimensionality control: 1×1 convolution can reduce or expand the number of feature maps, trading off representational capacity and computational cost.
For example, Bottleneck designs: In architectures like ResNet’s bottleneck block, a 1×1 conv first reduces channels (e.g., 256→64), then a 3×3 conv processes those, and finally another 1×1 conv expands back (64→256) to restore capacity while keeping compute manageable.
(2) Increased Network Depth with Controlled Cost: Allows for the design of deeper networks by reducing channel dimensionality before computationally expensive spatial convolutions.
(3) Cross-Channel Feature Fusion: Enables interaction and combination of information across different feature channels at the same spatial location.
(4) Non-linear mixing: When followed by activations (ReLU, etc.), 1×1 convolutions introduce non‐linear channel mixing that enhances model expressiveness.
Login to view more content
April 29, 2025
DL0002 All Ones Init
What are the potential consequences of initializing all weights to one in a deep learning model?
Answer
Below are the key consequences of initializing all weights in a deep-learning model to one (a constant non-zero value), illustrating why random, scaled initializations (e.g., Xavier/He) are essential.
(1) Symmetry Problem: Neurons receive identical gradients, causing them to learn the same features rather than developing distinct representations.
(2) Limited Representational Capacity: The network cannot capture complex, varied patterns because all neurons behave identically.
(3) Slow/No Convergence: The lack of Representational Capacity further makes it difficult for the model to update to the optimal weights. (The below image shows an example for training loss comparison for ones initialization vs random initialization)

(4) Activation Saturation: Can push neurons into saturated regions of activation functions (e.g., sigmoid, tanh), leading to vanishing gradients.
Login to view more content
April 28, 2025
DL0001 Residual Connection
Why are residual connections important in deep neural networks?
Answer
Residual connections, also known as skip connections, are vital in deep neural networks primarily because they tackle the infamous vanishing gradient problem and help with the related issue of network degradation as the network depth increases.
Residual connection is often expressed by the following equation:
$y = F(x) + x$
Where:
$F(x)$ represents the residual mapping that the network learns (i.e., what needs to be added to the input $x$ to achieve the desired output).
$x$ is the input to the residual block.
(1) Tackle vanishing gradient problem:
Residual connections create a direct shortcut for gradient flow by incorporating an identity mapping into the learned transformation. This ensures that even if the gradient through the learned component is small, a strong, direct gradient component persists, preventing vanishing gradients in deep networks. This improves gradient flow during backpropagation, reducing vanishing gradients and enabling the training of very deep networks.
(2) Address network degradation:
Residual connections mitigate the degradation problem often seen in deep networks. Without these connections, simply stacking more layers can result in higher training errors, as the network struggles to update its weights effectively. With residual connections, any layer that doesn’t contribute useful information can effectively learn to output zeros in the residual branch, letting the network default to an identity mapping.
Login to view more content
April 23, 2025
ML0031 Linear Regression
What are the advantages and disadvantages of linear regression?
Answer
Linear regression aims to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.
$\displaystyle h_\theta(x) = \theta_0 + \sum_{j=1}^{p} \theta_j x_j$
Where
$h_\theta(x)$ represents the hypothesis (predicted value) for input feature vector x.
$\theta_0$ is the bias (intercept) parameter, shifting the prediction up or down independent of features.
$\theta_j$ are the weight parameters multiplying each feature.
$x_j$ denotes the j‑th feature of the input vector x.
$p$ is the total number of features (excluding the bias) used in the model.
Advantages:
(1) Simplicity & Interpretability: Linear regression is easy to understand and implement. The coefficients of the model directly indicate the strength and direction of the relationship between the features and the target variable, making it highly interpretable.
(2) Computational Efficiency: Its low computational cost makes linear regression fast to train, even on large datasets.
(3) Effective for Linearly Separable Data: It performs well when the relationship between the independent and dependent variables is approximately linear.
Disadvantages:
(1) Assumes Linearity: The primary limitation is the assumption that the relationship between the variables is linear. It will perform poorly if the underlying relationship is nonlinear.
(2) Sensitivity to Outliers: Extreme values can disproportionately affect the model, distorting the results.
(3) Multicollinearity Issues: When predictors are highly correlated, it becomes difficult to isolate individual effects, leading to unreliable coefficient estimates
(4) Potential for Underfitting: The simplicity of the model may fail to capture the nuances and complexities of more intricate datasets.
Login to view more content
April 18, 2025
ML0030 Sigmoid
What are the advantages and disadvantages of using a sigmoid activation function?
Answer
The sigmoid activation function transforms input values into a range between 0 and 1, making it useful in various applications like binary classification.
$\mbox{Sigmoid}(x) = \frac{1}{1 + e^{-x}}$
Advantages:
(1) Smooth, Bounded Gradient: The sigmoid’s S‑shape yields a continuous derivative, preventing abrupt changes in backpropagation and aiding stable training on shallow networks.
(2) Probability interpretation: Since the output is between 0 and 1, it can be useful for problems where predictions need to represent probabilities.
Disadvantages:
(1) Vanishing gradient problem: For very large or small inputs, the gradient becomes almost zero, slowing down training in deep networks.
(2) Not zero-centered: The outputs are always positive, which can lead to inefficient weight updates and slower convergence.
Login to view more content
April 18, 2025
ML0029 Tanh
What are the advantages and disadvantages of using the tanh activation function?
Answer
In machine learning, the hyperbolic tangent (tanh) activation function is defined as
$\mbox{tanh}(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$
This function transforms input values into a range between -1 and 1, helping with faster convergence in neural networks.
Advantages:
(1) Zero-centered outputs: Unlike sigmoid, which outputs values between 0 and 1, tanh produces values between -1 and 1, making optimization easier and reducing bias in gradient updates.
(2) Smooth and Differentiable: The function is infinitely differentiable, supporting stable gradient‑based methods.
Disadvantages:
(1) Vanishing gradient problem: For very large or very small input values, the derivative of tanh approaches zero, leading to slow weight updates and potentially hindering deep network training.
(2) Computationally expensive: Compared to ReLU, tanh requires exponentially complex calculations, which may slow down model inference.
Login to view more content
April 17, 2025
ML0028 Softmax
What is the Softmax activation function, and what is its purpose?
Answer
Softmax is an activation function typically used in the output layer of a neural network for multi-class classification problems. Its purpose is to convert a vector of raw scores (logits) into a probability distribution over the possible output classes. The output of Softmax is a vector where each element represents the probability of the input belonging to a specific class, and the sum of these probabilities is always 1.
$\mbox{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$
Where:
$z_i$ represents the raw score (also known as a “logit”) for the i th class.
$K$ represents the total number of classes in the classification problem.
The combination of the softmax function with the cross-entropy loss function is standard for multi-class classification problems. The softmax function provides a probability distribution over classes, and the cross-entropy loss measures how well this predicted distribution aligns with the true distribution (typically a one-hot encoded vector).
Login to view more content
April 12, 2025
ML0027 Leaky ReLU
What are the benefits of the Leaky ReLU activation function?
Answer
Leaky ReLU modifies the standard ReLU by allowing a small, non-zero gradient for negative inputs. Its formula is typically written as:
${\large \text{Leaky ReLU}(x) = x \text{ if } x \ge 0,\quad \alpha x \text{ if } x < 0}$
Advantages of Leaky ReLU:
1. Addresses the dying ReLU problem: By having a small non-zero slope for negative inputs, Leaky ReLU allows a small gradient to flow even when the neuron is not active in the positive region. This prevents neurons from getting stuck in a permanently inactive state and potentially helps them recover during training.
2. Retains the benefits of ReLU for positive inputs: Maintains the linearity and non-saturation for positive values, contributing to efficient computation and gradient propagation.
Login to view more content
April 12, 2025
ML0026 ReLU
What are the benefits and limitations of the ReLU activation function?
Answer
ReLU offers substantial benefits in terms of computational efficiency, gradient propagation, and sparsity, which have made it a popular choice for activation functions in deep learning.
${\large \text{ReLU}(x) = \max(0, x)}$
Advantages of ReLU:
1. Mitigation of the Vanishing Gradient Problem: In the positive region (x>0), ReLU has a constant gradient of 1. This helps to alleviate the vanishing gradient problem that plagues sigmoid and tanh functions, especially in deep networks. A constant gradient allows for more effective backpropagation of the error signal to earlier layers.
2. Sparse Activation:
By outputting zero for all negative input values, ReLU naturally induces sparsity in the network. This means that, at any given time, only a subset of neurons are active. Sparse activations can lead to more efficient representations and can help the network learn more robust features.
3. Computational Efficiency:
ReLU is computationally simple, requiring only a threshold operation, which accelerates both training and inference processes compared to functions like sigmoid or tanh that involve more complex calculations.
Drawbacks of ReLU:
1. Dying ReLU Problem:
Neurons can become inactive if they consistently receive negative inputs, leading them to output zero and potentially never recover, thus reducing the model’s capacity.
2. Unbounded Output:
The unbounded nature of ReLU’s positive outputs can lead to large activation values, potentially causing issues like exploding gradients if not properly managed.
Login to view more content
April 12, 2025