Interview for Machine Learning

Author: admin

ML0034 Backpropagation
What is backpropagation?
Answer
Backpropagation, backward propagation of errors, is the central algorithm by which multilayer neural networks learn. At its core, it efficiently computes how each weight and bias in the network contributes to the overall prediction error (loss). Then, it updates those parameters in the direction that reduces the error the most.
By combining the chain rule from calculus with gradient‑based optimization (e.g., gradient descent), backpropagation makes training deep architectures tractable and underpins virtually all modern advances in deep learning.
Steps to conduct Backpropagation:
(1) Forward Pass: Inputs are propagated through the network to compute outputs. Intermediate activations are stored for later use.
(2) Compute Loss: Use a loss function to compare the network’s output to the actual target values.
(3) Backward Pass (Error Propagation): The error is computed at the output layer. The chain rule is applied to recursively calculate the gradients of the loss for each weight, starting from the output layer back to the input layer.
(4) Gradient Calculation: For every neuron, determine how much its weights contributed to the error by computing partial derivatives.
(5) Update Weights: Adjust the weights using an optimization algorithm (e.g., gradient descent), by subtracting a fraction (learning rate) of the computed gradients. This step is repeated iteratively to gradually minimize the loss.
More details for step (3): Backward Pass (Error Propagation)
At the Output Layer:
Imagine a neuron with an output value $a$ (its activation) and a weighted sum $z$ computed as:
$\mbox z = w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b$
Suppose we use the mean squared error (MSE) as our loss function:
$\mbox L = \frac{1}{2} (T - a)^2$
Where $T$ is the target value.
The derivative of the loss to the activation is:
$\frac{dL}{da} = a - T$
To update weights, we need to know how the loss changes to $z$ . Using the chain rule, we have:
$\frac{dL}{dz} = \frac{dL}{da} \cdot \frac{da}{dz}$
For example, if the activation function is sigmoid, then:
$\frac{da}{dz} = a (1 - a)$
For Hidden Layers:
Consider a hidden neuron $j$ that feeds into the output neurons. Its contribution to the loss is influenced by all neurons it connects to in the subsequent layer. The backpropagated error for neuron $j$ is given by:
$\frac{dL}{dz_j} = \left( \sum_{k} \frac{dL}{dz_k} \cdot w_{jk} \right) \cdot f'(z_j)$
Here, $f'(z_j)$ is the derivative of the activation function at neuron $j$ .
More details for step (4): Gradient Calculation
For Each Weight:
Once you have the error signal $\frac{dL}{dz}$ for a neuron, the gradient with respect to a weight $w_i$ connected to input $x_i$ is:
$\frac{dL}{dw_i} = \frac{dL}{dz} \cdot x_i$
This shows that the gradient is directly proportional to the input, linking how much weight its contribution had on the final error.
For the Bias:
Since the bias b contributes to $z$ with a derivative of 1, the gradient for the bias is simply:
$\frac{dL}{db} = \frac{dL}{dz}$
Login to view more content
May 8, 2025
ML0033 All Zeros Init
How does initializing all weights and biases to zero affect a neural network’s training?
Answer
Initializing all weights and biases to zero forces neurons to behave identically, leading to uniform gradient updates that prevent the network from learning diverse representations.
(1) Symmetry Problem: Neurons receive identical gradients, causing them to learn the same features rather than developing distinct representations.
(2) Limited Representational Capacity: The network cannot capture complex, varied patterns because all neurons behave identically.
(3) Slow/No Convergence: The lack of Representational Capacity further makes it difficult for the model to update to the optimal weights.
(4) Zero Output (Potentially): For some activation functions (like ReLu), with zero weights and biases, the initial output of every neuron will be zero. This can lead to zero gradients in the subsequent layers, halting the learning process entirely.
Here is an example comparing initializing all weights and biases to zero vs random initialization for a binary classification problem.
Login to view more content
May 6, 2025
ML0032 Non-Linear Activation
Why use non-linear activation functions in neural networks in machine learning, and what limitations would a network face if only linear activation functions were used?
Answer
The benefits of using non-linear activation functions in neural networks are as follows:
(1) Introduce Non-Linearity: Enable learning complex patterns in data.
(2) Model Complexity: Allow approximation of any continuous function.
(3) Enable Multiple Layers to Add Power: Enable multiple layers to build complex, abstract representations rather than simple linear mappings. Stacking multiple layers with only linear activations collapses into an equivalent single linear transformation; depth would confer no additional modeling capacity.
The limitations of only linear activations are as follows:
(1) No Depth Advantage: Any multilayer network collapses to a single-layer linear model, so adding layers does not increase modeling power. Acts as a single linear regression, regardless of depth.
(2) Inability to Learn Non‑Linear Boundaries: Only learn linearly separable data. Tasks requiring non‑linear decision boundaries become impossible.
The following example shows the limitation of using linear activations only in neural networks.
Login to view more content
May 5, 2025
DL0011 Fully Connected Layer
Can you explain what a fully connected layer is？
Answer
A Fully Connected (FC) Layer, or Dense Layer, is one where every neuron connects to all neurons in the previous layer. It computes a weighted sum of inputs, adds a bias, and applies an activation function to introduce non-linearity. This allows the network to learn complex feature combinations.
FC layers learn complex combinations of features but can be parameter-heavy and lose spatial context while flattening the feature maps.
Global Average Pooling (GAP) summarizes each feature map into a single value, reducing dimensionality and improving spatial robustness with no added parameters.
GAP followed by a small FC layer, is often used to replace the flatten operation with a large FC layer at the end of Convolutional Neural Networks (CNNs) for classification tasks.
The image below shows examples of parameter comparisons between using Flatten + FC and GAP + FC. There is a total of 6 classes and 8 channels.
Login to view more content
May 3, 2025
DL0010 Receptive Field
What is the receptive field in convolutional neural networks, and how do you calculate it?
Answer
In convolutional neural networks (CNNs), the receptive field of a neuron is the region of the input image that can affect that neuron’s activation. Receptive field Increases in deeper layers, allowing the network to learn hierarchical features.
Use the following iterative formula to calculate the Receptive field:
$RF_l = RF_{l-1} + (k_l - 1) \times \prod_{i=1}^{l-1} s_i$
Where:
$RF_l$ represents the receptive field size in layer $l$ . $RF_0 = 1$ for the input layer.
$k_l$ represents the kernel size of layer $l$ .
$s_i$ represents the stride of layer $i$ .
The following image shows an example of receptive field size growth in a CNN.
K means kernel size, S means stride, and D means dilation rate.
Login to view more content
May 1, 2025
DL0009 Pooling
Please compare max pooling and average pooling in deep learning, and explain in which scenarios you would prefer one over the other.
Answer
Max Pooling: Selects the maximum value within each non-overlapping window (kernel) of the feature map, downsampling while preserving the strongest activation in each region.
Average Pooling: Computes the average of all values within each window of the feature map, downsampling by smoothing and retaining a holistic summary of the region.
The image below shows one example of Max Pooling and Average Pooling.

Max Pooling VS Average Pooling in summary
Login to view more content
May 1, 2025
DL0008 Hyperparameter Tuning
What are the common strategies for Hyperparameter Tuning in deep learning?
Answer
Hyperparameter tuning in deep learning is the process of optimizing the configuration settings that control the learning process.
(1) Manual/Heuristic Search: Start with values from prior work or common practice and iteratively adjust based on validation performance.
(2) Grid Search: Exhaustively evaluates all combinations over a predefined, discrete grid of hyperparameter values; simple but scales poorly with dimensionality.
(3) Random Search: Randomly sampling hyperparameter values from predefined ranges.
(4) Bayesian Optimization: Using probabilistic models to intelligently suggest the next set of hyperparameters to try, balancing exploration and exploitation.
The below plot illustrates how validation accuracy varies with different learning rates (on a log scale) for two batch-size settings (32 and 64).
Login to view more content
April 30, 2025
DL0007 Batch Norm
Why use batch normalization in deep learning training?
Answer
Batch normalization is a crucial technique during deep learning training that enhances network stability and accelerates learning. It achieves this by normalizing the inputs to the activation function for each mini-batch, specifically by subtracting the batch mean and dividing by the batch standard deviation.
After normalization, the layer applies a learnable scale (gamma) and shift (beta) that are updated during training to allow the network to recover the identity transformation if needed and to re-center/re-scale activations appropriately.
Here’s the formula for Batch Normalization:
$BN(x_i) = \gamma \left( \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \right) + \beta$
Where:
$x_i$ represents an individual feature value in the batch.
$\mu_B$ represents the mean of that feature across the current batch.
$\sigma_B^2$ represents the variance of that feature across the current batch.
$\epsilon$ is a small constant (e.g. $10^{-5}$ ) added to the denominator for numerical stability.
$\gamma$ is a learnable scaling parameter.
$\beta$ is a learnable shifting parameter.
Batch Normalization is typically applied after the linear transformation of a layer (e.g., after the convolution operation in a convolutional layer) and before the non-linear activation function (e.g., ReLU).
The benefits of using Batch Normalization include:
(1) Stabilizes learning: Reduces internal covariate shift, making training more stable and less sensitive to network initialization and hyperparameter choices.
(2) Enables higher learning rates and accelerates training: Allows for larger learning rates without causing instability, leading to faster convergence.
(3) Improves generalization: Normalizes each mini-batch independently, introducing noise into activations. This noise prevents over-reliance on specific mini-batch activations, forcing the network to learn more robust and generalizable features.
Login to view more content
April 30, 2025
DL0006 Layer Freeze in TL
What are the common strategies for layer freezing in transfer learning?
Answer
Here are the common strategies for layer freezing in transfer learning:
(1) Freeze all but the output layer(s): Train only the final classification/regression layers. Good starting point for similar tasks and small datasets.
(2) Freeze early layers that capture general features: Train later, task-specific layers. Effective for moderately similar tasks. Balances leveraging pre-learned features with adapting higher-level representations.
(3) Fine-tune all layers with a low learning rate: Adapt all weights slowly. Use with caution on small datasets.
(4) Gradual Unfreezing: Start with frozen layers and progressively unfreeze layers during training to refine the model incrementally. Helps avoid large initial weight updates that can “destroy” learned features.
Login to view more content
April 30, 2025
DL0005 Transfer Learning
Why use transfer learning in deep learning instead of training from scratch?
Answer
Transfer learning can leverage knowledge from a pre-trained model to improve performance, reduce training time and data requirements, and lower computational costs when tackling a new but related task.
(1) Leverages Existing Knowledge and Reduced Data Requirements: Transfer learning leverages learned useful representations from large datasets, which can achieve good performance with significantly less task-specific data.
(2) Faster Convergence and Training Time: Starting with pre-trained weights provides a much better initialization point for training than random weights, leading to faster convergence and potentially better local optima. Pre-trained weights have already learned generalizable features, so fine-tuning on a new task typically requires much less training time.
(3) Improved Performance on Limited Data Tasks: When data is limited, transfer learning often yields higher accuracy and better generalization compared to training a model from scratch.
Login to view more content
April 30, 2025