Interview for Machine Learning

Category: Easy

DL0011 Fully Connected Layer
Can you explain what a fully connected layer is？
Answer
A Fully Connected (FC) Layer, or Dense Layer, is one where every neuron connects to all neurons in the previous layer. It computes a weighted sum of inputs, adds a bias, and applies an activation function to introduce non-linearity. This allows the network to learn complex feature combinations.
FC layers learn complex combinations of features but can be parameter-heavy and lose spatial context while flattening the feature maps.
Global Average Pooling (GAP) summarizes each feature map into a single value, reducing dimensionality and improving spatial robustness with no added parameters.
GAP followed by a small FC layer, is often used to replace the flatten operation with a large FC layer at the end of Convolutional Neural Networks (CNNs) for classification tasks.
The image below shows examples of parameter comparisons between using Flatten + FC and GAP + FC. There is a total of 6 classes and 8 channels.
Login to view more content
May 3, 2025
DL0009 Pooling
Please compare max pooling and average pooling in deep learning, and explain in which scenarios you would prefer one over the other.
Answer
Max Pooling: Selects the maximum value within each non-overlapping window (kernel) of the feature map, downsampling while preserving the strongest activation in each region.
Average Pooling: Computes the average of all values within each window of the feature map, downsampling by smoothing and retaining a holistic summary of the region.
The image below shows one example of Max Pooling and Average Pooling.

Max Pooling VS Average Pooling in summary
Login to view more content
May 1, 2025
DL0008 Hyperparameter Tuning
What are the common strategies for Hyperparameter Tuning in deep learning?
Answer
Hyperparameter tuning in deep learning is the process of optimizing the configuration settings that control the learning process.
(1) Manual/Heuristic Search: Start with values from prior work or common practice and iteratively adjust based on validation performance.
(2) Grid Search: Exhaustively evaluates all combinations over a predefined, discrete grid of hyperparameter values; simple but scales poorly with dimensionality.
(3) Random Search: Randomly sampling hyperparameter values from predefined ranges.
(4) Bayesian Optimization: Using probabilistic models to intelligently suggest the next set of hyperparameters to try, balancing exploration and exploitation.
The below plot illustrates how validation accuracy varies with different learning rates (on a log scale) for two batch-size settings (32 and 64).
Login to view more content
April 30, 2025
DL0007 Batch Norm
Why use batch normalization in deep learning training?
Answer
Batch normalization is a crucial technique during deep learning training that enhances network stability and accelerates learning. It achieves this by normalizing the inputs to the activation function for each mini-batch, specifically by subtracting the batch mean and dividing by the batch standard deviation.
After normalization, the layer applies a learnable scale (gamma) and shift (beta) that are updated during training to allow the network to recover the identity transformation if needed and to re-center/re-scale activations appropriately.
Here’s the formula for Batch Normalization:
$BN(x_i) = \gamma \left( \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \right) + \beta$
Where:
$x_i$ represents an individual feature value in the batch.
$\mu_B$ represents the mean of that feature across the current batch.
$\sigma_B^2$ represents the variance of that feature across the current batch.
$\epsilon$ is a small constant (e.g. $10^{-5}$ ) added to the denominator for numerical stability.
$\gamma$ is a learnable scaling parameter.
$\beta$ is a learnable shifting parameter.
Batch Normalization is typically applied after the linear transformation of a layer (e.g., after the convolution operation in a convolutional layer) and before the non-linear activation function (e.g., ReLU).
The benefits of using Batch Normalization include:
(1) Stabilizes learning: Reduces internal covariate shift, making training more stable and less sensitive to network initialization and hyperparameter choices.
(2) Enables higher learning rates and accelerates training: Allows for larger learning rates without causing instability, leading to faster convergence.
(3) Improves generalization: Normalizes each mini-batch independently, introducing noise into activations. This noise prevents over-reliance on specific mini-batch activations, forcing the network to learn more robust and generalizable features.
Login to view more content
April 30, 2025
DL0006 Layer Freeze in TL
What are the common strategies for layer freezing in transfer learning?
Answer
Here are the common strategies for layer freezing in transfer learning:
(1) Freeze all but the output layer(s): Train only the final classification/regression layers. Good starting point for similar tasks and small datasets.
(2) Freeze early layers that capture general features: Train later, task-specific layers. Effective for moderately similar tasks. Balances leveraging pre-learned features with adapting higher-level representations.
(3) Fine-tune all layers with a low learning rate: Adapt all weights slowly. Use with caution on small datasets.
(4) Gradual Unfreezing: Start with frozen layers and progressively unfreeze layers during training to refine the model incrementally. Helps avoid large initial weight updates that can “destroy” learned features.
Login to view more content
April 30, 2025
DL0005 Transfer Learning
Why use transfer learning in deep learning instead of training from scratch?
Answer
Transfer learning can leverage knowledge from a pre-trained model to improve performance, reduce training time and data requirements, and lower computational costs when tackling a new but related task.
(1) Leverages Existing Knowledge and Reduced Data Requirements: Transfer learning leverages learned useful representations from large datasets, which can achieve good performance with significantly less task-specific data.
(2) Faster Convergence and Training Time: Starting with pre-trained weights provides a much better initialization point for training than random weights, leading to faster convergence and potentially better local optima. Pre-trained weights have already learned generalizable features, so fine-tuning on a new task typically requires much less training time.
(3) Improved Performance on Limited Data Tasks: When data is limited, transfer learning often yields higher accuracy and better generalization compared to training a model from scratch.
Login to view more content
April 30, 2025
DL0004 Small Kernels
What are the key advantages of using small convolutional kernels, such as 3×3, over utilizing a few larger kernels in deep learning architectures?
Answer
Using small convolutional kernels instead of a few larger kernels offers several significant advantages in deep learning architectures:
(1) Deeper Networks & More Non-Linearity: Stacking multiple 3×3 layers (e.g., three 3×3 layers) allows for a deeper network with more non-linear activation functions compared to a single large kernel.
(2) Reduced Parameters: Multiple small kernels can achieve the same receptive field as a larger one, but with fewer parameters.
Example: Two stacked 3×3 layers ( $18 \cdot C_{in} \cdot C_{out}$ total parameters) have the same receptive field as a 5×5 layer ( $25 \cdot C_{in} \cdot C_{out}$ total parameters) but fewer parameters.

(3) Computational Efficiency: Fewer parameters in smaller kernels generally lead to lower computation costs during training and inference.
(4) Gradual Receptive Field Expansion: Successive 3×3 convolutions progressively build a larger receptive field while maintaining fine detail. (3×3 filters focus on local detail capture with pixel neighborhoods, ideal for textures or edges.)
Login to view more content
April 29, 2025