Interview for Machine Learning

Tag: Norm

DL0034 Layer Norm
What is layer normalization, and why is it used in Transformers?
Answer
Layer Normalization is a technique that standardizes the inputs across the features for a single training example. Unlike Batch Normalization, which normalizes across the batch dimension, Layer Normalization computes the mean and variance for every single example independently to normalize its features.
(1) Normalization within a Sample: Layer Normalization (LN) calculates the mean and variance across all the features of a single data point (e.g., a single token’s embedding vector in a sequence). It then uses these statistics to normalize the features for that data point only.
(2) Batch Size Independence: Because it operates on individual examples, its calculations are independent of the batch size. This is a major advantage in models like Transformers that often process sequences of varying lengths, which can make batch statistics unstable.
(3) Stabilizes Training: By keeping the activations in each layer within a consistent range (mean of 0, standard deviation of 1), LN helps prevent the exploding or vanishing gradients problem. This leads to a smoother training process and faster convergence, especially in deep networks.
Layer Normalization Equation:
$\hat{x}_i = \frac{x_i - \mu}{\sigma + \epsilon} \cdot \gamma + \beta$
Where:
$x_i$ = input feature,
$\mu$ = mean of all features for the current sample,
$\sigma$ = standard deviation of all features,
$\epsilon$ = small constant for numerical stability,
$\gamma, \beta$ = learnable scale and shift parameters.
The figure below demonstrates the difference between batch normalization and layer normalization.
Login to view more content
August 14, 2025
DL0013 Instance Normalization
Can you explain what Instance Normalization is in the context of deep learning?
Answer
Instance Normalization (IN) normalizes each individual data sample (often per channel) by subtracting its own mean and dividing by its variance, then applying a scale and shift. This makes it ideal for applications where per-instance adjustment is needed, such as artistic style transfer, ensuring that the normalization is not affected by the mini-batch composition.
Here are the equations for calculating Instance Normalization output $y_{nchw}$ for input $x_{nchw}$ :
$\mu_{nc} = \frac{1}{HW} \sum_{h=1}^{H} \sum_{w=1}^{W} x_{nchw}$
$\sigma_{nc}^2 = \frac{1}{HW} \sum_{h=1}^{H} \sum_{w=1}^{W} (x_{nchw} - \mu_{nc})^2$
$\hat{x}_{nchw}=\frac{{x}_{nchw} - \mu_{nc}}{\sqrt{\sigma_{nc}^2 + \epsilon}}$
$y_{nchw} = \gamma_c \hat{x}_{nchw} + \beta_c$
Where:
$x_{nchw}$ is the input feature at batch $n$ , channel $c$ , height $h$ , and width $w$ .
$H$ is the height of the feature map (number of rows per channel).
$W$ is the width of the feature map (number of columns per channel).
$\mu_{nc}$ is the mean of all spatial values in channel $c$ of instance $n$ .
$\sigma_{nc}^2$ is the variance of spatial values in channel $c$ of instance $n$ .
$\hat{x}_{nchw}$ is the normalized value after subtracting the mean and dividing by the standard deviation.
$\epsilon$ is a small constant added to the denominator to prevent division by zero and improve numerical stability.
$y_{nchw}$ is the final output after applying normalization and scaling.
$\gamma_c$ is a learnable scale parameter for channel $c$ .
$\beta_c$ is a learnable shift parameter for channel $c$ .
Login to view more content
May 22, 2025
DL0007 Batch Norm
Why use batch normalization in deep learning training?
Answer
Batch normalization is a crucial technique during deep learning training that enhances network stability and accelerates learning. It achieves this by normalizing the inputs to the activation function for each mini-batch, specifically by subtracting the batch mean and dividing by the batch standard deviation.
After normalization, the layer applies a learnable scale (gamma) and shift (beta) that are updated during training to allow the network to recover the identity transformation if needed and to re-center/re-scale activations appropriately.
Here’s the formula for Batch Normalization:
$BN(x_i) = \gamma \left( \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \right) + \beta$
Where:
$x_i$ represents an individual feature value in the batch.
$\mu_B$ represents the mean of that feature across the current batch.
$\sigma_B^2$ represents the variance of that feature across the current batch.
$\epsilon$ is a small constant (e.g. $10^{-5}$ ) added to the denominator for numerical stability.
$\gamma$ is a learnable scaling parameter.
$\beta$ is a learnable shifting parameter.
Batch Normalization is typically applied after the linear transformation of a layer (e.g., after the convolution operation in a convolutional layer) and before the non-linear activation function (e.g., ReLU).
The benefits of using Batch Normalization include:
(1) Stabilizes learning: Reduces internal covariate shift, making training more stable and less sensitive to network initialization and hyperparameter choices.
(2) Enables higher learning rates and accelerates training: Allows for larger learning rates without causing instability, leading to faster convergence.
(3) Improves generalization: Normalizes each mini-batch independently, introducing noise into activations. This noise prevents over-reliance on specific mini-batch activations, forcing the network to learn more robust and generalizable features.
Login to view more content
April 30, 2025
ML0018 Data Normalization
Why is data normalization used in Machine Learning?
Answer
Data normalization is the process of scaling data to fit within a specific range or distribution, often between 0 and 1 or with a mean of 0 and standard deviation of 1. It’s used in machine learning and statistical modeling to ensure that features contribute equally to the model’s learning process.
Login to view more content
March 28, 2025