Tag: Norm

  • DL0034 Layer Norm

    What is layer normalization, and why is it used in Transformers?

    Answer

    Layer Normalization is a technique that standardizes the inputs across the features for a single training example. Unlike Batch Normalization, which normalizes across the batch dimension, Layer Normalization computes the mean and variance for every single example independently to normalize its features.
    (1) Normalization within a Sample: Layer Normalization (LN) calculates the mean and variance across all the features of a single data point (e.g., a single token’s embedding vector in a sequence). It then uses these statistics to normalize the features for that data point only.
    (2) Batch Size Independence: Because it operates on individual examples, its calculations are independent of the batch size. This is a major advantage in models like Transformers that often process sequences of varying lengths, which can make batch statistics unstable.
    (3) Stabilizes Training: By keeping the activations in each layer within a consistent range (mean of 0, standard deviation of 1), LN helps prevent the exploding or vanishing gradients problem. This leads to a smoother training process and faster convergence, especially in deep networks.

    Layer Normalization Equation:
    \hat{x}_i = \frac{x_i - \mu}{\sigma + \epsilon} \cdot \gamma + \beta
    Where:
     x_i = input feature,
     \mu = mean of all features for the current sample,
     \sigma = standard deviation of all features,
     \epsilon = small constant for numerical stability,
     \gamma, \beta = learnable scale and shift parameters.

    The figure below demonstrates the difference between batch normalization and layer normalization.


    Login to view more content

  • DL0013 Instance Normalization

    Can you explain what Instance Normalization is in the context of deep learning?

    Answer

    Instance Normalization (IN) normalizes each individual data sample (often per channel) by subtracting its own mean and dividing by its variance, then applying a scale and shift. This makes it ideal for applications where per-instance adjustment is needed, such as artistic style transfer, ensuring that the normalization is not affected by the mini-batch composition.

    Here are the equations for calculating Instance Normalization output  y_{nchw} for input  x_{nchw} :

    \mu_{nc} = \frac{1}{HW} \sum_{h=1}^{H} \sum_{w=1}^{W} x_{nchw}
    \sigma_{nc}^2 = \frac{1}{HW} \sum_{h=1}^{H} \sum_{w=1}^{W} (x_{nchw} - \mu_{nc})^2
    \hat{x}_{nchw}=\frac{{x}_{nchw} - \mu_{nc}}{\sqrt{\sigma_{nc}^2 + \epsilon}}
    y_{nchw} = \gamma_c \hat{x}_{nchw} + \beta_c
    Where:
     x_{nchw} is the input feature at batch  n , channel  c , height  h , and width  w .
     H is the height of the feature map (number of rows per channel).
     W is the width of the feature map (number of columns per channel).
     \mu_{nc} is the mean of all spatial values in channel  c of instance  n .
     \sigma_{nc}^2 is the variance of spatial values in channel  c of instance  n .
     \hat{x}_{nchw} is the normalized value after subtracting the mean and dividing by the standard deviation.
     \epsilon is a small constant added to the denominator to prevent division by zero and improve numerical stability.
     y_{nchw} is the final output after applying normalization and scaling.
     \gamma_c is a learnable scale parameter for channel  c .
     \beta_c is a learnable shift parameter for channel  c .


    Login to view more content
  • DL0007 Batch Norm

    Why use batch normalization in deep learning training?

    Answer

    Batch normalization is a crucial technique during deep learning training that enhances network stability and accelerates learning. It achieves this by normalizing the inputs to the activation function for each mini-batch, specifically by subtracting the batch mean and dividing by the batch standard deviation.

    After normalization, the layer applies a learnable scale (gamma) and shift (beta) that are updated during training to allow the network to recover the identity transformation if needed and to re-center/re-scale activations appropriately.

    Here’s the formula for Batch Normalization:
    BN(x_i) = \gamma \left( \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \right) + \beta
    Where:
    x_i represents an individual feature value in the batch.
    \mu_B represents the mean of that feature across the current batch.
    \sigma_B^2 represents the variance of that feature across the current batch.
    \epsilon is a small constant (e.g. 10^{-5}) added to the denominator for numerical stability.
    \gamma is a learnable scaling parameter.
    \beta is a learnable shifting parameter.

    Batch Normalization is typically applied after the linear transformation of a layer (e.g., after the convolution operation in a convolutional layer) and before the non-linear activation function (e.g., ReLU).

    The benefits of using Batch Normalization include:
    (1) Stabilizes learning: Reduces internal covariate shift, making training more stable and less sensitive to network initialization and hyperparameter choices.
    (2) Enables higher learning rates and accelerates training: Allows for larger learning rates without causing instability, leading to faster convergence.
    (3) Improves generalization: Normalizes each mini-batch independently, introducing noise into activations. This noise prevents over-reliance on specific mini-batch activations, forcing the network to learn more robust and generalizable features.


    Login to view more content
  • ML0018 Data Normalization

    Why is data normalization used in Machine Learning?

    Answer

    Data normalization is the process of scaling data to fit within a specific range or distribution, often between 0 and 1 or with a mean of 0 and standard deviation of 1. It’s used in machine learning and statistical modeling to ensure that features contribute equally to the model’s learning process.


    Login to view more content