Author: admin

  • DL0039 Transformer Weight Tying

    Explain weight sharing in Transformers.

    Answer

    Weight sharing in Transformers mainly refers to tying the input embedding matrix with the output projection matrix for softmax prediction, saving parameters, and improving consistency. In some models (like ALBERT), it also extends to sharing weights across Transformer layers for further parameter efficiency.

    (1) Input–Output Embedding Tying:
    The same embedding matrix is used for both input token embeddings and the output softmax projection.
    Reduces parameters and enforces consistency between input and output spaces.
    \mbox{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}
    Where:
    z_i = (E h)_i is the logit for token i, computed using the embedding matrix E \in \mathbb{R}^{K \times d}.
    h \in \mathbb{R}^{d} is the hidden representation from the Transformer.
    K is the vocabulary size.

    Weights tying are shown in the figure below.

    (2) Layer Weight Sharing (e.g., ALBERT [1]):
    Instead of unique weights per layer, parameters are reused across all Transformer blocks.
    Cuts model size dramatically while keeping depth.

    References:
    [1] Lan, Zhenzhong, et al. “Albert: A lite bert for self-supervised learning of language representations.” arXiv preprint arXiv:1909.11942 (2019).


    Login to view more content

  • DL0038 Transformer Activation

    Which activation functions do transformer models use?

    Answer

    Transformers mainly use GELU/ReLU in the feed-forward layers to introduce non-linearity and Softmax in attention to produce normalized attention weights. GELU is preferred for smoother gradient flow and better performance.
    (1) Feed-Forward Network (FFN):
    Uses ReLU or GELU as the non-linear activation.
    GELU is more common in modern Transformers (like BERT, GPT).
    Equation for GELU:
    \mbox{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \mbox{erf}\left(\frac{x}{\sqrt{2}}\right)\right]
    Where:
    x is the input,
    \Phi(x) is the Cumulative Distribution Function (CDF) of the standard Gaussian.

    The figure below demonstrates the difference between ReLU and GELU.

    (2) Attention Output:
    Uses Softmax to convert attention scores into probabilities.
    Equation for Softmax:
    \mbox{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}
    Where:
     z_i represents the raw attention score for the i-th token,
     K is the total number of tokens considered in attention.


    Login to view more content
  • DL0037 Transformer Architecture III

    Why do Transformers use a dot product, rather than addition, to compute attention scores?

    Answer

    Dot product attention is a fast and naturally aligned similarity measure; with scaling, it remains numerically stable and highly parallelizable, which is why Transformers prefer it over addition.
    (1) Dot product captures similarity: The dot product between query  q and key  k grows larger when they point in similar directions, making it a natural similarity measure.
    The scores are normalized with Softmax and have probabilistic interpretations:
    \alpha_i = \frac{e^{q \cdot k_i}}{\sum_{j=1}^K e^{q \cdot k_j}}
    Where:
     q \cdot k_i is the dot product similarity between query and key.

    The figure below illustrates the dot product for measuring similarity.

    (2) Efficient computation: Dot products can be computed in parallel as a matrix multiplication  QK^\top , which is hardware-friendly.


    Login to view more content

  • DL0036 Transformer Architecture II

    What are the main differences between the encoder and decoder in a Transformer?

    Answer

    The encoder focuses on encoding input into rich representations via bidirectional self-attention, while the decoder leverages these for output generation through masked self-attention and cross-attention, ensuring autoregressive and context-aware predictions.
    (1) Self‑Attention:
    Encoder: Unmasked, attends to all positions in the input sequence.
    Decoder: Masked, attends only to past positions to maintain causal order.
    (2) Cross‑Attention:
    Encoder: None.
    Decoder: Present — attends to encoder outputs for context.
    (3) Masking:
    Encoder: No masking needed.
    Decoder: Causal mask prevents looking ahead.
    (4) Positional Encoding:
    Encoder: Added to source embeddings.
    Decoder: Added to target embeddings (shifted right during training).
    (5) Function:
    Encoder: Encodes the full source sequence into contextual representations.
    Decoder: Generates the target sequence one token at a time using its own history and encoder context.
    The figure below shows the encoder and the decoder in the Transformer.


    Login to view more content

  • ML0066 Model Capacity

    Without activation functions, how does the model capacity of a 2‑layer neural network compare to a 20‑layer network?

    Answer

    In the absence of activation functions, a neural network, regardless of depth, is equivalent to a single linear transformation. The model capacity is limited by the expressiveness of linear mappings, with the maximum rank determined by layer widths and thus the parameter count. The network with more parameters, whether 2-layer or 20-layer, can represent a higher-rank transformation, but depth alone provides no additional ability to capture non-linear relationships.

    Without activation functions, all layers collapse to a single linear transformation:
    y = W_{\text{eff}} x + b_{\text{eff}}
    Where:
     W_{\text{eff}} is the effective weight matrix.
     b_{\text{eff}} is the effective bias.
    Representational capacity is the same for both 2-layer and 20-layer networks.

    Parameter count also depends on the width of layers, not just depth.
    Formula for fully connected layers:
    \mbox{Params per layer} = d_\text{in} \times d_\text{out} + d_\text{out}
    Where:
     d_\text{in} is the input dimension.
     d_\text{out} is the output dimension of that layer.
    A wider 2-layer network can have more parameters than a narrow 20-layer network. Conversely, a sufficiently deep 20-layer network can have more parameters than a narrow 2-layer network.

    Neither network can model nonlinear data, as shown below.


    Login to view more content
  • DL0035 Transformer Architecture

    Describe the original Transformer encoder–decoder architecture.

    Answer

    The original Transformer model has an encoder-decoder architecture. The encoder processes the input sequence (e.g., a sentence) to create a contextual representation for each word. The decoder then uses this representation to generate the output sequence (e.g., the translated sentence), one word at a time. This entire process relies on attention mechanisms instead of recurrence.
    Overall: Sequence-to-sequence encoder–decoder model with 6 encoder layers and 6 decoder layers [1].
    Encoder layer:
    (1) Multi-Head Self-Attention (all tokens attend to each other).
    (2) Position-wise Feed-Forward Network (two linear layers + ReLU).
    (3) Residual connection + LayerNorm after each sublayer.
    Decoder layer:
    (1) Masked Multi-Head Self-Attention (prevents seeing future tokens).
    (2) Cross-Attention (queries from decoder, keys/values from encoder output).
    (3) Position-wise Feed-Forward Network.
    (4) Residual connection + LayerNorm after each sublayer.
    Input representation: Inputs are represented as token embeddings summed with positional encodings to preserve sequence order, as the attention mechanism is permutation-invariant.
    Output: The final decoder output passes through a linear projection layer, followed by softmax to produce probabilities over the target vocabulary for next-token prediction.

    The figure below shows the architecture of the Transformer.

    References:

    [1] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).


    Login to view more content
  • DL0034 Layer Norm

    What is layer normalization, and why is it used in Transformers?

    Answer

    Layer Normalization is a technique that standardizes the inputs across the features for a single training example. Unlike Batch Normalization, which normalizes across the batch dimension, Layer Normalization computes the mean and variance for every single example independently to normalize its features.
    (1) Normalization within a Sample: Layer Normalization (LN) calculates the mean and variance across all the features of a single data point (e.g., a single token’s embedding vector in a sequence). It then uses these statistics to normalize the features for that data point only.
    (2) Batch Size Independence: Because it operates on individual examples, its calculations are independent of the batch size. This is a major advantage in models like Transformers that often process sequences of varying lengths, which can make batch statistics unstable.
    (3) Stabilizes Training: By keeping the activations in each layer within a consistent range (mean of 0, standard deviation of 1), LN helps prevent the exploding or vanishing gradients problem. This leads to a smoother training process and faster convergence, especially in deep networks.

    Layer Normalization Equation:
    \hat{x}_i = \frac{x_i - \mu}{\sigma + \epsilon} \cdot \gamma + \beta
    Where:
     x_i = input feature,
     \mu = mean of all features for the current sample,
     \sigma = standard deviation of all features,
     \epsilon = small constant for numerical stability,
     \gamma, \beta = learnable scale and shift parameters.

    The figure below demonstrates the difference between batch normalization and layer normalization.


    Login to view more content

  • DL0033 Transformer Computation

    In a Transformer architecture, which components are the primary contributors to computational cost, and why?

    Answer

    For short sequences, the feed-forward network (FFN) is often the dominant cost. For long sequences, the multi-head attention mechanism becomes the overwhelming bottleneck.
    (1) Multi‑Head Attention (MHA):
    Short sequences (small  n ): Cost is relatively small; attention score matrix overhead is minimal. Q, K, and V projections together dominate the compute.
    Long sequences (large  n ): Cost explodes quadratically with  n because every token attends to every other token. This becomes the main bottleneck. Cost:  \mathcal{O}(n^2 \cdot d)
    (2) Feed-Forward Network (FFN):
    Two dense layers with an expansion factor of 4.
    Cost:  \mathcal{O}(n \cdot d^2)
    Short sequences: FFN dominates cost since  n is small, but  d^2 is large.
    Long sequences: Cost grows linearly with  n , but MHSA cost overtakes it when  n is big.

    The table below shows the FLOP breakdown comparing Multi‑Head Attention (MHA) and Feed‑Forward Network (FFN) at different sequence lengths for one of the transformer designs, where d=512.


    Login to view more content

  • DL0032 Transformer VS RNN

    What makes Transformers more parallel-friendly than RNNs?

    Answer

    The fundamental difference lies in their architecture: RNNs sequentially process data, with each step depending on the output of the previous one. Transformers, on the other hand, utilize attention to examine all parts of the sequence simultaneously, enabling parallel processing. This parallelizability is a key reason for the Transformer’s superior performance on many tasks and its dominance in modern natural language processing.
    (1) No Temporal Dependency: Transformers process all input tokens simultaneously, unlike RNNs, which depend on previous hidden states.
    (2) Self-Attention is Fully Parallelizable: Attention scores are computed for all positions in a single pass.
    (3) Optimized for GPUs: Matrix multiplications in Transformers leverage GPU cores better than the sequential loops in RNNs.

    The figure below demonstrates the architectures of RNNs and Transformers.


    Login to view more content

  • DL0031 FFN in Transformer

    What is the purpose of the feed-forward network inside each Transformer block?

    Answer

    The feed-forward network (FFN) inside each Transformer block processes each token’s features independently after attention, expands and transforms them non-linearly, and projects them back to the model’s dimension. This ensures that after attention has mixed information across tokens, each token’s representation is individually refined for richer feature learning.

    Purpose of FFN:
    (1) Non-linear transformation: Adds non-linearity after attention, allowing the model to capture complex patterns.
    (2) Token-wise processing: Applies the same transformation to each token independently (no mixing across positions).
    (3) Dimensional expansion: Often increases dimensionality in the hidden layer to give the network more capacity.
    (4) Feature recombination: Refines and reweights token representations produced by the attention mechanism.
    (5) Complement to attention: Attention mixes information across tokens; the FFN processes each token’s features deeply.

    Typical FFN equation in a Transformer:
    \mathrm{FFN}(x) = \max(0, xW_1 + b_1) W_2 + b_2
    Where:
     x — input vector for a token after the attention layer
     W_1, W_2 — trainable weight matrices
     b_1, b_2 — trainable bias vectors
     \max(0, \cdot) — ReLU activation (sometimes replaced by GELU)


    Login to view more content