Tag: Transformer

  • DL0036 Transformer Architecture II

    What are the main differences between the encoder and decoder in a Transformer?

    Answer

    The encoder focuses on encoding input into rich representations via bidirectional self-attention, while the decoder leverages these for output generation through masked self-attention and cross-attention, ensuring autoregressive and context-aware predictions.
    (1) Self‑Attention:
    Encoder: Unmasked, attends to all positions in the input sequence.
    Decoder: Masked, attends only to past positions to maintain causal order.
    (2) Cross‑Attention:
    Encoder: None.
    Decoder: Present — attends to encoder outputs for context.
    (3) Masking:
    Encoder: No masking needed.
    Decoder: Causal mask prevents looking ahead.
    (4) Positional Encoding:
    Encoder: Added to source embeddings.
    Decoder: Added to target embeddings (shifted right during training).
    (5) Function:
    Encoder: Encodes the full source sequence into contextual representations.
    Decoder: Generates the target sequence one token at a time using its own history and encoder context.
    The figure below shows the encoder and the decoder in the Transformer.


    Login to view more content

  • DL0035 Transformer Architecture

    Describe the original Transformer encoder–decoder architecture.

    Answer

    The original Transformer model has an encoder-decoder architecture. The encoder processes the input sequence (e.g., a sentence) to create a contextual representation for each word. The decoder then uses this representation to generate the output sequence (e.g., the translated sentence), one word at a time. This entire process relies on attention mechanisms instead of recurrence.
    Overall: Sequence-to-sequence encoder–decoder model with 6 encoder layers and 6 decoder layers [1].
    Encoder layer:
    (1) Multi-Head Self-Attention (all tokens attend to each other).
    (2) Position-wise Feed-Forward Network (two linear layers + ReLU).
    (3) Residual connection + LayerNorm after each sublayer.
    Decoder layer:
    (1) Masked Multi-Head Self-Attention (prevents seeing future tokens).
    (2) Cross-Attention (queries from decoder, keys/values from encoder output).
    (3) Position-wise Feed-Forward Network.
    (4) Residual connection + LayerNorm after each sublayer.
    Input representation: Inputs are represented as token embeddings summed with positional encodings to preserve sequence order, as the attention mechanism is permutation-invariant.
    Output: The final decoder output passes through a linear projection layer, followed by softmax to produce probabilities over the target vocabulary for next-token prediction.

    The figure below shows the architecture of the Transformer.

    References:

    [1] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).


    Login to view more content
  • DL0034 Layer Norm

    What is layer normalization, and why is it used in Transformers?

    Answer

    Layer Normalization is a technique that standardizes the inputs across the features for a single training example. Unlike Batch Normalization, which normalizes across the batch dimension, Layer Normalization computes the mean and variance for every single example independently to normalize its features.
    (1) Normalization within a Sample: Layer Normalization (LN) calculates the mean and variance across all the features of a single data point (e.g., a single token’s embedding vector in a sequence). It then uses these statistics to normalize the features for that data point only.
    (2) Batch Size Independence: Because it operates on individual examples, its calculations are independent of the batch size. This is a major advantage in models like Transformers that often process sequences of varying lengths, which can make batch statistics unstable.
    (3) Stabilizes Training: By keeping the activations in each layer within a consistent range (mean of 0, standard deviation of 1), LN helps prevent the exploding or vanishing gradients problem. This leads to a smoother training process and faster convergence, especially in deep networks.

    Layer Normalization Equation:
    \hat{x}_i = \frac{x_i - \mu}{\sigma + \epsilon} \cdot \gamma + \beta
    Where:
     x_i = input feature,
     \mu = mean of all features for the current sample,
     \sigma = standard deviation of all features,
     \epsilon = small constant for numerical stability,
     \gamma, \beta = learnable scale and shift parameters.

    The figure below demonstrates the difference between batch normalization and layer normalization.


    Login to view more content

  • DL0033 Transformer Computation

    In a Transformer architecture, which components are the primary contributors to computational cost, and why?

    Answer

    For short sequences, the feed-forward network (FFN) is often the dominant cost. For long sequences, the multi-head attention mechanism becomes the overwhelming bottleneck.
    (1) Multi‑Head Attention (MHA):
    Short sequences (small  n ): Cost is relatively small; attention score matrix overhead is minimal. Q, K, and V projections together dominate the compute.
    Long sequences (large  n ): Cost explodes quadratically with  n because every token attends to every other token. This becomes the main bottleneck. Cost:  \mathcal{O}(n^2 \cdot d)
    (2) Feed-Forward Network (FFN):
    Two dense layers with an expansion factor of 4.
    Cost:  \mathcal{O}(n \cdot d^2)
    Short sequences: FFN dominates cost since  n is small, but  d^2 is large.
    Long sequences: Cost grows linearly with  n , but MHSA cost overtakes it when  n is big.

    The table below shows the FLOP breakdown comparing Multi‑Head Attention (MHA) and Feed‑Forward Network (FFN) at different sequence lengths for one of the transformer designs, where d=512.


    Login to view more content

  • DL0032 Transformer VS RNN

    What makes Transformers more parallel-friendly than RNNs?

    Answer

    The fundamental difference lies in their architecture: RNNs sequentially process data, with each step depending on the output of the previous one. Transformers, on the other hand, utilize attention to examine all parts of the sequence simultaneously, enabling parallel processing. This parallelizability is a key reason for the Transformer’s superior performance on many tasks and its dominance in modern natural language processing.
    (1) No Temporal Dependency: Transformers process all input tokens simultaneously, unlike RNNs, which depend on previous hidden states.
    (2) Self-Attention is Fully Parallelizable: Attention scores are computed for all positions in a single pass.
    (3) Optimized for GPUs: Matrix multiplications in Transformers leverage GPU cores better than the sequential loops in RNNs.

    The figure below demonstrates the architectures of RNNs and Transformers.


    Login to view more content

  • DL0031 FFN in Transformer

    What is the purpose of the feed-forward network inside each Transformer block?

    Answer

    The feed-forward network (FFN) inside each Transformer block processes each token’s features independently after attention, expands and transforms them non-linearly, and projects them back to the model’s dimension. This ensures that after attention has mixed information across tokens, each token’s representation is individually refined for richer feature learning.

    Purpose of FFN:
    (1) Non-linear transformation: Adds non-linearity after attention, allowing the model to capture complex patterns.
    (2) Token-wise processing: Applies the same transformation to each token independently (no mixing across positions).
    (3) Dimensional expansion: Often increases dimensionality in the hidden layer to give the network more capacity.
    (4) Feature recombination: Refines and reweights token representations produced by the attention mechanism.
    (5) Complement to attention: Attention mixes information across tokens; the FFN processes each token’s features deeply.

    Typical FFN equation in a Transformer:
    \mathrm{FFN}(x) = \max(0, xW_1 + b_1) W_2 + b_2
    Where:
     x — input vector for a token after the attention layer
     W_1, W_2 — trainable weight matrices
     b_1, b_2 — trainable bias vectors
     \max(0, \cdot) — ReLU activation (sometimes replaced by GELU)


    Login to view more content
  • DL0030 Positional Encoding

    Explain “Positional Encoding” in Transformers. Why is it necessary?

    Answer

    Positional encoding is crucial in Transformers to equip the model with an understanding of token order while maintaining full parallel computation. Fixed sinusoidal functions offer parameter-free generalization to unseen lengths, learned embeddings provide task-specific flexibility, and relative schemes directly capture inter-token distances.

    Self-attention is permutation-invariant and, on its own, cannot distinguish token order. Positional encodings inject sequence information by adding position-dependent vectors to token embeddings.

    Encoding types:
    (1) Fixed (sinusoidal): Predefined functions of position. Sinusoidal (fixed) encodings utilize sine and cosine functions at different frequencies, enabling the model to learn both relative and absolute positions.
    (2) Learned: Learned during training as parameters. Learned positional embeddings are trainable vectors but may not generalize beyond the maximum training length.

    Sinusoidal Encoding Formula:
    PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right)
    PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right)
    Where:
     pos : token position in the sequence
     i : dimension index
     d_{\text{model}} : embedding dimension

    The figure below shows how the encoding values change across different positions and dimensions.


    Login to view more content
  • DL0029 Dilated Attention

    Could you explain the concept of dilated attention in transformer architectures?

    Answer

    Dilated attention introduces gaps between attention positions to sparsify computation, enabling efficient long-range dependency modeling. It is particularly helpful in tasks requiring scalable attention over long sequences. It trades off some granularity for global context by spreading attention more widely and sparsely.

    Dilated attention is similar to dilated convolutions in CNNs, where gaps (dilations) are introduced between the sampled positions.
    Instead of attending to all tokens (as in standard self-attention), each query token attends to every d-th token. This dilation rate controls the stride in attention.

    Reduction in Complexity: Reduces attention computation and memory from  \mathcal{O}(n^2) to a lower bound depending on the sparsity pattern.

    In dilated attention, the dot-product  QK^\top is computed only at dilated positions:
    \mbox{Attention}_{\text{dilated}}(Q, K, V) = \mbox{Softmax}\left(\frac{QK_d^\top}{\sqrt{d_k}}\right) V_d
    Where:
     K_d, V_d are the dilated subsets of keys and values.
     d_k is the key dimension.

    Below is the visualization of dilated attention with a dilation rate of 3.


    Login to view more content
  • DL0028 Sliding Window Attention

    Explain the sliding window attention mechanism in transformer architectures.

    Answer

    Sliding window attention is an optimization that addresses the scalability issues of the standard self-attention mechanism. It improves efficiency by limiting the attention scope of each token to a local, fixed-size window. This enables transformer models to handle longer sequences more effectively without a quadratic increase in computational resources. The trade-off is a potential loss of global context.

    Purpose: Efficiently scale attention for long sequences by restricting each token’s attention to a fixed-size local window instead of the full sequence.
    Window Size: Each token attends only to tokens within a fixed window of size  w (e.g., the token itself and  \pm \frac{w}{2} neighbors).
    Sparse Attention: Results in a sparse attention matrix — reduces memory and computation from  \mathcal{O}(n^2) to  \mathcal{O}(n \cdot w) .

    Here is a side-by-side comparison of Global Attention vs Sliding Window Attention: Each token attends to all others (dense matrix) in Global Attention. Each token attends only to a small window of nearby tokens (sparse band around the diagonal) in Sliding Window Attention.


    Login to view more content
  • DL0027 Multi-Head Attention

    How does multi-head attention work in transformer architectures?

    Answer

    Multi-head attention projects the input into multiple distinct subspaces, with each head performing scaled dot-product attention independently on the full input sequence. By attending to different aspects or relationships within the data, these separate heads capture diverse information patterns. Their outputs are then combined to form a richer, more expressive representation, enabling the model to understand complex dependencies better and improve overall performance.

    Outputs from all heads are concatenated and linearly projected to form the final output.
    All heads are computed in parallel, enabling efficient computation.

    \mbox{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O
    Where:
    \text{head}_i = \mbox{Attention}(Q W_i^Q, K W_i^K, V W_i^V)
    W_i^Q, W_i^K, W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_k}
     W^O \in \mathbb{R}^{h d_k \times d_{\text{model}}} : Final output projection matrix that maps the concatenated attention outputs back to the original model dimension.
     h : Number of attention heads.
     d_{\text{model}} : Dimensionality of the input embeddings and final output.
     d_k = d_{\text{model}} / h : Dimension of each head’s projected subspace.

    The below figure shows a single-head attention heatmap and 4 independent multi-head attention heatmaps.


    Login to view more content