Category: Hard

  • DL0039 Transformer Weight Tying

    Explain weight sharing in Transformers.

    Answer

    Weight sharing in Transformers mainly refers to tying the input embedding matrix with the output projection matrix for softmax prediction, saving parameters, and improving consistency. In some models (like ALBERT), it also extends to sharing weights across Transformer layers for further parameter efficiency.

    (1) Input–Output Embedding Tying:
    The same embedding matrix is used for both input token embeddings and the output softmax projection.
    Reduces parameters and enforces consistency between input and output spaces.
    \mbox{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}
    Where:
    z_i = (E h)_i is the logit for token i, computed using the embedding matrix E \in \mathbb{R}^{K \times d}.
    h \in \mathbb{R}^{d} is the hidden representation from the Transformer.
    K is the vocabulary size.

    Weights tying are shown in the figure below.

    (2) Layer Weight Sharing (e.g., ALBERT [1]):
    Instead of unique weights per layer, parameters are reused across all Transformer blocks.
    Cuts model size dramatically while keeping depth.

    References:
    [1] Lan, Zhenzhong, et al. “Albert: A lite bert for self-supervised learning of language representations.” arXiv preprint arXiv:1909.11942 (2019).


    Login to view more content

  • DL0033 Transformer Computation

    In a Transformer architecture, which components are the primary contributors to computational cost, and why?

    Answer

    For short sequences, the feed-forward network (FFN) is often the dominant cost. For long sequences, the multi-head attention mechanism becomes the overwhelming bottleneck.
    (1) Multi‑Head Attention (MHA):
    Short sequences (small  n ): Cost is relatively small; attention score matrix overhead is minimal. Q, K, and V projections together dominate the compute.
    Long sequences (large  n ): Cost explodes quadratically with  n because every token attends to every other token. This becomes the main bottleneck. Cost:  \mathcal{O}(n^2 \cdot d)
    (2) Feed-Forward Network (FFN):
    Two dense layers with an expansion factor of 4.
    Cost:  \mathcal{O}(n \cdot d^2)
    Short sequences: FFN dominates cost since  n is small, but  d^2 is large.
    Long sequences: Cost grows linearly with  n , but MHSA cost overtakes it when  n is big.

    The table below shows the FLOP breakdown comparing Multi‑Head Attention (MHA) and Feed‑Forward Network (FFN) at different sequence lengths for one of the transformer designs, where d=512.


    Login to view more content