DL0039 Transformer Weight Tying

Written by

Explain weight sharing in Transformers.

Answer

Weight sharing in Transformers mainly refers to tying the input embedding matrix with the output projection matrix for softmax prediction, saving parameters, and improving consistency. In some models (like ALBERT), it also extends to sharing weights across Transformer layers for further parameter efficiency.

(1) Input–Output Embedding Tying:
The same embedding matrix is used for both input token embeddings and the output softmax projection.
Reduces parameters and enforces consistency between input and output spaces.
$\mbox{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$
Where:
$z_i = (E h)_i$ is the logit for token $i$ , computed using the embedding matrix $E \in \mathbb{R}^{K \times d}$ .
$h \in \mathbb{R}^{d}$ is the hidden representation from the Transformer.
$K$ is the vocabulary size.

Weights tying are shown in the figure below.

(2) Layer Weight Sharing (e.g., ALBERT [1]):
Instead of unique weights per layer, parameters are reused across all Transformer blocks.
Cuts model size dramatically while keeping depth.

References:
[1] Lan, Zhenzhong, et al. “Albert: A lite bert for self-supervised learning of language representations.” arXiv preprint arXiv:1909.11942 (2019).

Did you solve the problem?

Transformer

DL0039 Transformer Weight Tying

Comments

Leave a Reply Cancel reply

More posts

MSD0007 Demand Forecasting System for Retailer

MSD0006 Video Recommendation System

MSD0005 Surveillance Video Anomaly Detection

DL0052 Rotary Positional Embedding