DL0038 Transformer Activation

Which activation functions do transformer models use?

Answer

Transformers mainly use GELU/ReLU in the feed-forward layers to introduce non-linearity and Softmax in attention to produce normalized attention weights. GELU is preferred for smoother gradient flow and better performance.
(1) Feed-Forward Network (FFN):
Uses ReLU or GELU as the non-linear activation.
GELU is more common in modern Transformers (like BERT, GPT).
Equation for GELU:
\mbox{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \mbox{erf}\left(\frac{x}{\sqrt{2}}\right)\right]
Where:
x is the input,
\Phi(x) is the Cumulative Distribution Function (CDF) of the standard Gaussian.

The figure below demonstrates the difference between ReLU and GELU.

(2) Attention Output:
Uses Softmax to convert attention scores into probabilities.
Equation for Softmax:
\mbox{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}
Where:
 z_i represents the raw attention score for the i-th token,
 K is the total number of tokens considered in attention.


Login to view more content

Did you solve the problem?

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *