DL0038 Transformer Activation

Written by

Which activation functions do transformer models use?

Answer

Transformers mainly use GELU/ReLU in the feed-forward layers to introduce non-linearity and Softmax in attention to produce normalized attention weights. GELU is preferred for smoother gradient flow and better performance.
(1) Feed-Forward Network (FFN):
Uses ReLU or GELU as the non-linear activation.
GELU is more common in modern Transformers (like BERT, GPT).
Equation for GELU:
$\mbox{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \mbox{erf}\left(\frac{x}{\sqrt{2}}\right)\right]$
Where:
$x$ is the input,
$\Phi(x)$ is the Cumulative Distribution Function (CDF) of the standard Gaussian.

The figure below demonstrates the difference between ReLU and GELU.

(2) Attention Output:
Uses Softmax to convert attention scores into probabilities.
Equation for Softmax:
$\mbox{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$
Where:
$z_i$ represents the raw attention score for the i-th token,
$K$ is the total number of tokens considered in attention.

Did you solve the problem?

Transformer

DL0038 Transformer Activation

Comments

Leave a Reply Cancel reply

More posts

MSD0007 Demand Forecasting System for Retailer

MSD0006 Video Recommendation System

MSD0005 Surveillance Video Anomaly Detection

DL0052 Rotary Positional Embedding