DL0045 Dimension in FFN

In Transformers, why does the feed-forward network expand the hidden dimension (e.g.,  d_{\text{model}} 4 d_{\text{model}} ) before reducing it back?

Answer

The feed-forward network in Transformers expands the hidden dimension (e.g.,  d_{\text{model}} \to 4 \cdot d_{\text{model}} ) to enhance the model’s ability to learn complex, non-linear feature interactions, then reduces it back to maintain compatibility with other layers. This design acts as a bottleneck, balancing expressiveness and efficiency, and has been empirically shown to boost performance in large-scale models.
(1) Extra Capacity: Expanding from  d_{\text{model}} to  4d_{\text{model}} allows the FFN to capture richer nonlinear transformations.
(2) Non‑linear mixing: The intermediate expansion allows the activation function (ReLU, GeLU, SwiGLU, etc.) to operate in a richer space, capturing more complex patterns.
(3) Projection back ensures compatibility: Reducing the dimension back to  d_{\text{model}} ensures compatibility with the subsequent layers. It ensures residual connection compatibility and uniformity across layers.

The equation and the figure below show the architecture of FFN:
 \text{FFN}(x) = W_2  \sigma(W_1 x + b_1) + b_2
Where:
 x \in \mathbb{R}^{d_{\text{model}}} is the input vector.
 W_1 \in \mathbb{R}^{4d_{\text{model}} \times d_{\text{model}}} expands the dimension.
 \sigma is a non-linear activation (e.g., ReLU/GELU).
 W_2 \in \mathbb{R}^{d_{\text{model}} \times 4d_{\text{model}}} projects back down.


Login to view more content

Did you solve the problem?

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *