DL0045 Dimension in FFN

Written by

In Transformers, why does the feed-forward network expand the hidden dimension (e.g., $d_{\text{model}}$ → $4 d_{\text{model}}$ ) before reducing it back?

Answer

The feed-forward network in Transformers expands the hidden dimension (e.g., $d_{\text{model}} \to 4 \cdot d_{\text{model}}$ ) to enhance the model’s ability to learn complex, non-linear feature interactions, then reduces it back to maintain compatibility with other layers. This design acts as a bottleneck, balancing expressiveness and efficiency, and has been empirically shown to boost performance in large-scale models.
(1) Extra Capacity: Expanding from $d_{\text{model}}$ to $4d_{\text{model}}$ allows the FFN to capture richer nonlinear transformations.
(2) Non‑linear mixing: The intermediate expansion allows the activation function (ReLU, GeLU, SwiGLU, etc.) to operate in a richer space, capturing more complex patterns.
(3) Projection back ensures compatibility: Reducing the dimension back to $d_{\text{model}}$ ensures compatibility with the subsequent layers. It ensures residual connection compatibility and uniformity across layers.

The equation and the figure below show the architecture of FFN:
$\text{FFN}(x) = W_2 \sigma(W_1 x + b_1) + b_2$
Where:
$x \in \mathbb{R}^{d_{\text{model}}}$ is the input vector.
$W_1 \in \mathbb{R}^{4d_{\text{model}} \times d_{\text{model}}}$ expands the dimension.
$\sigma$ is a non-linear activation (e.g., ReLU/GELU).
$W_2 \in \mathbb{R}^{d_{\text{model}} \times 4d_{\text{model}}}$ projects back down.