In Transformers, why does the feed-forward network expand the hidden dimension (e.g., →
) before reducing it back?
Answer
The feed-forward network in Transformers expands the hidden dimension (e.g., ) to enhance the model’s ability to learn complex, non-linear feature interactions, then reduces it back to maintain compatibility with other layers. This design acts as a bottleneck, balancing expressiveness and efficiency, and has been empirically shown to boost performance in large-scale models.
(1) Extra Capacity: Expanding from to
allows the FFN to capture richer nonlinear transformations.
(2) Non‑linear mixing: The intermediate expansion allows the activation function (ReLU, GeLU, SwiGLU, etc.) to operate in a richer space, capturing more complex patterns.
(3) Projection back ensures compatibility: Reducing the dimension back to ensures compatibility with the subsequent layers. It ensures residual connection compatibility and uniformity across layers.
The equation and the figure below show the architecture of FFN:
Where: is the input vector.
expands the dimension.
is a non-linear activation (e.g., ReLU/GELU).
projects back down.

Leave a Reply