DL0031 FFN in Transformer

Written by

What is the purpose of the feed-forward network inside each Transformer block?

Answer

The feed-forward network (FFN) inside each Transformer block processes each token’s features independently after attention, expands and transforms them non-linearly, and projects them back to the model’s dimension. This ensures that after attention has mixed information across tokens, each token’s representation is individually refined for richer feature learning.

Purpose of FFN:
(1) Non-linear transformation: Adds non-linearity after attention, allowing the model to capture complex patterns.
(2) Token-wise processing: Applies the same transformation to each token independently (no mixing across positions).
(3) Dimensional expansion: Often increases dimensionality in the hidden layer to give the network more capacity.
(4) Feature recombination: Refines and reweights token representations produced by the attention mechanism.
(5) Complement to attention: Attention mixes information across tokens; the FFN processes each token’s features deeply.

Typical FFN equation in a Transformer:
$\mathrm{FFN}(x) = \max(0, xW_1 + b_1) W_2 + b_2$
Where:
$x$ — input vector for a token after the attention layer
$W_1, W_2$ — trainable weight matrices
$b_1, b_2$ — trainable bias vectors
$\max(0, \cdot)$ — ReLU activation (sometimes replaced by GELU)