DL0031 FFN in Transformer

What is the purpose of the feed-forward network inside each Transformer block?

Answer

The feed-forward network (FFN) inside each Transformer block processes each token’s features independently after attention, expands and transforms them non-linearly, and projects them back to the model’s dimension. This ensures that after attention has mixed information across tokens, each token’s representation is individually refined for richer feature learning.

Purpose of FFN:
(1) Non-linear transformation: Adds non-linearity after attention, allowing the model to capture complex patterns.
(2) Token-wise processing: Applies the same transformation to each token independently (no mixing across positions).
(3) Dimensional expansion: Often increases dimensionality in the hidden layer to give the network more capacity.
(4) Feature recombination: Refines and reweights token representations produced by the attention mechanism.
(5) Complement to attention: Attention mixes information across tokens; the FFN processes each token’s features deeply.

Typical FFN equation in a Transformer:
\mathrm{FFN}(x) = \max(0, xW_1 + b_1) W_2 + b_2
Where:
 x — input vector for a token after the attention layer
 W_1, W_2 — trainable weight matrices
 b_1, b_2 — trainable bias vectors
 \max(0, \cdot) — ReLU activation (sometimes replaced by GELU)


Login to view more content

Did you solve the problem?

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *