DL0033 Transformer Computation

In a Transformer architecture, which components are the primary contributors to computational cost, and why?

Answer

For short sequences, the feed-forward network (FFN) is often the dominant cost. For long sequences, the multi-head attention mechanism becomes the overwhelming bottleneck.
(1) Multi‑Head Attention (MHA):
Short sequences (small  n ): Cost is relatively small; attention score matrix overhead is minimal. Q, K, and V projections together dominate the compute.
Long sequences (large  n ): Cost explodes quadratically with  n because every token attends to every other token. This becomes the main bottleneck. Cost:  \mathcal{O}(n^2 \cdot d)
(2) Feed-Forward Network (FFN):
Two dense layers with an expansion factor of 4.
Cost:  \mathcal{O}(n \cdot d^2)
Short sequences: FFN dominates cost since  n is small, but  d^2 is large.
Long sequences: Cost grows linearly with  n , but MHSA cost overtakes it when  n is big.

The table below shows the FLOP breakdown comparing Multi‑Head Attention (MHA) and Feed‑Forward Network (FFN) at different sequence lengths for one of the transformer designs, where d=512.


Login to view more content


Did you solve the problem?

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *