DL0033 Transformer Computation

Written by

In a Transformer architecture, which components are the primary contributors to computational cost, and why?

Answer

For short sequences, the feed-forward network (FFN) is often the dominant cost. For long sequences, the multi-head attention mechanism becomes the overwhelming bottleneck.
(1) Multi‑Head Attention (MHA):
Short sequences (small $n$ ): Cost is relatively small; attention score matrix overhead is minimal. Q, K, and V projections together dominate the compute.
Long sequences (large $n$ ): Cost explodes quadratically with $n$ because every token attends to every other token. This becomes the main bottleneck. Cost: $\mathcal{O}(n^2 \cdot d)$
(2) Feed-Forward Network (FFN):
Two dense layers with an expansion factor of 4.
Cost: $\mathcal{O}(n \cdot d^2)$
Short sequences: FFN dominates cost since $n$ is small, but $d^2$ is large.
Long sequences: Cost grows linearly with $n$ , but MHSA cost overtakes it when $n$ is big.

The table below shows the FLOP breakdown comparing Multi‑Head Attention (MHA) and Feed‑Forward Network (FFN) at different sequence lengths for one of the transformer designs, where d=512.

Did you solve the problem?

Transformer

DL0033 Transformer Computation

Comments

Leave a Reply Cancel reply

More posts

MSD0007 Demand Forecasting System for Retailer

MSD0006 Video Recommendation System

MSD0005 Surveillance Video Anomaly Detection

DL0052 Rotary Positional Embedding