DL0027 Multi-Head Attention

Written by

How does multi-head attention work in transformer architectures?

Answer

Multi-head attention projects the input into multiple distinct subspaces, with each head performing scaled dot-product attention independently on the full input sequence. By attending to different aspects or relationships within the data, these separate heads capture diverse information patterns. Their outputs are then combined to form a richer, more expressive representation, enabling the model to understand complex dependencies better and improve overall performance.

Outputs from all heads are concatenated and linearly projected to form the final output.
All heads are computed in parallel, enabling efficient computation.
$\mbox{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O$
Where:
$\text{head}_i = \mbox{Attention}(Q W_i^Q, K W_i^K, V W_i^V)$
$W_i^Q, W_i^K, W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_k}$
$W^O \in \mathbb{R}^{h d_k \times d_{\text{model}}}$ : Final output projection matrix that maps the concatenated attention outputs back to the original model dimension.
$h$ : Number of attention heads.
$d_{\text{model}}$ : Dimensionality of the input embeddings and final output.
$d_k = d_{\text{model}} / h$ : Dimension of each head’s projected subspace.