Why do Transformers use a dot product, rather than addition, to compute attention scores?
Answer
Dot product attention is a fast and naturally aligned similarity measure; with scaling, it remains numerically stable and highly parallelizable, which is why Transformers prefer it over addition.
(1) Dot product captures similarity: The dot product between query and key
grows larger when they point in similar directions, making it a natural similarity measure.
The scores are normalized with Softmax and have probabilistic interpretations:
Where: is the dot product similarity between query and key.
The figure below illustrates the dot product for measuring similarity.
(2) Efficient computation: Dot products can be computed in parallel as a matrix multiplication , which is hardware-friendly.
Leave a Reply