DL0037 Transformer Architecture III

Why do Transformers use a dot product, rather than addition, to compute attention scores?

Answer

Dot product attention is a fast and naturally aligned similarity measure; with scaling, it remains numerically stable and highly parallelizable, which is why Transformers prefer it over addition.
(1) Dot product captures similarity: The dot product between query  q and key  k grows larger when they point in similar directions, making it a natural similarity measure.
The scores are normalized with Softmax and have probabilistic interpretations:
\alpha_i = \frac{e^{q \cdot k_i}}{\sum_{j=1}^K e^{q \cdot k_j}}
Where:
 q \cdot k_i is the dot product similarity between query and key.

The figure below illustrates the dot product for measuring similarity.

(2) Efficient computation: Dot products can be computed in parallel as a matrix multiplication  QK^\top , which is hardware-friendly.


Login to view more content


Did you solve the problem?

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *