DL0037 Transformer Architecture III

Written by

Why do Transformers use a dot product, rather than addition, to compute attention scores?

Answer

Dot product attention is a fast and naturally aligned similarity measure; with scaling, it remains numerically stable and highly parallelizable, which is why Transformers prefer it over addition.
(1) Dot product captures similarity: The dot product between query $q$ and key $k$ grows larger when they point in similar directions, making it a natural similarity measure.
The scores are normalized with Softmax and have probabilistic interpretations:
$\alpha_i = \frac{e^{q \cdot k_i}}{\sum_{j=1}^K e^{q \cdot k_j}}$
Where:
$q \cdot k_i$ is the dot product similarity between query and key.

The figure below illustrates the dot product for measuring similarity.

(2) Efficient computation: Dot products can be computed in parallel as a matrix multiplication $QK^\top$ , which is hardware-friendly.

Did you solve the problem?

Transformer

DL0037 Transformer Architecture III

Comments

Leave a Reply Cancel reply

More posts

MSD0007 Demand Forecasting System for Retailer

MSD0006 Video Recommendation System

MSD0005 Surveillance Video Anomaly Detection

DL0052 Rotary Positional Embedding