DL0026 Self-Attention vs Cross-Attention

What distinguishes self-attention from cross-attention in transformer models?

Answer

Self-attention allows a sequence to attend to itself, making it powerful for capturing intra-sequence relationships. Cross-attention bridges different sequences, crucial for combining encoder and decoder representations in tasks like machine translation.
Input Scope:
Self-Attention: Query, key, and value all come from the same input sequence
Cross-Attention: Query comes from one sequence, key and value come from a different source

Usage in Transformer Architecture:
Self-Attention: Used in both the encoder and decoder for modeling internal dependencies
Cross-Attention: Used in the decoder to integrate the encoder output

Both mechanisms use the scaled dot-product attention formula:
\mbox{Attention}(Q, K, V) = \mbox{Softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V
Where:
 Q ,  K , and  V represent query, key, and value matrices, respectively
 d_k is the dimensionality of the key vectors

The plot below on the left demonstrates self-attention by showing a token’s attention to all other tokens within the same sequence. The plot below on the right illustrates cross-attention, where tokens from one sequence (the decoder) attend to tokens from another, separate sequence (the encoder).


Login to view more content


Did you solve the problem?

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *