DL0040 Attention Mask

What is the role of masking in attention?

Answer

Masking is a critical technique in transformer attention mechanisms that controls which parts of the input sequence the model is allowed to focus on.
(1) Leakage prevention: Blocks access to future tokens in autoregressive decoding to preserve causality.
(2) Padding handling: Excludes pad positions so they don’t absorb probability mass or distort context.
(3) Structured constraint: Enforces task rules (e.g., graph neighborhoods, spans, or blocked regions).
Core equation with mask:
\text{Attn}(Q,K,V,M)=\text{Softmax}\left(\frac{QK^\top}{\sqrt{d_k}}+M\right)V
Where:
 Q query matrix.
 K key matrix.
 V value matrix.
 d_k key dimensionality (for scaling).
 M mask matrix with 0 for allowed positions and large negative values (e.g., −∞) for disallowed positions.

The figure below shows three side-by-side heatmaps: a padding mask that disallows attending to padding tokens, a causal mask that enforces autoregressive decoding, and a structured mask that enforces local neighborhood constraint.


Login to view more content


Did you solve the problem?

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *