Could you explain the concept of hierarchical attention in transformer architectures?
Answer
Hierarchical Attention in Transformers applies self-attention at multiple levels of granularity (e.g., words to sentences to documents). Instead of one flat attention over all tokens, it computes local attention within segments and then global attention across segments, leading to efficiency gains, better structure modeling, and interpretable focus at each level.
Motivation: Transformers normally apply flat self-attention over all tokens. For long structured inputs (documents, videos, graphs), this is inefficient and loses hierarchical structure.
Hierarchical Attention Idea:
(1) Local level (fine-grained): Compute attention within smaller segments (e.g., words within a sentence, frames within a shot).
(2) Global level (coarse-grained): Aggregate segment representations, then apply attention across segments (e.g., sentences within a document, shots within a video).
This mirrors natural data hierarchies and reduces quadratic cost.
The figure below shows Hierarchical Attention used in the document classification use case.
Leave a Reply