DL0041 Hierarchical Attention

Written by

Could you explain the concept of hierarchical attention in transformer architectures?

Answer

Hierarchical Attention in Transformers applies self-attention at multiple levels of granularity (e.g., words to sentences to documents). Instead of one flat attention over all tokens, it computes local attention within segments and then global attention across segments, leading to efficiency gains, better structure modeling, and interpretable focus at each level.

Motivation: Transformers normally apply flat self-attention over all tokens. For long structured inputs (documents, videos, graphs), this is inefficient and loses hierarchical structure.
Hierarchical Attention Idea:
(1) Local level (fine-grained): Compute attention within smaller segments (e.g., words within a sentence, frames within a shot).
(2) Global level (coarse-grained): Aggregate segment representations, then apply attention across segments (e.g., sentences within a document, shots within a video).
This mirrors natural data hierarchies and reduces quadratic cost.

The figure below shows Hierarchical Attention used in the document classification use case.

Did you solve the problem?

Transformer

DL0041 Hierarchical Attention

Comments

Leave a Reply Cancel reply

More posts

MSD0007 Demand Forecasting System for Retailer

MSD0006 Video Recommendation System

MSD0005 Surveillance Video Anomaly Detection

DL0052 Rotary Positional Embedding