MSD0004 Long Document Attention Scalability

Written by

The standard Transformer’s self-attention mechanism has a computational and memory complexity of $O(N^2)$ , where $N$ is the sequence length. For long document classification (e.g., thousands of tokens), this quadratic scaling becomes prohibitive.

Describe one or more attention modifications you would design or choose to enable efficient and effective long document classification.

Answer

To handle long documents, the quadratic complexity of full self-attention ( $O(N^2)$ ) must be reduced. The primary approaches involve Sparse Attention (like Longformer or BigBird) and Hierarchical Attention (like HATN).
Sparse attention constrains each token to only attend to a limited, relevant subset of tokens (local window and global tokens), while hierarchical attention segments the document and applies attention at both the sentence/segment level and the document level.
(1) Sparse Attention (Mechanism):
Replaces the full attention matrix with a sparse design.
Local Window Attention: Each token attends to its immediate neighbors, which is crucial for local context.
Global Attention: A few special tokens (e.g., [CLS]) act as global connectors that exchange information with all tokens, preserving long-range dependencies. (Models: Longformer, BigBird).

Here is a side-by-side comparison of Global Attention and Sliding Window Attention:

(2) Hierarchical Attention (Structure):
Hierarchical Attention (Structure)
Splits the long document into smaller, manageable segments (sentences or paragraphs).
Segment-level: Applies a standard Transformer token-level attention within each segment.
Document-level: Applies a separate document attention over the segment-level representations (e.g., the [CLS] token of each segment) to capture global dependencies. (Models: HATN, LNLF-BERT).

The figure below shows Hierarchical Attention used in the document classification use case.

Did you solve the problem?

Model

MSD0004 Long Document Attention Scalability

Comments

Leave a Reply Cancel reply

More posts

MSD0007 Demand Forecasting System for Retailer

MSD0006 Video Recommendation System

MSD0005 Surveillance Video Anomaly Detection

DL0052 Rotary Positional Embedding