Tag: Model

MSD0004 Long Document Attention Scalability
The standard Transformer’s self-attention mechanism has a computational and memory complexity of $O(N^2)$ , where $N$ is the sequence length. For long document classification (e.g., thousands of tokens), this quadratic scaling becomes prohibitive.
Describe one or more attention modifications you would design or choose to enable efficient and effective long document classification.
Answer
To handle long documents, the quadratic complexity of full self-attention ( $O(N^2)$ ) must be reduced. The primary approaches involve Sparse Attention (like Longformer or BigBird) and Hierarchical Attention (like HATN).
Sparse attention constrains each token to only attend to a limited, relevant subset of tokens (local window and global tokens), while hierarchical attention segments the document and applies attention at both the sentence/segment level and the document level.
(1) Sparse Attention (Mechanism):
Replaces the full attention matrix with a sparse design.
Local Window Attention: Each token attends to its immediate neighbors, which is crucial for local context.
Global Attention: A few special tokens (e.g., [CLS]) act as global connectors that exchange information with all tokens, preserving long-range dependencies. (Models: Longformer, BigBird).
Here is a side-by-side comparison of Global Attention and Sliding Window Attention:

(2) Hierarchical Attention (Structure):
Hierarchical Attention (Structure)
Splits the long document into smaller, manageable segments (sentences or paragraphs).
Segment-level: Applies a standard Transformer token-level attention within each segment.
Document-level: Applies a separate document attention over the segment-level representations (e.g., the [CLS] token of each segment) to capture global dependencies. (Models: HATN, LNLF-BERT).
The figure below shows Hierarchical Attention used in the document classification use case.
Login to view more content
November 4, 2025
MSD0002 Image to Video Classification
You are given a pretrained image classification network, such as ResNet. How would you adapt it to perform video classification, ensuring that both spatial and temporal information are captured?
Please discuss possible architectural modifications and trade-offs between different approaches.
Answer
Adding temporal modeling is essential when adapting an image-classification CNN for video classification. Options include:
(1) 3D CNNs (C3D/I3D):
Extend 2D convs to 3D to learn motion. Inflating 2D convolutions into 3D. (Example: I3D inflates 2D ResNet filters into 3D filters using pretrained ImageNet weights.)
Pros: Superior capability to capture fine-grained motion and spatio-temporal features.
Cons: High computational cost and high demand for video training data (Can be mitigated by I3D-style inflation).
(2) Combining frame-level CNN features with RNN/LSTM/TCN/Transformer:
Use CNN for spatial feature extraction, sequence model for temporal modeling with the extracted spatial features.
Pre-train CNN on the target dataset to conduct image classification might improve the performance.
Pros: Leverages powerful 2D CNN pre-training easily, lower computational cost. Flexible, handles variable-length sequences and can be good at modeling long-term sequence dependencies.
Cons: Less effective at modeling local and subtle motion features without further specialize temporal modeling design.
The below figure demonstrate the CNN combined with RNN/LSTM/TCN/Transformer modelling process.
(3) Temporal pooling/attention:
Simple frame aggregation with average/max pooling or attention.
Pros: Lightweight, efficient. Useful when frame order is less critical or resources are limited.
Cons: May lose fine-grained motion cues.
Login to view more content
November 2, 2025

Tag: Model

MSD0004 Long Document Attention Scalability

MSD0002 Image to Video Classification