Interview for Machine Learning

Category: Medium

DL0054 Deformable Attention
What is Deformable Attention and how does it reduce computational complexity for object detection tasks?
Answer
Deformable Attention is a sparse attention mechanism that learns dynamic sampling locations instead of attending to all spatial positions uniformly. It uses learned 2D offsets from reference points to sample only the most relevant features, reducing complexity from $O(N^2)$ to $O(NK)$ where K is a small constant (typically 4). This makes it ideal for high-resolution feature maps in object detection where full attention is computationally prohibitive — for a 1024×1024 feature map, standard attention requires ~1M operations per head while deformable attention needs only ~4K.

Figure 1: Deformable attention learns K=4 sampling offsets per query point instead of dense N×N attention
(1) Sparse Sampling: Instead of computing attention over all $N \times N$ positions, deformable attention samples only K reference points per query, typically K=4 or 8, reducing the key-value set from N to K.
(2) Learned Offsets: The model predicts $\Delta p_{mk}$ offsets from each reference point $p_k$ using a lightweight linear layer on query features, requiring only $O(NC)$ additional computation where C is channel dimension.
(3) Bilinear Interpolation: When offsets point to non-integer locations, bilinear interpolation computes feature values from the 4 nearest pixels, enabling sub-pixel precision sampling without modifying the feature map.

Figure 2: Complexity comparison shows O(NK) grows linearly while O(N²) becomes prohibitive for large feature maps
Mathematical Formulation:
$y(p) = \sum_{m=1}^{M} W_m \left[ \sum_{k=1}^{K} A_{mk} \cdot W_m' x(p + p_k + \Delta p_{mk}) \right]$
Where:
$p$ is the reference position (query location on the feature map)
$M$ is the number of attention heads
$K$ is the number of sampled keys per head (typically 4)
$p_k$ are fixed reference offsets (uniformly initialized)
$\Delta p_{mk}$ are learned deformable offsets (2D, predicted per head per key)
$A_{mk}$ is the attention weight (normalized, not from softmax over all positions)
$W_m, W_m'$ are projection matrices for each head
The offsets $\Delta p_{mk}$ are predicted by a linear projection from query features: $\Delta p_{mk} = W_\text{offset} \cdot q_m(p)$ , where $W_\text{offset} \in \mathbb{R}^{C \times 2K}$ . The attention weights $A_{mk}$ are computed via a separate softmax over only K elements, not the full N positions. In Deformable DETR, multi-scale deformable attention extends this to sample across multiple feature map resolutions simultaneously, enabling the model to capture both small and large objects efficiently.
Login to view more content
July 13, 2026
DL0053 Gated Attention
What is Gated Attention and how does it improve transformer architectures over standard scaled dot-product attention?
Answer
Gated Attention (arXiv:2505.06708) applies a head-specific sigmoid gate after Scaled Dot-Product Attention (SDPA) to dynamically modulate attention output. Unlike standard attention where all heads contribute equally, gated attention introduces query-dependent sparse gating that suppresses irrelevant heads and activates only salient ones. This mitigates the attention sink problem where standard transformers concentrate disproportionate attention on the first few tokens, and enhances long-context extrapolation by maintaining diverse attention patterns across sequence lengths.

Figure 1: Gated attention applies a head-specific sigmoid gate after SDPA to modulate attention output before the residual connection
(1) Post-SDPA Gating: The gate is applied after SDPA computation, not before — each attention head $h_i$ is multiplied by a sigmoid gate $g_i = \sigma(W_g \cdot q_i)$ where $W_g$ is a head-specific projection.
(2) Sparsity Induction: The sigmoid gate produces values in $[0, 1]$ , and empirical measurements show mean gate activation of ~0.116, meaning most heads are heavily suppressed — introducing beneficial sparsity without hard pruning.
(3) Attention Sink Mitigation: Standard attention allocates ~46.7% of attention mass to the first token; gated attention reduces this to ~4.8%, distributing attention more uniformly across tokens.

Figure 2: Gate activation distribution across 8 attention heads shows heavy suppression (mean ~0.116) with sparse high-activation regions
Mathematical Formulation:
$\text{GatedAttn}(Q, K, V) = \text{Concat}(g_1 \odot h_1, \ldots, g_H \odot h_H) W_O$
$g_i = \sigma(W_g^{(i)} \cdot q_i + b_g^{(i)})$
Where:
$h_i = \text{SDPA}(q_i, k_i, v_i)$ is the output of the $i$ -th attention head
$g_i \in [0, 1]$ is the head-specific gate score
$W_g^{(i)} \in \mathbb{R}^{d_k \times 1}$ is a learned projection from query to scalar gate
$\sigma$ is the sigmoid function
$\odot$ denotes element-wise multiplication
$W_O$ is the standard output projection
The gate projection $W_g^{(i)}$ adds only $O(d_k)$ parameters per head — a negligible overhead of ~0.1% of total model parameters — yet significantly improves long-context performance. On the RULER benchmark at 128K context length, gated attention improves needle-in-haystack retrieval accuracy from ~72% to ~94% compared to standard attention.
Login to view more content
July 13, 2026
DL0052 Rotary Positional Embedding
What is Rotary Positional Embedding (RoPE)?
Answer
Rotary Positional Embedding (RoPE) is a positional encoding method that rotates query and key vectors in multi‑head attention by position‑dependent angles. This rotation naturally encodes relative positional information, improves generalization to longer contexts, and avoids the limitations of fixed or learned absolute positional embeddings. It is used in GPT-NeoX, LLaMA, PaLM, Qwen, etc.
It has below charactretidstics:
(1) Relative position encoding method for Transformers
(2) Applies rotation to query (Q) and key (K) vectors using position-dependent angles
(3) Encodes position via geometry, not by adding vectors
(4) Preserves relative distance naturally in dot-product attention
(5) Extrapolates well to longer sequences than the training length
RoPE rotates each 2D pair of hidden dimensions:
$f(x, m)=\begin{pmatrix}\cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta)\end{pmatrix}\begin{pmatrix}x_1 \\x_2\end{pmatrix}$
Where:
$m$ represents the absolute position of the token in the sequence.
$\theta$ represents the base frequency/rotation angle.
$x_1, x_2$ represent the components of the embedding vector.
The below plot visualizes how RoPE makes attention decay smoothly with relative distance, while standard sinusoidal PE reflects absolute position similarity.
Login to view more content
January 2, 2026
DL0051 Sparsity in NN
Explain the concept of “Sparsity” in neural networks.
Answer
Sparsity in neural networks refers to the property that many parameters (weights) or activations are exactly zero (or very close to zero).
This leads to lighter, faster, and more interpretable models. Techniques such as L1 regularization, pruning, and ReLU activations help enforce sparsity, making networks more efficient without compromising performance.
Common techniques and their equations:
(1) L1 Regularization (encourages sparse weights)
$L = L_{\text{task}} + \lambda \sum_i |w_i|$
Where:
$w_i$ represents the i-th model weight
$\lambda$ controls the strength of sparsity
(2) ReLU Activation (induces sparse activations)
$\mathrm{ReLU}(x) = \max(0, x)$
Where:
$x$ is the neuron input.
The plot below shows weight distributions trained without using L1 and with L1-induced sparsity.
Login to view more content
December 30, 2025
DL0050 Knowledge Distillation
Describe the process and benefits of knowledge distillation.
Answer
Knowledge Distillation (KD) transfers “dark knowledge” about inter-class relationships from a large, accurate teacher model to a smaller student model. The student learns via a temperature-scaled softmax and a combined distillation plus supervised loss, enabling substantial model compression and faster inference while retaining high accuracy, provided that the teacher quality, student capacity, and hyperparameters are well chosen.
Definition: Knowledge distillation is a process where a smaller model (student) learns to mimic the behavior of a larger, well-trained model (teacher).
Soft Targets: The student is trained not only on hard labels (one-hot) but also on the soft output probabilities of the teacher.
Temperature Scaling: Teacher logits are softened using a temperature $T$ to reveal more information about class similarities:
$\mbox{Softmax}(z_i / T) = \frac{e^{z_i / T}}{\sum_{j=1}^{K} e^{z_j / T}}$
Where:
$z_i$ : Raw score (logit) for the i-th class.
$K$ : Total number of classes in the classification problem.
$T$ : Temperature parameter (>0) used to soften the probabilities. Higher $T$ produces a smoother distribution, revealing relationships between classes (“dark knowledge”).
The below plot shows the Softmax probabilities for a fixed set of Teacher logits under three different temperatures. Increasing the temperature smooths the distribution.
Loss Function: Typically combines distillation loss (difference between teacher and student soft outputs) and standard cross-entropy loss with true labels.
Key Benefits of KD:
(1) Model compression: the student is smaller and faster while retaining much of the teacher’s performance, enabling deployment on resource-constrained devices.
(2) Inference Speed: Significantly decreases latency, making the model suitable for deployment on edge devices or real-time applications.
(3) Improved Generalization: The Teacher’s smooth soft targets act as a form of powerful regularization, often leading the Student to generalize better than if it were trained only on hard labels.
The plot below demonstrates the Knowledge Distillation (KD) process.
Login to view more content
October 28, 2025
DL0047 Focal Loss II
Please compare focal loss and weighted cross-entropy.
Answer
Weighted Cross-Entropy (WCE) rescales loss by class to correct prior imbalance and is simple and robust for noisy labels; Focal Loss (FL) multiplies cross-entropy by a difficulty-dependent factor $\gamma$ to suppress easy-example gradients and focus learning on hard examples, making it preferable when many easy negatives overwhelm training but requiring careful tuning to avoid amplifying label noise.
$\text{WeightedCE}(p_t) = -\alpha_t \log(p_t)$
Where:
$p_t$ is the model probability for the ground-truth class;
$\alpha_t$ is the per-class weight for class t.
$\text{FocalLoss}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)$
Where:
$p_t$ is the model probability for the ground-truth class;
$\alpha_t$ is the optional per-class weight for class t;
$\gamma \ge 0$ is the focusing parameter that down-weights easy examples.
Here is a table to compare focal loss and weighted cross-entropy.
The figure below compares Cross-Entropy, Weighted Cross-Entropy, and Focal Loss.
Login to view more content
October 27, 2025
DL0045 Dimension in FFN
In Transformers, why does the feed-forward network expand the hidden dimension (e.g., $d_{\text{model}}$ → $4 d_{\text{model}}$ ) before reducing it back?
Answer
The feed-forward network in Transformers expands the hidden dimension (e.g., $d_{\text{model}} \to 4 \cdot d_{\text{model}}$ ) to enhance the model’s ability to learn complex, non-linear feature interactions, then reduces it back to maintain compatibility with other layers. This design acts as a bottleneck, balancing expressiveness and efficiency, and has been empirically shown to boost performance in large-scale models.
(1) Extra Capacity: Expanding from $d_{\text{model}}$ to $4d_{\text{model}}$ allows the FFN to capture richer nonlinear transformations.
(2) Non‑linear mixing: The intermediate expansion allows the activation function (ReLU, GeLU, SwiGLU, etc.) to operate in a richer space, capturing more complex patterns.
(3) Projection back ensures compatibility: Reducing the dimension back to $d_{\text{model}}$ ensures compatibility with the subsequent layers. It ensures residual connection compatibility and uniformity across layers.
The equation and the figure below show the architecture of FFN:
$\text{FFN}(x) = W_2 \sigma(W_1 x + b_1) + b_2$
Where:
$x \in \mathbb{R}^{d_{\text{model}}}$ is the input vector.
$W_1 \in \mathbb{R}^{4d_{\text{model}} \times d_{\text{model}}}$ expands the dimension.
$\sigma$ is a non-linear activation (e.g., ReLU/GELU).
$W_2 \in \mathbb{R}^{d_{\text{model}} \times 4d_{\text{model}}}$ projects back down.
Login to view more content
September 24, 2025
DL0044 Multi-Query Attention
What is Multi-Query Attention in transformer models?
Answer
Multi-Query Attention (MQA) optimizes the standard Multi-Head Attention (MHA) in transformers by using multiple query heads while sharing a single key-value projection across them. This design maintains similar computational expressiveness to MHA but significantly reduces memory usage during inference, especially in KV caching for autoregressive tasks, making it ideal for scaling large models. It trades minor potential quality for efficiency.
Comparison to Multi-Head Attention (MHA): In standard MHA, each attention head has independent projections for queries (Q), keys (K), and values (V). In MQA, only Q is projected into multiple heads, while K and V use a single projection shared across all query heads, as shown in the figure below.
Efficiency Benefits: Reduces memory footprint during inference, particularly with KV caching, as the cache stores only one set of K and V vectors instead of one per head, lowering memory complexity from O(n * h * d) to O(n * d), where n is sequence length, h is number of heads, and d is head dimension.
The core attention operation for a single head in MQA can be represented by the following equation:
$\mbox{Attention}(Q_i, K_{shared}, V_{shared}) = \mbox{Softmax}(\frac{Q_i K_{shared}^T}{\sqrt{d_k}}) V_{shared}$
Where:
$Q_i$ represents the query vector for the ith attention head.
$K_{shared}$ and $V_{shared}$ represent the single, shared key and value vectors used by all heads.
$d_k$ is the dimension of the key vectors.
Login to view more content
September 14, 2025
DL0042 Attention Computation
Please break down the computational cost of attention.
Answer
Here is the breakdown of the computational cost of attention:
$\mathcal{O}(n^2 \cdot d + n \cdot d^2)$
Input dimensions: Sequence length = $n$ , hidden dimension = $d$ , number of heads = $h$ .
(1) Linear projections (Q, K, V):
Each input $X \in \mathbb{R}^{n \times d}$ is projected into queries, keys, and values.
Cost: $\mathcal{O}(n \cdot d^2)$ (for all 3 matrices).
(2) Attention score computation (QKᵀ):
Queries: $Q \in \mathbb{R}^{n \times d_k}$
Keys: $K \in \mathbb{R}^{n \times d_k}$
Score matrix:
$S = QK^\top \in \mathbb{R}^{n \times n}$
Cost: $\mathcal{O}(n^2 \cdot d_k)$
(3) Softmax normalization:
For each row of the score matrix:
$\mbox{Softmax}(s_i) = \frac{e^{s_i}}{\sum_{j=1}^n e^{s_j}}$
Where:
$s_i$ = raw score for position $i$
$n$ = total sequence length
Cost: $\mathcal{O}(n^2)$
(4) Weighted sum with values (AV):
Attention weights $A \in \mathbb{R}^{n \times n}$ applied to values $V \in \mathbb{R}^{n \times d_v}$ :
$O = AV$
Cost: $\mathcal{O}(n^2 \cdot d_v)$
(5) Output projection:
Final linear layer to mix heads back to $d$ .
Cost: $\mathcal{O}(n \cdot d^2)$
Total Complexity:
Putting it all together:
$\mathcal{O}(n \cdot d^2) + \mathcal{O}(n^2 \cdot d_k) + \mathcal{O}(n^2 \cdot d_v)$
Since $d_k, d_v \approx d/h$ , the dominant term is:
$\mathcal{O}(n^2 \cdot d + n \cdot d^2)$
Overall, the cost of Multi-Head Attention (MHA)is in the same order as single-head, because per-head dims scale as $d/h$ .
Login to view more content
September 6, 2025
DL0041 Hierarchical Attention
Could you explain the concept of hierarchical attention in transformer architectures?
Answer
Hierarchical Attention in Transformers applies self-attention at multiple levels of granularity (e.g., words to sentences to documents). Instead of one flat attention over all tokens, it computes local attention within segments and then global attention across segments, leading to efficiency gains, better structure modeling, and interpretable focus at each level.
Motivation: Transformers normally apply flat self-attention over all tokens. For long structured inputs (documents, videos, graphs), this is inefficient and loses hierarchical structure.
Hierarchical Attention Idea:
(1) Local level (fine-grained): Compute attention within smaller segments (e.g., words within a sentence, frames within a shot).
(2) Global level (coarse-grained): Aggregate segment representations, then apply attention across segments (e.g., sentences within a document, shots within a video).
This mirrors natural data hierarchies and reduces quadratic cost.
The figure below shows Hierarchical Attention used in the document classification use case.
Login to view more content
September 1, 2025