Interview for Machine Learning

Category: Medium

DL0052 Rotary Positional Embedding
What is Rotary Positional Embedding (RoPE)?
Answer
Rotary Positional Embedding (RoPE) is a positional encoding method that rotates query and key vectors in multi‑head attention by position‑dependent angles. This rotation naturally encodes relative positional information, improves generalization to longer contexts, and avoids the limitations of fixed or learned absolute positional embeddings. It is used in GPT-NeoX, LLaMA, PaLM, Qwen, etc.
It has below charactretidstics:
(1) Relative position encoding method for Transformers
(2) Applies rotation to query (Q) and key (K) vectors using position-dependent angles
(3) Encodes position via geometry, not by adding vectors
(4) Preserves relative distance naturally in dot-product attention
(5) Extrapolates well to longer sequences than the training length
RoPE rotates each 2D pair of hidden dimensions:
$f(x, m)=\begin{pmatrix}\cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta)\end{pmatrix}\begin{pmatrix}x_1 \\x_2\end{pmatrix}$
Where:
$m$ represents the absolute position of the token in the sequence.
$\theta$ represents the base frequency/rotation angle.
$x_1, x_2$ represent the components of the embedding vector.
The below plot visualizes how RoPE makes attention decay smoothly with relative distance, while standard sinusoidal PE reflects absolute position similarity.
Login to view more content
January 2, 2026
DL0051 Sparsity in NN
Explain the concept of “Sparsity” in neural networks.
Answer
Sparsity in neural networks refers to the property that many parameters (weights) or activations are exactly zero (or very close to zero).
This leads to lighter, faster, and more interpretable models. Techniques such as L1 regularization, pruning, and ReLU activations help enforce sparsity, making networks more efficient without compromising performance.
Common techniques and their equations:
(1) L1 Regularization (encourages sparse weights)
$L = L_{\text{task}} + \lambda \sum_i |w_i|$
Where:
$w_i$ represents the i-th model weight
$\lambda$ controls the strength of sparsity
(2) ReLU Activation (induces sparse activations)
$\mathrm{ReLU}(x) = \max(0, x)$
Where:
$x$ is the neuron input.
The plot below shows weight distributions trained without using L1 and with L1-induced sparsity.
Login to view more content
December 30, 2025
DL0050 Knowledge Distillation
Describe the process and benefits of knowledge distillation.
Answer
Knowledge Distillation (KD) transfers “dark knowledge” about inter-class relationships from a large, accurate teacher model to a smaller student model. The student learns via a temperature-scaled softmax and a combined distillation plus supervised loss, enabling substantial model compression and faster inference while retaining high accuracy, provided that the teacher quality, student capacity, and hyperparameters are well chosen.
Definition: Knowledge distillation is a process where a smaller model (student) learns to mimic the behavior of a larger, well-trained model (teacher).
Soft Targets: The student is trained not only on hard labels (one-hot) but also on the soft output probabilities of the teacher.
Temperature Scaling: Teacher logits are softened using a temperature $T$ to reveal more information about class similarities:
$\mbox{Softmax}(z_i / T) = \frac{e^{z_i / T}}{\sum_{j=1}^{K} e^{z_j / T}}$
Where:
$z_i$ : Raw score (logit) for the i-th class.
$K$ : Total number of classes in the classification problem.
$T$ : Temperature parameter (>0) used to soften the probabilities. Higher $T$ produces a smoother distribution, revealing relationships between classes (“dark knowledge”).
The below plot shows the Softmax probabilities for a fixed set of Teacher logits under three different temperatures. Increasing the temperature smooths the distribution.
Loss Function: Typically combines distillation loss (difference between teacher and student soft outputs) and standard cross-entropy loss with true labels.
Key Benefits of KD:
(1) Model compression: the student is smaller and faster while retaining much of the teacher’s performance, enabling deployment on resource-constrained devices.
(2) Inference Speed: Significantly decreases latency, making the model suitable for deployment on edge devices or real-time applications.
(3) Improved Generalization: The Teacher’s smooth soft targets act as a form of powerful regularization, often leading the Student to generalize better than if it were trained only on hard labels.
The plot below demonstrates the Knowledge Distillation (KD) process.
Login to view more content
October 28, 2025
DL0047 Focal Loss II
Please compare focal loss and weighted cross-entropy.
Answer
Weighted Cross-Entropy (WCE) rescales loss by class to correct prior imbalance and is simple and robust for noisy labels; Focal Loss (FL) multiplies cross-entropy by a difficulty-dependent factor $\gamma$ to suppress easy-example gradients and focus learning on hard examples, making it preferable when many easy negatives overwhelm training but requiring careful tuning to avoid amplifying label noise.
$\text{WeightedCE}(p_t) = -\alpha_t \log(p_t)$
Where:
$p_t$ is the model probability for the ground-truth class;
$\alpha_t$ is the per-class weight for class t.
$\text{FocalLoss}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)$
Where:
$p_t$ is the model probability for the ground-truth class;
$\alpha_t$ is the optional per-class weight for class t;
$\gamma \ge 0$ is the focusing parameter that down-weights easy examples.
Here is a table to compare focal loss and weighted cross-entropy.
The figure below compares Cross-Entropy, Weighted Cross-Entropy, and Focal Loss.
Login to view more content
October 27, 2025
DL0045 Dimension in FFN
In Transformers, why does the feed-forward network expand the hidden dimension (e.g., $d_{\text{model}}$ → $4 d_{\text{model}}$ ) before reducing it back?
Answer
The feed-forward network in Transformers expands the hidden dimension (e.g., $d_{\text{model}} \to 4 \cdot d_{\text{model}}$ ) to enhance the model’s ability to learn complex, non-linear feature interactions, then reduces it back to maintain compatibility with other layers. This design acts as a bottleneck, balancing expressiveness and efficiency, and has been empirically shown to boost performance in large-scale models.
(1) Extra Capacity: Expanding from $d_{\text{model}}$ to $4d_{\text{model}}$ allows the FFN to capture richer nonlinear transformations.
(2) Non‑linear mixing: The intermediate expansion allows the activation function (ReLU, GeLU, SwiGLU, etc.) to operate in a richer space, capturing more complex patterns.
(3) Projection back ensures compatibility: Reducing the dimension back to $d_{\text{model}}$ ensures compatibility with the subsequent layers. It ensures residual connection compatibility and uniformity across layers.
The equation and the figure below show the architecture of FFN:
$\text{FFN}(x) = W_2 \sigma(W_1 x + b_1) + b_2$
Where:
$x \in \mathbb{R}^{d_{\text{model}}}$ is the input vector.
$W_1 \in \mathbb{R}^{4d_{\text{model}} \times d_{\text{model}}}$ expands the dimension.
$\sigma$ is a non-linear activation (e.g., ReLU/GELU).
$W_2 \in \mathbb{R}^{d_{\text{model}} \times 4d_{\text{model}}}$ projects back down.
Login to view more content
September 24, 2025
DL0044 Multi-Query Attention
What is Multi-Query Attention in transformer models?
Answer
Multi-Query Attention (MQA) optimizes the standard Multi-Head Attention (MHA) in transformers by using multiple query heads while sharing a single key-value projection across them. This design maintains similar computational expressiveness to MHA but significantly reduces memory usage during inference, especially in KV caching for autoregressive tasks, making it ideal for scaling large models. It trades minor potential quality for efficiency.
Comparison to Multi-Head Attention (MHA): In standard MHA, each attention head has independent projections for queries (Q), keys (K), and values (V). In MQA, only Q is projected into multiple heads, while K and V use a single projection shared across all query heads, as shown in the figure below.
Efficiency Benefits: Reduces memory footprint during inference, particularly with KV caching, as the cache stores only one set of K and V vectors instead of one per head, lowering memory complexity from O(n * h * d) to O(n * d), where n is sequence length, h is number of heads, and d is head dimension.
The core attention operation for a single head in MQA can be represented by the following equation:
$\mbox{Attention}(Q_i, K_{shared}, V_{shared}) = \mbox{Softmax}(\frac{Q_i K_{shared}^T}{\sqrt{d_k}}) V_{shared}$
Where:
$Q_i$ represents the query vector for the ith attention head.
$K_{shared}$ and $V_{shared}$ represent the single, shared key and value vectors used by all heads.
$d_k$ is the dimension of the key vectors.
Login to view more content
September 14, 2025
DL0042 Attention Computation
Please break down the computational cost of attention.
Answer
Here is the breakdown of the computational cost of attention:
$\mathcal{O}(n^2 \cdot d + n \cdot d^2)$
Input dimensions: Sequence length = $n$ , hidden dimension = $d$ , number of heads = $h$ .
(1) Linear projections (Q, K, V):
Each input $X \in \mathbb{R}^{n \times d}$ is projected into queries, keys, and values.
Cost: $\mathcal{O}(n \cdot d^2)$ (for all 3 matrices).
(2) Attention score computation (QKᵀ):
Queries: $Q \in \mathbb{R}^{n \times d_k}$
Keys: $K \in \mathbb{R}^{n \times d_k}$
Score matrix:
$S = QK^\top \in \mathbb{R}^{n \times n}$
Cost: $\mathcal{O}(n^2 \cdot d_k)$
(3) Softmax normalization:
For each row of the score matrix:
$\mbox{Softmax}(s_i) = \frac{e^{s_i}}{\sum_{j=1}^n e^{s_j}}$
Where:
$s_i$ = raw score for position $i$
$n$ = total sequence length
Cost: $\mathcal{O}(n^2)$
(4) Weighted sum with values (AV):
Attention weights $A \in \mathbb{R}^{n \times n}$ applied to values $V \in \mathbb{R}^{n \times d_v}$ :
$O = AV$
Cost: $\mathcal{O}(n^2 \cdot d_v)$
(5) Output projection:
Final linear layer to mix heads back to $d$ .
Cost: $\mathcal{O}(n \cdot d^2)$
Total Complexity:
Putting it all together:
$\mathcal{O}(n \cdot d^2) + \mathcal{O}(n^2 \cdot d_k) + \mathcal{O}(n^2 \cdot d_v)$
Since $d_k, d_v \approx d/h$ , the dominant term is:
$\mathcal{O}(n^2 \cdot d + n \cdot d^2)$
Overall, the cost of Multi-Head Attention (MHA)is in the same order as single-head, because per-head dims scale as $d/h$ .
Login to view more content
September 6, 2025
DL0041 Hierarchical Attention
Could you explain the concept of hierarchical attention in transformer architectures?
Answer
Hierarchical Attention in Transformers applies self-attention at multiple levels of granularity (e.g., words to sentences to documents). Instead of one flat attention over all tokens, it computes local attention within segments and then global attention across segments, leading to efficiency gains, better structure modeling, and interpretable focus at each level.
Motivation: Transformers normally apply flat self-attention over all tokens. For long structured inputs (documents, videos, graphs), this is inefficient and loses hierarchical structure.
Hierarchical Attention Idea:
(1) Local level (fine-grained): Compute attention within smaller segments (e.g., words within a sentence, frames within a shot).
(2) Global level (coarse-grained): Aggregate segment representations, then apply attention across segments (e.g., sentences within a document, shots within a video).
This mirrors natural data hierarchies and reduces quadratic cost.
The figure below shows Hierarchical Attention used in the document classification use case.
Login to view more content
September 1, 2025
DL0037 Transformer Architecture III
Why do Transformers use a dot product, rather than addition, to compute attention scores?
Answer
Dot product attention is a fast and naturally aligned similarity measure; with scaling, it remains numerically stable and highly parallelizable, which is why Transformers prefer it over addition.
(1) Dot product captures similarity: The dot product between query $q$ and key $k$ grows larger when they point in similar directions, making it a natural similarity measure.
The scores are normalized with Softmax and have probabilistic interpretations:
$\alpha_i = \frac{e^{q \cdot k_i}}{\sum_{j=1}^K e^{q \cdot k_j}}$
Where:
$q \cdot k_i$ is the dot product similarity between query and key.
The figure below illustrates the dot product for measuring similarity.

(2) Efficient computation: Dot products can be computed in parallel as a matrix multiplication $QK^\top$ , which is hardware-friendly.
Login to view more content
August 26, 2025
DL0029 Dilated Attention
Could you explain the concept of dilated attention in transformer architectures?
Answer
Dilated attention introduces gaps between attention positions to sparsify computation, enabling efficient long-range dependency modeling. It is particularly helpful in tasks requiring scalable attention over long sequences. It trades off some granularity for global context by spreading attention more widely and sparsely.
Dilated attention is similar to dilated convolutions in CNNs, where gaps (dilations) are introduced between the sampled positions.
Instead of attending to all tokens (as in standard self-attention), each query token attends to every d-th token. This dilation rate controls the stride in attention.
Reduction in Complexity: Reduces attention computation and memory from $\mathcal{O}(n^2)$ to a lower bound depending on the sparsity pattern.
In dilated attention, the dot-product $QK^\top$ is computed only at dilated positions:
$\mbox{Attention}_{\text{dilated}}(Q, K, V) = \mbox{Softmax}\left(\frac{QK_d^\top}{\sqrt{d_k}}\right) V_d$
Where:
$K_d, V_d$ are the dilated subsets of keys and values.
$d_k$ is the key dimension.
Below is the visualization of dilated attention with a dilation rate of 3.
Login to view more content
August 7, 2025