Interview for Machine Learning

Author: admin

DL0049 Weight Init
Why is “weight initialization” important in deep neural networks?
Answer
Weight initialization is crucial for stabilizing activations and gradients, enabling deep neural networks to train efficiently and converge faster without numerical instability.
(1) Prevents vanishing/exploding gradients: Proper initialization keeps activations and gradients within reasonable ranges during forward and backward passes
(2) Ensures faster convergence: Good initialization allows the optimizer to reach a good solution more quickly.
(3) Breaks symmetry: Different initial weights ensure neurons learn unique features rather than identical outputs.

Xavier Initialization is used for activations like Sigmoid or Tanh:
Xavier aims to keep the variance of activations consistent across layers for symmetric activations (tanh/sigmoid).
$W \sim \mathcal{N}\left(0, \frac{1}{n_{\text{in}} + n_{\text{out}}}\right)$
or (uniform version):
$W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right)$
Where:
$n_{\text{in}}$ = number of input units
$n_{\text{out}}$ = number of output units
$\mathcal{N}(\mu, \sigma^2)$ means weights are sampled from a normal (Gaussian) distribution with mean $\mu$ and variance $\sigma^2$ .
$\mathcal{U}(a, b)$ means weights are sampled from a uniform distribution in the range $[a, b]$ .
He Initialization is used for activations like ReLU or Leaky ReLU:
He initialization scales up the variance for ReLU, since half of its outputs are zero, preventing vanishing activations.
$W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)$
or (uniform version):
$W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}}}}, \sqrt{\frac{6}{n_{\text{in}}}}\right)$
Where:
$n_{\text{in}}$ = number of input units
He initialization is recommended for ReLU networks as shown in the plot below.
Login to view more content
October 28, 2025
DL0048 Adam Optimizer
Can you explain how the Adam optimizer works?
Answer
The Adam (Adaptive Moment Estimation) optimizer is a powerful algorithm for training deep learning models, combining the principles of Momentum and RMSprop to compute adaptive learning rates for each parameter.
Adam updates following the steps below:
(1) First Moment Calculation(Mean/Momentum)
It computes an exponentially decaying average of past gradients, which is the estimate of the first moment (mean) of the gradient. This introduces a momentum effect to smooth out the updates.
$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$
Where:
$m_t$ is the 1st moment (mean of gradients).
$g_t$ is the gradient at step $t$ .
$\beta_1$ controls momentum (default: 0.9).
(2) Second Moment Calculation (Variance)
Adam also computes an exponentially decaying average of past squared gradients, which is the estimate of the second moment (uncentered variance) of the gradient. This provides a measure of the scale (magnitude) of the gradients.
$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$
Where:
$v_t$ is the 2nd moment (variance of gradients).
$\beta_2$ controls smoothing of squared gradients (default: 0.999).
(3) Bias Correction
Since $m_t$ and $v_t$ are initialized as zero vectors, they are biased towards zero, especially during the initial steps. Adam applies bias correction to these estimates.
$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$
$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$
Where
$\hat{m}_t$ is the bias-corrected 1st moment.
$\hat{v}_t$ is the bias-corrected 2nd moment.
$\beta_1^t$ , $\beta_2^t$ are the exponential decay raised to step $t$ , correcting bias from initialization.
(4) Parameter Update
The final parameter update scales the bias-corrected first moment ( $m_t$ ) by the overall learning rate ( $\alpha$ ) and divides by the square root of the bias-corrected second moment ( $v_t$ ).
$\theta_t = \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$
Where:
$\theta_t$ are model parameters.
$\alpha$ is the learning rate.
$\epsilon$ prevents division by zero.
The plot below shows how the Adam optimizer efficiently moves from the starting point to the minimum of the quadratic bowl, taking adaptive steps that quickly converge to the origin.
Login to view more content
October 27, 2025
DL0047 Focal Loss II
Please compare focal loss and weighted cross-entropy.
Answer
Weighted Cross-Entropy (WCE) rescales loss by class to correct prior imbalance and is simple and robust for noisy labels; Focal Loss (FL) multiplies cross-entropy by a difficulty-dependent factor $\gamma$ to suppress easy-example gradients and focus learning on hard examples, making it preferable when many easy negatives overwhelm training but requiring careful tuning to avoid amplifying label noise.
$\text{WeightedCE}(p_t) = -\alpha_t \log(p_t)$
Where:
$p_t$ is the model probability for the ground-truth class;
$\alpha_t$ is the per-class weight for class t.
$\text{FocalLoss}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)$
Where:
$p_t$ is the model probability for the ground-truth class;
$\alpha_t$ is the optional per-class weight for class t;
$\gamma \ge 0$ is the focusing parameter that down-weights easy examples.
Here is a table to compare focal loss and weighted cross-entropy.
The figure below compares Cross-Entropy, Weighted Cross-Entropy, and Focal Loss.
Login to view more content
October 27, 2025
DL0046 Focal Loss
What is focal loss, and why does it help with class imbalance?
Answer
Focal loss augments cross-entropy with a modulating term $(1 - p_t)^\gamma$ and an optional balancing weight $\alpha_t$ to suppress gradients from easy, majority-class examples and amplify learning from hard or minority-class examples, improving performance in severe class-imbalance settings when hyperparameters are properly tuned.
(1) Focal loss formula:
$\text{FocalLoss}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)$
Where:
$p_t$ is the model probability for the ground-truth class;
$\gamma \ge 0$ is the focusing parameter that down-weights easy examples;
$\alpha_t \in (0,1)$ is an optional class-balancing weight for class t.
(2) Modulation: The factor $(1 - p_t)^\gamma$ reduces loss from well-classified (high-confidence) examples, concentrating gradients on hard / low-confidence examples.

(3) Class imbalance effect:
In cross-entropy, abundant, easy negatives still produce a large total gradient, dominating learning.
Focal loss down-weights those contributions, ensuring rare/difficult samples have a stronger influence.
The plot below shows cross-entropy and focal-loss curves for several $\gamma$ values and an example $\alpha$ .
Login to view more content
October 27, 2025
DL0045 Dimension in FFN
In Transformers, why does the feed-forward network expand the hidden dimension (e.g., $d_{\text{model}}$ → $4 d_{\text{model}}$ ) before reducing it back?
Answer
The feed-forward network in Transformers expands the hidden dimension (e.g., $d_{\text{model}} \to 4 \cdot d_{\text{model}}$ ) to enhance the model’s ability to learn complex, non-linear feature interactions, then reduces it back to maintain compatibility with other layers. This design acts as a bottleneck, balancing expressiveness and efficiency, and has been empirically shown to boost performance in large-scale models.
(1) Extra Capacity: Expanding from $d_{\text{model}}$ to $4d_{\text{model}}$ allows the FFN to capture richer nonlinear transformations.
(2) Non‑linear mixing: The intermediate expansion allows the activation function (ReLU, GeLU, SwiGLU, etc.) to operate in a richer space, capturing more complex patterns.
(3) Projection back ensures compatibility: Reducing the dimension back to $d_{\text{model}}$ ensures compatibility with the subsequent layers. It ensures residual connection compatibility and uniformity across layers.
The equation and the figure below show the architecture of FFN:
$\text{FFN}(x) = W_2 \sigma(W_1 x + b_1) + b_2$
Where:
$x \in \mathbb{R}^{d_{\text{model}}}$ is the input vector.
$W_1 \in \mathbb{R}^{4d_{\text{model}} \times d_{\text{model}}}$ expands the dimension.
$\sigma$ is a non-linear activation (e.g., ReLU/GELU).
$W_2 \in \mathbb{R}^{d_{\text{model}} \times 4d_{\text{model}}}$ projects back down.
Login to view more content
September 24, 2025
DL0044 Multi-Query Attention
What is Multi-Query Attention in transformer models?
Answer
Multi-Query Attention (MQA) optimizes the standard Multi-Head Attention (MHA) in transformers by using multiple query heads while sharing a single key-value projection across them. This design maintains similar computational expressiveness to MHA but significantly reduces memory usage during inference, especially in KV caching for autoregressive tasks, making it ideal for scaling large models. It trades minor potential quality for efficiency.
Comparison to Multi-Head Attention (MHA): In standard MHA, each attention head has independent projections for queries (Q), keys (K), and values (V). In MQA, only Q is projected into multiple heads, while K and V use a single projection shared across all query heads, as shown in the figure below.
Efficiency Benefits: Reduces memory footprint during inference, particularly with KV caching, as the cache stores only one set of K and V vectors instead of one per head, lowering memory complexity from O(n * h * d) to O(n * d), where n is sequence length, h is number of heads, and d is head dimension.
The core attention operation for a single head in MQA can be represented by the following equation:
$\mbox{Attention}(Q_i, K_{shared}, V_{shared}) = \mbox{Softmax}(\frac{Q_i K_{shared}^T}{\sqrt{d_k}}) V_{shared}$
Where:
$Q_i$ represents the query vector for the ith attention head.
$K_{shared}$ and $V_{shared}$ represent the single, shared key and value vectors used by all heads.
$d_k$ is the dimension of the key vectors.
Login to view more content
September 14, 2025
DL0043 KV Cache
What is KV Cache in transformers, and why is it useful during inference?
Answer
The KV Cache in transformers optimizes inference by storing and reusing key and value vectors from the attention mechanism, avoiding redundant computations for previous tokens in a sequence. This is particularly useful in autoregressive models, where each new token requires attention over all prior tokens. By caching K and V vectors, the model only computes the query for the new token and retrieves cached K and V for earlier tokens, improving speed at the cost of higher memory usage.
The attention mechanism, optimized by KV Cache, is:
$\mathrm{Attention}(Q_t, K_{1:t}, V_{1:t}) = \mathrm{Softmax}\left(\frac{Q_t K_{1:t}^\top}{\sqrt{d_k}}\right) V_{1:t}$
Where:
$Q_t$ = query of the current token t.
$K_{1:t}, V_{1:t}$ = cached keys and values for all tokens up to t.
$d_k$ = key dimension.
The figure below explains the KV cache in autoregressive transformers.
Login to view more content
September 14, 2025
DL0042 Attention Computation
Please break down the computational cost of attention.
Answer
Here is the breakdown of the computational cost of attention:
$\mathcal{O}(n^2 \cdot d + n \cdot d^2)$
Input dimensions: Sequence length = $n$ , hidden dimension = $d$ , number of heads = $h$ .
(1) Linear projections (Q, K, V):
Each input $X \in \mathbb{R}^{n \times d}$ is projected into queries, keys, and values.
Cost: $\mathcal{O}(n \cdot d^2)$ (for all 3 matrices).
(2) Attention score computation (QKᵀ):
Queries: $Q \in \mathbb{R}^{n \times d_k}$
Keys: $K \in \mathbb{R}^{n \times d_k}$
Score matrix:
$S = QK^\top \in \mathbb{R}^{n \times n}$
Cost: $\mathcal{O}(n^2 \cdot d_k)$
(3) Softmax normalization:
For each row of the score matrix:
$\mbox{Softmax}(s_i) = \frac{e^{s_i}}{\sum_{j=1}^n e^{s_j}}$
Where:
$s_i$ = raw score for position $i$
$n$ = total sequence length
Cost: $\mathcal{O}(n^2)$
(4) Weighted sum with values (AV):
Attention weights $A \in \mathbb{R}^{n \times n}$ applied to values $V \in \mathbb{R}^{n \times d_v}$ :
$O = AV$
Cost: $\mathcal{O}(n^2 \cdot d_v)$
(5) Output projection:
Final linear layer to mix heads back to $d$ .
Cost: $\mathcal{O}(n \cdot d^2)$
Total Complexity:
Putting it all together:
$\mathcal{O}(n \cdot d^2) + \mathcal{O}(n^2 \cdot d_k) + \mathcal{O}(n^2 \cdot d_v)$
Since $d_k, d_v \approx d/h$ , the dominant term is:
$\mathcal{O}(n^2 \cdot d + n \cdot d^2)$
Overall, the cost of Multi-Head Attention (MHA)is in the same order as single-head, because per-head dims scale as $d/h$ .
Login to view more content
September 6, 2025
DL0041 Hierarchical Attention
Could you explain the concept of hierarchical attention in transformer architectures?
Answer
Hierarchical Attention in Transformers applies self-attention at multiple levels of granularity (e.g., words to sentences to documents). Instead of one flat attention over all tokens, it computes local attention within segments and then global attention across segments, leading to efficiency gains, better structure modeling, and interpretable focus at each level.
Motivation: Transformers normally apply flat self-attention over all tokens. For long structured inputs (documents, videos, graphs), this is inefficient and loses hierarchical structure.
Hierarchical Attention Idea:
(1) Local level (fine-grained): Compute attention within smaller segments (e.g., words within a sentence, frames within a shot).
(2) Global level (coarse-grained): Aggregate segment representations, then apply attention across segments (e.g., sentences within a document, shots within a video).
This mirrors natural data hierarchies and reduces quadratic cost.
The figure below shows Hierarchical Attention used in the document classification use case.
Login to view more content
September 1, 2025
DL0040 Attention Mask
What is the role of masking in attention?
Answer
Masking is a critical technique in transformer attention mechanisms that controls which parts of the input sequence the model is allowed to focus on.
(1) Leakage prevention: Blocks access to future tokens in autoregressive decoding to preserve causality.
(2) Padding handling: Excludes pad positions so they don’t absorb probability mass or distort context.
(3) Structured constraint: Enforces task rules (e.g., graph neighborhoods, spans, or blocked regions).
Core equation with mask:
$\text{Attn}(Q,K,V,M)=\text{Softmax}\left(\frac{QK^\top}{\sqrt{d_k}}+M\right)V$
Where:
$Q$ query matrix.
$K$ key matrix.
$V$ value matrix.
$d_k$ key dimensionality (for scaling).
$M$ mask matrix with 0 for allowed positions and large negative values (e.g., −∞) for disallowed positions.
The figure below shows three side-by-side heatmaps: a padding mask that disallows attending to padding tokens, a causal mask that enforces autoregressive decoding, and a structured mask that enforces local neighborhood constraint.
Login to view more content
August 31, 2025