Author: admin

  • DL0049 Weight Init

    Why is “weight initialization” important in deep neural networks?

    Answer

    Weight initialization is crucial for stabilizing activations and gradients, enabling deep neural networks to train efficiently and converge faster without numerical instability.
    (1) Prevents vanishing/exploding gradients: Proper initialization keeps activations and gradients within reasonable ranges during forward and backward passes
    (2) Ensures faster convergence: Good initialization allows the optimizer to reach a good solution more quickly.
    (3) Breaks symmetry: Different initial weights ensure neurons learn unique features rather than identical outputs.

    Xavier Initialization
    is used for activations like Sigmoid or Tanh:
    Xavier aims to keep the variance of activations consistent across layers for symmetric activations (tanh/sigmoid).
     W \sim \mathcal{N}\left(0, \frac{1}{n_{\text{in}} + n_{\text{out}}}\right)
    or (uniform version):
     W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}},  \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right)
    Where:
     n_{\text{in}} = number of input units
     n_{\text{out}} = number of output units
     \mathcal{N}(\mu, \sigma^2) means weights are sampled from a normal (Gaussian) distribution with mean  \mu and variance  \sigma^2 .
     \mathcal{U}(a, b) means weights are sampled from a uniform distribution in the range  [a, b] .

    He Initialization is used for activations like ReLU or Leaky ReLU:
    He initialization scales up the variance for ReLU, since half of its outputs are zero, preventing vanishing activations.
     W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)
    or (uniform version):
     W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}}}},  \sqrt{\frac{6}{n_{\text{in}}}}\right)
    Where:
     n_{\text{in}} = number of input units

    He initialization is recommended for ReLU networks as shown in the plot below.


    Login to view more content
  • DL0048 Adam Optimizer

    Can you explain how the Adam optimizer works?

    Answer

    The Adam (Adaptive Moment Estimation) optimizer is a powerful algorithm for training deep learning models, combining the principles of Momentum and RMSprop to compute adaptive learning rates for each parameter.
    Adam updates following the steps below:
    (1) First Moment Calculation(Mean/Momentum)
    It computes an exponentially decaying average of past gradients, which is the estimate of the first moment (mean) of the gradient. This introduces a momentum effect to smooth out the updates.
    m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
    Where:
    m_t is the 1st moment (mean of gradients).
    g_t is the gradient at step t.
    \beta_1 controls momentum (default: 0.9).

    (2) Second Moment Calculation (Variance)
    Adam also computes an exponentially decaying average of past squared gradients, which is the estimate of the second moment (uncentered variance) of the gradient. This provides a measure of the scale (magnitude) of the gradients.
    v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
    Where:
    v_t is the 2nd moment (variance of gradients).
    \beta_2 controls smoothing of squared gradients (default: 0.999).

    (3) Bias Correction
    Since m_t​ and v_t ​ are initialized as zero vectors, they are biased towards zero, especially during the initial steps. Adam applies bias correction to these estimates.
    \hat{m}_t = \frac{m_t}{1 - \beta_1^t}
    \hat{v}_t = \frac{v_t}{1 - \beta_2^t}
    Where
    \hat{m}_t is the bias-corrected 1st moment.
    \hat{v}_t is the bias-corrected 2nd moment.
    \beta_1^t, \beta_2^t are the exponential decay raised to step t, correcting bias from initialization.

    (4) Parameter Update
    The final parameter update scales the bias-corrected first moment (m_t) by the overall learning rate (\alpha) and divides by the square root of the bias-corrected second moment (v_t​).
    \theta_t = \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
    Where:
    \theta_t are model parameters.
    \alpha is the learning rate.
    \epsilon prevents division by zero.

    The plot below shows how the Adam optimizer efficiently moves from the starting point to the minimum of the quadratic bowl, taking adaptive steps that quickly converge to the origin.


    Login to view more content

  • DL0047 Focal Loss II

    Please compare focal loss and weighted cross-entropy.

    Answer

    Weighted Cross-Entropy (WCE) rescales loss by class to correct prior imbalance and is simple and robust for noisy labels; Focal Loss (FL) multiplies cross-entropy by a difficulty-dependent factor \gamma to suppress easy-example gradients and focus learning on hard examples, making it preferable when many easy negatives overwhelm training but requiring careful tuning to avoid amplifying label noise.

    \text{WeightedCE}(p_t) = -\alpha_t \log(p_t)
    Where:
    p_t is the model probability for the ground-truth class;
    \alpha_t is the per-class weight for class t.

    \text{FocalLoss}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)
    Where:
    p_t is the model probability for the ground-truth class;
    \alpha_t is the optional per-class weight for class t;
    \gamma \ge 0 is the focusing parameter that down-weights easy examples.

    Here is a table to compare focal loss and weighted cross-entropy.

    The figure below compares Cross-Entropy, Weighted Cross-Entropy, and Focal Loss.


    Login to view more content

  • DL0046 Focal Loss

    What is focal loss, and why does it help with class imbalance?

    Answer

    Focal loss augments cross-entropy with a modulating term (1 - p_t)^\gamma and an optional balancing weight \alpha_t to suppress gradients from easy, majority-class examples and amplify learning from hard or minority-class examples, improving performance in severe class-imbalance settings when hyperparameters are properly tuned.
    (1) Focal loss formula:
    \text{FocalLoss}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)
    Where:
    p_t is the model probability for the ground-truth class;
    \gamma \ge 0 is the focusing parameter that down-weights easy examples;
    \alpha_t \in (0,1) is an optional class-balancing weight for class t.
    (2) Modulation: The factor (1 - p_t)^\gamma reduces loss from well-classified (high-confidence) examples, concentrating gradients on hard / low-confidence examples.

    (3) Class imbalance effect:
    In cross-entropy, abundant, easy negatives still produce a large total gradient, dominating learning.
    Focal loss down-weights those contributions, ensuring rare/difficult samples have a stronger influence.

    The plot below shows cross-entropy and focal-loss curves for several \gamma values and an example \alpha.


    Login to view more content

  • DL0045 Dimension in FFN

    In Transformers, why does the feed-forward network expand the hidden dimension (e.g.,  d_{\text{model}} 4 d_{\text{model}} ) before reducing it back?

    Answer

    The feed-forward network in Transformers expands the hidden dimension (e.g.,  d_{\text{model}} \to 4 \cdot d_{\text{model}} ) to enhance the model’s ability to learn complex, non-linear feature interactions, then reduces it back to maintain compatibility with other layers. This design acts as a bottleneck, balancing expressiveness and efficiency, and has been empirically shown to boost performance in large-scale models.
    (1) Extra Capacity: Expanding from  d_{\text{model}} to  4d_{\text{model}} allows the FFN to capture richer nonlinear transformations.
    (2) Non‑linear mixing: The intermediate expansion allows the activation function (ReLU, GeLU, SwiGLU, etc.) to operate in a richer space, capturing more complex patterns.
    (3) Projection back ensures compatibility: Reducing the dimension back to  d_{\text{model}} ensures compatibility with the subsequent layers. It ensures residual connection compatibility and uniformity across layers.

    The equation and the figure below show the architecture of FFN:
     \text{FFN}(x) = W_2  \sigma(W_1 x + b_1) + b_2
    Where:
     x \in \mathbb{R}^{d_{\text{model}}} is the input vector.
     W_1 \in \mathbb{R}^{4d_{\text{model}} \times d_{\text{model}}} expands the dimension.
     \sigma is a non-linear activation (e.g., ReLU/GELU).
     W_2 \in \mathbb{R}^{d_{\text{model}} \times 4d_{\text{model}}} projects back down.


    Login to view more content
  • DL0044 Multi-Query Attention

    What is Multi-Query Attention in transformer models?

    Answer

    Multi-Query Attention (MQA) optimizes the standard Multi-Head Attention (MHA) in transformers by using multiple query heads while sharing a single key-value projection across them. This design maintains similar computational expressiveness to MHA but significantly reduces memory usage during inference, especially in KV caching for autoregressive tasks, making it ideal for scaling large models. It trades minor potential quality for efficiency.

    Comparison to Multi-Head Attention (MHA): In standard MHA, each attention head has independent projections for queries (Q), keys (K), and values (V). In MQA, only Q is projected into multiple heads, while K and V use a single projection shared across all query heads, as shown in the figure below.

    Efficiency Benefits: Reduces memory footprint during inference, particularly with KV caching, as the cache stores only one set of K and V vectors instead of one per head, lowering memory complexity from O(n * h * d) to O(n * d), where n is sequence length, h is number of heads, and d is head dimension.

    The core attention operation for a single head in MQA can be represented by the following equation:
    \mbox{Attention}(Q_i, K_{shared}, V_{shared}) = \mbox{Softmax}(\frac{Q_i K_{shared}^T}{\sqrt{d_k}}) V_{shared}
    Where:
     Q_i represents the query vector for the ith attention head.
     K_{shared} and  V_{shared} represent the single, shared key and value vectors used by all heads.
     d_k is the dimension of the key vectors.


    Login to view more content
  • DL0043 KV Cache

    What is KV Cache in transformers, and why is it useful during inference?

    Answer

    The KV Cache in transformers optimizes inference by storing and reusing key and value vectors from the attention mechanism, avoiding redundant computations for previous tokens in a sequence. This is particularly useful in autoregressive models, where each new token requires attention over all prior tokens. By caching K and V vectors, the model only computes the query for the new token and retrieves cached K and V for earlier tokens, improving speed at the cost of higher memory usage.

    The attention mechanism, optimized by KV Cache, is:
    \mathrm{Attention}(Q_t, K_{1:t}, V_{1:t}) = \mathrm{Softmax}\left(\frac{Q_t K_{1:t}^\top}{\sqrt{d_k}}\right) V_{1:t}
    Where:
     Q_t = query of the current token t.
     K_{1:t}, V_{1:t} = cached keys and values for all tokens up to t.
     d_k = key dimension.

    The figure below explains the KV cache in autoregressive transformers.


    Login to view more content

  • DL0042 Attention Computation

    Please break down the computational cost of attention.

    Answer

    Here is the breakdown of the computational cost of attention:
    \mathcal{O}(n^2 \cdot d + n \cdot d^2)
    Input dimensions: Sequence length =  n , hidden dimension =  d , number of heads =  h .
    (1) Linear projections (Q, K, V):
    Each input  X \in \mathbb{R}^{n \times d} is projected into queries, keys, and values.
    Cost: \mathcal{O}(n \cdot d^2) (for all 3 matrices).

    (2) Attention score computation (QKᵀ):
    Queries:  Q \in \mathbb{R}^{n \times d_k}
    Keys:  K \in \mathbb{R}^{n \times d_k}
    Score matrix:
    S = QK^\top \in \mathbb{R}^{n \times n}
    Cost: \mathcal{O}(n^2 \cdot d_k)

    (3) Softmax normalization:
    For each row of the score matrix:
    \mbox{Softmax}(s_i) = \frac{e^{s_i}}{\sum_{j=1}^n e^{s_j}}
    Where:
     s_i = raw score for position  i
     n = total sequence length
    Cost: \mathcal{O}(n^2)

    (4) Weighted sum with values (AV):
    Attention weights  A \in \mathbb{R}^{n \times n} applied to values  V \in \mathbb{R}^{n \times d_v} :
    O = AV
    Cost: \mathcal{O}(n^2 \cdot d_v)

    (5) Output projection:
    Final linear layer to mix heads back to  d .
    Cost: \mathcal{O}(n \cdot d^2)

    Total Complexity:
    Putting it all together:
    \mathcal{O}(n \cdot d^2) + \mathcal{O}(n^2 \cdot d_k) + \mathcal{O}(n^2 \cdot d_v)
    Since  d_k, d_v \approx d/h , the dominant term is:
    \mathcal{O}(n^2 \cdot d + n \cdot d^2)
    Overall, the cost of Multi-Head Attention (MHA)is in the same order as single-head, because per-head dims scale as  d/h .


    Login to view more content
  • DL0041 Hierarchical Attention

    Could you explain the concept of hierarchical attention in transformer architectures?

    Answer

    Hierarchical Attention in Transformers applies self-attention at multiple levels of granularity (e.g., words to sentences to documents). Instead of one flat attention over all tokens, it computes local attention within segments and then global attention across segments, leading to efficiency gains, better structure modeling, and interpretable focus at each level.

    Motivation: Transformers normally apply flat self-attention over all tokens. For long structured inputs (documents, videos, graphs), this is inefficient and loses hierarchical structure.
    Hierarchical Attention Idea:
    (1) Local level (fine-grained): Compute attention within smaller segments (e.g., words within a sentence, frames within a shot).
    (2) Global level (coarse-grained): Aggregate segment representations, then apply attention across segments (e.g., sentences within a document, shots within a video).
    This mirrors natural data hierarchies and reduces quadratic cost.

    The figure below shows Hierarchical Attention used in the document classification use case.


    Login to view more content

  • DL0040 Attention Mask

    What is the role of masking in attention?

    Answer

    Masking is a critical technique in transformer attention mechanisms that controls which parts of the input sequence the model is allowed to focus on.
    (1) Leakage prevention: Blocks access to future tokens in autoregressive decoding to preserve causality.
    (2) Padding handling: Excludes pad positions so they don’t absorb probability mass or distort context.
    (3) Structured constraint: Enforces task rules (e.g., graph neighborhoods, spans, or blocked regions).
    Core equation with mask:
    \text{Attn}(Q,K,V,M)=\text{Softmax}\left(\frac{QK^\top}{\sqrt{d_k}}+M\right)V
    Where:
     Q query matrix.
     K key matrix.
     V value matrix.
     d_k key dimensionality (for scaling).
     M mask matrix with 0 for allowed positions and large negative values (e.g., −∞) for disallowed positions.

    The figure below shows three side-by-side heatmaps: a padding mask that disallows attending to padding tokens, a causal mask that enforces autoregressive decoding, and a structured mask that enforces local neighborhood constraint.


    Login to view more content