Category: Medium

  • DL0052 Rotary Positional Embedding

    What is Rotary Positional Embedding (RoPE)?

    Answer

    Rotary Positional Embedding (RoPE) is a positional encoding method that rotates query and key vectors in multi‑head attention by position‑dependent angles. This rotation naturally encodes relative positional information, improves generalization to longer contexts, and avoids the limitations of fixed or learned absolute positional embeddings. It is used in GPT-NeoX, LLaMA, PaLM, Qwen, etc.
    It has below charactretidstics:
    (1) Relative position encoding method for Transformers
    (2) Applies rotation to query (Q) and key (K) vectors using position-dependent angles
    (3) Encodes position via geometry, not by adding vectors
    (4) Preserves relative distance naturally in dot-product attention
    (5) Extrapolates well to longer sequences than the training length

    RoPE rotates each 2D pair of hidden dimensions:
    f(x, m)=\begin{pmatrix}\cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta)\end{pmatrix}\begin{pmatrix}x_1 \\x_2\end{pmatrix}
    Where:
     m represents the absolute position of the token in the sequence.
     \theta represents the base frequency/rotation angle.
     x_1, x_2 represent the components of the embedding vector.

    The below plot visualizes how RoPE makes attention decay smoothly with relative distance, while standard sinusoidal PE reflects absolute position similarity.


    Login to view more content
  • DL0051 Sparsity in NN

    Explain the concept of “Sparsity” in neural networks.

    Answer

    Sparsity in neural networks refers to the property that many parameters (weights) or activations are exactly zero (or very close to zero).
    This leads to lighter, faster, and more interpretable models. Techniques such as L1 regularization, pruning, and ReLU activations help enforce sparsity, making networks more efficient without compromising performance.

    Common techniques and their equations:
    (1) L1 Regularization (encourages sparse weights)
     L = L_{\text{task}} + \lambda \sum_i |w_i|
    Where:
     w_i represents the i-th model weight
     \lambda controls the strength of sparsity

    (2) ReLU Activation (induces sparse activations)
     \mathrm{ReLU}(x) = \max(0, x)
    Where:
     x is the neuron input.

    The plot below shows weight distributions trained without using L1 and with L1-induced sparsity.


    Login to view more content
  • DL0050 Knowledge Distillation

    Describe the process and benefits of knowledge distillation.

    Answer

    Knowledge Distillation (KD) transfers “dark knowledge” about inter-class relationships from a large, accurate teacher model to a smaller student model. The student learns via a temperature-scaled softmax and a combined distillation plus supervised loss, enabling substantial model compression and faster inference while retaining high accuracy, provided that the teacher quality, student capacity, and hyperparameters are well chosen.

    Definition: Knowledge distillation is a process where a smaller model (student) learns to mimic the behavior of a larger, well-trained model (teacher).

    Soft Targets: The student is trained not only on hard labels (one-hot) but also on the soft output probabilities of the teacher.

    Temperature Scaling: Teacher logits are softened using a temperature T to reveal more information about class similarities:
    \mbox{Softmax}(z_i / T) = \frac{e^{z_i / T}}{\sum_{j=1}^{K} e^{z_j / T}}
    Where:
     z_i : Raw score (logit) for the i-th class.
     K : Total number of classes in the classification problem.
     T : Temperature parameter (>0) used to soften the probabilities. Higher  T produces a smoother distribution, revealing relationships between classes (“dark knowledge”).

    The below plot shows the Softmax probabilities for a fixed set of Teacher logits under three different temperatures. Increasing the temperature smooths the distribution.

    Loss Function: Typically combines distillation loss (difference between teacher and student soft outputs) and standard cross-entropy loss with true labels.

    Key Benefits of KD:
    (1) Model compression: the student is smaller and faster while retaining much of the teacher’s performance, enabling deployment on resource-constrained devices.
    (2) Inference Speed: Significantly decreases latency, making the model suitable for deployment on edge devices or real-time applications.
    (3) Improved Generalization: The Teacher’s smooth soft targets act as a form of powerful regularization, often leading the Student to generalize better than if it were trained only on hard labels.

    The plot below demonstrates the Knowledge Distillation (KD) process.


    Login to view more content

  • DL0047 Focal Loss II

    Please compare focal loss and weighted cross-entropy.

    Answer

    Weighted Cross-Entropy (WCE) rescales loss by class to correct prior imbalance and is simple and robust for noisy labels; Focal Loss (FL) multiplies cross-entropy by a difficulty-dependent factor \gamma to suppress easy-example gradients and focus learning on hard examples, making it preferable when many easy negatives overwhelm training but requiring careful tuning to avoid amplifying label noise.

    \text{WeightedCE}(p_t) = -\alpha_t \log(p_t)
    Where:
    p_t is the model probability for the ground-truth class;
    \alpha_t is the per-class weight for class t.

    \text{FocalLoss}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)
    Where:
    p_t is the model probability for the ground-truth class;
    \alpha_t is the optional per-class weight for class t;
    \gamma \ge 0 is the focusing parameter that down-weights easy examples.

    Here is a table to compare focal loss and weighted cross-entropy.

    The figure below compares Cross-Entropy, Weighted Cross-Entropy, and Focal Loss.


    Login to view more content

  • DL0045 Dimension in FFN

    In Transformers, why does the feed-forward network expand the hidden dimension (e.g.,  d_{\text{model}} 4 d_{\text{model}} ) before reducing it back?

    Answer

    The feed-forward network in Transformers expands the hidden dimension (e.g.,  d_{\text{model}} \to 4 \cdot d_{\text{model}} ) to enhance the model’s ability to learn complex, non-linear feature interactions, then reduces it back to maintain compatibility with other layers. This design acts as a bottleneck, balancing expressiveness and efficiency, and has been empirically shown to boost performance in large-scale models.
    (1) Extra Capacity: Expanding from  d_{\text{model}} to  4d_{\text{model}} allows the FFN to capture richer nonlinear transformations.
    (2) Non‑linear mixing: The intermediate expansion allows the activation function (ReLU, GeLU, SwiGLU, etc.) to operate in a richer space, capturing more complex patterns.
    (3) Projection back ensures compatibility: Reducing the dimension back to  d_{\text{model}} ensures compatibility with the subsequent layers. It ensures residual connection compatibility and uniformity across layers.

    The equation and the figure below show the architecture of FFN:
     \text{FFN}(x) = W_2  \sigma(W_1 x + b_1) + b_2
    Where:
     x \in \mathbb{R}^{d_{\text{model}}} is the input vector.
     W_1 \in \mathbb{R}^{4d_{\text{model}} \times d_{\text{model}}} expands the dimension.
     \sigma is a non-linear activation (e.g., ReLU/GELU).
     W_2 \in \mathbb{R}^{d_{\text{model}} \times 4d_{\text{model}}} projects back down.


    Login to view more content
  • DL0044 Multi-Query Attention

    What is Multi-Query Attention in transformer models?

    Answer

    Multi-Query Attention (MQA) optimizes the standard Multi-Head Attention (MHA) in transformers by using multiple query heads while sharing a single key-value projection across them. This design maintains similar computational expressiveness to MHA but significantly reduces memory usage during inference, especially in KV caching for autoregressive tasks, making it ideal for scaling large models. It trades minor potential quality for efficiency.

    Comparison to Multi-Head Attention (MHA): In standard MHA, each attention head has independent projections for queries (Q), keys (K), and values (V). In MQA, only Q is projected into multiple heads, while K and V use a single projection shared across all query heads, as shown in the figure below.

    Efficiency Benefits: Reduces memory footprint during inference, particularly with KV caching, as the cache stores only one set of K and V vectors instead of one per head, lowering memory complexity from O(n * h * d) to O(n * d), where n is sequence length, h is number of heads, and d is head dimension.

    The core attention operation for a single head in MQA can be represented by the following equation:
    \mbox{Attention}(Q_i, K_{shared}, V_{shared}) = \mbox{Softmax}(\frac{Q_i K_{shared}^T}{\sqrt{d_k}}) V_{shared}
    Where:
     Q_i represents the query vector for the ith attention head.
     K_{shared} and  V_{shared} represent the single, shared key and value vectors used by all heads.
     d_k is the dimension of the key vectors.


    Login to view more content
  • DL0042 Attention Computation

    Please break down the computational cost of attention.

    Answer

    Here is the breakdown of the computational cost of attention:
    \mathcal{O}(n^2 \cdot d + n \cdot d^2)
    Input dimensions: Sequence length =  n , hidden dimension =  d , number of heads =  h .
    (1) Linear projections (Q, K, V):
    Each input  X \in \mathbb{R}^{n \times d} is projected into queries, keys, and values.
    Cost: \mathcal{O}(n \cdot d^2) (for all 3 matrices).

    (2) Attention score computation (QKᵀ):
    Queries:  Q \in \mathbb{R}^{n \times d_k}
    Keys:  K \in \mathbb{R}^{n \times d_k}
    Score matrix:
    S = QK^\top \in \mathbb{R}^{n \times n}
    Cost: \mathcal{O}(n^2 \cdot d_k)

    (3) Softmax normalization:
    For each row of the score matrix:
    \mbox{Softmax}(s_i) = \frac{e^{s_i}}{\sum_{j=1}^n e^{s_j}}
    Where:
     s_i = raw score for position  i
     n = total sequence length
    Cost: \mathcal{O}(n^2)

    (4) Weighted sum with values (AV):
    Attention weights  A \in \mathbb{R}^{n \times n} applied to values  V \in \mathbb{R}^{n \times d_v} :
    O = AV
    Cost: \mathcal{O}(n^2 \cdot d_v)

    (5) Output projection:
    Final linear layer to mix heads back to  d .
    Cost: \mathcal{O}(n \cdot d^2)

    Total Complexity:
    Putting it all together:
    \mathcal{O}(n \cdot d^2) + \mathcal{O}(n^2 \cdot d_k) + \mathcal{O}(n^2 \cdot d_v)
    Since  d_k, d_v \approx d/h , the dominant term is:
    \mathcal{O}(n^2 \cdot d + n \cdot d^2)
    Overall, the cost of Multi-Head Attention (MHA)is in the same order as single-head, because per-head dims scale as  d/h .


    Login to view more content
  • DL0041 Hierarchical Attention

    Could you explain the concept of hierarchical attention in transformer architectures?

    Answer

    Hierarchical Attention in Transformers applies self-attention at multiple levels of granularity (e.g., words to sentences to documents). Instead of one flat attention over all tokens, it computes local attention within segments and then global attention across segments, leading to efficiency gains, better structure modeling, and interpretable focus at each level.

    Motivation: Transformers normally apply flat self-attention over all tokens. For long structured inputs (documents, videos, graphs), this is inefficient and loses hierarchical structure.
    Hierarchical Attention Idea:
    (1) Local level (fine-grained): Compute attention within smaller segments (e.g., words within a sentence, frames within a shot).
    (2) Global level (coarse-grained): Aggregate segment representations, then apply attention across segments (e.g., sentences within a document, shots within a video).
    This mirrors natural data hierarchies and reduces quadratic cost.

    The figure below shows Hierarchical Attention used in the document classification use case.


    Login to view more content

  • DL0037 Transformer Architecture III

    Why do Transformers use a dot product, rather than addition, to compute attention scores?

    Answer

    Dot product attention is a fast and naturally aligned similarity measure; with scaling, it remains numerically stable and highly parallelizable, which is why Transformers prefer it over addition.
    (1) Dot product captures similarity: The dot product between query  q and key  k grows larger when they point in similar directions, making it a natural similarity measure.
    The scores are normalized with Softmax and have probabilistic interpretations:
    \alpha_i = \frac{e^{q \cdot k_i}}{\sum_{j=1}^K e^{q \cdot k_j}}
    Where:
     q \cdot k_i is the dot product similarity between query and key.

    The figure below illustrates the dot product for measuring similarity.

    (2) Efficient computation: Dot products can be computed in parallel as a matrix multiplication  QK^\top , which is hardware-friendly.


    Login to view more content

  • DL0029 Dilated Attention

    Could you explain the concept of dilated attention in transformer architectures?

    Answer

    Dilated attention introduces gaps between attention positions to sparsify computation, enabling efficient long-range dependency modeling. It is particularly helpful in tasks requiring scalable attention over long sequences. It trades off some granularity for global context by spreading attention more widely and sparsely.

    Dilated attention is similar to dilated convolutions in CNNs, where gaps (dilations) are introduced between the sampled positions.
    Instead of attending to all tokens (as in standard self-attention), each query token attends to every d-th token. This dilation rate controls the stride in attention.

    Reduction in Complexity: Reduces attention computation and memory from  \mathcal{O}(n^2) to a lower bound depending on the sparsity pattern.

    In dilated attention, the dot-product  QK^\top is computed only at dilated positions:
    \mbox{Attention}_{\text{dilated}}(Q, K, V) = \mbox{Softmax}\left(\frac{QK_d^\top}{\sqrt{d_k}}\right) V_d
    Where:
     K_d, V_d are the dilated subsets of keys and values.
     d_k is the key dimension.

    Below is the visualization of dilated attention with a dilation rate of 3.


    Login to view more content