Author: admin

  • DL0030 Positional Encoding

    Explain “Positional Encoding” in Transformers. Why is it necessary?

    Answer

    Positional encoding is crucial in Transformers to equip the model with an understanding of token order while maintaining full parallel computation. Fixed sinusoidal functions offer parameter-free generalization to unseen lengths, learned embeddings provide task-specific flexibility, and relative schemes directly capture inter-token distances.

    Self-attention is permutation-invariant and, on its own, cannot distinguish token order. Positional encodings inject sequence information by adding position-dependent vectors to token embeddings.

    Encoding types:
    (1) Fixed (sinusoidal): Predefined functions of position. Sinusoidal (fixed) encodings utilize sine and cosine functions at different frequencies, enabling the model to learn both relative and absolute positions.
    (2) Learned: Learned during training as parameters. Learned positional embeddings are trainable vectors but may not generalize beyond the maximum training length.

    Sinusoidal Encoding Formula:
    PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right)
    PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right)
    Where:
     pos : token position in the sequence
     i : dimension index
     d_{\text{model}} : embedding dimension

    The figure below shows how the encoding values change across different positions and dimensions.


    Login to view more content
  • DL0029 Dilated Attention

    Could you explain the concept of dilated attention in transformer architectures?

    Answer

    Dilated attention introduces gaps between attention positions to sparsify computation, enabling efficient long-range dependency modeling. It is particularly helpful in tasks requiring scalable attention over long sequences. It trades off some granularity for global context by spreading attention more widely and sparsely.

    Dilated attention is similar to dilated convolutions in CNNs, where gaps (dilations) are introduced between the sampled positions.
    Instead of attending to all tokens (as in standard self-attention), each query token attends to every d-th token. This dilation rate controls the stride in attention.

    Reduction in Complexity: Reduces attention computation and memory from  \mathcal{O}(n^2) to a lower bound depending on the sparsity pattern.

    In dilated attention, the dot-product  QK^\top is computed only at dilated positions:
    \mbox{Attention}_{\text{dilated}}(Q, K, V) = \mbox{Softmax}\left(\frac{QK_d^\top}{\sqrt{d_k}}\right) V_d
    Where:
     K_d, V_d are the dilated subsets of keys and values.
     d_k is the key dimension.

    Below is the visualization of dilated attention with a dilation rate of 3.


    Login to view more content
  • DL0028 Sliding Window Attention

    Explain the sliding window attention mechanism in transformer architectures.

    Answer

    Sliding window attention is an optimization that addresses the scalability issues of the standard self-attention mechanism. It improves efficiency by limiting the attention scope of each token to a local, fixed-size window. This enables transformer models to handle longer sequences more effectively without a quadratic increase in computational resources. The trade-off is a potential loss of global context.

    Purpose: Efficiently scale attention for long sequences by restricting each token’s attention to a fixed-size local window instead of the full sequence.
    Window Size: Each token attends only to tokens within a fixed window of size  w (e.g., the token itself and  \pm \frac{w}{2} neighbors).
    Sparse Attention: Results in a sparse attention matrix — reduces memory and computation from  \mathcal{O}(n^2) to  \mathcal{O}(n \cdot w) .

    Here is a side-by-side comparison of Global Attention vs Sliding Window Attention: Each token attends to all others (dense matrix) in Global Attention. Each token attends only to a small window of nearby tokens (sparse band around the diagonal) in Sliding Window Attention.


    Login to view more content
  • DL0027 Multi-Head Attention

    How does multi-head attention work in transformer architectures?

    Answer

    Multi-head attention projects the input into multiple distinct subspaces, with each head performing scaled dot-product attention independently on the full input sequence. By attending to different aspects or relationships within the data, these separate heads capture diverse information patterns. Their outputs are then combined to form a richer, more expressive representation, enabling the model to understand complex dependencies better and improve overall performance.

    Outputs from all heads are concatenated and linearly projected to form the final output.
    All heads are computed in parallel, enabling efficient computation.

    \mbox{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O
    Where:
    \text{head}_i = \mbox{Attention}(Q W_i^Q, K W_i^K, V W_i^V)
    W_i^Q, W_i^K, W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_k}
     W^O \in \mathbb{R}^{h d_k \times d_{\text{model}}} : Final output projection matrix that maps the concatenated attention outputs back to the original model dimension.
     h : Number of attention heads.
     d_{\text{model}} : Dimensionality of the input embeddings and final output.
     d_k = d_{\text{model}} / h : Dimension of each head’s projected subspace.

    The below figure shows a single-head attention heatmap and 4 independent multi-head attention heatmaps.


    Login to view more content

  • DL0026 Self-Attention vs Cross-Attention

    What distinguishes self-attention from cross-attention in transformer models?

    Answer

    Self-attention allows a sequence to attend to itself, making it powerful for capturing intra-sequence relationships. Cross-attention bridges different sequences, crucial for combining encoder and decoder representations in tasks like machine translation.
    Input Scope:
    Self-Attention: Query, key, and value all come from the same input sequence
    Cross-Attention: Query comes from one sequence, key and value come from a different source

    Usage in Transformer Architecture:
    Self-Attention: Used in both the encoder and decoder for modeling internal dependencies
    Cross-Attention: Used in the decoder to integrate the encoder output

    Both mechanisms use the scaled dot-product attention formula:
    \mbox{Attention}(Q, K, V) = \mbox{Softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V
    Where:
     Q ,  K , and  V represent query, key, and value matrices, respectively
     d_k is the dimensionality of the key vectors

    The plot below on the left demonstrates self-attention by showing a token’s attention to all other tokens within the same sequence. The plot below on the right illustrates cross-attention, where tokens from one sequence (the decoder) attend to tokens from another, separate sequence (the encoder).


    Login to view more content

  • DL0025 Attention Mechanism

    Please explain the concept of “Attention Mechanism.”

    Answer

    The attention mechanism is a technique in neural networks that allows the model to focus on specific parts of the input sequence when making predictions. It addresses the limitation of traditional sequence-to-sequence models that compress an entire input sequence into a single fixed-size context vector, which can lose information, especially for long sequences.

    Attention lets the model dynamically decide which parts of the input are most important for each output step. For each output token, attention computes a weighted sum over all input tokens. These weights represent how much “attention” the model should pay to each input.

    Key Components:
    Query (Q): Represents what we are looking for or the current element being processed.
    Key (K): Represents what information is available from the input.
    Value (V): The actual information content to be extracted if a key matches the query.
    Each output uses a query to compare with keys and then uses the scores to weight values.

    Calculation (Scaled Dot-Product Attention):
    Similarity Score: Calculated by taking the dot product of the Query with each Key.
    Scaling: The scores are scaled down by the square root of the dimension of the keys ( d_k ) to reduce variance and prevent large values from pushing the Softmax function into regions with tiny gradients.
    Normalization: Normalized into a probability distribution using the Softmax function. Ensures the weights sum to 1.
    Weighted Sum: Multiplied by the Values to get the final attention output.

    \mbox{Attention}(Q, K, V) = \mbox{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V
    Where:
     Q, K, V : Matrices of queries, keys, and values.
     d_k : Dimension of key vectors.
     \mbox{Softmax} : Converts similarity scores to probabilities.

    The plot below shows how much “attention” each input token receives in a simplified attention mechanism. It uses softmax-normalized weights over a 5-token sentence.


    Login to view more content

  • ML0065 Random Forest III

    How to choose the number of features in a random forest?

    Answer

    Select the number of features (m) using rules of thumb (default heuristics), then tune via cross-validation or out-of-bag (OOB) error to find the best value for your specific dataset.

    Default Heuristics:
    Classification: m = \sqrt{p}
    Regression: m = \frac{p}{3}
    Where:
    p = total number of features,
    m = number of features considered at each split.

    Bias-Variance Trade-off:
    (1) Smaller max_features will increase randomness, leading to less correlated trees (reducing variance) but potentially higher bias.
    (2) Larger max_features will decrease randomness, leading to more correlated trees (increasing variance) but potentially lower bias.

    Grid Search/Randomized Search:
    This is the most robust method. Define a range of possible max_features values and use cross-validation to evaluate the model’s performance for each value.

    Out-of-Bag (OOB) Error:
    Random Forests can estimate the generalization error internally using OOB samples. You can monitor the OOB error as you vary max_features to find the optimal value.

    The figure below shows the cross-validation accuracy curve when using different numbers of features.


    Login to view more content
  • ML0064 Random Forest II

    Please explain the benefits and drawbacks of random forest.

    Answer

    Random Forest is a powerful ensemble method that reduces overfitting and improves predictive accuracy by combining many decision trees. However, it trades interpretability and computational efficiency for these benefits and may require careful tuning when dealing with large, imbalanced, or sparse datasets.

    Benefits of random forest:
    (1) Reduces Overfitting: Aggregating many trees lowers variance.
    (2) Robust to Noise and Outliers: Less sensitive to anomalous data.
    (3) Handles High Dimensionality: Works well with many input features.
    (4) Estimates Feature Importance: Helps identify influential variables.
    (5) Built-in Bagging: Bootstrap sampling improves generalization.

    Drawbacks of random forest:
    (1) Less Interpretability: Hard to visualize or explain compared to a single decision tree.
    (2) Computational Cost: Training and prediction can be slower with many trees.
    (3) Memory Usage: Large forests can consume significant resources.
    (4) Biased with Imbalanced Data: Class imbalance can lead to biased predictions.
    (5) Not Always Optimal for Sparse Data: May underperform compared to other algorithms on very sparse datasets.

    The example below demonstrates that the random forest sometimes underperforms on the imbalanced dataset.


    Login to view more content

  • ML0063 Random Forest

    How does the random forest algorithm operate? Please outline its key steps.

    Answer

    Random Forest builds an ensemble of decision trees using bootstrapped samples and random feature subsets at each split. This combination reduces variance, combats overfitting, and improves predictive accuracy. The final output aggregates the predictions of all trees (majority vote for classification, averaging for regression).
    (1) Bootstrap Sampling: Create multiple subsets of the original training data by sampling with replacement (bootstrap samples).
    (2) Grow Decision Trees: For each bootstrap sample, train an unpruned decision tree.
    (3) Random Feature Selection: At each split in a tree, randomly select a subset of features. The split is chosen only among this random subset (increases diversity).
    (4) Aggregate Results with Voting or Averaging:
    Classification: Each tree votes for a class label. The majority vote is used.
    \hat{y} = \mathrm{mode}\, { T_b(x) },\quad b=1,\ldots,B
    Where:
     T_b(x) = prediction of the b-th tree.
     B = total number of trees.

    Regression: Each tree predicts a numeric value. The average is used.
    \hat{y} = \frac{1}{B}\sum_{b=1}^{B} T_b(x)
    Where:
     T_b(x) = prediction of the b-th tree.
     B = total number of trees.

    The example below shows the decision boundary differences between three decision trees and their random forest ensemble.


    Login to view more content
  • ML0062 Decision Tree

    Please explain how a decision tree works.

    Answer

    A decision tree partitions the input space into regions by recursively splitting on features that best separate the target variable. Each split aims to improve the “purity” of the resulting subsets, as measured by criteria such as Gini impurity or Entropy. Predictions are made by following the sequence of splits down to a leaf node and returning the most common class (classification) or average target (regression).

    Structure: A tree of nodes where each internal node tests a feature, branches represent feature outcomes, and leaves give predictions.

    Splitting Criterion: Chooses the best feature (and threshold) by maximizing purity—e.g., Information Gain, Gini Impurity, or Variance Reduction.

    Recursive Growth: Starting at the root, data is split, then the process recurses on each subset until stopping criteria (max depth, min samples, or pure leaves) are met.

    Prediction: A new sample “travels” from root to leaf by following feature-test branches; the leaf’s label or value is returned.

    The example below demonstrates using a Decision Tree on a 2-feature dataset for classification.


    Login to view more content