Interview for Machine Learning

Category: Easy

DL0031 FFN in Transformer
What is the purpose of the feed-forward network inside each Transformer block?
Answer
The feed-forward network (FFN) inside each Transformer block processes each token’s features independently after attention, expands and transforms them non-linearly, and projects them back to the model’s dimension. This ensures that after attention has mixed information across tokens, each token’s representation is individually refined for richer feature learning.
Purpose of FFN:
(1) Non-linear transformation: Adds non-linearity after attention, allowing the model to capture complex patterns.
(2) Token-wise processing: Applies the same transformation to each token independently (no mixing across positions).
(3) Dimensional expansion: Often increases dimensionality in the hidden layer to give the network more capacity.
(4) Feature recombination: Refines and reweights token representations produced by the attention mechanism.
(5) Complement to attention: Attention mixes information across tokens; the FFN processes each token’s features deeply.
Typical FFN equation in a Transformer:
$\mathrm{FFN}(x) = \max(0, xW_1 + b_1) W_2 + b_2$
Where:
$x$ — input vector for a token after the attention layer
$W_1, W_2$ — trainable weight matrices
$b_1, b_2$ — trainable bias vectors
$\max(0, \cdot)$ — ReLU activation (sometimes replaced by GELU)
Login to view more content
August 8, 2025
DL0030 Positional Encoding
Explain “Positional Encoding” in Transformers. Why is it necessary?
Answer
Positional encoding is crucial in Transformers to equip the model with an understanding of token order while maintaining full parallel computation. Fixed sinusoidal functions offer parameter-free generalization to unseen lengths, learned embeddings provide task-specific flexibility, and relative schemes directly capture inter-token distances.
Self-attention is permutation-invariant and, on its own, cannot distinguish token order. Positional encodings inject sequence information by adding position-dependent vectors to token embeddings.
Encoding types:
(1) Fixed (sinusoidal): Predefined functions of position. Sinusoidal (fixed) encodings utilize sine and cosine functions at different frequencies, enabling the model to learn both relative and absolute positions.
(2) Learned: Learned during training as parameters. Learned positional embeddings are trainable vectors but may not generalize beyond the maximum training length.
Sinusoidal Encoding Formula:
$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right)$
$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right)$
Where:
$pos$ : token position in the sequence
$i$ : dimension index
$d_{\text{model}}$ : embedding dimension
The figure below shows how the encoding values change across different positions and dimensions.
Login to view more content
August 7, 2025
DL0027 Multi-Head Attention
How does multi-head attention work in transformer architectures?
Answer
Multi-head attention projects the input into multiple distinct subspaces, with each head performing scaled dot-product attention independently on the full input sequence. By attending to different aspects or relationships within the data, these separate heads capture diverse information patterns. Their outputs are then combined to form a richer, more expressive representation, enabling the model to understand complex dependencies better and improve overall performance.
Outputs from all heads are concatenated and linearly projected to form the final output.
All heads are computed in parallel, enabling efficient computation.
$\mbox{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O$
Where:
$\text{head}_i = \mbox{Attention}(Q W_i^Q, K W_i^K, V W_i^V)$
$W_i^Q, W_i^K, W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_k}$
$W^O \in \mathbb{R}^{h d_k \times d_{\text{model}}}$ : Final output projection matrix that maps the concatenated attention outputs back to the original model dimension.
$h$ : Number of attention heads.
$d_{\text{model}}$ : Dimensionality of the input embeddings and final output.
$d_k = d_{\text{model}} / h$ : Dimension of each head’s projected subspace.
The below figure shows a single-head attention heatmap and 4 independent multi-head attention heatmaps.
Login to view more content
July 31, 2025
DL0026 Self-Attention vs Cross-Attention
What distinguishes self-attention from cross-attention in transformer models?
Answer
Self-attention allows a sequence to attend to itself, making it powerful for capturing intra-sequence relationships. Cross-attention bridges different sequences, crucial for combining encoder and decoder representations in tasks like machine translation.
Input Scope:
Self-Attention: Query, key, and value all come from the same input sequence
Cross-Attention: Query comes from one sequence, key and value come from a different source
Usage in Transformer Architecture:
Self-Attention: Used in both the encoder and decoder for modeling internal dependencies
Cross-Attention: Used in the decoder to integrate the encoder output
Both mechanisms use the scaled dot-product attention formula:
$\mbox{Attention}(Q, K, V) = \mbox{Softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V$
Where:
$Q$ , $K$ , and $V$ represent query, key, and value matrices, respectively
$d_k$ is the dimensionality of the key vectors
The plot below on the left demonstrates self-attention by showing a token’s attention to all other tokens within the same sequence. The plot below on the right illustrates cross-attention, where tokens from one sequence (the decoder) attend to tokens from another, separate sequence (the encoder).
Login to view more content
July 31, 2025
DL0025 Attention Mechanism
Please explain the concept of “Attention Mechanism.”
Answer
The attention mechanism is a technique in neural networks that allows the model to focus on specific parts of the input sequence when making predictions. It addresses the limitation of traditional sequence-to-sequence models that compress an entire input sequence into a single fixed-size context vector, which can lose information, especially for long sequences.
Attention lets the model dynamically decide which parts of the input are most important for each output step. For each output token, attention computes a weighted sum over all input tokens. These weights represent how much “attention” the model should pay to each input.
Key Components:
Query (Q): Represents what we are looking for or the current element being processed.
Key (K): Represents what information is available from the input.
Value (V): The actual information content to be extracted if a key matches the query.
Each output uses a query to compare with keys and then uses the scores to weight values.
Calculation (Scaled Dot-Product Attention):
Similarity Score: Calculated by taking the dot product of the Query with each Key.
Scaling: The scores are scaled down by the square root of the dimension of the keys ( $d_k$ ) to reduce variance and prevent large values from pushing the Softmax function into regions with tiny gradients.
Normalization: Normalized into a probability distribution using the Softmax function. Ensures the weights sum to 1.
Weighted Sum: Multiplied by the Values to get the final attention output.
$\mbox{Attention}(Q, K, V) = \mbox{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$
Where:
$Q, K, V$ : Matrices of queries, keys, and values.
$d_k$ : Dimension of key vectors.
$\mbox{Softmax}$ : Converts similarity scores to probabilities.
The plot below shows how much “attention” each input token receives in a simplified attention mechanism. It uses softmax-normalized weights over a 5-token sentence.
Login to view more content
July 20, 2025
DL0022 CNN Architecture
Describe the typical architecture of a CNN.
Answer
A Convolutional Neural Network (CNN) is structured to efficiently recognize complex patterns in data. It begins with an input layer that feeds in raw data. Convolutional layers then extract key features using filters, which are enhanced through non-linear activation functions like ReLU. Pooling layers are used to reduce the size or dimensions of these features, thereby improving computational efficiency and promoting invariance to small shifts. The extracted features are flattened and passed through fully connected layers that culminate in an output layer for final predictions, typically employing a softmax function for classification tasks. Optional techniques, such as dropout and batch normalization, further refine learning and help prevent overfitting.
(1) Input Layer: Accepts raw data as multi-dimensional arrays.
(2) Convolutional Layers: Use learnable filters (kernels) to scan the input and extract local features.
(3) Activation Functions: Apply non-linearity (commonly ReLU) after each convolution operation.
(4) Pooling Layers: Downsample feature maps using techniques like max or average pooling to reduce spatial dimensions and computations.
(5) Stacked Convolutional and Pooling Blocks: Multiple iterations to progressively extract intricate hierarchical features.
(6) Flattening: Converts feature maps into one-dimensional vectors.
(7) Fully Connected Layers: Learn complex patterns and perform decision-making.
(8) Output Layer: Produces final predictions using appropriate activation functions (e.g., softmax for classification)
(9) Additional Components (Optional): Dropout for regularization, batch normalization for training stability, and skip connections in more advanced models.
Below is a visual representation of a typical CNN architecture. Padding is used in convolution to maintain dimensions.
Login to view more content
May 31, 2025
DL0021 Feature Map
What is the feature map in Convolutional Neural Networks?
Answer
A feature map is the output of a convolution operation in a Convolutional Neural Network (CNN) that highlights where specific features appear in the input, enabling the network to understand patterns and structures in input data.
Feature Map in CNNs:
(1) Output of a Filter: It’s the 2D (or 3D) output generated when a single convolutional filter slides across the input data.
(2) Highlighting a Specific Feature: Each feature map represents the spatial locations and strengths where a particular pattern or characteristic (e.g., a vertical edge, a specific texture, a corner) is detected in the input.
(3) Multiple Feature Maps per Layer: A convolutional layer typically uses multiple filters, with each filter producing its unique feature map.
The following example shows feature map examples calculated with different filters on the original image.
Login to view more content
May 31, 2025
DL0020 CNN Parameter Sharing
How do Convolutional Neural Networks achieve parameter sharing? Why is it beneficial?
Answer
Convolutional Neural Networks (CNNs) share parameters by using the same convolutional filter across different spatial locations, enabling them to learn location-independent features efficiently with fewer parameters and better generalization.
How CNNs Achieve Parameter Sharing:
(1) Convolutional Filters/Kernels: A small matrix of learnable weights (the filter) is defined.
(2) Sliding Window Operation: This filter slides across the entire input image (or feature map).
(3) Weight Reuse: The same weights within that filter are used to compute outputs at every spatial location where the filter is applied.
Why Parameter Sharing is Beneficial:
(1) Reduced Parameters: Significantly fewer learnable parameters compared to fully connected networks.
(2) Translation equivariance: Detects features regardless of their position in the image.
The following example demonstrates translation equivariance using a CNN-like convolution with a shared filter.
(3) Improved Generalization: Less prone to overfitting due to fewer parameters.
(4) Computational Efficiency: Faster training and inference.
Login to view more content
May 30, 2025
DL0017 Reproducibility
How to ensure the reproducibility of the deep learning experiments?
Answer
Reproducibility in deep learning is achieved by controlling randomness via fixed seeds and deterministic operations, maintaining strict code and dependency versioning, managing datasets carefully, and keeping comprehensive logs of all experiment settings. These practices ensure that experiments can be reliably repeated and validated, regardless of external factors.
(1) Seed Control and Deterministic Operations:
Set random seeds for all libraries (Python, NumPy, TensorFlow/PyTorch).
Enable deterministic settings in your deep learning framework to reduce nondeterminism.
(2) Code Versioning and Configuration Management:
Use version control systems like Git.
Maintain detailed configuration files (using YAML or JSON) that log hyperparameters and settings for each experiment.
(3) Environment and Dependency Control:
Use virtual environments (e.g., Conda) or containerize your projects with Docker.
Freeze library versions to ensure consistency in the software environment.
(4) Dataset Management:
Fix train-test splits and document data preprocessing steps.
Use versioned or static datasets to prevent unintentional changes over time.
(5) Logging and Documentation:
Log hardware details, random seeds, and experiment configurations.
Utilize experiment tracking tools (like MLflow or Weights & Biases) to archive training runs and parameters.
Below is one example that illustrates the experiments are not reproducible.
Login to view more content
May 29, 2025
DL0016 Learning Rate Warmup
What is Learning Rate Warmup? What is the purpose of using Learning Rate Warmup?
Answer
Learning Rate Warmup is a training technique where the learning rate starts from a small value and gradually increases to a target (base) learning rate over the first few steps or epochs of training.
Purpose of Using Learning Rate Warmup:
(1) Stabilizes Early Training: At the beginning of training, weights are randomly initialized, making the model sensitive to large updates. A warmup gradually increases the learning rate, preventing unstable behavior.
(2) Allow Optimizers to Adapt: Optimizers like Adam and AdamW rely on gradient statistics that can be unstable at the start. Warmup allows these optimizers to accumulate more accurate estimates before using a high learning rate.
(3) Enables Large Batch Training: Mitigates issues that can arise when combining a large batch size with a high initial learning rate.
Below shows an example using Learning Warmup followed by Cosine Decay.
Login to view more content
May 27, 2025