Interview for Machine Learning

Tag: Transformer

DL0026 Self-Attention vs Cross-Attention
What distinguishes self-attention from cross-attention in transformer models?
Answer
Self-attention allows a sequence to attend to itself, making it powerful for capturing intra-sequence relationships. Cross-attention bridges different sequences, crucial for combining encoder and decoder representations in tasks like machine translation.
Input Scope:
Self-Attention: Query, key, and value all come from the same input sequence
Cross-Attention: Query comes from one sequence, key and value come from a different source
Usage in Transformer Architecture:
Self-Attention: Used in both the encoder and decoder for modeling internal dependencies
Cross-Attention: Used in the decoder to integrate the encoder output
Both mechanisms use the scaled dot-product attention formula:
$\mbox{Attention}(Q, K, V) = \mbox{Softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V$
Where:
$Q$ , $K$ , and $V$ represent query, key, and value matrices, respectively
$d_k$ is the dimensionality of the key vectors
The plot below on the left demonstrates self-attention by showing a token’s attention to all other tokens within the same sequence. The plot below on the right illustrates cross-attention, where tokens from one sequence (the decoder) attend to tokens from another, separate sequence (the encoder).
Login to view more content
July 31, 2025
DL0025 Attention Mechanism
Please explain the concept of “Attention Mechanism.”
Answer
The attention mechanism is a technique in neural networks that allows the model to focus on specific parts of the input sequence when making predictions. It addresses the limitation of traditional sequence-to-sequence models that compress an entire input sequence into a single fixed-size context vector, which can lose information, especially for long sequences.
Attention lets the model dynamically decide which parts of the input are most important for each output step. For each output token, attention computes a weighted sum over all input tokens. These weights represent how much “attention” the model should pay to each input.
Key Components:
Query (Q): Represents what we are looking for or the current element being processed.
Key (K): Represents what information is available from the input.
Value (V): The actual information content to be extracted if a key matches the query.
Each output uses a query to compare with keys and then uses the scores to weight values.
Calculation (Scaled Dot-Product Attention):
Similarity Score: Calculated by taking the dot product of the Query with each Key.
Scaling: The scores are scaled down by the square root of the dimension of the keys ( $d_k$ ) to reduce variance and prevent large values from pushing the Softmax function into regions with tiny gradients.
Normalization: Normalized into a probability distribution using the Softmax function. Ensures the weights sum to 1.
Weighted Sum: Multiplied by the Values to get the final attention output.
$\mbox{Attention}(Q, K, V) = \mbox{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$
Where:
$Q, K, V$ : Matrices of queries, keys, and values.
$d_k$ : Dimension of key vectors.
$\mbox{Softmax}$ : Converts similarity scores to probabilities.
The plot below shows how much “attention” each input token receives in a simplified attention mechanism. It uses softmax-normalized weights over a 5-token sentence.
Login to view more content
July 20, 2025

Tag: Transformer

DL0026 Self-Attention vs Cross-Attention

DL0025 Attention Mechanism