Tag: Transformer

  • DL0026 Self-Attention vs Cross-Attention

    What distinguishes self-attention from cross-attention in transformer models?

    Answer

    Self-attention allows a sequence to attend to itself, making it powerful for capturing intra-sequence relationships. Cross-attention bridges different sequences, crucial for combining encoder and decoder representations in tasks like machine translation.
    Input Scope:
    Self-Attention: Query, key, and value all come from the same input sequence
    Cross-Attention: Query comes from one sequence, key and value come from a different source

    Usage in Transformer Architecture:
    Self-Attention: Used in both the encoder and decoder for modeling internal dependencies
    Cross-Attention: Used in the decoder to integrate the encoder output

    Both mechanisms use the scaled dot-product attention formula:
    \mbox{Attention}(Q, K, V) = \mbox{Softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V
    Where:
     Q ,  K , and  V represent query, key, and value matrices, respectively
     d_k is the dimensionality of the key vectors

    The plot below on the left demonstrates self-attention by showing a token’s attention to all other tokens within the same sequence. The plot below on the right illustrates cross-attention, where tokens from one sequence (the decoder) attend to tokens from another, separate sequence (the encoder).


    Login to view more content

  • DL0025 Attention Mechanism

    Please explain the concept of “Attention Mechanism.”

    Answer

    The attention mechanism is a technique in neural networks that allows the model to focus on specific parts of the input sequence when making predictions. It addresses the limitation of traditional sequence-to-sequence models that compress an entire input sequence into a single fixed-size context vector, which can lose information, especially for long sequences.

    Attention lets the model dynamically decide which parts of the input are most important for each output step. For each output token, attention computes a weighted sum over all input tokens. These weights represent how much “attention” the model should pay to each input.

    Key Components:
    Query (Q): Represents what we are looking for or the current element being processed.
    Key (K): Represents what information is available from the input.
    Value (V): The actual information content to be extracted if a key matches the query.
    Each output uses a query to compare with keys and then uses the scores to weight values.

    Calculation (Scaled Dot-Product Attention):
    Similarity Score: Calculated by taking the dot product of the Query with each Key.
    Scaling: The scores are scaled down by the square root of the dimension of the keys ( d_k ) to reduce variance and prevent large values from pushing the Softmax function into regions with tiny gradients.
    Normalization: Normalized into a probability distribution using the Softmax function. Ensures the weights sum to 1.
    Weighted Sum: Multiplied by the Values to get the final attention output.

    \mbox{Attention}(Q, K, V) = \mbox{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V
    Where:
     Q, K, V : Matrices of queries, keys, and values.
     d_k : Dimension of key vectors.
     \mbox{Softmax} : Converts similarity scores to probabilities.

    The plot below shows how much “attention” each input token receives in a simplified attention mechanism. It uses softmax-normalized weights over a 5-token sentence.


    Login to view more content