DL0030 Positional Encoding

Written by

Explain “Positional Encoding” in Transformers. Why is it necessary?

Answer

Positional encoding is crucial in Transformers to equip the model with an understanding of token order while maintaining full parallel computation. Fixed sinusoidal functions offer parameter-free generalization to unseen lengths, learned embeddings provide task-specific flexibility, and relative schemes directly capture inter-token distances.

Self-attention is permutation-invariant and, on its own, cannot distinguish token order. Positional encodings inject sequence information by adding position-dependent vectors to token embeddings.

Encoding types:
(1) Fixed (sinusoidal): Predefined functions of position. Sinusoidal (fixed) encodings utilize sine and cosine functions at different frequencies, enabling the model to learn both relative and absolute positions.
(2) Learned: Learned during training as parameters. Learned positional embeddings are trainable vectors but may not generalize beyond the maximum training length.

Sinusoidal Encoding Formula:
$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right)$
$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right)$
Where:
$pos$ : token position in the sequence
$i$ : dimension index
$d_{\text{model}}$ : embedding dimension