DL0029 Dilated Attention

Written by

Could you explain the concept of dilated attention in transformer architectures?

Answer

Dilated attention introduces gaps between attention positions to sparsify computation, enabling efficient long-range dependency modeling. It is particularly helpful in tasks requiring scalable attention over long sequences. It trades off some granularity for global context by spreading attention more widely and sparsely.

Dilated attention is similar to dilated convolutions in CNNs, where gaps (dilations) are introduced between the sampled positions.
Instead of attending to all tokens (as in standard self-attention), each query token attends to every d-th token. This dilation rate controls the stride in attention.

Reduction in Complexity: Reduces attention computation and memory from $\mathcal{O}(n^2)$ to a lower bound depending on the sparsity pattern.

In dilated attention, the dot-product $QK^\top$ is computed only at dilated positions:
$\mbox{Attention}_{\text{dilated}}(Q, K, V) = \mbox{Softmax}\left(\frac{QK_d^\top}{\sqrt{d_k}}\right) V_d$
Where:
$K_d, V_d$ are the dilated subsets of keys and values.
$d_k$ is the key dimension.