DL0029 Dilated Attention

Could you explain the concept of dilated attention in transformer architectures?

Answer

Dilated attention introduces gaps between attention positions to sparsify computation, enabling efficient long-range dependency modeling. It is particularly helpful in tasks requiring scalable attention over long sequences. It trades off some granularity for global context by spreading attention more widely and sparsely.

Dilated attention is similar to dilated convolutions in CNNs, where gaps (dilations) are introduced between the sampled positions.
Instead of attending to all tokens (as in standard self-attention), each query token attends to every d-th token. This dilation rate controls the stride in attention.

Reduction in Complexity: Reduces attention computation and memory from  \mathcal{O}(n^2) to a lower bound depending on the sparsity pattern.

In dilated attention, the dot-product  QK^\top is computed only at dilated positions:
\mbox{Attention}_{\text{dilated}}(Q, K, V) = \mbox{Softmax}\left(\frac{QK_d^\top}{\sqrt{d_k}}\right) V_d
Where:
 K_d, V_d are the dilated subsets of keys and values.
 d_k is the key dimension.

Below is the visualization of dilated attention with a dilation rate of 3.


Login to view more content

Did you solve the problem?

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *