Could you explain the concept of dilated attention in transformer architectures?
Answer
Dilated attention introduces gaps between attention positions to sparsify computation, enabling efficient long-range dependency modeling. It is particularly helpful in tasks requiring scalable attention over long sequences. It trades off some granularity for global context by spreading attention more widely and sparsely.
Dilated attention is similar to dilated convolutions in CNNs, where gaps (dilations) are introduced between the sampled positions.
Instead of attending to all tokens (as in standard self-attention), each query token attends to every d-th token. This dilation rate controls the stride in attention.
Reduction in Complexity: Reduces attention computation and memory from to a lower bound depending on the sparsity pattern.
In dilated attention, the dot-product is computed only at dilated positions:
Where: are the dilated subsets of keys and values.
is the key dimension.
Below is the visualization of dilated attention with a dilation rate of 3.
Leave a Reply