DL0044 Multi-Query Attention

Written by

What is Multi-Query Attention in transformer models?

Answer

Multi-Query Attention (MQA) optimizes the standard Multi-Head Attention (MHA) in transformers by using multiple query heads while sharing a single key-value projection across them. This design maintains similar computational expressiveness to MHA but significantly reduces memory usage during inference, especially in KV caching for autoregressive tasks, making it ideal for scaling large models. It trades minor potential quality for efficiency.

Comparison to Multi-Head Attention (MHA): In standard MHA, each attention head has independent projections for queries (Q), keys (K), and values (V). In MQA, only Q is projected into multiple heads, while K and V use a single projection shared across all query heads, as shown in the figure below.

Efficiency Benefits: Reduces memory footprint during inference, particularly with KV caching, as the cache stores only one set of K and V vectors instead of one per head, lowering memory complexity from O(n * h * d) to O(n * d), where n is sequence length, h is number of heads, and d is head dimension.

The core attention operation for a single head in MQA can be represented by the following equation:
$\mbox{Attention}(Q_i, K_{shared}, V_{shared}) = \mbox{Softmax}(\frac{Q_i K_{shared}^T}{\sqrt{d_k}}) V_{shared}$
Where:
$Q_i$ represents the query vector for the ith attention head.
$K_{shared}$ and $V_{shared}$ represent the single, shared key and value vectors used by all heads.
$d_k$ is the dimension of the key vectors.