Before attention, sequence models had to compress an entire input sequence into a single fixed-size context vector — a bottleneck that forced information loss. Attention broke this by letting every output position look back at all input positions simultaneously, weighting them by relevance.
The core idea is a soft, differentiable dictionary lookup. A query is compared against a set of keys. The similarities become weights. Those weights aggregate over values to produce an output. Everything is continuous and therefore backpropagable — the network learns which queries, keys, and values to produce.
Self-attention is the special case where queries, keys, and values all come from the same sequence. Each position asks: "which other positions are relevant to understanding me?" and the answer updates the representation. Stack enough of these with feedforward layers and you have a transformer.
et al.
vs n
orig paper
factor
What am I looking for? The query is a learned linear projection of the current token's representation. It encodes the "information need" of this position.
W_Q ∈ R^{d_model × d_k} — learned
In cross-attention (encoder-decoder), queries come from the decoder. In self-attention, all three come from the same source sequence X.
What do I contain? Keys are another linear projection of each token. They advertise content. A high dot product between a query and a key means that token is relevant to the current position's need.
W_K ∈ R^{d_model × d_k} — learned
Keys are compared against all queries — this is the O(n²) step. Every pair (i,j) requires a dot product. For n=4096 tokens, that's 16M operations per head per layer.
What do I actually send? Values are the third projection. Once attention weights are computed via Q·K, the output is a weighted sum of values. The keys determine relevance; values carry the actual information transferred.
W_V ∈ R^{d_model × d_v} — learned
Keys and values are often called "memories." The query retrieves from memory by matching keys, then reads out the associated values. Identical in structure to a Modern Hopfield network update.
Matrix multiply of queries by transposed keys. Produces an n×n similarity matrix — every pair of positions gets a raw score. Expensive but parallelizable.
Scale factor prevents dot products from growing too large in high dimensions (where variance scales with d_k). Without this, softmax saturates and gradients vanish — rediscovery of the Hopfield β parameter.
Applied row-wise. Converts raw scores to a probability distribution over positions. Each row sums to 1 — these are the attention weights. Differentiable everywhere.
Final weighted sum. Each output position gets a convex combination of all value vectors, weighted by how much attention it paid to each position. This is the output.
Classic example from Vaswani et al. The pronoun "it" must resolve to either "animal" or "street." Select any token to see how it distributes attention weight across the sentence. The model learns these patterns from data — no explicit grammar rules.
Running attention once gives one perspective on which positions are relevant. Multi-head attention runs h independent attention operations in parallel — each with its own Q, K, V projections — then concatenates and projects the results. Each head can specialize in a different relationship type.
head_i = Attention(Q·W_Qi, K·W_Ki, V·W_Vi)
The visualization below shows 8 heads for the sentence "The cat sat on the mat." Circle size encodes attention weight from each head to each token. Note how different heads develop distinct, interpretable specializations — these emerge purely from gradient descent.
GQA is now standard in production LLMs (Llama, Mistral, Gemma). Reduces KV cache memory by 4-8x with minimal quality loss. Flash Attention is the standard implementation — tiles computation to fit SRAM, eliminating the bottleneck of reading/writing the full n×n matrix to HBM.
Memory: O(n²)
n = sequence length, d = model dim
For n=4096, the attention matrix alone requires 4096² = 16.7M entries. At fp16 that's 33MB per head per layer. A 32-layer, 32-head model: 33MB × 32 × 32 = 33GB just for attention matrices. This is why long-context models are expensive.
Flash Attention sidesteps materializing the full n×n matrix by fusing the softmax + matmul into tiled operations that stay in SRAM. Memory drops to O(n). This is how 128K+ context windows became practical.
Attention is permutation-invariant — shuffle the tokens and the same values flow through (just re-routed). There's no inherent notion of position. Position must be injected explicitly.
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
Original sinusoidal encoding — Vaswani 2017
Added to the input embeddings before the first layer. Learned absolute position embeddings (BERT) are simpler but don't generalize to lengths unseen during training. RoPE (Rotary Position Embedding) — used in GPT-NeoX, Llama — encodes relative position directly in the QK dot product via rotation matrices. Better extrapolation, now standard.
Attention heads develop interpretable specializations through training. Elhage et al. (2021) and Anthropic mechanistic interpretability work identified recurring head types:
Induction heads in particular are thought to underlie in-context learning — the model's ability to follow patterns demonstrated in the prompt. They form during a phase transition in training and are a candidate mechanism for many few-shot capabilities.