Attention is a mechanism for every token in a sequence to dynamically weight how much it should care about every other token. Rather than compressing context into a fixed-size vector (the old RNN problem), attention lets the model maintain direct connections to all positions simultaneously.
The "Attention Is All You Need" paper (Vaswani et al. 2017) replaced recurrence entirely. The key insight: you don't need sequential processing if you can compute all pairwise relationships at once.
Think of it as a soft, differentiable lookup table. You have a query (what you're looking for), a set of keys (what's available), and values (the actual content). Attention computes similarity between your query and all keys, turns those similarities into weights via softmax, then returns a weighted sum of values.
Each token's embedding gets projected into three different spaces by learned weight matrices W_Q, W_K, W_V. These are not the same - they serve different roles:
W_Q · x
W_K · x
W_V · x
The dot product Q · Kᵀ produces a score matrix of shape (seq_len × seq_len). Every query attends to every key. Score = geometric similarity in the projected space. High dot product means the query and key are "compatible".
Softmax converts raw scores to a probability distribution over positions. The output is a weighted sum of values - each output token is a blend of all input values, weighted by attention.
The separation of K and V is important: the key determines relevance while the value determines what actually gets communicated. A token can be "findable" (high key activation) without dominating the output (low value magnitude).
Each row = query token. Each column = key token. Color intensity = attention weight. This is how the model resolves "it" → "animal".
A single attention head can only learn one type of relationship at a time. Multi-head attention runs H parallel attention operations with different learned projections, then concatenates the results.
Standard attention is O(n²) in both compute and memory with respect to sequence length. A 1M token context computes 1 trillion attention scores. This is why long contexts are expensive and why alternatives exist.
Reorders computation to be IO-aware. Tiles Q,K,V matrices to fit in SRAM. Same output, ~3x faster, 10x less memory. Still O(n²) FLOPs but dramatically fewer HBM reads/writes. The engineering solution.
Only compute attention for a subset of token pairs. Local windows + strided global tokens. Longformer, BigBird. O(n) complexity. Miss some cross-range interactions.
Kernel trick to approximate softmax(QKᵀ) without materializing the full matrix. φ(Q)(φ(K)ᵀV) instead. True O(n) but quality tradeoffs. Ongoing research.
Grouped Query Attention - multiple query heads share key/value heads. Llama 3 uses GQA. Reduces KV cache size significantly. Critical for inference efficiency at scale.
In autoregressive generation, future tokens must not influence past tokens. A causal mask zeros out the upper triangle of the attention matrix - token i can only attend to positions ≤ i.
-∞ before softmax → 0 after softmax. Causality enforced without any explicit rule.
Attention has no inherent sense of position - "cat sat mat" and "mat sat cat" produce identical attention scores without positional information injected.
Sinusoidal PE (original paper): Add fixed sine/cosine patterns of different frequencies to token embeddings. Position encoded as a unique superposition of waves.
RoPE (Rotary Position Embedding, used in Llama, Gemma, Mistral): Rotate Q and K vectors by an angle proportional to position before computing dot products. The dot product then naturally encodes relative distance. Generalizes better to sequences longer than training length.
During autoregressive generation, each new token needs to attend to all previous tokens. Recomputing K and V for the entire context at each step would be catastrophically slow.
The KV cache stores the computed key and value tensors for all previous tokens. Each new token only needs to compute its own Q, K, V and attend to cached K/V from prior positions.
The cost: memory. A single 70B model serving long contexts needs gigabytes of KV cache per active session. At scale (many parallel sessions) this dominates memory usage, not the model weights themselves.