Attention Mechanism

01 // Concept

WHAT ATTENTION DOES

Before attention, sequence models had to compress an entire input sequence into a single fixed-size context vector — a bottleneck that forced information loss. Attention broke this by letting every output position look back at all input positions simultaneously, weighting them by relevance.

The core idea is a soft, differentiable dictionary lookup. A query is compared against a set of keys. The similarities become weights. Those weights aggregate over values to produce an output. Everything is continuous and therefore backpropagable — the network learns which queries, keys, and values to produce.

Self-attention is the special case where queries, keys, and values all come from the same sequence. Each position asks: "which other positions are relevant to understanding me?" and the answer updates the representation. Stack enough of these with feedforward layers and you have a transformer.

2017

Vaswani
et al.

O(n²)

complexity
vs n

heads in
orig paper

√d

scale
factor

02 // Query

What am I looking for? The query is a learned linear projection of the current token's representation. It encodes the "information need" of this position.

Q = X · W_QW_Q ∈ R^{d_model × d_k} — learned

In cross-attention (encoder-decoder), queries come from the decoder. In self-attention, all three come from the same source sequence X.

03 // Key

What do I contain? Keys are another linear projection of each token. They advertise content. A high dot product between a query and a key means that token is relevant to the current position's need.

K = X · W_KW_K ∈ R^{d_model × d_k} — learned

Keys are compared against all queries — this is the O(n²) step. Every pair (i,j) requires a dot product. For n=4096 tokens, that's 16M operations per head per layer.

04 // Value

What do I actually send? Values are the third projection. Once attention weights are computed via Q·K, the output is a weighted sum of values. The keys determine relevance; values carry the actual information transferred.

V = X · W_VW_V ∈ R^{d_model × d_v} — learned

Keys and values are often called "memories." The query retrieves from memory by matching keys, then reads out the associated values. Identical in structure to a Modern Hopfield network update.

05 // The Formula

SCALED DOT-PRODUCT ATTENTION

Attention(Q, K, V) = softmax(QKᵀ / √d_k) V

QKᵀ

Matrix multiply of queries by transposed keys. Produces an n×n similarity matrix — every pair of positions gets a raw score. Expensive but parallelizable.

/ √d_k

Scale factor prevents dot products from growing too large in high dimensions (where variance scales with d_k). Without this, softmax saturates and gradients vanish — rediscovery of the Hopfield β parameter.

softmax

Applied row-wise. Converts raw scores to a probability distribution over positions. Each row sums to 1 — these are the attention weights. Differentiable everywhere.

·V

Final weighted sum. Each output position gets a convex combination of all value vectors, weighted by how much attention it paid to each position. This is the output.

06 // Interactive // Coreference Resolution

CLICK A TOKEN TO SEE ITS ATTENTION

Classic example from Vaswani et al. The pronoun "it" must resolve to either "animal" or "street." Select any token to see how it distributes attention weight across the sentence. The model learns these patterns from data — no explicit grammar rules.

Self-Attention // Sentence — select a token —

07 // Multi-Head Attention

8 HEADS — 8 RELATIONSHIP TYPES

Running attention once gives one perspective on which positions are relevant. Multi-head attention runs h independent attention operations in parallel — each with its own Q, K, V projections — then concatenates and projects the results. Each head can specialize in a different relationship type.

MultiHead(Q,K,V) = Concat(head_1, ..., head_h) \cdot W_O head_i = Attention(Q\cdotW_Qi, K\cdotW_Ki, V\cdotW_Vi)

The visualization below shows 8 heads for the sentence "The cat sat on the mat." Circle size encodes attention weight from each head to each token. Note how different heads develop distinct, interpretable specializations — these emerge purely from gradient descent.

08 // Variants

ATTENTION TYPES

Self-attentionQ,K,V all from X

Cross-attentionQ from decoder, K/V from encoder

Causal / maskedmask future tokens (GPT)

Grouped-query (GQA)fewer K/V heads, share across Q heads

Multi-query (MQA)one K/V head, many Q heads

Flash AttentionIO-aware exact attention, no approx

Linear attentionO(n) approximation via kernel trick

GQA is now standard in production LLMs (Llama, Mistral, Gemma). Reduces KV cache memory by 4-8x with minimal quality loss. Flash Attention is the standard implementation — tiles computation to fit SRAM, eliminating the bottleneck of reading/writing the full n×n matrix to HBM.

GPT = causalBERT = bidirectionalLlama = GQA

09 // Complexity

THE O(n²) PROBLEM

Time: O(n² \cdot d) Memory: O(n²) n = sequence length, d = model dim

For n=4096, the attention matrix alone requires 4096² = 16.7M entries. At fp16 that's 33MB per head per layer. A 32-layer, 32-head model: 33MB × 32 × 32 = 33GB just for attention matrices. This is why long-context models are expensive.

Flash Attention sidesteps materializing the full n×n matrix by fusing the softmax + matmul into tiled operations that stay in SRAM. Memory drops to O(n). This is how 128K+ context windows became practical.

Standard

O(n²)

memory

Flash Attn

O(n)

memory

bottleneckFlash AttnDao 2022

10 // Positional Encoding

POSITION IS NOT FREE

Attention is permutation-invariant — shuffle the tokens and the same values flow through (just re-routed). There's no inherent notion of position. Position must be injected explicitly.

PE(pos, 2i) = sin(pos / 10000^(2i/d)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d)) Original sinusoidal encoding — Vaswani 2017

Added to the input embeddings before the first layer. Learned absolute position embeddings (BERT) are simpler but don't generalize to lengths unseen during training. RoPE (Rotary Position Embedding) — used in GPT-NeoX, Llama — encodes relative position directly in the QK dot product via rotation matrices. Better extrapolation, now standard.

sinusoidalRoPEALiBilearned

11 // Interpretability

WHAT HEADS LEARN

Attention heads develop interpretable specializations through training. Elhage et al. (2021) and Anthropic mechanistic interpretability work identified recurring head types:

Induction headscopy patterns [A][B]...[A]→[B]

Previous tokenattend to pos i-1

Name lookupretrieve attributes of named entity

Syntacticsubject-verb-object relations

Coreferencepronoun → referent linking

Induction heads in particular are thought to underlie in-context learning — the model's ability to follow patterns demonstrated in the prompt. They form during a phase transition in training and are a candidate mechanism for many few-shot capabilities.

Elhage 2021induction headsICL mechanism

ATTENTIONMECHANISM

ATTENTION
MECHANISM