ATTENTION

THE MECHANISM UNDERNEATH EVERYTHING QUERY // KEY // VALUE // SCALED DOT-PRODUCT
TRANSFORMER ARCHITECTURE // VASWANI ET AL. 2017
01 // Core Concept
WHAT ATTENTION DOES

Attention is a mechanism for every token in a sequence to dynamically weight how much it should care about every other token. Rather than compressing context into a fixed-size vector (the old RNN problem), attention lets the model maintain direct connections to all positions simultaneously.

The "Attention Is All You Need" paper (Vaswani et al. 2017) replaced recurrence entirely. The key insight: you don't need sequential processing if you can compute all pairwise relationships at once.

Think of it as a soft, differentiable lookup table. You have a query (what you're looking for), a set of keys (what's available), and values (the actual content). Attention computes similarity between your query and all keys, turns those similarities into weights via softmax, then returns a weighted sum of values.

Attention(Q,K,V) = softmax(QKᵀ / √dₖ) · V
// Q=queries, K=keys, V=values. dₖ=key dimension. √dₖ scaling prevents dot products from growing too large, which would push softmax into regions with tiny gradients.
02 // Q, K, V Projections
QUERY // KEY // VALUE

Each token's embedding gets projected into three different spaces by learned weight matrices W_Q, W_K, W_V. These are not the same - they serve different roles:

Qwhat am i looking for?
W_Q · x
Kwhat do i contain?
W_K · x
Vwhat do i communicate?
W_V · x

The dot product Q · Kᵀ produces a score matrix of shape (seq_len × seq_len). Every query attends to every key. Score = geometric similarity in the projected space. High dot product means the query and key are "compatible".

Softmax converts raw scores to a probability distribution over positions. The output is a weighted sum of values - each output token is a blend of all input values, weighted by attention.

The separation of K and V is important: the key determines relevance while the value determines what actually gets communicated. A token can be "findable" (high key activation) without dominating the output (low value magnitude).

03 // Attention Heatmap // "The animal didn't cross the street because it was too tired"
WHAT ATTENDS TO WHAT

Each row = query token. Each column = key token. Color intensity = attention weight. This is how the model resolves "it" → "animal".

The animal didn't cross the street because it was tired it → 0.04 0.61 0.03 0.04 0.02 0.17 0.04 it 0.03 0.05 coreference resolved low attention medium high attention
04 // Multi-Head Attention
MULTIPLE HEADS

A single attention head can only learn one type of relationship at a time. Multi-head attention runs H parallel attention operations with different learned projections, then concatenates the results.

HEAD 1
syntactic deps
HEAD 2
coreference
HEAD 3
positional
HEAD 4
semantic sim
HEAD 5
rare tokens
HEAD 6
verb-object
HEAD 7
long range
HEAD 8
local window
MultiHead(Q,K,V) = Concat(head₁,...,headₕ) · W_O
where headᵢ = Attention(QWᵢQ, KWᵢK, VWᵢV)
// GPT-3: 96 heads, dmodel=12288. Each head operates on dmodel/H=128 dimensional subspace. W_O projects concatenated output back to model dimension.
05 // Complexity & Variants
THE QUADRATIC PROBLEM

Standard attention is O(n²) in both compute and memory with respect to sequence length. A 1M token context computes 1 trillion attention scores. This is why long contexts are expensive and why alternatives exist.

FLASH ATTENTION

Reorders computation to be IO-aware. Tiles Q,K,V matrices to fit in SRAM. Same output, ~3x faster, 10x less memory. Still O(n²) FLOPs but dramatically fewer HBM reads/writes. The engineering solution.

SPARSE ATTENTION

Only compute attention for a subset of token pairs. Local windows + strided global tokens. Longformer, BigBird. O(n) complexity. Miss some cross-range interactions.

LINEAR ATTENTION

Kernel trick to approximate softmax(QKᵀ) without materializing the full matrix. φ(Q)(φ(K)ᵀV) instead. True O(n) but quality tradeoffs. Ongoing research.

GQA / MQA

Grouped Query Attention - multiple query heads share key/value heads. Llama 3 uses GQA. Reduces KV cache size significantly. Critical for inference efficiency at scale.

06 // Causal Masking
DECODER MASK

In autoregressive generation, future tokens must not influence past tokens. A causal mask zeros out the upper triangle of the attention matrix - token i can only attend to positions ≤ i.

tok1 tok2 tok3 tok4 t1 t2 t3 t4 -∞ -∞ -∞ -∞ -∞ -∞

-∞ before softmax → 0 after softmax. Causality enforced without any explicit rule.

07 // Positional Encoding
ROPE & SINUSOIDAL

Attention has no inherent sense of position - "cat sat mat" and "mat sat cat" produce identical attention scores without positional information injected.

Sinusoidal PE (original paper): Add fixed sine/cosine patterns of different frequencies to token embeddings. Position encoded as a unique superposition of waves.

RoPE (Rotary Position Embedding, used in Llama, Gemma, Mistral): Rotate Q and K vectors by an angle proportional to position before computing dot products. The dot product then naturally encodes relative distance. Generalizes better to sequences longer than training length.

ROPE // ROTATION IN EMBEDDING SPACE
pos=1 pos=2 pos=3 same vector, different angle per position
08 // KV Cache
INFERENCE TRICK

During autoregressive generation, each new token needs to attend to all previous tokens. Recomputing K and V for the entire context at each step would be catastrophically slow.

The KV cache stores the computed key and value tensors for all previous tokens. Each new token only needs to compute its own Q, K, V and attend to cached K/V from prior positions.

The cost: memory. A single 70B model serving long contexts needs gigabytes of KV cache per active session. At scale (many parallel sessions) this dominates memory usage, not the model weights themselves.

KV cache size = 2 · n_layers · n_heads · d_head · seq_len · bytes_per_param
Llama 3 70B, 128k ctx: ~35GB KV cache alone