DNN MATH // CHEATSHEET

00 // Reference Card

The Math That Runs Everything

Every deep learning system -- from a two-layer perceptron to a 405B transformer -- is built on the same small set of operations. Linear transformations, nonlinear gates, loss minimization, gradient propagation. The architectures change. The math doesn't.

Twelve sections covering linear algebra, activation functions with live interactive plots, loss functions, backpropagation, a gradient descent landscape simulator, transformer internals, an attention pattern visualizer, quantization/LoRA, information theory, sampling, regularization, and numerical stability.

Notes reference Qwen, LLaMA, and the homelab where relevant.

12

Sections

38

Equations

3

Interactive viz

0

MathJax deps

backproptransformergradient landscapequantizationLoRAvanishing gradients

01 // Linear Algebra Foundations

The Substrate

Everything in a neural network reduces to matrix multiplication and element-wise operations. Knowing shapes at each step is more useful than memorizing architecture diagrams.

Core Forward Pass

z = Wx + b // pre-activation (affine transform) a = f(z) // post-activation (nonlinearity) y_hat = softmax(aL) // final output (classification) shapes: W is (d_out x d_in), x is (d_in,), b is (d_out,)

The whole network is a composition of these. Every "layer" is one pass. d_in and d_out must match at boundaries.

Dot Product

a . b = SUM(a_i * b_i) = ||a|| * ||b|| * cos(theta)

Similarity measure. High dot product = aligned vectors. This is what attention computes.

Matrix Multiply

(A @ B)_ij = SUM_k(A_ik * B_kj) (m x n) @ (n x p) = (m x p)

Inner dimensions must match. Every linear layer is a matmul. Every attention head is four matmuls.

Cosine Similarity

cos_sim(a, b) = (a . b) / (||a|| * ||b||)

Range [-1, 1]. Used in embedding retrieval, RAG, contrastive loss.

Eigendecomposition

Av = lambda * v A = Q * Lambda * Q^T // symmetric case

Eigenvalues of the Hessian determine loss surface curvature. Large eigenvalue ratios = ill-conditioned optimization. This is why Adam outperforms SGD.

02 // Activation Functions -- Interactive

Nonlinearity or Bust

Without nonlinearity, any stack of linear layers collapses to one linear transform. Select an activation to see its shape and derivative. The red dashed trace is the derivative -- when it flatlines at zero, gradients die.

ReLU

f(x) = max(0, x) f'(x) = 1 if x > 0, else 0

Cheap, fast, works. Dying ReLU: units that go negative never recover.

GELU (Transformers)

f(x) = x * Phi(x) approx: 0.5x(1 + tanh(sqrt(2/pi)(x + 0.044715x^3)))

Phi = standard normal CDF. Smooth, differentiable everywhere. Default in BERT, GPT-2.

SiLU / Swish (LLaMA, Qwen)

f(x) = x * sigma(x) f'(x) = sigma(x) + x * sigma(x) * (1 - sigma(x))

Self-gated. Used in SwiGLU FFN blocks. Small negative lobe avoids dead neurons.

Softmax

softmax(z_i) = e^(z_i) / SUM_j(e^(z_j)) temperature: softmax(z_i / T)

Turns logits into probabilities. T = temperature slider in Ollama and LM Studio.

03 // Loss Functions

What the Network Is Minimizing

The loss function defines "correct." The optimizer only knows the scalar it's trying to shrink.

Cross-Entropy

L = -SUM(y_i * log(y_hat_i)) binary: L = -[y*log(p) + (1-y)*log(1-p)]

Standard for classification. Also next-token prediction loss in LLMs.

KL Divergence

D_KL(P || Q) = SUM(P(x) * log(P(x) / Q(x)))

Not symmetric. Shows up in VAEs, knowledge distillation, RLHF.

MSE (Regression)

L = (1/n) * SUM((y_i - y_hat_i)^2)

Penalizes large errors quadratically. Sensitive to outliers.

Perplexity

PPL = e^L // L = avg cross-entropy per token

PPL of k = model is as uncertain as choosing among k tokens uniformly. Lower is better.

04 // Backpropagation and Optimization

Gradient Flow

Backprop is the chain rule applied recursively from loss to first layer. The optimizer uses these gradients to update weights.

Chain Rule (The Whole Game)

dL/dw = (dL/da) * (da/dz) * (dz/dw) dL/dz_l = dL/dz_(l+1) * W_(l+1)^T * f'(z_l) // layer recursion

That's it. Everything else is bookkeeping. Autograd builds and traverses the computational graph.

Adam / AdamW (The Default)

m <- beta1*m + (1-beta1)*grad(L) // first moment v <- beta2*v + (1-beta2)*(grad(L))^2 // second moment m_hat = m / (1 - beta1^t) // bias correction v_hat = v / (1 - beta2^t) w <- w - eta * m_hat / (sqrt(v_hat) + eps) AdamW: w <- w - eta*(m_hat/(sqrt(v_hat)+eps) + lambda*w) typical: beta1=0.9 beta2=0.999 eps=1e-8

Per-parameter adaptive learning rates. AdamW decouples weight decay from gradient. The standard for transformer training.

SGD + Momentum

v <- beta*v + grad(L) w <- w - eta*v

Accumulates velocity. Dampens oscillation in high-curvature directions.

Cosine Annealing + Warmup

warmup: eta(t) = eta_max * (t / warmup_steps) anneal: eta(t) = eta_min + 0.5*(eta_max - eta_min) * (1 + cos(pi * t / T))

Linear warmup for first 1-5% prevents early instability. Cosine decay over remaining steps.

05 // Gradient Descent Landscape -- Interactive

The Terrain of Optimization

The cyan dot is the current parameter. Watch how learning rate and momentum affect convergence. The red arrow shows negative gradient direction.

LR: Mom: step: 0 | loss: --

06 // Transformer Architecture

Attention Is All You Need (And Some FFN)

The block: norm -> attention -> residual -> norm -> FFN -> residual. Modern variants use pre-norm, RMSNorm, SwiGLU. The attention mechanism hasn't changed since 2017.

Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V Q = x * W_Q // what am I looking for? K = x * W_K // what do I contain? V = x * W_V // what do I return?

sqrt(d_k) prevents dot products from pushing softmax into saturation. Without it, near-one-hot distributions kill gradients.

Multi-Head Attention

MultiHead = Concat(head_1,...,head_h) * W_O head_i = Attention(Q*W_i^Q, K*W_i^K, V*W_i^V)

Each head attends to different relationships. GQA in LLaMA 3 shares K/V heads to reduce KV cache.

RoPE

q_m = R(theta_m) * q k_n = R(theta_n) * k score = q_m . k_n = f(m - n) // relative pos

Rotates Q and K in 2D subspaces. Score depends on relative distance. Better long-sequence extension than sinusoidal.

SwiGLU FFN (Qwen, LLaMA)

FFN(x) = (Swish(x*W1) * x*W3) * W2 // three matrices, hidden ~ 8/3 * d_model

Gated linear unit. More expressive than ReLU-FFN at same parameter count.

RMSNorm

RMSNorm(x) = x * gamma / sqrt(mean(x^2) + eps)

Drops mean-centering from LayerNorm. Slightly faster. The norm of choice in modern open weights.

Residual Connection

output = x + sublayer(x) d(output)/d(x) = 1 + d(sublayer)/d(x)

The +1 prevents vanishing gradients. Not optional -- gradients collapse exponentially without them.

07 // Attention Pattern Visualizer -- Interactive

What Tokens Attend To

Each cell shows how much a query (row) attends to each key (column). Rows sum to 1.0 (softmax). Hover for weights. Causal mask means tokens only see backward.

hover a cell

08 // Quantization and LoRA

Fitting Large Models Into Small Boxes

Quantization reduces precision. LoRA reduces trainable parameters. Both necessary for running anything on a 1650 Ti.

Uniform Quantization

q = round((x - z) / s) // quantize x_hat = q * s + z // dequantize s = (x_max - x_min) / (2^b - 1)

b = bits (4 or 8). Error bounded by s/2. Symmetric sets z=0, simpler but wastes range.

NF4 (NormalFloat 4-bit)

quantile bins: equal probability mass under N(0,1) distribution non-uniform edges clustered near zero

Trained weights are ~normal. NF4 puts more bins where density is highest. Used in bitsandbytes QLoRA.

LoRA (Low-Rank Adaptation)

frozen: W (d x d) trainable: B (d x r), A (r x d) r << d h = (W + B * A) * x merged: W' = W + (alpha/r) * B * A full finetune: d^2 LoRA: 2*d*r // r=16, d=4096 -> 0.78%

Only A and B get gradients. r typically 8-64. alpha often = r. At merge time, folds back into full weight.

VRAM Rule of Thumb

inference: params * bytes_per_param 3B @ int4 ~1.5 GB 7B @ int4 ~3.5 GB 7B @ fp16 ~14 GB training: ~3-4x inference

Add 10-20% for KV cache. Qwen 3B Q4 fits on 4GB 1650 Ti with room. 7B Q4 fits if you trim context.

GGUF Quant Levels

Q2_K ~2.5 bpw barely coherent Q3_K_M ~3.5 bpw usable for small models Q4_K_M ~4.5 bpw sweet spot (1650 Ti) Q5_K_M ~5.5 bpw near-lossless Q6_K ~6.5 bpw diminishing returns Q8_0 ~8.5 bpw basically FP16

bpw = bits per weight. _K = k-quant with per-group scaling. IQ variants use importance-weighted quantization.

09 // Information Theory

Uncertainty as a Quantity

Shannon's framework treats information as measurable. These equations appear throughout DL because next-token prediction is an information-theoretic problem.

Shannon Entropy

H(X) = -SUM(p(x) * log2(p(x))) max: log2(N) uniform min: 0 deterministic

Average uncertainty. Uniform maximizes, delta minimizes. Units: bits (log2), nats (ln).

Mutual Information

I(X;Y) = H(X) - H(X|Y) = H(X) + H(Y) - H(X,Y)

How much Y reduces uncertainty about X. Symmetric, unlike KL divergence.

Cross-Entropy (Information View)

H(P, Q) = -SUM(P(x) * log(Q(x))) = H(P) + D_KL(P || Q)

Minimizing CE over data minimizes KL from model to data distribution. The LLM training objective exactly.

10 // LLM Sampling

Controlling the Output

Temperature

P(token) = softmax(logit / T) T->0: greedy T=1: standard T->inf: uniform

Creative writing: 0.7-0.9. Coding: 0.2-0.4.

Top-p (Nucleus)

sort by P descending, accumulate until >= p sample from this "nucleus", renormalize

p=0.9 uses smallest set covering 90% mass. Adapts to distribution shape.

Top-k

keep k highest tokens, zero rest, renormalize

Fixed slice regardless of shape. Often combined with temp and top-p.

Min-P (Dynamic Floor)

threshold = min_p * max(P) keep tokens where P >= threshold

Scales cutoff relative to most likely token. More adaptive than top-k. Gaining adoption in llama.cpp.

11 // Regularization

Keeping the Network Stable

Dropout

mask ~ Bernoulli(1-p) output = (x * mask) / (1-p)

Zero activations with probability p. Scale by 1/(1-p) to preserve expectation. Disabled at inference.

Gradient Clipping

if ||g|| > threshold: g <- threshold * g / ||g||

Prevents exploding gradients. Preserves direction. Standard max_norm = 1.0.

Weight Decay (L2)

L_total = L_task + (lambda/2) * ||w||^2 step: w <- w*(1-eta*lambda) - eta*grad(L)

Pulls weights toward zero. In Adam, L2 != weight decay -- AdamW implements true decay.

Vanishing Gradient Cascade

grad at layer l = PRODUCT(k=l..L) f'(z_k)*W_k^T sigmoid: max(f') = 0.25 50 layers: 0.25^50 = ~10^(-30) DEAD relu: f' = 1 (when active) 50 layers: 1^50 = 1 ALIVE

The reason ReLU replaced sigmoid. Sigmoid derivatives compound multiplicatively. The entirety of "why deep learning works now" is managing this cascade.

12 // Numerical Intuitions

Numbers You Should Know Cold

Floating Point Ranges

FP32: ~1.4e-45 to 3.4e38 23-bit mantissa FP16: ~6e-8 to 65504 10-bit, OVERFLOWS BF16: ~1.4e-45 to 3.4e38 7-bit, FP32 exponent

BF16 preferred for training: same range as FP32. FP16 needs loss scaling. Mix of BF16 activations + int4/int8 weights is common.

Softmax Stability

// UNSTABLE: e^z_i / SUM(e^z_j) overflows // STABLE: c = max(z) softmax(z_i) = e^(z_i - c) / SUM(e^(z_j - c))

Softmax is shift-invariant. Subtracting max prevents overflow. Every production implementation does this.

Transformer Param Count

~12*L*d^2 + 2*V*d LLaMA 7B: L=32 d=4096 V=32000 -> ~6.7B

12 = 4 attention + 2 FFN matrices, each d^2. Embedding often tied input/output.

KV Cache Memory

KV = 2 * L * h * d_head * seq_len * bytes LLaMA 7B, FP16, 4k: ~2.1 GB

Grows linearly with sequence length. GQA reduces h for K/V heads. This is why long context costs.

FlashAttention

standard: O(N^2) memory (materializes full NxN) flash: O(N) memory (tiled, never materializes) speed: ~2-4x faster via reduced HBM reads

IO-aware exact algorithm, not an approximation. Tiles computation to fit in GPU SRAM. Same math, reordered. This is why 32k+ context works on consumer GPUs.

DNNMATH

DNN
MATH