DNN
MATH

Deep Network Mathematics Reference ACTIVATION // LOSS // BACKPROP // TRANSFORMER
00 // Reference Card
The Math That Runs Everything

Every deep learning system -- from a two-layer perceptron to a 405B transformer -- is built on the same small set of operations. Linear transformations, nonlinear gates, loss minimization, gradient propagation. The architectures change. The math doesn't.

Twelve sections covering linear algebra, activation functions with live interactive plots, loss functions, backpropagation, a gradient descent landscape simulator, transformer internals, an attention pattern visualizer, quantization/LoRA, information theory, sampling, regularization, and numerical stability.

Notes reference Qwen, LLaMA, and the ersatz lab where relevant.

12
Sections
38
Equations
3
Interactive viz
0
MathJax deps
backproptransformergradient landscapequantizationLoRAvanishing gradients
01 // Linear Algebra Foundations
The Substrate

Everything in a neural network reduces to matrix multiplication and element-wise operations. Knowing shapes at each step is more useful than memorizing architecture diagrams.

Core Forward Pass
z = Wx + b // pre-activation (affine transform) a = f(z) // post-activation (nonlinearity) y_hat = softmax(aL) // final output (classification) shapes: W is (d_out x d_in), x is (d_in,), b is (d_out,)
The whole network is a composition of these. Every "layer" is one pass. d_in and d_out must match at boundaries.
Dot Product
a . b = SUM(a_i * b_i) = ||a|| * ||b|| * cos(theta)
Similarity measure. High dot product = aligned vectors. This is what attention computes.
Matrix Multiply
(A @ B)_ij = SUM_k(A_ik * B_kj) (m x n) @ (n x p) = (m x p)
Inner dimensions must match. Every linear layer is a matmul. Every attention head is four matmuls.
Cosine Similarity
cos_sim(a, b) = (a . b) / (||a|| * ||b||)
Range [-1, 1]. Used in embedding retrieval, RAG, contrastive loss.
Eigendecomposition
Av = lambda * v A = Q * Lambda * Q^T // symmetric case
Eigenvalues of the Hessian determine loss surface curvature. Large eigenvalue ratios = ill-conditioned optimization. This is why Adam outperforms SGD.
02 // Activation Functions -- Interactive
Nonlinearity or Bust

Without nonlinearity, any stack of linear layers collapses to one linear transform. Select an activation to see its shape and derivative. The red dashed trace is the derivative -- when it flatlines at zero, gradients die.

ReLU
f(x) = max(0, x) f'(x) = 1 if x > 0, else 0
Cheap, fast, works. Dying ReLU: units that go negative never recover.
GELU (Transformers)
f(x) = x * Phi(x) approx: 0.5x(1 + tanh(sqrt(2/pi)(x + 0.044715x^3)))
Phi = standard normal CDF. Smooth, differentiable everywhere. Default in BERT, GPT-2.
SiLU / Swish (LLaMA, Qwen)
f(x) = x * sigma(x) f'(x) = sigma(x) + x * sigma(x) * (1 - sigma(x))
Self-gated. Used in SwiGLU FFN blocks. Small negative lobe avoids dead neurons.
Softmax
softmax(z_i) = e^(z_i) / SUM_j(e^(z_j)) temperature: softmax(z_i / T)
Turns logits into probabilities. T = temperature slider in Ollama and LM Studio.
03 // Loss Functions
What the Network Is Minimizing

The loss function defines "correct." The optimizer only knows the scalar it's trying to shrink.

Cross-Entropy
L = -SUM(y_i * log(y_hat_i)) binary: L = -[y*log(p) + (1-y)*log(1-p)]
Standard for classification. Also next-token prediction loss in LLMs.
KL Divergence
D_KL(P || Q) = SUM(P(x) * log(P(x) / Q(x)))
Not symmetric. Shows up in VAEs, knowledge distillation, RLHF.
MSE (Regression)
L = (1/n) * SUM((y_i - y_hat_i)^2)
Penalizes large errors quadratically. Sensitive to outliers.
Perplexity
PPL = e^L // L = avg cross-entropy per token
PPL of k = model is as uncertain as choosing among k tokens uniformly. Lower is better.
04 // Backpropagation and Optimization
Gradient Flow

Backprop is the chain rule applied recursively from loss to first layer. The optimizer uses these gradients to update weights.

Chain Rule (The Whole Game)
dL/dw = (dL/da) * (da/dz) * (dz/dw) dL/dz_l = dL/dz_(l+1) * W_(l+1)^T * f'(z_l) // layer recursion
That's it. Everything else is bookkeeping. Autograd builds and traverses the computational graph.
Adam / AdamW (The Default)
m <- beta1*m + (1-beta1)*grad(L) // first moment v <- beta2*v + (1-beta2)*(grad(L))^2 // second moment m_hat = m / (1 - beta1^t) // bias correction v_hat = v / (1 - beta2^t) w <- w - eta * m_hat / (sqrt(v_hat) + eps) AdamW: w <- w - eta*(m_hat/(sqrt(v_hat)+eps) + lambda*w) typical: beta1=0.9 beta2=0.999 eps=1e-8
Per-parameter adaptive learning rates. AdamW decouples weight decay from gradient. The standard for transformer training.
SGD + Momentum
v <- beta*v + grad(L) w <- w - eta*v
Accumulates velocity. Dampens oscillation in high-curvature directions.
Cosine Annealing + Warmup
warmup: eta(t) = eta_max * (t / warmup_steps) anneal: eta(t) = eta_min + 0.5*(eta_max - eta_min) * (1 + cos(pi * t / T))
Linear warmup for first 1-5% prevents early instability. Cosine decay over remaining steps.
05 // Gradient Descent Landscape -- Interactive
The Terrain of Optimization

The cyan dot is the current parameter. Watch how learning rate and momentum affect convergence. The red arrow shows negative gradient direction.

LR: Mom: step: 0 | loss: --
06 // Transformer Architecture
Attention Is All You Need (And Some FFN)

The block: norm -> attention -> residual -> norm -> FFN -> residual. Modern variants use pre-norm, RMSNorm, SwiGLU. The attention mechanism hasn't changed since 2017.

Scaled Dot-Product Attention
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V Q = x * W_Q // what am I looking for? K = x * W_K // what do I contain? V = x * W_V // what do I return?
sqrt(d_k) prevents dot products from pushing softmax into saturation. Without it, near-one-hot distributions kill gradients.
Multi-Head Attention
MultiHead = Concat(head_1,...,head_h) * W_O head_i = Attention(Q*W_i^Q, K*W_i^K, V*W_i^V)
Each head attends to different relationships. GQA in LLaMA 3 shares K/V heads to reduce KV cache.
RoPE
q_m = R(theta_m) * q k_n = R(theta_n) * k score = q_m . k_n = f(m - n) // relative pos
Rotates Q and K in 2D subspaces. Score depends on relative distance. Better long-sequence extension than sinusoidal.
SwiGLU FFN (Qwen, LLaMA)
FFN(x) = (Swish(x*W1) * x*W3) * W2 // three matrices, hidden ~ 8/3 * d_model
Gated linear unit. More expressive than ReLU-FFN at same parameter count.
RMSNorm
RMSNorm(x) = x * gamma / sqrt(mean(x^2) + eps)
Drops mean-centering from LayerNorm. Slightly faster. The norm of choice in modern open weights.
Residual Connection
output = x + sublayer(x) d(output)/d(x) = 1 + d(sublayer)/d(x)
The +1 prevents vanishing gradients. Not optional -- gradients collapse exponentially without them.
07 // Attention Pattern Visualizer -- Interactive
What Tokens Attend To

Each cell shows how much a query (row) attends to each key (column). Rows sum to 1.0 (softmax). Hover for weights. Causal mask means tokens only see backward.

hover a cell
08 // Quantization and LoRA
Fitting Large Models Into Small Boxes

Quantization reduces precision. LoRA reduces trainable parameters. Both necessary for running anything on a 1650 Ti.

Uniform Quantization
q = round((x - z) / s) // quantize x_hat = q * s + z // dequantize s = (x_max - x_min) / (2^b - 1)
b = bits (4 or 8). Error bounded by s/2. Symmetric sets z=0, simpler but wastes range.
NF4 (NormalFloat 4-bit)
quantile bins: equal probability mass under N(0,1) distribution non-uniform edges clustered near zero
Trained weights are ~normal. NF4 puts more bins where density is highest. Used in bitsandbytes QLoRA.
LoRA (Low-Rank Adaptation)
frozen: W (d x d) trainable: B (d x r), A (r x d) r << d h = (W + B * A) * x merged: W' = W + (alpha/r) * B * A full finetune: d^2 LoRA: 2*d*r // r=16, d=4096 -> 0.78%
Only A and B get gradients. r typically 8-64. alpha often = r. At merge time, folds back into full weight.
VRAM Rule of Thumb
inference: params * bytes_per_param 3B @ int4 ~1.5 GB 7B @ int4 ~3.5 GB 7B @ fp16 ~14 GB training: ~3-4x inference
Add 10-20% for KV cache. Qwen 3B Q4 fits on 4GB 1650 Ti with room. 7B Q4 fits if you trim context.
GGUF Quant Levels
Q2_K ~2.5 bpw barely coherent Q3_K_M ~3.5 bpw usable for small models Q4_K_M ~4.5 bpw sweet spot (1650 Ti) Q5_K_M ~5.5 bpw near-lossless Q6_K ~6.5 bpw diminishing returns Q8_0 ~8.5 bpw basically FP16
bpw = bits per weight. _K = k-quant with per-group scaling. IQ variants use importance-weighted quantization.
09 // Information Theory
Uncertainty as a Quantity

Shannon's framework treats information as measurable. These equations appear throughout DL because next-token prediction is an information-theoretic problem.

Shannon Entropy
H(X) = -SUM(p(x) * log2(p(x))) max: log2(N) uniform min: 0 deterministic
Average uncertainty. Uniform maximizes, delta minimizes. Units: bits (log2), nats (ln).
Mutual Information
I(X;Y) = H(X) - H(X|Y) = H(X) + H(Y) - H(X,Y)
How much Y reduces uncertainty about X. Symmetric, unlike KL divergence.
Cross-Entropy (Information View)
H(P, Q) = -SUM(P(x) * log(Q(x))) = H(P) + D_KL(P || Q)
Minimizing CE over data minimizes KL from model to data distribution. The LLM training objective exactly.
10 // LLM Sampling
Controlling the Output
Temperature
P(token) = softmax(logit / T) T->0: greedy T=1: standard T->inf: uniform
Creative writing: 0.7-0.9. Coding: 0.2-0.4.
Top-p (Nucleus)
sort by P descending, accumulate until >= p sample from this "nucleus", renormalize
p=0.9 uses smallest set covering 90% mass. Adapts to distribution shape.
Top-k
keep k highest tokens, zero rest, renormalize
Fixed slice regardless of shape. Often combined with temp and top-p.
Min-P (Dynamic Floor)
threshold = min_p * max(P) keep tokens where P >= threshold
Scales cutoff relative to most likely token. More adaptive than top-k. Gaining adoption in llama.cpp.
11 // Regularization
Keeping the Network Stable
Dropout
mask ~ Bernoulli(1-p) output = (x * mask) / (1-p)
Zero activations with probability p. Scale by 1/(1-p) to preserve expectation. Disabled at inference.
Gradient Clipping
if ||g|| > threshold: g <- threshold * g / ||g||
Prevents exploding gradients. Preserves direction. Standard max_norm = 1.0.
Weight Decay (L2)
L_total = L_task + (lambda/2) * ||w||^2 step: w <- w*(1-eta*lambda) - eta*grad(L)
Pulls weights toward zero. In Adam, L2 != weight decay -- AdamW implements true decay.
Vanishing Gradient Cascade
grad at layer l = PRODUCT(k=l..L) f'(z_k)*W_k^T sigmoid: max(f') = 0.25 50 layers: 0.25^50 = ~10^(-30) DEAD relu: f' = 1 (when active) 50 layers: 1^50 = 1 ALIVE
The reason ReLU replaced sigmoid. Sigmoid derivatives compound multiplicatively. The entirety of "why deep learning works now" is managing this cascade.
12 // Numerical Intuitions
Numbers You Should Know Cold
Floating Point Ranges
FP32: ~1.4e-45 to 3.4e38 23-bit mantissa FP16: ~6e-8 to 65504 10-bit, OVERFLOWS BF16: ~1.4e-45 to 3.4e38 7-bit, FP32 exponent
BF16 preferred for training: same range as FP32. FP16 needs loss scaling. Mix of BF16 activations + int4/int8 weights is common.
Softmax Stability
// UNSTABLE: e^z_i / SUM(e^z_j) overflows // STABLE: c = max(z) softmax(z_i) = e^(z_i - c) / SUM(e^(z_j - c))
Softmax is shift-invariant. Subtracting max prevents overflow. Every production implementation does this.
Transformer Param Count
~12*L*d^2 + 2*V*d LLaMA 7B: L=32 d=4096 V=32000 -> ~6.7B
12 = 4 attention + 2 FFN matrices, each d^2. Embedding often tied input/output.
KV Cache Memory
KV = 2 * L * h * d_head * seq_len * bytes LLaMA 7B, FP16, 4k: ~2.1 GB
Grows linearly with sequence length. GQA reduces h for K/V heads. This is why long context costs.
FlashAttention
standard: O(N^2) memory (materializes full NxN) flash: O(N) memory (tiled, never materializes) speed: ~2-4x faster via reduced HBM reads
IO-aware exact algorithm, not an approximation. Tiles computation to fit in GPU SRAM. Same math, reordered. This is why 32k+ context works on consumer GPUs.