Every deep learning system -- from a two-layer perceptron to a 405B transformer -- is built on the same small set of operations. Linear transformations, nonlinear gates, loss minimization, gradient propagation. The architectures change. The math doesn't.
Twelve sections covering linear algebra, activation functions with live interactive plots, loss functions, backpropagation, a gradient descent landscape simulator, transformer internals, an attention pattern visualizer, quantization/LoRA, information theory, sampling, regularization, and numerical stability.
Notes reference Qwen, LLaMA, and the ersatz lab where relevant.
Everything in a neural network reduces to matrix multiplication and element-wise operations. Knowing shapes at each step is more useful than memorizing architecture diagrams.
Without nonlinearity, any stack of linear layers collapses to one linear transform. Select an activation to see its shape and derivative. The red dashed trace is the derivative -- when it flatlines at zero, gradients die.
The loss function defines "correct." The optimizer only knows the scalar it's trying to shrink.
Backprop is the chain rule applied recursively from loss to first layer. The optimizer uses these gradients to update weights.
The cyan dot is the current parameter. Watch how learning rate and momentum affect convergence. The red arrow shows negative gradient direction.
The block: norm -> attention -> residual -> norm -> FFN -> residual. Modern variants use pre-norm, RMSNorm, SwiGLU. The attention mechanism hasn't changed since 2017.
Each cell shows how much a query (row) attends to each key (column). Rows sum to 1.0 (softmax). Hover for weights. Causal mask means tokens only see backward.
Quantization reduces precision. LoRA reduces trainable parameters. Both necessary for running anything on a 1650 Ti.
Shannon's framework treats information as measurable. These equations appear throughout DL because next-token prediction is an information-theoretic problem.