A neural network makes a prediction. It's wrong. The question is: which weights caused this wrongness, and by how much? Backpropagation is the systematic answer — an efficient application of the chain rule of calculus that assigns blame to every single weight in the network proportionally to its contribution to the error.
The forward pass computes a prediction. The loss function measures how bad it is. Then backprop runs the signal in reverse — from output back to input — computing the gradient of the loss with respect to every parameter. Gradient descent then nudges each weight in the direction that reduces loss. Repeat millions of times. That is training.
"Backprop is just the chain rule applied recursively. The insight was realizing you could cache intermediate computations and reuse them — making it O(n) instead of O(n²)."
A tiny network: 2 inputs, one hidden layer of 3 neurons (ReLU), 1 output (sigmoid). Step through the forward pass to see activations propagate, then watch gradients flow backward. Edge thickness encodes magnitude.
To find how much a weight in layer 1 affects the loss, trace every path from that weight to the output and multiply local derivatives along the way. Every node owns just its local gradient — backprop does the rest by accumulation.
where η is the learning rate — step size in parameter space per iteration.
Each backprop + weight update cycle should decrease the loss — we're descending the gradient. The loss curve reveals the health of training. Click Run Training to simulate 200 gradient descent steps.