BACKPROP
AGATION

GRADIENT-BASED CREDIT ASSIGNMENT CHAIN RULE // COMPUTATIONAL GRAPHS
FORWARD PASS // BACKWARD PASS
01 // Concept
THE BIG PICTURE

A neural network makes a prediction. It's wrong. The question is: which weights caused this wrongness, and by how much? Backpropagation is the systematic answer — an efficient application of the chain rule of calculus that assigns blame to every single weight in the network proportionally to its contribution to the error.

The forward pass computes a prediction. The loss function measures how bad it is. Then backprop runs the signal in reverse — from output back to input — computing the gradient of the loss with respect to every parameter. Gradient descent then nudges each weight in the direction that reduces loss. Repeat millions of times. That is training.

"Backprop is just the chain rule applied recursively. The insight was realizing you could cache intermediate computations and reuse them — making it O(n) instead of O(n²)."

02 // Visualization
FORWARD + BACKWARD PASS

A tiny network: 2 inputs, one hidden layer of 3 neurons (ReLU), 1 output (sigmoid). Step through the forward pass to see activations propagate, then watch gradients flow backward. Edge thickness encodes magnitude.

Node Values
x1 (input)0.80
x2 (input)0.20
h1 (pre-act)
h1 (post-ReLU)
h2 (post-ReLU)
h3 (post-ReLU)
output (sigmoid)
target1.00
BCE loss
Gradients (dL/dW)
dL/d(output)
dL/dW_out[0]
dL/dW_out[1]
dL/dW_out[2]
dL/dh1
Click FORWARD to begin.
03 // Mathematics
THE CHAIN RULE, UNROLLED

To find how much a weight in layer 1 affects the loss, trace every path from that weight to the output and multiply local derivatives along the way. Every node owns just its local gradient — backprop does the rest by accumulation.

Loss fnL = -[ y·log(ŷ) + (1-y)·log(1-ŷ) ]Binary Cross-Entropy
dL/dŷ= -(y/ŷ) + (1-y)/(1-ŷ)
dŷ/dz_out= σ(z) · (1 - σ(z))sigmoid derivative
dz_out/dw_i= h_ioutput layer weight grad
dL/dw_idL/dŷ×dŷ/dz_out×dz_out/dw_iCHAIN RULE
dL/dh_i= dL/dz_out×w_ipropagate back to hidden
ReLU graddh_i/dz_i = 1 if z_i > 0, else 0gate: zero if pre-act negative
dL/dw1_ij= dL/dh_i×dh_i/dz_i×x_jhidden layer weight grad
Weight update: w := w - η · dL/dw
where η is the learning rate — step size in parameter space per iteration.
04 // Training Dynamics
LOSS OVER TIME

Each backprop + weight update cycle should decrease the loss — we're descending the gradient. The loss curve reveals the health of training. Click Run Training to simulate 200 gradient descent steps.

Epoch 0Epoch 200
Final prediction: Target: 1.0000
05 // Glossary
CORE CONCEPTS
Computational Graph
A directed acyclic graph where nodes are operations and edges carry values. Backprop traverses this graph in reverse topological order, accumulating gradients at each node via the chain rule.
Local Gradient
The derivative of a node's output with respect to its own input, computed independently of the rest of the graph. Each node multiplies incoming upstream gradient by its local gradient to produce the outgoing signal.
∂out/∂in (at a single node)
Vanishing Gradient
When gradients shrink as they propagate backward through many layers, eventually becoming so small that early layers receive near-zero signal and stop learning. Caused by activations like sigmoid whose derivatives are less than 1.
Learning Rate η
Scalar controlling step size in weight space. Too large: oscillates or diverges. Too small: trains glacially. Adaptive methods like Adam adjust it per-parameter dynamically based on gradient history.
w <- w - η · ∂L/∂w
Loss Function
Maps network output + ground truth to a scalar measuring prediction error. MSE for regression, cross-entropy for classification. Its geometry determines how hard training is.
Gradient Descent
The optimizer that uses backprop's output. Each step moves all weights opposite to the gradient. SGD: one sample. Mini-batch: small groups. Full-batch: all data at once.
Automatic Diff
PyTorch and JAX implement backprop via autodiff — they build a computation graph dynamically and populate gradients on backward(). You define the forward pass; you get backward for free.
Saddle Points
Points where gradient is zero but it's not a minimum — some directions curve up, others down. High-dimensional networks are dominated by saddle points. Momentum-based optimizers escape them by accumulating directional velocity.