Backpropagation

01 // Concept

THE BIG PICTURE

A neural network makes a prediction. It's wrong. The question is: which weights caused this wrongness, and by how much? Backpropagation is the systematic answer — an efficient application of the chain rule of calculus that assigns blame to every single weight in the network proportionally to its contribution to the error.

The forward pass computes a prediction. The loss function measures how bad it is. Then backprop runs the signal in reverse — from output back to input — computing the gradient of the loss with respect to every parameter. Gradient descent then nudges each weight in the direction that reduces loss. Repeat millions of times. That is training.

"Backprop is just the chain rule applied recursively. The insight was realizing you could cache intermediate computations and reuse them — making it O(n) instead of O(n²)."

02 // Visualization

FORWARD + BACKWARD PASS

A tiny network: 2 inputs, one hidden layer of 3 neurons (ReLU), 1 output (sigmoid). Step through the forward pass to see activations propagate, then watch gradients flow backward. Edge thickness encodes magnitude.

Node Values

x1 (input)0.80

x2 (input)0.20

h1 (pre-act)—

h1 (post-ReLU)—

h2 (post-ReLU)—

h3 (post-ReLU)—

output (sigmoid)—

target1.00

BCE loss—

Gradients (dL/dW)

dL/d(output)—

dL/dW_out[0]—

dL/dW_out[1]—

dL/dW_out[2]—

dL/dh1—

Click FORWARD to begin.

03 // Mathematics

THE CHAIN RULE, UNROLLED

To find how much a weight in layer 1 affects the loss, trace every path from that weight to the output and multiply local derivatives along the way. Every node owns just its local gradient — backprop does the rest by accumulation.

Loss fnL = -[ y·log(ŷ) + (1-y)·log(1-ŷ) ]Binary Cross-Entropy

dL/dŷ= -(y/ŷ) + (1-y)/(1-ŷ)

dŷ/dz_out= σ(z) · (1 - σ(z))sigmoid derivative

dz_out/dw_i= h_ioutput layer weight grad

dL/dw_idL/dŷ×dŷ/dz_out×dz_out/dw_iCHAIN RULE

dL/dh_i= dL/dz_out×w_ipropagate back to hidden

ReLU graddh_i/dz_i = 1 if z_i > 0, else 0gate: zero if pre-act negative

dL/dw1_ij= dL/dh_i×dh_i/dz_i×x_jhidden layer weight grad

Weight update: w := w - η · dL/dw
where η is the learning rate — step size in parameter space per iteration.

04 // Training Dynamics

LOSS OVER TIME

Each backprop + weight update cycle should decrease the loss — we're descending the gradient. The loss curve reveals the health of training. Click Run Training to simulate 200 gradient descent steps.

—

Epoch 0Epoch 200

Final prediction: — Target: 1.0000

05 // Glossary

CORE CONCEPTS

Computational Graph

A directed acyclic graph where nodes are operations and edges carry values. Backprop traverses this graph in reverse topological order, accumulating gradients at each node via the chain rule.

Local Gradient

The derivative of a node's output with respect to its own input, computed independently of the rest of the graph. Each node multiplies incoming upstream gradient by its local gradient to produce the outgoing signal.

∂out/∂in (at a single node)

Vanishing Gradient

When gradients shrink as they propagate backward through many layers, eventually becoming so small that early layers receive near-zero signal and stop learning. Caused by activations like sigmoid whose derivatives are less than 1.

Learning Rate η

Scalar controlling step size in weight space. Too large: oscillates or diverges. Too small: trains glacially. Adaptive methods like Adam adjust it per-parameter dynamically based on gradient history.

w <- w - η · ∂L/∂w

Loss Function

Maps network output + ground truth to a scalar measuring prediction error. MSE for regression, cross-entropy for classification. Its geometry determines how hard training is.

Gradient Descent

The optimizer that uses backprop's output. Each step moves all weights opposite to the gradient. SGD: one sample. Mini-batch: small groups. Full-batch: all data at once.

Automatic Diff

PyTorch and JAX implement backprop via autodiff — they build a computation graph dynamically and populate gradients on backward(). You define the forward pass; you get backward for free.

Saddle Points

Points where gradient is zero but it's not a minimum — some directions curve up, others down. High-dimensional networks are dominated by saddle points. Momentum-based optimizers escape them by accumulating directional velocity.

BACKPROPAGATION

BACKPROP
AGATION