Without an activation function, stacking linear layers is mathematically equivalent to a single linear transformation. No matter how deep the network, it can only learn linear mappings — useless for anything interesting. Activation functions introduce nonlinearity at each neuron, allowing the network to approximate arbitrarily complex functions.
The activation function is applied element-wise to the pre-activation value z = Wx + b. Its output becomes the neuron's activation — what flows to the next layer. The choice of function determines gradient behavior across layers, which in turn determines whether training works at all in deep architectures.
The history of deep learning is partly the history of finding better activations. Sigmoid dominated until ~2010. ReLU unlocked deep networks. GELU now dominates transformers. The ideal function is nonlinear, computationally cheap, and keeps gradients alive through many layers.
step function
paper
in deep nets
for transformers
derivative: f'(x) = f(x)(1 - f(x)) max: 0.25
Classic S-curve. Saturates hard at both extremes — gradient approaches zero outside [-4, 4]. In a 10-layer network, sigmoid gradient attenuates to 0.25^10 ≈ 9.5×10^-7 at the first layer. Training becomes impossible. This is the vanishing gradient problem.
derivative: f'(x) = 0 if x < 0, else 1
Rectified Linear Unit. AlexNet (2012) used it to win ImageNet by a large margin — the paper that reset the field. Gradient is 1 on the positive side so it passes through layers unchanged. Creates sparse activations: ~50% of neurons output zero at any time, which turns out to be efficient and regularizing.
derivative: f'(x) = 1 - tanh²(x) max: 1.0
Rescaled sigmoid centered at zero. Being zero-centered helps gradient updates point in consistent directions — sigmoid's positive-only output causes zig-zagging optimization. Still saturates, still vanishes in deep nets, but better than sigmoid for hidden layers. Common in RNNs.
where Φ(x) is the Gaussian CDF — smooth stochastic gating
Gaussian Error Linear Unit (Hendrycks & Gimpel, 2016). Instead of hard gating negative values to zero like ReLU, GELU weights inputs by their probability under a Gaussian — a soft stochastic gate. The small negative dip around x=-0.1 provides richer gradient signal. GPT-2 onward uses it exclusively.
During backpropagation, each layer multiplies the incoming gradient by the local derivative of its activation function. With sigmoid, the max derivative is 0.25 — meaning at best 75% of gradient signal is destroyed at every layer. Stack 10 layers and the gradient reaching layer 1 is a ghost.
Drag the slider to probe each function at a given input. Watch how sigmoid and tanh saturate in the extremes, flattening their gradients to near-zero. ReLU and GELU stay alive.
Fixes dying ReLU by allowing a small negative slope (0.01). Neurons can never fully die — they still receive small gradients. Parametric ReLU (PReLU) learns this slope. Neither dominates ReLU in practice unless dead neuron rate is a measured problem.
Self-gated activation discovered by neural architecture search (Ramachandran et al., 2017). Smooth, non-monotonic, unbounded above and bounded below. Outperforms ReLU on deep networks. Used in EfficientNet. SiLU is the same function — the name used in PyTorch.
Exponential Linear Unit. Negative saturation is smooth rather than hard zero — mean activations can be pushed toward zero. SELU (Scaled ELU) adds a scale factor that induces self-normalizing behavior: activations remain near unit variance without batch normalization.