ACTIVATION
FUNCTIONS

THE NONLINEARITY THAT MAKES NETWORKS WORK SIGMOID // TANH // RELU // GELU
VANISHING GRADIENTS // DYING NEURONS
01 // What They Are
WHY NONLINEARITY EXISTS

Without an activation function, stacking linear layers is mathematically equivalent to a single linear transformation. No matter how deep the network, it can only learn linear mappings — useless for anything interesting. Activation functions introduce nonlinearity at each neuron, allowing the network to approximate arbitrarily complex functions.

The activation function is applied element-wise to the pre-activation value z = Wx + b. Its output becomes the neuron's activation — what flows to the next layer. The choice of function determines gradient behavior across layers, which in turn determines whether training works at all in deep architectures.

The history of deep learning is partly the history of finding better activations. Sigmoid dominated until ~2010. ReLU unlocked deep networks. GELU now dominates transformers. The ideal function is nonlinear, computationally cheap, and keeps gradients alive through many layers.

1943
McCulloch-Pitts
step function
1986
Sigmoid + backprop
paper
2010
ReLU proven
in deep nets
2016
GELU introduced
for transformers
02 // Sigmoid
SIGMOID
f(x) = 1 / (1 + e^(-x))
derivative: f'(x) = f(x)(1 - f(x))   max: 0.25
Output range(0, 1)
Max gradient0.25 at x=0
Differentiableeverywhere
Zero-centeredno
Era1980s - ~2012

Classic S-curve. Saturates hard at both extremes — gradient approaches zero outside [-4, 4]. In a 10-layer network, sigmoid gradient attenuates to 0.25^10 ≈ 9.5×10^-7 at the first layer. Training becomes impossible. This is the vanishing gradient problem.

smoothvanishing gradoutput layer only
03 // ReLU
RELU
f(x) = max(0, x)
derivative: f'(x) = 0 if x < 0, else 1
Output range[0, +inf)
Max gradient1 (no decay)
Differentiablenot at x=0
Zero-centeredno
Era2010s - present

Rectified Linear Unit. AlexNet (2012) used it to win ImageNet by a large margin — the paper that reset the field. Gradient is 1 on the positive side so it passes through layers unchanged. Creates sparse activations: ~50% of neurons output zero at any time, which turns out to be efficient and regularizing.

dying neuronsfastno vanishing
04 // Tanh
TANH
f(x) = (e^x - e^-x) / (e^x + e^-x)
derivative: f'(x) = 1 - tanh²(x)   max: 1.0
Output range(-1, 1)
Max gradient1.0 at x=0
Zero-centeredyes

Rescaled sigmoid centered at zero. Being zero-centered helps gradient updates point in consistent directions — sigmoid's positive-only output causes zig-zagging optimization. Still saturates, still vanishes in deep nets, but better than sigmoid for hidden layers. Common in RNNs.

zero-centeredsaturatesRNN staple
05 // GELU
GELU
f(x) = x · Φ(x)
where Φ(x) is the Gaussian CDF — smooth stochastic gating
Output range(-0.17, +inf)
Zero crossingsmooth, not hard
Used inBERT, GPT, all modern

Gaussian Error Linear Unit (Hendrycks & Gimpel, 2016). Instead of hard gating negative values to zero like ReLU, GELU weights inputs by their probability under a Gaussian — a soft stochastic gate. The small negative dip around x=-0.1 provides richer gradient signal. GPT-2 onward uses it exclusively.

transformersBERT/GPTsmooth gate
06 // Interactive // Vanishing Gradient Demo
GRADIENT DECAY THROUGH LAYERS

During backpropagation, each layer multiplies the incoming gradient by the local derivative of its activation function. With sigmoid, the max derivative is 0.25 — meaning at best 75% of gradient signal is destroyed at every layer. Stack 10 layers and the gradient reaching layer 1 is a ghost.

Select activation:
Layers: 10
Gradient at layer 1
07 // Interactive // Live Output
INPUT X RESPONSE

Drag the slider to probe each function at a given input. Watch how sigmoid and tanh saturate in the extremes, flattening their gradients to near-zero. ReLU and GELU stay alive.

x = 0.00
Sigmoid
0.5000
grad: 0.2500
ReLU
0.0000
grad: 0.0000
Tanh
0.0000
grad: 1.0000
GELU
0.0000
grad: 0.5000
08 // Variants
LEAKY RELU
f(x) = x if x > 0, else 0.01x

Fixes dying ReLU by allowing a small negative slope (0.01). Neurons can never fully die — they still receive small gradients. Parametric ReLU (PReLU) learns this slope. Neither dominates ReLU in practice unless dead neuron rate is a measured problem.

no dead neurons
09 // Variants
SWISH / SILU
f(x) = x · sigmoid(x)

Self-gated activation discovered by neural architecture search (Ramachandran et al., 2017). Smooth, non-monotonic, unbounded above and bounded below. Outperforms ReLU on deep networks. Used in EfficientNet. SiLU is the same function — the name used in PyTorch.

NAS-discovered
10 // Variants
ELU / SELU
f(x) = x if x>0, else α(e^x - 1)

Exponential Linear Unit. Negative saturation is smooth rather than hard zero — mean activations can be pushed toward zero. SELU (Scaled ELU) adds a scale factor that induces self-normalizing behavior: activations remain near unit variance without batch normalization.

self-normalizing