Activation Functions

01 // What They Are

WHY NONLINEARITY EXISTS

Without an activation function, stacking linear layers is mathematically equivalent to a single linear transformation. No matter how deep the network, it can only learn linear mappings — useless for anything interesting. Activation functions introduce nonlinearity at each neuron, allowing the network to approximate arbitrarily complex functions.

The activation function is applied element-wise to the pre-activation value z = Wx + b. Its output becomes the neuron's activation — what flows to the next layer. The choice of function determines gradient behavior across layers, which in turn determines whether training works at all in deep architectures.

The history of deep learning is partly the history of finding better activations. Sigmoid dominated until ~2010. ReLU unlocked deep networks. GELU now dominates transformers. The ideal function is nonlinear, computationally cheap, and keeps gradients alive through many layers.

1943

McCulloch-Pitts
step function

1986

Sigmoid + backprop
paper

2010

ReLU proven
in deep nets

2016

GELU introduced
for transformers

02 // Sigmoid

SIGMOID

f(x) = 1 / (1 + e^(-x)) derivative: f'(x) = f(x)(1 - f(x)) max: 0.25

Output range(0, 1)

Max gradient0.25 at x=0

Differentiableeverywhere

Zero-centeredno

Era1980s - ~2012

Classic S-curve. Saturates hard at both extremes — gradient approaches zero outside [-4, 4]. In a 10-layer network, sigmoid gradient attenuates to 0.25^10 ≈ 9.5×10^-7 at the first layer. Training becomes impossible. This is the vanishing gradient problem.

smoothvanishing gradoutput layer only

03 // ReLU

RELU

f(x) = max(0, x) derivative: f'(x) = 0 if x < 0, else 1

Output range[0, +inf)

Max gradient1 (no decay)

Differentiablenot at x=0

Zero-centeredno

Era2010s - present

Rectified Linear Unit. AlexNet (2012) used it to win ImageNet by a large margin — the paper that reset the field. Gradient is 1 on the positive side so it passes through layers unchanged. Creates sparse activations: ~50% of neurons output zero at any time, which turns out to be efficient and regularizing.

dying neuronsfastno vanishing

04 // Tanh

TANH

f(x) = (e^x - e^-x) / (e^x + e^-x) derivative: f'(x) = 1 - tanh²(x) max: 1.0

Output range(-1, 1)

Max gradient1.0 at x=0

Zero-centeredyes

Rescaled sigmoid centered at zero. Being zero-centered helps gradient updates point in consistent directions — sigmoid's positive-only output causes zig-zagging optimization. Still saturates, still vanishes in deep nets, but better than sigmoid for hidden layers. Common in RNNs.

zero-centeredsaturatesRNN staple

05 // GELU

GELU

f(x) = x \cdot Φ(x) where Φ(x) is the Gaussian CDF — smooth stochastic gating

Output range(-0.17, +inf)

Zero crossingsmooth, not hard

Used inBERT, GPT, all modern

Gaussian Error Linear Unit (Hendrycks & Gimpel, 2016). Instead of hard gating negative values to zero like ReLU, GELU weights inputs by their probability under a Gaussian — a soft stochastic gate. The small negative dip around x=-0.1 provides richer gradient signal. GPT-2 onward uses it exclusively.

transformersBERT/GPTsmooth gate

06 // Interactive // Vanishing Gradient Demo

GRADIENT DECAY THROUGH LAYERS

During backpropagation, each layer multiplies the incoming gradient by the local derivative of its activation function. With sigmoid, the max derivative is 0.25 — meaning at best 75% of gradient signal is destroyed at every layer. Stack 10 layers and the gradient reaching layer 1 is a ghost.

Select activation:

Layers: 10

Gradient at layer 1

—

07 // Interactive // Live Output

INPUT X RESPONSE

Drag the slider to probe each function at a given input. Watch how sigmoid and tanh saturate in the extremes, flattening their gradients to near-zero. ReLU and GELU stay alive.

x = 0.00

Sigmoid

0.5000

grad: 0.2500

ReLU

0.0000

grad: 0.0000

Tanh

0.0000

grad: 1.0000

GELU

0.0000

grad: 0.5000

08 // Variants

LEAKY RELU

f(x) = x if x > 0, else 0.01x

Fixes dying ReLU by allowing a small negative slope (0.01). Neurons can never fully die — they still receive small gradients. Parametric ReLU (PReLU) learns this slope. Neither dominates ReLU in practice unless dead neuron rate is a measured problem.

no dead neurons

09 // Variants

SWISH / SILU

f(x) = x \cdot sigmoid(x)

Self-gated activation discovered by neural architecture search (Ramachandran et al., 2017). Smooth, non-monotonic, unbounded above and bounded below. Outperforms ReLU on deep networks. Used in EfficientNet. SiLU is the same function — the name used in PyTorch.

NAS-discovered

10 // Variants

ELU / SELU

f(x) = x if x>0, else α(e^x - 1)

Exponential Linear Unit. Negative saturation is smooth rather than hard zero — mean activations can be pushed toward zero. SELU (Scaled ELU) adds a scale factor that induces self-normalizing behavior: activations remain near unit variance without batch normalization.

self-normalizing

ACTIVATIONFUNCTIONS

ACTIVATION
FUNCTIONS