Convolutional Neural Networks

01 // Concept

WHY CONVOLUTION

A fully connected layer treats every input pixel as independent. An image with 512x512 pixels fed into a fully connected layer of 1000 neurons requires 512×512×1000 = 262 million weights — per layer. Worse, it ignores the fundamental fact about images: nearby pixels are related. A cat's ear is a local pattern. So is an edge, a corner, a texture.

Convolution exploits this. A small kernel — typically 3x3 or 5x5 — slides across the image computing a dot product at every position. The same kernel weights are used everywhere: weight sharing. This reduces parameters by orders of magnitude and forces the network to learn position-invariant detectors. An edge detector learned in the top-left corner works just as well in the bottom-right.

The term "convolutional" is technically a misnomer — it's actually cross-correlation, not convolution (which flips the kernel). The distinction doesn't matter in practice since kernels are learned anyway, but it bothers signal processing people.

3x3

typical
kernel size

~1%

params vs
fully connected

1989

LeCun
LeNet

2012

AlexNet
ImageNet

02 // Interactive // Kernel Convolution

KERNEL SLIDING DEMO

Select a kernel type and watch it slide across the input image. Each position computes a dot product between the kernel weights and the local patch — that scalar becomes one pixel in the output feature map. The highlighted region shows the current receptive field.

Input (8x8)

kernel

Feature Map (6x6)

dot
product

—

After ReLU

Position: 0,0

03 // The Math

CROSS-CORRELATION

For a 2D input I and kernel K of size k×k, the output feature map O at position (i,j) is:

O(i,j) = Σₘ Σₙ I(i+m, j+n) \cdot K(m,n) summed over all kernel positions (m,n)

For a 3x3 kernel on a single-channel image: 9 multiplications and 8 additions per output pixel. With stride s, the output dimensions are:

out_size = floor((in_size - k + 2p) / s) + 1 k = kernel size, p = padding, s = stride

With N filters (kernels), the output has N channels — the feature maps. Each filter learns to detect a different pattern. The total parameter count for one conv layer:

params = k² \times C_in \times C_out + C_out k=kernel size, C_in=input channels, C_out=output channels, +bias

04 // Receptive Field

HOW MUCH CAN ONE NEURON SEE

A neuron in the first conv layer with a 3x3 kernel has a receptive field of 3x3 — it sees 9 pixels. After a second 3x3 conv layer, each output neuron now sees a 5x5 patch of the original input. After five layers: 11x11.

This is why deep networks can detect complex, large-scale features. Early layers detect edges and textures, middle layers detect parts, deeper layers detect objects. The hierarchy is real and visualizable — dissecting a trained CNN shows this clearly.

RF_L = RF_{L-1} + (k-1) × Π_{i=1}^{L-1} s_iRF grows with depth. Stride multiplies growth rate.

Dilated convolutions skip pixels with a dilation factor d, expanding the receptive field without adding parameters: RF grows as if the kernel were d×(k-1)+1. Used in WaveNet, DeepLab.

3x3 ×5 layersRF=11x113x3 dilated d=2effective 5x5

05 // Feature Hierarchy // What Layers Actually Learn

LAYER-BY-LAYER EMERGENCE

Zeiler and Fergus (2013) first systematically visualized what each layer of a deep CNN learns by running gradient ascent to find the input that maximally activates each filter. The results revealed a clean hierarchy:

LAYER 1

Gabor-like filters
edges, colors,
orientations

LAYER 2

Textures, grids,
corners, junctions
frequencies

LAYER 3

Complex textures
repeating patterns
part-like shapes

LAYER 4

Dog faces, wheels
text patterns
object parts

LAYER 5

Whole objects
faces, animals
scene geometry

06 // Pooling

SPATIAL DOWNSAMPLING

After each conv layer, pooling reduces spatial dimensions. Max pooling takes the maximum activation within each local region — this preserves the strongest feature response while discarding exact position. It builds translation invariance into the architecture.

MaxPool(2x2, stride=2): 224x224 \to 112x112 \to 56x56 \to 28x28 \to 14x14

Average pooling takes the mean instead of the max. Global Average Pooling (GAP) collapses an entire feature map to a single number — used in modern architectures (ResNet, EfficientNet) to replace fully-connected layers, drastically reducing parameters and overfitting. Spatial Pyramid Pooling handles variable-size inputs.

max poolavg poolGAPstride 2

07 // Architectures

LANDMARK DESIGNS

1989

LeNet-5 — LeCun et al.First practical CNN. 5 layers, ~60K params. Handwritten digit recognition (MNIST). Proved the concept but computing power wasn't ready for scale.

2012

AlexNet — Krizhevsky et al.Won ImageNet by a 10.8% margin. Used ReLU (not tanh), dropout, data augmentation, GPU training. Reset the entire field in one paper.

2014

VGG-16 — Simonyan, ZissermanAll 3x3 kernels, 16 layers. Showed depth matters more than kernel size. 138M params, extremely influential as a baseline and transfer learning source.

2015

ResNet-152 — He et al.Skip connections solve vanishing gradients at extreme depth. 152 layers. Beat human-level ImageNet performance. Residual learning is now universal.

2019

EfficientNet — Tan, LeNeural architecture search to optimally scale width, depth, and resolution together. Best accuracy-per-parameter ratio for years. Still used in production vision systems.

08 // Stride

STRIDE + PADDING

stride=1: kernel steps 1px stride=2: halves spatial dims

Stride >1 replaces pooling in some architectures (strided convolution). "Same" padding (zeros around border) preserves spatial size. "Valid" padding loses border pixels. Choice affects receptive field growth and information retention at edges.

samevalidzero-pad

09 // Batch Norm

BATCH NORMALIZATION

y = γ \cdot (x - μ_B)/σ_B + β normalize within mini-batch, learnable γ,β

Introduced by Ioffe and Szegedy (2015). Normalizes activations within each mini-batch, drastically stabilizing training and allowing much higher learning rates. Added after every conv layer in modern architectures. Also acts as a regularizer, sometimes replacing dropout. Layer Norm is the transformer equivalent.

2015regularizerfaster train

10 // Depthwise

DEPTHWISE SEPARABLE

standard: k²\cdotC_in\cdotC_out separable: k²\cdotC_in + C_in\cdotC_out

MobileNet's key insight. A standard conv mixes spatial filtering and channel mixing in one operation. Separable convolution does them separately: depthwise (one filter per input channel, spatial only) then pointwise (1x1 conv, channel mixing only). 8-9x fewer operations. Foundation of MobileNet, Xception, EfficientNet.

MobileNet8x fasteredge devices

CONVOLUTIONALNETWORKS

CONVOLUTIONAL
NETWORKS