A fully connected layer treats every input pixel as independent. An image with 512x512 pixels fed into a fully connected layer of 1000 neurons requires 512×512×1000 = 262 million weights — per layer. Worse, it ignores the fundamental fact about images: nearby pixels are related. A cat's ear is a local pattern. So is an edge, a corner, a texture.
Convolution exploits this. A small kernel — typically 3x3 or 5x5 — slides across the image computing a dot product at every position. The same kernel weights are used everywhere: weight sharing. This reduces parameters by orders of magnitude and forces the network to learn position-invariant detectors. An edge detector learned in the top-left corner works just as well in the bottom-right.
The term "convolutional" is technically a misnomer — it's actually cross-correlation, not convolution (which flips the kernel). The distinction doesn't matter in practice since kernels are learned anyway, but it bothers signal processing people.
kernel size
fully connected
LeNet
ImageNet
Select a kernel type and watch it slide across the input image. Each position computes a dot product between the kernel weights and the local patch — that scalar becomes one pixel in the output feature map. The highlighted region shows the current receptive field.
product
For a 2D input I and kernel K of size k×k, the output feature map O at position (i,j) is:
summed over all kernel positions (m,n)
For a 3x3 kernel on a single-channel image: 9 multiplications and 8 additions per output pixel. With stride s, the output dimensions are:
k = kernel size, p = padding, s = stride
With N filters (kernels), the output has N channels — the feature maps. Each filter learns to detect a different pattern. The total parameter count for one conv layer:
k=kernel size, C_in=input channels, C_out=output channels, +bias
A neuron in the first conv layer with a 3x3 kernel has a receptive field of 3x3 — it sees 9 pixels. After a second 3x3 conv layer, each output neuron now sees a 5x5 patch of the original input. After five layers: 11x11.
This is why deep networks can detect complex, large-scale features. Early layers detect edges and textures, middle layers detect parts, deeper layers detect objects. The hierarchy is real and visualizable — dissecting a trained CNN shows this clearly.
RF grows with depth. Stride multiplies growth rate.
Dilated convolutions skip pixels with a dilation factor d, expanding the receptive field without adding parameters: RF grows as if the kernel were d×(k-1)+1. Used in WaveNet, DeepLab.
Zeiler and Fergus (2013) first systematically visualized what each layer of a deep CNN learns by running gradient ascent to find the input that maximally activates each filter. The results revealed a clean hierarchy:
edges, colors,
orientations
corners, junctions
frequencies
repeating patterns
part-like shapes
text patterns
object parts
faces, animals
scene geometry
After each conv layer, pooling reduces spatial dimensions. Max pooling takes the maximum activation within each local region — this preserves the strongest feature response while discarding exact position. It builds translation invariance into the architecture.
224x224 → 112x112 → 56x56 → 28x28 → 14x14
Average pooling takes the mean instead of the max. Global Average Pooling (GAP) collapses an entire feature map to a single number — used in modern architectures (ResNet, EfficientNet) to replace fully-connected layers, drastically reducing parameters and overfitting. Spatial Pyramid Pooling handles variable-size inputs.
stride=2: halves spatial dims
Stride >1 replaces pooling in some architectures (strided convolution). "Same" padding (zeros around border) preserves spatial size. "Valid" padding loses border pixels. Choice affects receptive field growth and information retention at edges.
normalize within mini-batch, learnable γ,β
Introduced by Ioffe and Szegedy (2015). Normalizes activations within each mini-batch, drastically stabilizing training and allowing much higher learning rates. Added after every conv layer in modern architectures. Also acts as a regularizer, sometimes replacing dropout. Layer Norm is the transformer equivalent.
separable: k²·C_in + C_in·C_out
MobileNet's key insight. A standard conv mixes spatial filtering and channel mixing in one operation. Separable convolution does them separately: depthwise (one filter per input channel, spatial only) then pointwise (1x1 conv, channel mixing only). 8-9x fewer operations. Foundation of MobileNet, Xception, EfficientNet.