CONVOLUTIONAL
NETWORKS

SPATIAL FEATURE EXTRACTION VIA LEARNED FILTERS KERNELS // FEATURE MAPS // RECEPTIVE FIELDS
POOLING // STRIDE // TRANSLATION INVARIANCE
01 // Concept
WHY CONVOLUTION

A fully connected layer treats every input pixel as independent. An image with 512x512 pixels fed into a fully connected layer of 1000 neurons requires 512×512×1000 = 262 million weights — per layer. Worse, it ignores the fundamental fact about images: nearby pixels are related. A cat's ear is a local pattern. So is an edge, a corner, a texture.

Convolution exploits this. A small kernel — typically 3x3 or 5x5 — slides across the image computing a dot product at every position. The same kernel weights are used everywhere: weight sharing. This reduces parameters by orders of magnitude and forces the network to learn position-invariant detectors. An edge detector learned in the top-left corner works just as well in the bottom-right.

The term "convolutional" is technically a misnomer — it's actually cross-correlation, not convolution (which flips the kernel). The distinction doesn't matter in practice since kernels are learned anyway, but it bothers signal processing people.

3x3
typical
kernel size
~1%
params vs
fully connected
1989
LeCun
LeNet
2012
AlexNet
ImageNet
02 // Interactive // Kernel Convolution
KERNEL SLIDING DEMO

Select a kernel type and watch it slide across the input image. Each position computes a dot product between the kernel weights and the local patch — that scalar becomes one pixel in the output feature map. The highlighted region shows the current receptive field.

Input (8x8)
kernel
Feature Map (6x6)
dot
product
After ReLU
Position: 0,0
03 // The Math
CROSS-CORRELATION

For a 2D input I and kernel K of size k×k, the output feature map O at position (i,j) is:

O(i,j) = Σₘ Σₙ I(i+m, j+n) · K(m,n)
summed over all kernel positions (m,n)

For a 3x3 kernel on a single-channel image: 9 multiplications and 8 additions per output pixel. With stride s, the output dimensions are:

out_size = floor((in_size - k + 2p) / s) + 1
k = kernel size, p = padding, s = stride

With N filters (kernels), the output has N channels — the feature maps. Each filter learns to detect a different pattern. The total parameter count for one conv layer:

params = k² × C_in × C_out + C_out
k=kernel size, C_in=input channels, C_out=output channels, +bias
04 // Receptive Field
HOW MUCH CAN ONE NEURON SEE

A neuron in the first conv layer with a 3x3 kernel has a receptive field of 3x3 — it sees 9 pixels. After a second 3x3 conv layer, each output neuron now sees a 5x5 patch of the original input. After five layers: 11x11.

This is why deep networks can detect complex, large-scale features. Early layers detect edges and textures, middle layers detect parts, deeper layers detect objects. The hierarchy is real and visualizable — dissecting a trained CNN shows this clearly.

RF_L = RF_{L-1} + (k-1) × Π_{i=1}^{L-1} s_i
RF grows with depth. Stride multiplies growth rate.

Dilated convolutions skip pixels with a dilation factor d, expanding the receptive field without adding parameters: RF grows as if the kernel were d×(k-1)+1. Used in WaveNet, DeepLab.

3x3 ×5 layersRF=11x113x3 dilated d=2effective 5x5
05 // Feature Hierarchy // What Layers Actually Learn
LAYER-BY-LAYER EMERGENCE

Zeiler and Fergus (2013) first systematically visualized what each layer of a deep CNN learns by running gradient ascent to find the input that maximally activates each filter. The results revealed a clean hierarchy:

LAYER 1
Gabor-like filters
edges, colors,
orientations
LAYER 2
Textures, grids,
corners, junctions
frequencies
LAYER 3
Complex textures
repeating patterns
part-like shapes
LAYER 4
Dog faces, wheels
text patterns
object parts
LAYER 5
Whole objects
faces, animals
scene geometry
06 // Pooling
SPATIAL DOWNSAMPLING

After each conv layer, pooling reduces spatial dimensions. Max pooling takes the maximum activation within each local region — this preserves the strongest feature response while discarding exact position. It builds translation invariance into the architecture.

MaxPool(2x2, stride=2):
224x224 → 112x112 → 56x56 → 28x28 → 14x14

Average pooling takes the mean instead of the max. Global Average Pooling (GAP) collapses an entire feature map to a single number — used in modern architectures (ResNet, EfficientNet) to replace fully-connected layers, drastically reducing parameters and overfitting. Spatial Pyramid Pooling handles variable-size inputs.

max poolavg poolGAPstride 2
07 // Architectures
LANDMARK DESIGNS
1989
LeNet-5 — LeCun et al.First practical CNN. 5 layers, ~60K params. Handwritten digit recognition (MNIST). Proved the concept but computing power wasn't ready for scale.
2012
AlexNet — Krizhevsky et al.Won ImageNet by a 10.8% margin. Used ReLU (not tanh), dropout, data augmentation, GPU training. Reset the entire field in one paper.
2014
VGG-16 — Simonyan, ZissermanAll 3x3 kernels, 16 layers. Showed depth matters more than kernel size. 138M params, extremely influential as a baseline and transfer learning source.
2015
ResNet-152 — He et al.Skip connections solve vanishing gradients at extreme depth. 152 layers. Beat human-level ImageNet performance. Residual learning is now universal.
2019
EfficientNet — Tan, LeNeural architecture search to optimally scale width, depth, and resolution together. Best accuracy-per-parameter ratio for years. Still used in production vision systems.
08 // Stride
STRIDE + PADDING
stride=1: kernel steps 1px
stride=2: halves spatial dims

Stride >1 replaces pooling in some architectures (strided convolution). "Same" padding (zeros around border) preserves spatial size. "Valid" padding loses border pixels. Choice affects receptive field growth and information retention at edges.

samevalidzero-pad
09 // Batch Norm
BATCH NORMALIZATION
y = γ · (x - μ_B)/σ_B + β
normalize within mini-batch, learnable γ,β

Introduced by Ioffe and Szegedy (2015). Normalizes activations within each mini-batch, drastically stabilizing training and allowing much higher learning rates. Added after every conv layer in modern architectures. Also acts as a regularizer, sometimes replacing dropout. Layer Norm is the transformer equivalent.

2015regularizerfaster train
10 // Depthwise
DEPTHWISE SEPARABLE
standard: k²·C_in·C_out
separable: k²·C_in + C_in·C_out

MobileNet's key insight. A standard conv mixes spatial filtering and channel mixing in one operation. Separable convolution does them separately: depthwise (one filter per input channel, spatial only) then pointwise (1x1 conv, channel mixing only). 8-9x fewer operations. Foundation of MobileNet, Xception, EfficientNet.

MobileNet8x fasteredge devices