Diffusion Models

01 // Core Idea

DESTROY THEN REBUILD

Diffusion models learn to generate data by learning the reverse of a destruction process. The forward process is fixed and simple: gradually add Gaussian noise to a real data sample over T steps until it becomes pure noise — indistinguishable from a standard normal distribution. This is not learned; it's a predefined Markov chain.

The reverse process is what gets learned: a neural network (usually a U-Net) trained to predict the noise added at each timestep. Given a noisy image x_t and the timestep t, it predicts the noise ε that was added to get there. Subtracting that prediction gives a slightly cleaner x_{t-1}. Repeat T times starting from pure noise and you generate a new sample.

The connection to physics is literal — this is Brownian motion and its time-reversal. The connection to thermodynamics: the forward process is entropy increase (order → disorder), and the reverse is entropy decrease guided by learned information. The model learns the score function ∇_x log p(x) — the gradient of the log data density — which points toward more probable data configurations.

1000

Typical T
timesteps

~1B

U-Net
parameters

2020

DDPM
paper

50x

DDIM
speedup

02 // Interactive // Forward + Reverse Process

NOISE TRAJECTORY

The forward process adds noise according to a fixed variance schedule β_t. Watch a 2D point distribution evolve from structured data to pure Gaussian noise — then watch the reverse process reconstruct it. The model learns every step of the reverse arrow.

Timestep t

of T=100

Noise level α̅_t

1.000

signal retained

Mode

CLEAN DATA

t = 100

03 // Forward Process

q(x_t | x_{t-1})

The forward process is a fixed Markov chain that adds Gaussian noise at each step. The variance schedule β_1, ..., β_T controls how quickly the signal is destroyed.

q(x_t | x_{t-1}) = N(x_t; √(1-β_t)·x_{t-1}, β_t·I)

Using the reparametrization trick, we can sample x_t at any timestep directly from x_0 without running all T steps:

x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · εwhere ᾱ_t = Π_{s=1}^{t}(1-β_s) and ε ~ N(0,I)

This closed-form expression is what makes DDPM tractable: training doesn't require simulating the full chain. Sample a random t, corrupt x_0 directly to get x_t, and train the network to predict ε.

Markov chainGaussian noiseclosed form

04 // Reverse Process

p_θ(x_{t-1} | x_t)

The reverse process is what the model learns — predicting the true posterior q(x_{t-1}|x_t, x_0) using a neural network parameterized by θ.

p_θ(x_{t-1}|x_t) = N(x_{t-1}; μ_θ(x_t,t), Σ_θ(x_t,t))

Ho et al. (2020) showed that predicting the noise ε_θ(x_t, t) is equivalent to and more stable than predicting x_0 directly. The predicted mean is then:

μ_θ = (1/\sqrtα_t)(x_t - β_t/\sqrt(1-ᾱ_t) \cdot ε_θ(x_t,t))

The training objective simplifies to: minimize E[||ε - ε_θ(√ᾱ_t x_0 + √(1-ᾱ_t)ε, t)||²]. Just predict the noise. That's it.

noise predictionU-NetDDPM 2020

05 // Score Function // The Deep Connection

SCORE MATCHING + LANGEVIN

Song and Ermon (2019) showed that predicting noise is equivalent to learning the score function — the gradient of the log probability density:

s_θ(x,t) \approx \nabla_x log p_t(x) score = direction toward higher probability

The score function points uphill on the probability landscape. Starting from noise and following the score with Langevin dynamics:

x_{i+1} = x_i + δ·s_θ(x_i) + √(2δ)·εLangevin MCMC — converges to samples from p(x)

This unified view — DDPM noise prediction = score matching — connects diffusion models to a rich body of statistical physics and stochastic differential equation theory.

Song et al. (2021) generalized this to continuous-time SDEs: the forward process is a stochastic differential equation, and the reverse is also an SDE with a drift term involving the score function.

Forward SDE: dx = f(x,t)dt + g(t)dW Reverse SDE: dx = [f - g²\cdot\nablalogp_t(x)]dt + g\cdotdW̄ Anderson 1982 // Song et al. 2021

score matchingLangevinSDESong 2021

06 // Sampling Speed

DDIM + FAST SAMPLERS

Original DDPM requires T=1000 denoising steps at inference — slow. DDIM (Song et al., 2020) showed you can skip most timesteps by reformulating the reverse process as a non-Markovian deterministic process.

DDPM: 1000 steps (stochastic) DDIM: 50 steps (deterministic) same quality, ~20x faster

Later samplers (DPM-Solver, DEIS, UniPC) achieve comparable quality in 10-20 steps by treating diffusion as an ODE and using higher-order numerical solvers. Modern image generation uses these exclusively. Consistency Models (Song et al. 2023) can generate in a single step.

DDIMDPM-Solver1-step possible

07 // Conditioning

CLASSIFIER-FREE GUIDANCE

Conditioning a diffusion model on text (or class labels) requires teaching it to sample from p(x|c) instead of p(x). Classifier guidance uses a separately trained classifier. Classifier-Free Guidance (Ho & Salimans, 2021) is cleaner: train a single model on both conditioned and unconditional objectives.

ε̂ = ε_uncond + w \cdot (ε_cond - ε_uncond) w = guidance scale. higher w = more faithful to prompt, less diversity

During training, randomly drop the condition (set to null) with probability p_uncond ~0.1. The model learns both. At inference, extrapolate between conditioned and unconditioned predictions. This is the core technique behind Stable Diffusion, DALL-E 2, Imagen — all of them.

CFGguidance scaleSD / DALL-E 2

08 // Latent Diffusion

STABLE DIFFUSION ARCHITECTURE

Running diffusion directly in pixel space is expensive: a 512x512 RGB image has 786,432 dimensions, and every denoising step runs a full U-Net pass through that space. Latent Diffusion Models (Rombach et al., 2022) compress the image first.

Input

512x512
RGB

786K dims

→

VAE Encoder

64x64
x4ch

16K dims // 48x smaller

→

Diffusion
in latent space

U-Net
+
Attention

text conditioning

The VAE is trained separately to compress images into a lower-dimensional latent space that still preserves perceptual detail. Diffusion happens entirely in this compressed space. The VAE decoder reconstructs pixels at the end. This is why Stable Diffusion can run on consumer GPUs: the denoising U-Net operates on 64x64x4 tensors, not 512x512x3.

LDMVAEU-NetRombach 202248x compression

09 // Variants

FLOW MATCHING

dx/dt = v_θ(x,t) learn a vector field, not noise

Lipman et al. (2022). Instead of predicting noise, predict the velocity field that transports noise to data. Straighter trajectories than DDPM → fewer sampling steps needed. Rectified Flow, FLUX, SD3 use this. Theoretically cleaner than DDPM; practically at least as good.

SD3FLUXODE-based

10 // Variants

CONSISTENCY MODELS

f_θ(x_t, t) = f_θ(x_t', t') same trajectory point, any t

Song et al. (2023). Train a model that maps any noisy x_t on a trajectory directly to the clean x_0. The consistency condition enforces that all points on the same reverse diffusion path map to the same output. Achieves single-step generation with competitive quality. LCM (Latent Consistency Models) applies this to Stable Diffusion.

1-stepLCMSong 2023

11 // Beyond Images

AUDIO + 3D + PROTEIN

Diffusion is a general framework that works on any continuous data. WaveGrad / DiffWave apply it to raw audio waveforms. Point-E and Shap-E generate 3D objects. RFDiffusion (Baker Lab) generates protein backbone structures — used to design novel proteins with no natural analogue. Sora's video model is a diffusion transformer operating on spacetime patches.

WaveGradRFDiffusionSora

12 // Timeline

HISTORY

2015

Sohl-Dickstein et al.Deep Unsupervised Learning using Nonequilibrium Thermodynamics. The original diffusion model paper — borrowed directly from statistical mechanics. Ignored for 5 years.

2019

Song + Ermon — Score MatchingGenerative Modeling by Estimating Gradients of the Data Distribution. Langevin dynamics sampling. Connected diffusion to score functions. Set the theoretical stage.

2020

Ho et al. — DDPMDenoising Diffusion Probabilistic Models. Simplified training objective (predict noise), beat GANs on image quality. The breakthrough that made everyone pay attention.

2022

Rombach et al. — LDMHigh-Resolution Image Synthesis with Latent Diffusion Models. Stable Diffusion. Made it fast enough to run on consumer hardware. The model that changed the internet.

13 // vs. GANs + VAEs

GENERATIVE MODEL LANDSCAPE

Training stabilityDiffusion: excellent

Sample qualityDiffusion: best

Sample speedDiffusion: slow

DiversityDiffusion: high

Mode coverageGAN: mode collapse risk

Latent spaceVAE: cleanest

ConditioningDiffusion: CFG

GANs dominated 2014-2020: fast inference, sharp images, but training instability and mode collapse. VAEs: stable, interpretable latent space, but blurry. Diffusion: slow inference, but no adversarial training, no mode collapse, and quality that surpassed both. Now dominant for image, video, audio generation.

replaced GANs2022 onward

DIFFUSIONMODELS

DIFFUSION
MODELS