DIFFUSION
MODELS

GENERATIVE MODELS VIA LEARNED DENOISING DDPM // SCORE MATCHING // LANGEVIN DYNAMICS
FORWARD PROCESS // REVERSE PROCESS // CFG
01 // Core Idea
DESTROY THEN REBUILD

Diffusion models learn to generate data by learning the reverse of a destruction process. The forward process is fixed and simple: gradually add Gaussian noise to a real data sample over T steps until it becomes pure noise — indistinguishable from a standard normal distribution. This is not learned; it's a predefined Markov chain.

The reverse process is what gets learned: a neural network (usually a U-Net) trained to predict the noise added at each timestep. Given a noisy image x_t and the timestep t, it predicts the noise ε that was added to get there. Subtracting that prediction gives a slightly cleaner x_{t-1}. Repeat T times starting from pure noise and you generate a new sample.

The connection to physics is literal — this is Brownian motion and its time-reversal. The connection to thermodynamics: the forward process is entropy increase (order → disorder), and the reverse is entropy decrease guided by learned information. The model learns the score function ∇_x log p(x) — the gradient of the log data density — which points toward more probable data configurations.

1000
Typical T
timesteps
~1B
U-Net
parameters
2020
DDPM
paper
50x
DDIM
speedup
02 // Interactive // Forward + Reverse Process
NOISE TRAJECTORY

The forward process adds noise according to a fixed variance schedule β_t. Watch a 2D point distribution evolve from structured data to pure Gaussian noise — then watch the reverse process reconstruct it. The model learns every step of the reverse arrow.

Timestep t
0
of T=100
Noise level α̅_t
1.000
signal retained
Mode
CLEAN DATA
t = 100
03 // Forward Process
q(x_t | x_{t-1})

The forward process is a fixed Markov chain that adds Gaussian noise at each step. The variance schedule β_1, ..., β_T controls how quickly the signal is destroyed.

q(x_t | x_{t-1}) = N(x_t; √(1-β_t)·x_{t-1}, β_t·I)

Using the reparametrization trick, we can sample x_t at any timestep directly from x_0 without running all T steps:

x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε
where ᾱ_t = Π_{s=1}^{t}(1-β_s) and ε ~ N(0,I)

This closed-form expression is what makes DDPM tractable: training doesn't require simulating the full chain. Sample a random t, corrupt x_0 directly to get x_t, and train the network to predict ε.

Markov chainGaussian noiseclosed form
04 // Reverse Process
p_θ(x_{t-1} | x_t)

The reverse process is what the model learns — predicting the true posterior q(x_{t-1}|x_t, x_0) using a neural network parameterized by θ.

p_θ(x_{t-1}|x_t) = N(x_{t-1}; μ_θ(x_t,t), Σ_θ(x_t,t))

Ho et al. (2020) showed that predicting the noise ε_θ(x_t, t) is equivalent to and more stable than predicting x_0 directly. The predicted mean is then:

μ_θ = (1/√α_t)(x_t - β_t/√(1-ᾱ_t) · ε_θ(x_t,t))

The training objective simplifies to: minimize E[||ε - ε_θ(√ᾱ_t x_0 + √(1-ᾱ_t)ε, t)||²]. Just predict the noise. That's it.

noise predictionU-NetDDPM 2020
05 // Score Function // The Deep Connection
SCORE MATCHING + LANGEVIN

Song and Ermon (2019) showed that predicting noise is equivalent to learning the score function — the gradient of the log probability density:

s_θ(x,t) ≈ ∇_x log p_t(x)
score = direction toward higher probability

The score function points uphill on the probability landscape. Starting from noise and following the score with Langevin dynamics:

x_{i+1} = x_i + δ·s_θ(x_i) + √(2δ)·ε
Langevin MCMC — converges to samples from p(x)

This unified view — DDPM noise prediction = score matching — connects diffusion models to a rich body of statistical physics and stochastic differential equation theory.

Song et al. (2021) generalized this to continuous-time SDEs: the forward process is a stochastic differential equation, and the reverse is also an SDE with a drift term involving the score function.

Forward SDE: dx = f(x,t)dt + g(t)dW
Reverse SDE: dx = [f - g²·∇logp_t(x)]dt + g·dW̄
Anderson 1982 // Song et al. 2021
score matchingLangevinSDESong 2021
06 // Sampling Speed
DDIM + FAST SAMPLERS

Original DDPM requires T=1000 denoising steps at inference — slow. DDIM (Song et al., 2020) showed you can skip most timesteps by reformulating the reverse process as a non-Markovian deterministic process.

DDPM: 1000 steps (stochastic)
DDIM: 50 steps (deterministic)
same quality, ~20x faster

Later samplers (DPM-Solver, DEIS, UniPC) achieve comparable quality in 10-20 steps by treating diffusion as an ODE and using higher-order numerical solvers. Modern image generation uses these exclusively. Consistency Models (Song et al. 2023) can generate in a single step.

DDIMDPM-Solver1-step possible
07 // Conditioning
CLASSIFIER-FREE GUIDANCE

Conditioning a diffusion model on text (or class labels) requires teaching it to sample from p(x|c) instead of p(x). Classifier guidance uses a separately trained classifier. Classifier-Free Guidance (Ho & Salimans, 2021) is cleaner: train a single model on both conditioned and unconditional objectives.

ε̂ = ε_uncond + w · (ε_cond - ε_uncond)
w = guidance scale. higher w = more faithful to prompt, less diversity

During training, randomly drop the condition (set to null) with probability p_uncond ~0.1. The model learns both. At inference, extrapolate between conditioned and unconditioned predictions. This is the core technique behind Stable Diffusion, DALL-E 2, Imagen — all of them.

CFGguidance scaleSD / DALL-E 2
08 // Latent Diffusion
STABLE DIFFUSION ARCHITECTURE

Running diffusion directly in pixel space is expensive: a 512x512 RGB image has 786,432 dimensions, and every denoising step runs a full U-Net pass through that space. Latent Diffusion Models (Rombach et al., 2022) compress the image first.

Input
512x512
RGB
786K dims
VAE Encoder
64x64
x4ch
16K dims // 48x smaller
Diffusion
in latent space
U-Net
+
Attention
text conditioning

The VAE is trained separately to compress images into a lower-dimensional latent space that still preserves perceptual detail. Diffusion happens entirely in this compressed space. The VAE decoder reconstructs pixels at the end. This is why Stable Diffusion can run on consumer GPUs: the denoising U-Net operates on 64x64x4 tensors, not 512x512x3.

LDMVAEU-NetRombach 202248x compression
09 // Variants
FLOW MATCHING
dx/dt = v_θ(x,t)
learn a vector field, not noise

Lipman et al. (2022). Instead of predicting noise, predict the velocity field that transports noise to data. Straighter trajectories than DDPM → fewer sampling steps needed. Rectified Flow, FLUX, SD3 use this. Theoretically cleaner than DDPM; practically at least as good.

SD3FLUXODE-based
10 // Variants
CONSISTENCY MODELS
f_θ(x_t, t) = f_θ(x_t', t')
same trajectory point, any t

Song et al. (2023). Train a model that maps any noisy x_t on a trajectory directly to the clean x_0. The consistency condition enforces that all points on the same reverse diffusion path map to the same output. Achieves single-step generation with competitive quality. LCM (Latent Consistency Models) applies this to Stable Diffusion.

1-stepLCMSong 2023
11 // Beyond Images
AUDIO + 3D + PROTEIN

Diffusion is a general framework that works on any continuous data. WaveGrad / DiffWave apply it to raw audio waveforms. Point-E and Shap-E generate 3D objects. RFDiffusion (Baker Lab) generates protein backbone structures — used to design novel proteins with no natural analogue. Sora's video model is a diffusion transformer operating on spacetime patches.

WaveGradRFDiffusionSora
12 // Timeline
HISTORY
2015
Sohl-Dickstein et al.Deep Unsupervised Learning using Nonequilibrium Thermodynamics. The original diffusion model paper — borrowed directly from statistical mechanics. Ignored for 5 years.
2019
Song + Ermon — Score MatchingGenerative Modeling by Estimating Gradients of the Data Distribution. Langevin dynamics sampling. Connected diffusion to score functions. Set the theoretical stage.
2020
Ho et al. — DDPMDenoising Diffusion Probabilistic Models. Simplified training objective (predict noise), beat GANs on image quality. The breakthrough that made everyone pay attention.
2022
Rombach et al. — LDMHigh-Resolution Image Synthesis with Latent Diffusion Models. Stable Diffusion. Made it fast enough to run on consumer hardware. The model that changed the internet.
13 // vs. GANs + VAEs
GENERATIVE MODEL LANDSCAPE
Training stabilityDiffusion: excellent
Sample qualityDiffusion: best
Sample speedDiffusion: slow
DiversityDiffusion: high
Mode coverageGAN: mode collapse risk
Latent spaceVAE: cleanest
ConditioningDiffusion: CFG

GANs dominated 2014-2020: fast inference, sharp images, but training instability and mode collapse. VAEs: stable, interpretable latent space, but blurry. Diffusion: slow inference, but no adversarial training, no mode collapse, and quality that surpassed both. Now dominant for image, video, audio generation.

replaced GANs2022 onward