Diffusion models learn to generate data by learning the reverse of a destruction process. The forward process is fixed and simple: gradually add Gaussian noise to a real data sample over T steps until it becomes pure noise — indistinguishable from a standard normal distribution. This is not learned; it's a predefined Markov chain.
The reverse process is what gets learned: a neural network (usually a U-Net) trained to predict the noise added at each timestep. Given a noisy image x_t and the timestep t, it predicts the noise ε that was added to get there. Subtracting that prediction gives a slightly cleaner x_{t-1}. Repeat T times starting from pure noise and you generate a new sample.
The connection to physics is literal — this is Brownian motion and its time-reversal. The connection to thermodynamics: the forward process is entropy increase (order → disorder), and the reverse is entropy decrease guided by learned information. The model learns the score function ∇_x log p(x) — the gradient of the log data density — which points toward more probable data configurations.
timesteps
parameters
paper
speedup
The forward process adds noise according to a fixed variance schedule β_t. Watch a 2D point distribution evolve from structured data to pure Gaussian noise — then watch the reverse process reconstruct it. The model learns every step of the reverse arrow.
The forward process is a fixed Markov chain that adds Gaussian noise at each step. The variance schedule β_1, ..., β_T controls how quickly the signal is destroyed.
Using the reparametrization trick, we can sample x_t at any timestep directly from x_0 without running all T steps:
where ᾱ_t = Π_{s=1}^{t}(1-β_s) and ε ~ N(0,I)
This closed-form expression is what makes DDPM tractable: training doesn't require simulating the full chain. Sample a random t, corrupt x_0 directly to get x_t, and train the network to predict ε.
The reverse process is what the model learns — predicting the true posterior q(x_{t-1}|x_t, x_0) using a neural network parameterized by θ.
Ho et al. (2020) showed that predicting the noise ε_θ(x_t, t) is equivalent to and more stable than predicting x_0 directly. The predicted mean is then:
The training objective simplifies to: minimize E[||ε - ε_θ(√ᾱ_t x_0 + √(1-ᾱ_t)ε, t)||²]. Just predict the noise. That's it.
Song and Ermon (2019) showed that predicting noise is equivalent to learning the score function — the gradient of the log probability density:
score = direction toward higher probability
The score function points uphill on the probability landscape. Starting from noise and following the score with Langevin dynamics:
Langevin MCMC — converges to samples from p(x)
This unified view — DDPM noise prediction = score matching — connects diffusion models to a rich body of statistical physics and stochastic differential equation theory.
Song et al. (2021) generalized this to continuous-time SDEs: the forward process is a stochastic differential equation, and the reverse is also an SDE with a drift term involving the score function.
Reverse SDE: dx = [f - g²·∇logp_t(x)]dt + g·dW̄
Anderson 1982 // Song et al. 2021
Original DDPM requires T=1000 denoising steps at inference — slow. DDIM (Song et al., 2020) showed you can skip most timesteps by reformulating the reverse process as a non-Markovian deterministic process.
DDIM: 50 steps (deterministic)
same quality, ~20x faster
Later samplers (DPM-Solver, DEIS, UniPC) achieve comparable quality in 10-20 steps by treating diffusion as an ODE and using higher-order numerical solvers. Modern image generation uses these exclusively. Consistency Models (Song et al. 2023) can generate in a single step.
Conditioning a diffusion model on text (or class labels) requires teaching it to sample from p(x|c) instead of p(x). Classifier guidance uses a separately trained classifier. Classifier-Free Guidance (Ho & Salimans, 2021) is cleaner: train a single model on both conditioned and unconditional objectives.
w = guidance scale. higher w = more faithful to prompt, less diversity
During training, randomly drop the condition (set to null) with probability p_uncond ~0.1. The model learns both. At inference, extrapolate between conditioned and unconditioned predictions. This is the core technique behind Stable Diffusion, DALL-E 2, Imagen — all of them.
Running diffusion directly in pixel space is expensive: a 512x512 RGB image has 786,432 dimensions, and every denoising step runs a full U-Net pass through that space. Latent Diffusion Models (Rombach et al., 2022) compress the image first.
RGB
x4ch
in latent space
+
Attention
The VAE is trained separately to compress images into a lower-dimensional latent space that still preserves perceptual detail. Diffusion happens entirely in this compressed space. The VAE decoder reconstructs pixels at the end. This is why Stable Diffusion can run on consumer GPUs: the denoising U-Net operates on 64x64x4 tensors, not 512x512x3.
learn a vector field, not noise
Lipman et al. (2022). Instead of predicting noise, predict the velocity field that transports noise to data. Straighter trajectories than DDPM → fewer sampling steps needed. Rectified Flow, FLUX, SD3 use this. Theoretically cleaner than DDPM; practically at least as good.
same trajectory point, any t
Song et al. (2023). Train a model that maps any noisy x_t on a trajectory directly to the clean x_0. The consistency condition enforces that all points on the same reverse diffusion path map to the same output. Achieves single-step generation with competitive quality. LCM (Latent Consistency Models) applies this to Stable Diffusion.
Diffusion is a general framework that works on any continuous data. WaveGrad / DiffWave apply it to raw audio waveforms. Point-E and Shap-E generate 3D objects. RFDiffusion (Baker Lab) generates protein backbone structures — used to design novel proteins with no natural analogue. Sora's video model is a diffusion transformer operating on spacetime patches.
GANs dominated 2014-2020: fast inference, sharp images, but training instability and mode collapse. VAEs: stable, interpretable latent space, but blurry. Diffusion: slow inference, but no adversarial training, no mode collapse, and quality that surpassed both. Now dominant for image, video, audio generation.