Diffusion Models Notes

Diffusion Models (Part 1)

“The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process. We then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data.”
— Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Diffusion Process Diagram – NVIDIA

Figure: Overview of the forward and reverse diffusion process. Source: NVIDIA Developer Blog

Motivation

motivation

We are currently in the context of image-generation models. Previously, we have seen VAEs and GANs. These models are stepping stones to diffusion models, which are now the state-of-the-art for high-quality, diverse image generation.

First came VAEs (Variational Autoencoders), which learn a probabilistic mapping between input data and a latent representation using an encoder-decoder structure. They are effective because they allow for smooth latent space interpolation and tractable inference using variational methods. However, their main drawback is that the generated images tend to be blurry, as the decoder often optimizes for pixel-wise reconstruction loss (typically L2), which penalizes sharp details. VAEs suffer at generating high-quality images.

Then came GANs (Generative Adversarial Networks), which pit a generator network against a discriminator in a minimax game. A generator tries to create images to fool a discriminator and gets better at this objective throughout training. GANs are capable of generating sharp, realistic images and have had a huge impact on image synthesis. They are effective because the adversarial loss encourages outputs that are indistinguishable from real data. However, GANs are notoriously difficult to train due to instability, mode collapse (where the generator produces limited diversity), and sensitivity to hyperparameters. This means GANs can fail to represent the diversity of a dataset.

Then Diffusion Models emerged, gaining traction with the introduction of Denoising Diffusion Probabilistic Models (DDPMs), first proposed by Ho et al. in their 2020 paper “Denoising Diffusion Probabilistic Models.” These models work by learning to reverse a gradual noising process applied to data, effectively training a model to denoise inputs over many time steps.

Diffusion models are powerful because they combine the stability of likelihood-based models like VAEs with the high sample quality of GANs. They are trained using a simple denoising loss and are robust to mode collapse. Their main drawback is the slow sampling time, as generating an image requires hundreds to thousands of denoising steps. Nevertheless, newer approaches like DDIMs and Denoising Score Matching aim to speed up this process.

High-Level Overview

A diffusion model consists of two main components: the forward process and the reverse process.

In the forward process, we gradually add noise to an image over several steps, eventually transforming it into pure noise.
In the reverse process, a neural network is trained to systematically denoise this noisy image, step-by-step, until it reconstructs a high-quality, realistic image.

This iterative approach enables the model to learn a powerful generative mapping from noise to data. Thus, we want to learn how to generate any image from Gaussian noise, or learn de-noising patterns.

Forward Process

In the forward (diffusion) process we corrupt a clean sample $x_0$ by adding Gaussian noise over T timesteps, producing a sequence ${x_t}_{t=1}^{T}$.

One‑step transition

For each timestep $t = 1,\dots,T$ we draw fresh noise $\varepsilon_t \sim N(0, I)$ and set

$$ q(x_t \mid x_{t-1}) \;=\; N\!\Bigl(x_t;\, \sqrt{1-\beta_t}\,x_{t-1},\; \beta_t I\Bigr), $$

where $\beta_t \in (0,1)$ is the variance schedule.
Define

$$ \alpha_t \;=\; 1-\beta_t \quad\text{and}\quad \bar\alpha_t \;=\; \prod_{s=1}^{t}\alpha_s . $$

Using the re‑parameterisation

$$ x_t \;=\; \sqrt{\alpha_t}\,x_{t-1} \;+\; \sqrt{1-\alpha_t}\,\varepsilon_t, $$

we see that each step injects additional noise while shrinking the signal.

Closed‑form sample from $x_0 \rightarrow x_t$

Recursively expanding the expression above gives

$$ \begin{aligned} x_t &= \sqrt{\alpha_t}\,x_{t-1} + \sqrt{1-\alpha_t}\,\varepsilon_t \\[4pt] &= \sqrt{\alpha_t\alpha_{t-1}}\,x_{t-2} + \sqrt{1-\alpha_t}\,\varepsilon_t + \sqrt{\alpha_t(1-\alpha_{t-1})}\,\varepsilon_{t-1} \\[-2pt] &\;\vdots \\[2pt] &= \sqrt{\bar\alpha_t}\,x_0 + \sqrt{1-\bar\alpha_t}\,\varepsilon, \end{aligned} $$

where $\varepsilon \sim N(0,I)$ absorbs the linear combination of all previous $\varepsilon_s$.

Hence

$$ q(x_t \mid x_0) \;=\; N\!\bigl( x_t;\, \sqrt{\bar\alpha_t}\,x_0,\, (1-\bar\alpha_t) I \bigr). $$

This closed‑form expression lets us sample any noisy timestep in one shot, avoiding an explicit $1 \to 2 \to \cdots \to t$ simulation.

Practical note
The schedule ${\beta_t}$ must keep $\beta_t\ll 1$ to prevent the variance from exploding; common choices are a linear or cosine schedule (Ho et al., 2020; Nichol & Dhariwal, 2021).

Derivation adapted from
T. Ho, J. Salimans, et al., “Denoising Diffusion Probabilistic Models,” 2020.

Choosing the noise schedule

Linear schedule (original DDPM):
In the original DDPM, the noise schedule is chosen so that $\beta_t$ increases linearly over time. Since
$$ \alpha_t = 1 - \beta_t, $$
$\alpha_t$ decreases as $t$ increases. Intuitively, each forward step keeps slightly less of the original image signal and adds slightly more Gaussian noise. Early timesteps only mildly corrupt the image, while later timesteps push $x_t$ closer to pure noise.

Diffusion Models: Reverse Process

What is the Reverse Process?

The reverse process is the generative part of diffusion models.

It learns to undo the noise added during the forward (noising) process.
Goal: Generate a sample that resembles the original data by reversing the Markov chain.

Idea

We destroy data with Gaussian noise in the forward process.
Now, we want to learn how to denoise step-by-step to recover the original image.
Since the forward process is a Markov chain, we assume the reverse can also be modeled as a Markov chain, but in reverse.

Since the forward process defines a Markov chain $q(x_{1:T} \mid x_0)$, the reverse process is modeled as a parameterized Markov chain:

$$ p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^{T} p_\theta(x_{t-1} \mid x_t) $$

where $p(x_T) = N(x_T; 0, I)$ and each transition is modeled as a Gaussian:

$$ p_\theta(x_{t-1} \mid x_t) = N\left(x_{t-1}; \mu_\theta(x_t, t),\; \Sigma_\theta(x_t, t) \right) $$

Mathematical Formulation

(Recall) forward process:

$$ q(x_t \mid x_{t-1}) = N(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I) $$

Ideal reverse process:

$$ q(x_{t-1} \mid x_t, x_0) = N(\mu_t(x_t, x_0), \tilde{\beta}_t I) $$

At inference time, we don’t have access to $x_0$, so we:
- Train a model $\epsilon_\theta(x_t, t)$ to predict the noise.

Learning the Reverse

We use the reparameterization:

$$ x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon $$

Then optimize this loss (this is simplified from ELBO):

$$ L = \mathbb{E}_{x_0, \epsilon, t} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right] $$

Training Objective via Variational Inference

We want to maximize the data log-likelihood:

$$ \log p_\theta(x_0) = \log \int p_\theta(x_{0:T})\, dx_{1:T} $$

This is intractable, so we use variational inference to derive a lower bound using the forward process $q(x_{1:T} \mid x_0)$ as the variational distribution. Applying Jensen’s inequality gives the Evidence Lower Bound (ELBO):

$$ \log p_\theta(x_0) \geq \mathbb{E}_{q(x_{1:T} \mid x_0)}\left[ \log \frac{p_\theta(x_{0:T})}{q(x_{1:T} \mid x_0)} \right] =: \mathcal{L}_{\text{ELBO}} $$

Using the Markov factorization:

$$ \mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q} \left[ \log p(x_T) + \sum_{t=1}^{T} \log \frac{p_\theta(x_{t-1} \mid x_t)}{q(x_t \mid x_{t-1})} \right] $$

Rewriting the full form (Ho et al., 2020) leads to:

$$ \mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q} \left[ \underbrace{\log p(x_T)}_{\text{prior term}} + \sum_{t=2}^{T} \underbrace{D_{\text{KL}}(q(x_{t-1} \mid x_t, x_0) \parallel p_\theta(x_{t-1} \mid x_t))}_{\text{reverse KL terms}} - \underbrace{\log q(x_1 \mid x_0)}_{\text{entropy term}} \right] $$

The KL terms compare the true reverse posterior with the learned reverse model.

Architecture

Commonly used model: U-Net (more on this in next set of notes)
Captures both local and global information
Time conditioning via sinusoidal positional embeddings
Often includes residual connections and self-attention layers

UNET [Figure: From CS 4782 Class Notes 2025]

Inference

Start from $x_T \sim N(0,I)$
At each step, denoise using the trained model:

$$ x_{t-1} = \mu_{\theta}(x_t, t, \epsilon_{\theta}(x_t, t)) + \sigma_tz, \text{ } z\sim N(0, I) $$

Summary

The reverse process generates new samples by reversing noise.
It requires training a neural network to approximate the denoising steps.
The loss function is mean squared error between predicted and actual noise.
This process is repeated iteratively from pure noise to produce coherent outputs.

Training Objective:

Our ultimate goal is to maximize the likelihood of the data, i.e.:

$$ \max \log p(x) $$

However, this is often intractable (no efficient algorithm to solve it). So, as in VAEs, we use the Evidence Lower Bound (ELBO) as a surrogate objective:

We can bound the likelihood with the ELBO (recall VAEs) such that

$$ \log p(x) \geq \mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{KL}(q_{\phi}(z|x)||p(z)) $$

Specifically, for diffusion models, we write the ELBO in terms of the forward and reverse diffusion processes.

$$ \log p(x) \geq \mathbb{E}_{q_{\phi}(x_1|x_0)}[\log p_{\theta}(x_0|x_1)] - D_{KL}(q(x_T|x_0)||p(x_T)) - \sum_{t=2}^T \mathbb{E}_{q(x_t|x_0)}[D_{KL}(q(x_{t-1}|x_t,x_0) || p_{\theta}(x_{t-1}|x_t))] $$

Where the first term ins the reconstruction term, the second is the prior matching term, and the last is the denoising matching term.

Explanation of Terms

Reconstruction Term:

$$ \mathbb{E}_{q(x_1|x_0)}[\log p_{\theta}(x_0|x_1)] $$

This encourages the reverse process to reconstruct the original input $x_0$ from a noisy version $x_1$. It’s similar to the reconstruction loss in a VAE.

Prior Matching Term:

$$ D_{KL}(q(x_T|x_0)\|p(x_T)) $$

This penalizes the difference between the noisy endpoint distribution and the standard normal prior. Ideally, $x_T$ should resemble pure Gaussian noise.

Denoising Matching Term:

$$ \sum_{t=2}^T \mathbb{E}_{q(x_t|x_0)}\left[D_{KL}(q(x_{t-1}|x_t,x_0) \| p_{\theta}(x_{t-1}|x_t))\right] $$

This ensures that at each timestep, the model learns to denoise $x_t$ into a distribution that matches the true reverse of the forward process.

Simplified Loss for Training

Ho et al. propose predicting the noise $\varepsilon$ added at each step, and minimizing the expected squared error:

$$ L_{simple} = \mathbb{E}_{t, x_0, \varepsilon \sim N(0,I)} \left[ \left\| \varepsilon - \varepsilon_\theta(x_t, t) \right\|^2 \right] $$

where

$$ x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \varepsilon $$

This loss is derived from the KL terms in the ELBO under the assumption that the variance is fixed and only the mean is learned.

Why Use the ELBO?

The ELBO provides a tractable surrogate for the intractable log-likelihood $\log p_\theta(x_0)$.
By maximizing this lower bound, we ensure:

We are learning a model whose reverse process approximates the true posterior trajectory from $x_T$ to $x_0$.
The training is stable and grounded in variational inference.
The model generates samples that match the data distribution when denoised from pure noise.

This theoretical framework connects diffusion models to well-established probabilistic principles, enabling principled training and evaluation.

Training Algorithm

Outlier, Diffusion Models | Paper Explanation | Math Explained. (YouTube)

Sampling Algorithm