Diffusion

Questions

Why don't you sample at training time? What if I sample everything at mean at inference time? Will it reproduce the mode?

Papers Have not Read

Improved Denoising Diffusion Probabilistic Models (Nichol et al., 2021): finds that learning the variance of the conditional distribution (besides the mean) helps in improving performance

Cascaded Diffusion Models for High Fidelity Image Generation (Ho et al., 2021): introduces cascaded diffusion, which comprises a pipeline of multiple diffusion models that generate images of increasing resolution for high-fidelity image synthesis

Diffusion Models Beat GANs on Image Synthesis (Dhariwal et al., 2021): show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models by improving the U-Net architecture, as well as introducing classifier guidance

Classifier-Free Diffusion Guidance (Ho et al., 2021): shows that you don't need a classifier for guiding a diffusion model by jointly training a conditional and an unconditional diffusion model with a single neural network

Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2) (Ramesh et al., 2022): uses a prior to turn a text caption into a CLIP image embedding, after which a diffusion model decodes it into an image

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (ImageGen) (Saharia et al., 2022): shows that combining a large pre-trained language model (e.g. T5) with cascaded diffusion works well for text-to-image synthesis

Theory

Implementation Details

Actual Unet Code in Diffusion: Here which uses residual blocks, multi-head attention, conditioning, and group norm.

Notations

$\begin{align*} p(\mathbf{x}) \tag{probability distribution of x equal to everything}\\ p(\mathbf{x}, \mathbf{y}) = p(\mathbf{x} \cap \mathbf{y}) \tag{probability distribution of x and y equal to everything}\\ p(\mathbf{x} | \mathbf{y}) \tag{probability distribution of observing x equal to everything given y equal to something}\\ x \sim p(x) \tag{random variable x is sampled from distribution p}\\ \end{align*}$

Note that $p(z_{1:n}|x_{1:n})$ denotes a probability distribution, not specific probability density since $z_{1:n}$ does not take on specific value. Knowing $p(z_{1:n}|x_{1:n})$ means knowing the probability density of $z_{1:n}$ taking all possible value conditioned on $x_{1:n}$ . $x$ is a random variable and $p$ is a distribution that "sits on" $x$ .

complete conditional: the probability distribution of a thing after knowing every other things. (special case of posterior distribution)

Preliminary:

Kullback-Leibler (KL) Divergence: watch this
Kolm-Smironv (KS) Distance: the difference between CDF of distributions
Wasserstein Distance: the work (force times distance) need to be done to move one distribution to fit another

Classical Variational Inference

Below notes are from video here

Goal: given knowledge (we make some assumption on data distribution) and data (sample from observation), we would love to predict new data.

Probabilistic Model: $p(\mathbf{z}, \mathbf{x})$ where $\mathbf{z}$ is latent variable, $\mathbf{x}$ is observed variable.

Posterior: $p(\mathbf{z} | \mathbf{x}) = \frac{p(\mathbf{z}, \mathbf{x})}{p(\mathbf{x})}$ is what we use for inference where $p(\mathbf{x})$ is called "evidence" and is usually intractable.

Approximate Posterior: $q(\mathbf{z}; \mathbf{v})$ is a distribution (indexed by $\mathbf{v}$ ) in a family of possible distributions. We start from a distribution indexed by $v^{\text{init}}$ and gradually go to a distribution indexed by $v^*$ that is optimal ( $KL(q(\mathbf{z}; v^*)) \| p(\mathbf{z} | \mathbf{x})$ is small). You can think of $\mathbf{v}$ as a parameter for $q$ .

Stochastic optimization turns variational inference problem into an optimization problem.

Mean Field Variational Inference

Labels in graph: The $i^{\text{th}}$ data point only depend on $z_{i}$ and $\beta$

Observation: $\mathbf{x} = x_{1:n}$
Local Latent Variables: $\mathbf{z} = z_{1:n}$
Global Latent Variables: $\beta$

The goal is to compute $p(\beta, \mathbf{z} | \mathbf{x})$ by minimizing KL between $q(\beta, \mathbf{z}; \mathbf{v})$ and $p(\beta, \mathbf{z} | \mathbf{x})$ .

KL isn't the best divergence, but it is the most common one. Maximizing ELBO is a good objective to minimize KL.

The above graph captures many models, including:

Bayesian Mixture Models
Time Series Models
Factorial Models
Matrix Factorization (PCA, CCA)
Dirichlet Process Mixtures, HDPs
Multilevel Regression (Linear, Probit, Poisson)
Stochastic Block Models
Mixed-Membership Models (LDA)

Assume:

Exponential Family Assumption: each complete conditional in above variables is in the exponential family.
Mean-Field Assumption: each variable ( $z_i, \beta$ ) depend only on another free random variable.

Amortization: variational parameters are not "free", but they are a function of the input. (idea behind variational auto-encoder)

Traditional VI use coordinate descent (only tweaking one parameter at a time). Stochastic optimization use high variance, unbiased gradient.

Blackbox Variational Inference

We wish to remove constraints on conditional conjugate assumption (ie. mean field assumption) to do variational inference on more models like Logistic Regression. (No mathematical work beyond specifying the model)

Nonconjugate models:

Nonlinear Time Series Models
Deep Latent Gaussian Models
Models with Attention
Generalized Linear Models
Stochastic Volatility Models
Discrete Choice Models
Bayesian Neural Networks
Deep Exponential Families
Correlated Topic Models
Sigmoid Belief Networks

Using Classical VI, we cannot calculate ELBO objective analytically for models above because we cannot calculate the expectation analytically.

The way to do it

we sample from $q(\cdot)$
form noisy gradients without model-specific computation (operators are cached)
stochastic optimization

Score Gradient (also called likelihood ratio, reinforce gradient): model gradient as an expectation, and use Monte Carlo to approximate the expectation.

$\nabla_v \mathfrak{L} = \mathbb{E}_{q(\mathbf{z}; \mathbf{v})} \left[\nabla_v \log q(\mathbf{z}; \mathbf{v}) (\log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z}; \mathbf{v}))\right]$

This is trackable given the library:

$\nabla_v \log q(\mathbf{z}; \mathbf{v})$ is the score function
$\log p(\mathbf{x}, \mathbf{z})$ is writing down the model
$\log q(\mathbf{z}; \mathbf{v})$ is density of variation approximation

Blackbox variational inference enables probabilistic programs: traditionally we write program to generate art using random variable, now we can write a model and given data, generate art.

Generative Models

Generative Model: there exists a data distribution $p(x)$ , we want our model to guess this distribution by sampling and observing our dataset $p(x)$ .

Data Distribution vs Model Distribution: each image correspond to a point in the distribution

We assume all data samples (training data) come from some underlying distribution. Since we have limited number of training data, we can't have analytical solution to this data distribution. Although one can come up with a distribution that explain our data analytically, this would overfit. So the most powerful assumption we can make is "the actual information density in $p(\mathbf{x})$ is low, and is the result of heavy computation using small data".

We use our model $q(\mathbf{x})$ to better approximate the data distribution $p(\mathbf{x})$ . Our untrained model $q(\mathbf{x})$ has the potential to be trained to produce a specific distribution from a family of distributions. Our job is to minimizing the distance between the data distribution $p(\mathbf{x})$ and the model distribution $q(\mathbf{x})$ to approximate data distribution. We then generate new data $x$ by sampling model distribution $q(\mathbf{x})$ .

Variational Auto-Encoder Basics

Videos and Articles:

Basic Intuition
Technical Description (only look at 4:00~9:00)
Detailed Analysis
Detailed Article

Autoencoder: compress information.

Purpose of Variational Autoencoder (VAE): In autoencoder, if latent space too small, image will become blurry (underfitting). If too large, most of the latent space sampled from random will not be meaningful (overfitting). This is because the latent space isn't well organized into structures. Variational autoencoder regularize the latent space to avoid overfitting and has good property for generation purpose.

We lack two properties that will be introduced by VAE - continuity: similar things are together in latent space - completeness: no meaningful data in latent space

Variational Autoencoder: instead of mapping $x$ to a fixed size vector, it maps $x$ to a normal distribution (represented by a mean $\mu$ and a standard deviation $\sigma$ ) of vectors in the latent space. Then sample from the distribution (reconstructed by $\mu$ and $\sigma$ ). This way we can fill the latent space with more meaningful information. Variational Autoencoder add variances to autoencoder.

Regularization Loss: If we only sample from a predicted distribution, then the network will learn to predict a very small variance which defeats the purpose. So in addition to reconstruction loss, we need Regularization loss (KL divergence) to "require the covariance matrices to be close to the identity, preventing punctual distributions, and the mean to be close to 0, preventing encoded distributions to be too far apart from each others."

Note that the sampling step has "re-parameterization trick" to create gradient for $\mu$ and $\sigma$ .

Variational Auto-Encoder Math

We want to maximize the probability of generating something that is in the real data distribution, regardless of latent variable $z$ , so we want to maximize $\int p(x | z) dz = p(x) = \frac{p(x, z)}{p(z|x)}$ . (ie. get a model such that is very likely to generate the dataset we have, not considering overfitting) However, integration and obtaining $p(z|x)$ are both computational infeasible, but the following "Variational Lower Bound" (ELBO) holds: where $q_\phi(z|x)$ is a neural network with weight $\phi$ that tries to approximate $p(z|x)$ .

$\log p(x) \geq \mathbb{E}_{q_\phi (z|x)} \left[\log \frac{p(x, z)}{q_\phi (z|x)}\right]$

In fact, since KL divergence is always non-negative, we have

$\log p(x) = \mathbb{E}_{q_\phi (z|x)} \left[\log \frac{p(x, z)}{q_\phi (z|x)}\right] + D_{KL}(q_\phi(z|x) \| p(z|x))$

when we maximize ELBO $\mathbb{E}_{q_\phi (z|x)} \left[\log \frac{p(x, z)}{q_\phi (z|x)}\right]$ , since $\log p(x)$ is fixed, the KL divergence must be minimized. So we are effectively minimizing the distance between $q_\phi(z|x)$ and $p(z|x)$ .

ELBO, when written out $\mathbb{E}_{q_\phi (z|x)}\left[\log p(x, z)\right] - \mathbb{E}_{q_\phi (z|x)}\left[q_\phi (z|x)\right]$ trade off two terms: - first term prefer $q_\phi(z|x)$ to be MAP estimate - second term prefer $q_\phi(z|x)$ to be diffuse (spread mass accross latent space)

So to maximize $\log p(x)$ , we just need to find proper $\phi$ .

The ELBO can be rewritten as a reconstruction term (ability of decoder, where $p_\theta(x|z)$ is a learned deterministic function called decoder with weight $\theta$ ) and a prior matching term (learned variational distribution should be similar to prior belief on latent variables). The prior matching term prevent collapse distribution into Dirac delta function.

$\mathbb{E}_{q_\phi (z|x)} \left[\log \frac{p(x, z)}{q_\phi (z|x)}\right] = \mathbb{E}_{q_\phi (z|x)} \left[\log p_\theta(x|z)\right] - D_{KL}(q_\phi(z|x) \| p(z))$

We use the following assumptions:

$\begin{align*} q_\phi(z|x) &= \mathcal{N}(z; \mu_\phi(x), \sigma^2_\phi(x)I)\\ p(z) &= \mathcal{N}(z; 0, I)\\ \end{align*}$

We would sample $z$ from $q_\phi(z|x)$ and then feed it to $p_\theta(x|z)$ to get a reconstructed image. The gradient for $\phi$ is calculated using re-parameterization trick (model sampling as first sampling standard gaussian and then shift it).

For the theory behind reparameterization trick, watch this video.

Note that we can chain multiple VAEs together (Hierarchical Variational Autoencoders, HVAE) to get a deeper model. But VAE assumes the latent space is Gaussian, which is not true for many cases.

Denoising Diffusion Probabilistic Models (DDPM)

Most math involve how to add noise to the image, watch this explanation first implementing DDPM Paper.

"Creativity emerge from sampling a well-formed distribution"

Diffusion process: given pixel color $x_{t-1}$ , we replace it with a random pixel color sampled from $\mathcal{N}(\sqrt{1 - \beta_t} x_{t - 1}, \beta_t \mathbb{I})$ .

Reconstruction process: there are a few options

predict only the mean and then remove a sampled noise (since variance is fixed). However, this assumes Gaussian with isotropic covariance matrix (not true)
predict the mean and covariance matrix of noise then remove a sampled noise (but predicting exact covariance number is hard, so instead we calculate an upper and lower bound and predict an interpolation coefficient between them)
predict the noise directly and remove it (chosen by DDPM paper)
mix of both "simple objective" (predict noise) and "variational objective" (predict mean and covariance matrix of noise) and do importance sampling biased toward high noise to get smoother loss for variance reduction

Classifier Guidance: add classifier loss in each reconstruction step to produce better guidance with only 25 steps instead of hundreds of steps. (from "Improved Denoising Diffusion Probabilistic Models")

Variational Diffusion Models

DDPM is a type of variational diffusion model

Variational Diffusion Models: a Markovian Hierarchical Variational Autoencoder with three key restrictions:

latent dimension is exactly equal to the data dimension
structure of the latent encoder at each timestep is pre-defined as a linear Gaussian model (not learned)
final timestep T is a standard Gaussian

Note that encoder $q(x_t | x_{t - 1})$ is no longer parameterized by $\phi$ and not trainable.

With this, after very long derivation, we can lower bound $\log p(x)$ by three terms:

reconstruction term: $-\mathbb{E}_{q(x_1|x_0)}\left[\log {p_\theta(x_0|x_1)}\right]$ to make sure the model can reconstruct the original image from noised image
prior matching term: $-\sum_{t = 1}^{T - 1} \mathbb{E}_{q(x_{T - 1}|x_0)}\left[D_{KL}(q(x_T | x_{T - 1}) \| p(x_T))\right]$ . This is minimized when the final latent distribution matches the Gaussian prior. It has no parameter so not optimizable.
consistency term: $- \sum_{t = 1}^{T - 1} \mathbb{E}_{q(x_{t - 1}, x_{t + 1} | x_0)} \left[D_{KL}(q(x_t | x_{t - 1}) \| p_\theta (x_t | x_{t+1}))\right]$ . This term dominates the loss as it sums over all $T$ . Also, since we calculate expectation using Monte Carlo, sampled result over $T$ terms will have high variance which is bad.

We can rewrite consistency term using $q(x_t|x_{t - 1}) = q(x_t | x_{t - 1}, x_0)$ due to the Markov property (does not depend on $x_0$ ). So now we can lower bound $\log p(x)$ by three terms:

reconstruction term: $-\mathbb{E}_{q(x_1|x_0)}\left[\log {p_\theta(x_0|x_1)}\right]$ same as above
prior matching term: $-D_{KL}(q(x_T|x_0) \| p(x_T))$ how close the distribution of the final noisified input is to the standard Gaussian prior. (not trainable)
denoising matching term: $-\sum_{t = 2}^T \mathbb{E}_{q(x_t | x_0)}\left[D_{KL}q(x_{t - 1} | x_t, 0) \| p_\theta (x_{t - 1} | x_t)\right]$ denoising step we learned is close to the true denoising step.

So now, since we know the encoder is some Gaussian, we can add gaussian assumption to decoder. And after some annoying PnC (15-259 Probability and Computing), we can get the following:

$\begin{align*} q(x_{t - 1} | x_t, x_0) &= \frac{q(x_t | x_{t -1}, x_0)q(x_{t - 1}|x_0)}{q(x_t|x_0)}\\ &\propto \mathcal{N}(x_{t - 1}; \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t - 1})x_t + \sqrt{\bar{\alpha}_{t - 1}}(1 - \alpha_t)x_0}{1 - \bar{\alpha_t}}, \frac{(1 - \alpha_t)(1 - \bar{\alpha}_{t - 1})}{1 - \bar{\alpha_t}}I)\\ \end{align*}$

Note that the mean is a function of $x_t, x_0$ and the variance a function of $\alpha$ and is not learned. When variance is not learned, we can prove that minimizing KL divergence of two Gaussian is equivalent to minimizing $\|\mu_\theta(x_t, t) - \mu_q(x_t, x_0)\|^2_2$ . Expanding the terms out, we can also prove that the neural network must have the ability to predict the denoised image from noised image in any timestamp.

Score Diffusion - A New Perspective on Diffusion

Watch this video of original author talking about various score-based generative models.

Gaussian Models

If we model our data distribution by a simple Gaussian distribution (assuming $\mu$ is the only variable), then the model can be a 2-layer neural network.

In above image, $p_\mu$ represents the model $p$ parameterized by $\mu$ . $x$ is the input data point sampled from the data distribution. $p_\mu(x)$ gives the likelihood of observing $x$ from model distribution $p_\mu$ .

The word "model" and "model distribution" are used interchangeably since a model is really representing a distribution.

Non-Gaussian Models

Model distribution using general functions

We can get rid of Gaussian assumption by modeling PDF (probability density function) using a general function $f_\theta$ modeled by deep neural network. The model is parameterized by model weights $\theta$ .

The problem is the output might not be positive (since PDF must be positive). We take exponential $e^{f_\theta(x)}$ to get positive result and normalize it by $Z_\theta$ $e^{f_\theta(x)} / Z_\theta$ to make it sum to $1$ . However, normalizing constant $Z_\theta = \int e^{f_\theta(x)} dx$ is a $\sharp \text{P-complete}$ problem (as hard as $\text{NP-complete}$ ). So we use a trick to make it easier to calculate.

For Gaussian models, $Z_\mu = \frac{1}{(2\pi)^{d/2}}$

So there are couple ways to solve $Z_\theta$ problems:

approximating normalizing constant (inaccurate)
- "Energy-based Models"
using restricted neural network models (restricted)
- "Autoregressive Models"
- "Normalizing Flow Models"
- "Variational Auto-Encoder"
model generation process only (cannot evaluate probability)
- "GANs"

// TODO: finish watching https://www.youtube.com/watch?v=nv-WTeKRLl0

Variational Inference

Variational Inference: we can sample from dataset $p(x)$ , we want to optimize similarity (KL-divergence) between $p$ and $q$ where $q$ is the model we currently have.

// TODO: finish watching https://www.youtube.com/watch?v=gV1NWMiiAEI // TODO: finish watching https://www.youtube.com/watch?v=UTMpM4orS30 (3 parts) // TODO: finish watching https://www.youtube.com/watch?v=oHnnz8gS-YM

Particle-based Variational Inference (ParVIs): represent variational distribution by particles. Update particles to minimize KL. It is more particle-efficient than Markov Chain Monte Carlo (MCMC), another main way to represent complex distribution. Another ParVIs is Stochastic Gradient Langevin Dynamics (SGLD).

Stein Variational Gradient Descent (SVGD)

If you wish it to lean it in Chinese, watch this. But this is bad for beginners. You might want to read article first.

Compared to previous methods, SVGD is a general approximation method for distributions without having heavy assumptions on the distribution family of $q$ .

Stein Variational Gradient Descent (SVGD): we use a finite set of $N$ particles $\mathbf{x}$ to approximate distribution $q$ and use this approximated $q$ to approximate $p$ . We model this approximation as a process of applying transformation $T$ on the distribution (modeled by particles $\mathbf{x}$ ) in 2-Wasserstein space $\mathbb{W}_2(\Theta)$ , following steepest-direction of gradient descent.

2-Wasserstein space ( $\mathbb{W}_2(\Theta)$ ): if $\mathbb{P}(\Theta)$ is a set containing all the distributions on a support Euclidean space $\Theta$ (Euclidean distance), then $\mathbb{W}_2(\Theta) = \{\mu \in \mathbb{P}(\Theta) | (\exists \theta_0 \in \Theta)(\mathbb{E}_{\mu(\theta)} \left[\|\theta - \theta_0\|^2\right] < \infty)\}$ (2-Wasserstein space is a subset of distribution that is not long-tile with Wasserstein 2-distance)

Since applying $T$ to $x \sim q(x)$ mimics a step of gradient descent, we can constrain $T: \mathbb{R}^N \to \mathbb{R}^N$ (where $N$ is the number of particles) to have the form $T(x) = x + \epsilon \phi(x)$ . If $\phi$ is smooth and $\epsilon$ (step size) is small, then $T$ is invertible, which means we can easily compute the density of $z$ . To pick appropriate $\phi$ , we want to minimize the KL divergence between $q'$ (or $q_{[T]}$ in some literature, the distribution after passing $q$ through $T$ ) and $p$ . We wish to know the steepest direction of gradient descent:

$q' = q_{[T]} = q(T^{-1}(z)) \cdot |\det(\nabla_z T^{-1}(z))|\\ \nabla_\epsilon KL(q' \| p) |_{\epsilon = 0} = - \mathbb{E}_{x \sim q} \left[\text{trace}A_p \phi(x)\right]$

where $A_p$ is the Stein operator $A_p\phi(x) = \nabla_x \log p(x) \phi^T (x) + \nabla_x \phi(x)$ .

The intuition of above formula:

Stein identity: let $p(x)$ be a smooth density distribution on space $X$ , and $\phi(x) = [\phi_1(x), ..., \phi_d(x)]^T$ is a smooth vector-valued function, if $\phi$ is sufficiently regular, we have Stain identity, saying that the following hold:

$\mathbb{E}_p \left[A_p \phi(x)\right] = 0\\ A_p \phi(x) = \nabla_x \log p(x) \phi^T (x) + \nabla_x \phi(x)$

$A_p$ is called Stein Operator.

Observe that $\mathbb{E}_p \left[A_q \phi(x)\right]$ with $p \neq q$ , the most likely value is not zero, so we can use this to measure the distance between $p$ and $q$ . But since we have $\nabla_x$ , $\mathbb{E}_p \left[A_q \phi(x)\right]$ will be a matrix. Because we wish it to act as a scalar-valued loss function, we need to take the trace.

We can define Stein Discrepancy measure between $p$ and $q$ as:

$\mathbb{S}(p, q) = \max_{\phi \in H^d, \|\phi\| \leq 1} \{\mathbb{E}_q \left[A_p \phi(x))\right]^2\}$

And we can find the optimal $\phi^*_{p, q}(\cdot) = \mathbb{E}_q \left[A_p k(x, \cdot)\right]$ that maximizes such discrepancy. (Of course, remember to normalize $\phi(x) = \frac{\phi^*_{p, q}(x)}{\|\phi^*_{p, q}(x)\|}$ ) We will later use this maximization to find the steepest direction of gradient descent.

$\nabla_\epsilon KL(q_{[T]} \| p)|_{\epsilon = 0} = - \mathbb{E}_q \left[\text{trace}(A_p \phi(x))\right]$

This gives us

$\nabla_f KL(q_{[T]} \| p)|_{f = 0} = - \phi^*_{p, q}(x)$

(for line by line proof, see here)

For steepest direction, we need to find $\phi$ that minimize the above equation, but this minimization is not well defined since $\phi$ can have an arbitrary scalar making the expectation unbounded. So we add a constraint that $\phi$ should live in Reproducing Kernel Hilbert Spaces (RKHS).

According to this, usually $\phi$ is Radial basis function kernel (RBF) Kernel. For $h \in \mathbb{R}$ , the RBF kernel is defined as:

$k(x, x') = \exp\left(-\frac{\|x - x'\|^2_2}{h}\right)$

Reproducing Kernel Hilbert Spaces (RKHS, $H$ ): a Hilbert space of continuous functions on a set $X$ that has some inner product $<\cdot, \cdot>_H$ defined. A vector space of functions where kernel functions are the basis. (for more information, see here)

We force $\phi$ to live in $H_D$ (RKHS with $D$ -dimensional vector-valued functions with inner product $<\cdot, \cdot>_H$ defined as:

$<a, b>_{H_D} = \sqrt{\sum_{d = 1}^D <f_d, g_d>^2_{H}}$

and force $\|\phi\|_{H_D} = <\phi, \phi>_{H_D} \leq 1$ .

Now, we can prove the optimal kernel $\phi^*$ that minimize the above KL-divergence is: where $k$ is a positive-definite kernel function that we optimize.

Kernel function: $k(x, x') = <\phi(x), \phi(x')>$ where $\phi : X \to F$ is usually a function that takes input to a higher dimension feature space where dot product ( $<\cdot, \cdot>$ similarity metrics) is defined, and $k: X \times X \to \mathbb{R}$ has scalar value. By taking input to higher dimension, linear classifier can be used to classify non-linearly separable data. Note that according to here, kernel functions are abstract (does not imply specific calculation rules). Because $\phi$ can be computational costly, it is sometimes good to find a clever kernel that satisfy this kernel property, but computed differently.

$\phi^* = \frac{\beta}{\|\beta\|_{H_d}}\\ \beta(\cdot) = \mathbb{E}_{x \sim q}\left[k(x, \cdot)\nabla_x \log p(x) + \nabla_x k(x, \cdot)\right]$

We use the above objective to derive an update rule for Stein Variational Gradient Descent (SVGD): we use $x_n^{(i)}$ to denote $n^\text{th} \in [1, N]$ particles at $i^\text{th}$ iteration of applying $T$ and $\{x_n^{(x)}\}_{n = 1}^N$ is a set of particles at initial position

$x_n^{(i + 1)} = x_{n}^i + \epsilon^{(i)} \frac{1}{N} \sum_{m = 1}^N \left[k(x_n^{(i)}, x_m^{(i)})\nabla_{x} \log p(x)|_{x = x_n^{(i)}} + \nabla_{x} k(x_m^{(i)}, x)|_{x = x_n^{(i)}}\right]$

Note that if initialized in a bad position, SVGD can get stuck in a local minimum. Annealed SVGD proposes an annealing term (similar to annealing learning rate).

SVGD simulates Wasserstein Gradient Flow, a continuous-time PDE. For more, see this presentation. The algorithm we derived here can be derived using perspective of simulating Wasserstein Gradient Flow to match two distributions. The only difference is that Wasserstein Gradient Flow is formulated in $L^2$ space but SVGD is formulated in $H^D$ (RKHS) space, and because of that SVGD uses "smooth function and density" (see On Wasserstein Gradient Flows and Particle-Based Variational Inference, ICML 2019 by Ruiyi Zhang, it also covers acceleration of SVGD). Since $H^D$ is a subset of $L^2$ , result of SVGD is the projection of Wasserstein Gradient Flow in $H^D$ space.

SVGD is also equivalent to black-box variational inference (paper by Casey Chu at Stanford, and two other people in Preferred Networks, Inc., the company that developed PaintsChainer).

For code implementation, see here.

DDPM, DDIM, DPM, DPM++

DDPM: Denoising Diffusion Probabilistic Models DDIM: Denoising Diffusion Implicit Models DPM: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

DDIM: remove noise, become deterministic, 100~250 steps, turning SDE into ODE

Classifier Guidance

Like DeepDream, use classifier-gradient to manipulate latent. You do need to train a separate classifier that support image at any noise level.

Classifier-Free Guidance

Train both conditioned and unconditioned model (together as one model by sometimes leaving out conditioning), and interpolate between them during inference.

Details

To see the actual diffusion code with annotations (and things that is implemented differently than in the paper), look at this article.

Histroy

NovelAI Leak: The first big Stable Diffusion Community Drama and the Banning of AUTOMATIC1111

Platform

AI GodLike: https://www.aigodlike.com/

Civitai: a platform for sharing fine-tuned models. Here

Twitter: where to learn news (#aiart OR #stablediffusion OR #midjourney OR #imagen) min_faves:1500 until:2022-10-20 since:2022-10-10

Reddit: midjourney, stablediffusion

Discord: stable diffusion, community research channel

Papers

LoRA paper: https://arxiv.org/abs/2106.09685 Dreambooth Paper: https://arxiv.org/abs/2208.12242 Textual Inversion Paper: https://arxiv.org/abs/2208.01618

Installation

WebUI: https://github.com/AUTOMATIC1111/stable-diffusion-webui

Variational Auto Encoder (VAE): turn latent to image

See: https://www.reddit.com/r/StableDiffusion/comments/10kr1m1/current_state_of_anythingv3/

Comparison between models: https://stable-diffusion-art.com/models/#Stable_diffusion_v14

Official WebUI Documentation: https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features

Models

ControlNet: https://huggingface.co/lllyasviel/ControlNet/tree/main/models

AnythingV3(w/VAE): a model fine-tuned from stable diffusion v2.

It seems like only the original 2.x model need config.yml in the same folder

768-v-ema.ckpt, 512-base-ema.ckpt config, 512-depth-ema.ckpt c

LoRA vs Dreambooth vs Textual Inversion vs Hypernetworks

Note that NovelAI uses CLIP Skip to train their model, which means it will work bettwer with CLIP Skip using leaked NovelAI model. But otherwise, don't use CLIP Skip.

.safetensors: a format that is safer (some pickle can be malicious) and faster than pickle (now supported by webui)

// QUESTION: why pickle malicious, what does safetensors do?

Extensions

Tag Complete: https://github.com/DominikDoom/a1111-sd-webui-tagcomplete

OpenPose Editor: https://github.com/fkunn1326/openpose-editor

when you use pose data, you don't need pose pre-processor

Deforum: https://www.youtube.com/watch?v=R52hxnpNews&ab_channel=Fictitiousness

Training Methods

Traditional Methods

text2img
img2img
video2video with this input image for this output image
video2video with this output image and next input frame as control for next output frame

Prompt Methods

Settings

Sampling Method: With extension, you might need Euler. Otherwise DDIM is good.

Resolution: 768 x 768 (since it was trained on)

See Miro Board: https://miro.com/app/board/uXjVPChrzFo=/

Seed: https://youtu.be/R52hxnpNews?t=294

With img2img, do exactly the amount of steps the slider specifies (normally you'd do less with less denoising). = true

Do not append detectmap to output = true

Allow detectmap auto saving = false

Deforum

Run: the same

Keyframe:

Strength Schedule: 0~1, basically how much strength to denoise

- CFG: 0~inf, when denoise, stick to prompt or be wild

Init:

Use init: img2img or not
Strength: 1 means stick 100% to the first image (override control net if stick to previous image)

ControlNet: if you enable, you can't disable

What the difference between Weight and Guidance strength?: https://github.com/Mikubill/sd-webui-controlnet/discussions/175

De-noising Parameter: https://github.com/deforum-art/deforum-for-automatic1111-webui/issues/3

Img2Img

Strength: more strength means closer to input image

Zoom: higher zoom (0.995) means zoom out slower

Prompts

Miku: 1girl, anime screencap, grey background, pixiv request, pixiv sample, hatsune miku, full body,

Rin: 1girl, anime screencap, grey background, pixiv request, pixiv sample, full body, kagamine rin, standing, simple background,

Negative: gradient background, (ugly:1.3), (fused fingers), (too many fingers), (bad anatomy:1.5), (watermark:1.5), (words), letters, untracked eyes, asymmetric eyes, floating head, (logo:1.5), (bad hands:1.3), (mangled hands:1.2), (missing hands), (missing arms), backward hands, floating jewelry, unattached jewelry, floating head, doubled head, unattached head, doubled head, head in body, (misshapen body:1.1), (badly fitted headwear:1.2), floating arms, (too many arms:1.5), limbs fused with body, (facial blemish:1.5), badly fitted clothes, imperfect eyes, untracked eyes, crossed eyes, hair growing from clothes, partial faces, hair not attached to head,

https://huggingface.co/JosephusCheung/ACertainThing

/home/hanke/Documents/stable-diffusion-webui/outputs/img2img-images/Deforum/20230222163920_video-settings.txt https://kokecacao.me/download?password=ec07a631-99c4-4be0-93be-85050c0e018d|9223372036854775807|-2592000.00000-22.png

No Control Net: /home/hanke/Documents/stable-diffusion-webui/outputs/img2img-images/Stable/20230223224439_video-settings.txt

// TODO: lora // TODO: vae: https://stable-diffusion-art.com/how-to-use-vae/

Seed: control net works bad with fixed or alternate seed, random and iter seed are fine, Euler a is fine Init: work well with noisy image init, even with 0 and 0 no init selected Guidance: small guidance strength = no pose, when both weight, guidance = max -> no control net Strength: 0.6 = keep image the same, 0.1 = don't need to keep the same, do pose (0.3 is good)

{
        "0": "  1girl, anime screencap, grey background, pixiv request, pixiv sample, full body, kagamine rin, standing, simple background, blurry background,  1girl, anime screencap, grey background, pixiv request, pixiv sample, full body, kagamine rin, standing, simple background,  tiny cute swamp bunny, highly detailed, intricate, ultra hd, sharp photo, crepuscular rays, in focus, by tomasz alen kopera --neg gradient background, (ugly:1.3), (fused fingers), (too many fingers), (bad anatomy:1.5), (watermark:1.5), (words), letters, untracked eyes, asymmetric eyes, floating head, (logo:1.5), (bad hands:1.3), (mangled hands:1.2), (missing hands), (missing arms), backward hands, floating jewelry, unattached jewelry, floating head, doubled head, unattached head, doubled head, head in body, (misshapen body:1.1), (badly fitted headwear:1.2), floating arms, (too many arms:1.5), limbs fused with body, (facial blemish:1.5), badly fitted clothes, imperfect eyes, untracked eyes, crossed eyes, hair growing from clothes, partial faces, hair not attached to head,   gradient background, (ugly:1.3), (fused fingers), (too many fingers), (bad anatomy:1.5), (watermark:1.5), (words), letters, untracked eyes, asymmetric eyes, floating head, (logo:1.5), (bad hands:1.3), (mangled hands:1.2), (missing hands), (missing arms), backward hands, floating jewelry, unattached jewelry, floating head, doubled head, unattached head, doubled head, head in body, (misshapen body:1.1), (badly fitted headwear:1.2), floating arms, (too many arms:1.5), limbs fused with body, (facial blemish:1.5), badly fitted clothes, imperfect eyes, untracked eyes, crossed eyes, hair growing from clothes, partial faces, hair not attached to head, crossed legs, crossed arms, film grain, noise,     ",
        "30": "  1girl, anime screencap, grey background, pixiv request, pixiv sample, full body, kagamine rin, standing, simple background, blurry background,  1girl, anime screencap, grey background, pixiv request, pixiv sample, full body, kagamine rin, standing, simple background,  anthropomorphic clean cat, surrounded by fractals, epic angle and pose, symmetrical, 3d, depth of field, ruan jia and fenghua zhong --neg gradient background, (ugly:1.3), (fused fingers), (too many fingers), (bad anatomy:1.5), (watermark:1.5), (words), letters, untracked eyes, asymmetric eyes, floating head, (logo:1.5), (bad hands:1.3), (mangled hands:1.2), (missing hands), (missing arms), backward hands, floating jewelry, unattached jewelry, floating head, doubled head, unattached head, doubled head, head in body, (misshapen body:1.1), (badly fitted headwear:1.2), floating arms, (too many arms:1.5), limbs fused with body, (facial blemish:1.5), badly fitted clothes, imperfect eyes, untracked eyes, crossed eyes, hair growing from clothes, partial faces, hair not attached to head,   gradient background, (ugly:1.3), (fused fingers), (too many fingers), (bad anatomy:1.5), (watermark:1.5), (words), letters, untracked eyes, asymmetric eyes, floating head, (logo:1.5), (bad hands:1.3), (mangled hands:1.2), (missing hands), (missing arms), backward hands, floating jewelry, unattached jewelry, floating head, doubled head, unattached head, doubled head, head in body, (misshapen body:1.1), (badly fitted headwear:1.2), floating arms, (too many arms:1.5), limbs fused with body, (facial blemish:1.5), badly fitted clothes, imperfect eyes, untracked eyes, crossed eyes, hair growing from clothes, partial faces, hair not attached to head, crossed legs, crossed arms, film grain, noise,     ",
        "60": "  1girl, anime screencap, grey background, pixiv request, pixiv sample, full body, kagamine rin, standing, simple background, blurry background,  1girl, anime screencap, grey background, pixiv request, pixiv sample, full body, kagamine rin, standing, simple background,  a beautiful coconut --neg photo, realistic  gradient background, (ugly:1.3), (fused fingers), (too many fingers), (bad anatomy:1.5), (watermark:1.5), (words), letters, untracked eyes, asymmetric eyes, floating head, (logo:1.5), (bad hands:1.3), (mangled hands:1.2), (missing hands), (missing arms), backward hands, floating jewelry, unattached jewelry, floating head, doubled head, unattached head, doubled head, head in body, (misshapen body:1.1), (badly fitted headwear:1.2), floating arms, (too many arms:1.5), limbs fused with body, (facial blemish:1.5), badly fitted clothes, imperfect eyes, untracked eyes, crossed eyes, hair growing from clothes, partial faces, hair not attached to head,   gradient background, (ugly:1.3), (fused fingers), (too many fingers), (bad anatomy:1.5), (watermark:1.5), (words), letters, untracked eyes, asymmetric eyes, floating head, (logo:1.5), (bad hands:1.3), (mangled hands:1.2), (missing hands), (missing arms), backward hands, floating jewelry, unattached jewelry, floating head, doubled head, unattached head, doubled head, head in body, (misshapen body:1.1), (badly fitted headwear:1.2), floating arms, (too many arms:1.5), limbs fused with body, (facial blemish:1.5), badly fitted clothes, imperfect eyes, untracked eyes, crossed eyes, hair growing from clothes, partial faces, hair not attached to head, crossed legs, crossed arms, film grain, noise,     ",
        "90": "  1girl, anime screencap, grey background, pixiv request, pixiv sample, full body, kagamine rin, standing, simple background, blurry background,  1girl, anime screencap, grey background, pixiv request, pixiv sample, full body, kagamine rin, standing, simple background,  a beautiful durian, trending on Artstation --neg gradient background, (ugly:1.3), (fused fingers), (too many fingers), (bad anatomy:1.5), (watermark:1.5), (words), letters, untracked eyes, asymmetric eyes, floating head, (logo:1.5), (bad hands:1.3), (mangled hands:1.2), (missing hands), (missing arms), backward hands, floating jewelry, unattached jewelry, floating head, doubled head, unattached head, doubled head, head in body, (misshapen body:1.1), (badly fitted headwear:1.2), floating arms, (too many arms:1.5), limbs fused with body, (facial blemish:1.5), badly fitted clothes, imperfect eyes, untracked eyes, crossed eyes, hair growing from clothes, partial faces, hair not attached to head,   gradient background, (ugly:1.3), (fused fingers), (too many fingers), (bad anatomy:1.5), (watermark:1.5), (words), letters, untracked eyes, asymmetric eyes, floating head, (logo:1.5), (bad hands:1.3), (mangled hands:1.2), (missing hands), (missing arms), backward hands, floating jewelry, unattached jewelry, floating head, doubled head, unattached head, doubled head, head in body, (misshapen body:1.1), (badly fitted headwear:1.2), floating arms, (too many arms:1.5), limbs fused with body, (facial blemish:1.5), badly fitted clothes, imperfect eyes, untracked eyes, crossed eyes, hair growing from clothes, partial faces, hair not attached to head, crossed legs, crossed arms, film grain, noise,     "
}

// Good Generations: https://civitai.com/images/1236210?period=Week&periodMode=published&sort=Most+Reactions&view=feed&tags=5232 // Good: https://civitai.com/images/1257980?period=Week&periodMode=published&sort=Most+Reactions&view=feed&tags=5232

Fantastic Work

// TODO: copy everything you liked on twitter, everything you 3lian on bilibili, and wechat, things you sent to discord,

// TODO: explore this https://civitai.com/login?returnUrl=/models/6424/chiloutmix

// TODO: create website / image avatar like this: https://huggingface.co/WarriorMama777/OrangeMixs#abyssorangemix3-aom3

// TODO: lora extraction? https://www.youtube.com/watch?v=CnoXyIcXQV4 and lora in general https://www.youtube.com/watch?v=gw2XQ8HKTAI

// TODO: https://www.youtube.com/watch?v=QzM5iV-9jUU

// TODO: multi-control net: https://www.youtube.com/watch?v=cNIHZInV3mg

// TODO: awesome control net repo: https://github.com/cobanov/awesome-controlnet

Table of Content