Questions
Why don't you sample at training time? What if I sample everything at mean at inference time? Will it reproduce the mode?
Papers Have not Read
Improved Denoising Diffusion Probabilistic Models (Nichol et al., 2021): finds that learning the variance of the conditional distribution (besides the mean) helps in improving performance
Cascaded Diffusion Models for High Fidelity Image Generation (Ho et al., 2021): introduces cascaded diffusion, which comprises a pipeline of multiple diffusion models that generate images of increasing resolution for high-fidelity image synthesis
Diffusion Models Beat GANs on Image Synthesis (Dhariwal et al., 2021): show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models by improving the U-Net architecture, as well as introducing classifier guidance
Classifier-Free Diffusion Guidance (Ho et al., 2021): shows that you don't need a classifier for guiding a diffusion model by jointly training a conditional and an unconditional diffusion model with a single neural network
Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2) (Ramesh et al., 2022): uses a prior to turn a text caption into a CLIP image embedding, after which a diffusion model decodes it into an image
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (ImageGen) (Saharia et al., 2022): shows that combining a large pre-trained language model (e.g. T5) with cascaded diffusion works well for text-to-image synthesis
Actual Unet Code in Diffusion: Here which uses residual blocks, multi-head attention, conditioning, and group norm.
Note that p(z_{1:n}|x_{1:n}) denotes a probability distribution, not specific probability density since z_{1:n} does not take on specific value. Knowing p(z_{1:n}|x_{1:n}) means knowing the probability density of z_{1:n} taking all possible value conditioned on x_{1:n}. x is a random variable and p is a distribution that "sits on" x.
complete conditional: the probability distribution of a thing after knowing every other things. (special case of posterior distribution)
Preliminary:
Kullback-Leibler (KL) Divergence: watch this
Kolm-Smironv (KS) Distance: the difference between CDF of distributions
Wasserstein Distance: the work (force times distance) need to be done to move one distribution to fit another
Below notes are from video here
Goal: given knowledge (we make some assumption on data distribution) and data (sample from observation), we would love to predict new data.
Probabilistic Model: p(\mathbf{z}, \mathbf{x}) where \mathbf{z} is latent variable, \mathbf{x} is observed variable.
Posterior: p(\mathbf{z} | \mathbf{x}) = \frac{p(\mathbf{z}, \mathbf{x})}{p(\mathbf{x})} is what we use for inference where p(\mathbf{x}) is called "evidence" and is usually intractable.
Approximate Posterior: q(\mathbf{z}; \mathbf{v}) is a distribution (indexed by \mathbf{v}) in a family of possible distributions. We start from a distribution indexed by v^{\text{init}} and gradually go to a distribution indexed by v^* that is optimal (KL(q(\mathbf{z}; v^*)) \| p(\mathbf{z} | \mathbf{x}) is small). You can think of \mathbf{v} as a parameter for q.
Stochastic optimization turns variational inference problem into an optimization problem.
Labels in graph: The i^{\text{th}} data point only depend on z_{i} and \beta
Observation: \mathbf{x} = x_{1:n}
Local Latent Variables: \mathbf{z} = z_{1:n}
Global Latent Variables: \beta
The goal is to compute p(\beta, \mathbf{z} | \mathbf{x}) by minimizing KL between q(\beta, \mathbf{z}; \mathbf{v}) and p(\beta, \mathbf{z} | \mathbf{x}).
KL isn't the best divergence, but it is the most common one. Maximizing ELBO is a good objective to minimize KL.
The above graph captures many models, including:
Bayesian Mixture Models
Time Series Models
Factorial Models
Matrix Factorization (PCA, CCA)
Dirichlet Process Mixtures, HDPs
Multilevel Regression (Linear, Probit, Poisson)
Stochastic Block Models
Mixed-Membership Models (LDA)
Assume:
Exponential Family Assumption: each complete conditional in above variables is in the exponential family.
Mean-Field Assumption: each variable (z_i, \beta) depend only on another free random variable.
Amortization: variational parameters are not "free", but they are a function of the input. (idea behind variational auto-encoder)
Traditional VI use coordinate descent (only tweaking one parameter at a time). Stochastic optimization use high variance, unbiased gradient.
We wish to remove constraints on conditional conjugate assumption (ie. mean field assumption) to do variational inference on more models like Logistic Regression. (No mathematical work beyond specifying the model)
Nonconjugate models:
Nonlinear Time Series Models
Deep Latent Gaussian Models
Models with Attention
Generalized Linear Models
Stochastic Volatility Models
Discrete Choice Models
Bayesian Neural Networks
Deep Exponential Families
Correlated Topic Models
Sigmoid Belief Networks
Using Classical VI, we cannot calculate ELBO objective analytically for models above because we cannot calculate the expectation analytically.
The way to do it
Score Gradient (also called likelihood ratio, reinforce gradient): model gradient as an expectation, and use Monte Carlo to approximate the expectation.
This is trackable given the library:
\nabla_v \log q(\mathbf{z}; \mathbf{v}) is the score function
\log p(\mathbf{x}, \mathbf{z}) is writing down the model
\log q(\mathbf{z}; \mathbf{v}) is density of variation approximation
Blackbox variational inference enables probabilistic programs: traditionally we write program to generate art using random variable, now we can write a model and given data, generate art.
Generative Model: there exists a data distribution p(x), we want our model to guess this distribution by sampling and observing our dataset p(x).
We assume all data samples (training data) come from some underlying distribution. Since we have limited number of training data, we can't have analytical solution to this data distribution. Although one can come up with a distribution that explain our data analytically, this would overfit. So the most powerful assumption we can make is "the actual information density in p(\mathbf{x}) is low, and is the result of heavy computation using small data".
We use our model q(\mathbf{x}) to better approximate the data distribution p(\mathbf{x}). Our untrained model q(\mathbf{x}) has the potential to be trained to produce a specific distribution from a family of distributions. Our job is to minimizing the distance between the data distribution p(\mathbf{x}) and the model distribution q(\mathbf{x}) to approximate data distribution. We then generate new data x by sampling model distribution q(\mathbf{x}).
Videos and Articles:
Technical Description (only look at 4:00~9:00)
Autoencoder: compress information.
Purpose of Variational Autoencoder (VAE): In autoencoder, if latent space too small, image will become blurry (underfitting). If too large, most of the latent space sampled from random will not be meaningful (overfitting). This is because the latent space isn't well organized into structures. Variational autoencoder regularize the latent space to avoid overfitting and has good property for generation purpose.
We lack two properties that will be introduced by VAE - continuity: similar things are together in latent space - completeness: no meaningful data in latent space
Variational Autoencoder: instead of mapping x to a fixed size vector, it maps x to a normal distribution (represented by a mean \mu and a standard deviation \sigma) of vectors in the latent space. Then sample from the distribution (reconstructed by \mu and \sigma). This way we can fill the latent space with more meaningful information. Variational Autoencoder add variances to autoencoder.
Regularization Loss: If we only sample from a predicted distribution, then the network will learn to predict a very small variance which defeats the purpose. So in addition to reconstruction loss, we need Regularization loss (KL divergence) to "require the covariance matrices to be close to the identity, preventing punctual distributions, and the mean to be close to 0, preventing encoded distributions to be too far apart from each others."
Note that the sampling step has "re-parameterization trick" to create gradient for \mu and \sigma.
We want to maximize the probability of generating something that is in the real data distribution, regardless of latent variable z, so we want to maximize \int p(x | z) dz = p(x) = \frac{p(x, z)}{p(z|x)}. (ie. get a model such that is very likely to generate the dataset we have, not considering overfitting) However, integration and obtaining p(z|x) are both computational infeasible, but the following "Variational Lower Bound" (ELBO) holds: where q_\phi(z|x) is a neural network with weight \phi that tries to approximate p(z|x).
In fact, since KL divergence is always non-negative, we have
\log p(x) = \mathbb{E}_{q_\phi (z|x)} \left[\log \frac{p(x, z)}{q_\phi (z|x)}\right] + D_{KL}(q_\phi(z|x) \| p(z|x))when we maximize ELBO \mathbb{E}_{q_\phi (z|x)} \left[\log \frac{p(x, z)}{q_\phi (z|x)}\right], since \log p(x) is fixed, the KL divergence must be minimized. So we are effectively minimizing the distance between q_\phi(z|x) and p(z|x).
ELBO, when written out \mathbb{E}_{q_\phi (z|x)}\left[\log p(x, z)\right] - \mathbb{E}_{q_\phi (z|x)}\left[q_\phi (z|x)\right] trade off two terms: - first term prefer q_\phi(z|x) to be MAP estimate - second term prefer q_\phi(z|x) to be diffuse (spread mass accross latent space)
So to maximize \log p(x), we just need to find proper \phi.
The ELBO can be rewritten as a reconstruction term (ability of decoder, where p_\theta(x|z) is a learned deterministic function called decoder with weight \theta) and a prior matching term (learned variational distribution should be similar to prior belief on latent variables). The prior matching term prevent collapse distribution into Dirac delta function.
We use the following assumptions:
We would sample z from q_\phi(z|x) and then feed it to p_\theta(x|z) to get a reconstructed image. The gradient for \phi is calculated using re-parameterization trick (model sampling as first sampling standard gaussian and then shift it).
For the theory behind reparameterization trick, watch this video.
Note that we can chain multiple VAEs together (Hierarchical Variational Autoencoders, HVAE) to get a deeper model. But VAE assumes the latent space is Gaussian, which is not true for many cases.
Most math involve how to add noise to the image, watch this explanation first implementing DDPM Paper.
"Creativity emerge from sampling a well-formed distribution"
Diffusion process: given pixel color x_{t-1}, we replace it with a random pixel color sampled from \mathcal{N}(\sqrt{1 - \beta_t} x_{t - 1}, \beta_t \mathbb{I}).
Reconstruction process: there are a few options
predict only the mean and then remove a sampled noise (since variance is fixed). However, this assumes Gaussian with isotropic covariance matrix (not true)
predict the mean and covariance matrix of noise then remove a sampled noise (but predicting exact covariance number is hard, so instead we calculate an upper and lower bound and predict an interpolation coefficient between them)
predict the noise directly and remove it (chosen by DDPM paper)
mix of both "simple objective" (predict noise) and "variational objective" (predict mean and covariance matrix of noise) and do importance sampling biased toward high noise to get smoother loss for variance reduction
Classifier Guidance: add classifier loss in each reconstruction step to produce better guidance with only 25 steps instead of hundreds of steps. (from "Improved Denoising Diffusion Probabilistic Models")
DDPM is a type of variational diffusion model
Variational Diffusion Models: a Markovian Hierarchical Variational Autoencoder with three key restrictions:
latent dimension is exactly equal to the data dimension
structure of the latent encoder at each timestep is pre-defined as a linear Gaussian model (not learned)
final timestep T is a standard Gaussian
Note that encoder q(x_t | x_{t - 1}) is no longer parameterized by \phi and not trainable.
With this, after very long derivation, we can lower bound \log p(x) by three terms:
reconstruction term: -\mathbb{E}_{q(x_1|x_0)}\left[\log {p_\theta(x_0|x_1)}\right] to make sure the model can reconstruct the original image from noised image
prior matching term: -\sum_{t = 1}^{T - 1} \mathbb{E}_{q(x_{T - 1}|x_0)}\left[D_{KL}(q(x_T | x_{T - 1}) \| p(x_T))\right]. This is minimized when the final latent distribution matches the Gaussian prior. It has no parameter so not optimizable.
consistency term: - \sum_{t = 1}^{T - 1} \mathbb{E}_{q(x_{t - 1}, x_{t + 1} | x_0)} \left[D_{KL}(q(x_t | x_{t - 1}) \| p_\theta (x_t | x_{t+1}))\right]. This term dominates the loss as it sums over all T. Also, since we calculate expectation using Monte Carlo, sampled result over T terms will have high variance which is bad.
We can rewrite consistency term using q(x_t|x_{t - 1}) = q(x_t | x_{t - 1}, x_0) due to the Markov property (does not depend on x_0). So now we can lower bound \log p(x) by three terms:
reconstruction term: -\mathbb{E}_{q(x_1|x_0)}\left[\log {p_\theta(x_0|x_1)}\right] same as above
prior matching term: -D_{KL}(q(x_T|x_0) \| p(x_T)) how close the distribution of the final noisified input is to the standard Gaussian prior. (not trainable)
denoising matching term: -\sum_{t = 2}^T \mathbb{E}_{q(x_t | x_0)}\left[D_{KL}q(x_{t - 1} | x_t, 0) \| p_\theta (x_{t - 1} | x_t)\right] denoising step we learned is close to the true denoising step.
So now, since we know the encoder is some Gaussian, we can add gaussian assumption to decoder. And after some annoying PnC (15-259 Probability and Computing), we can get the following:
Note that the mean is a function of x_t, x_0 and the variance a function of \alpha and is not learned. When variance is not learned, we can prove that minimizing KL divergence of two Gaussian is equivalent to minimizing \|\mu_\theta(x_t, t) - \mu_q(x_t, x_0)\|^2_2. Expanding the terms out, we can also prove that the neural network must have the ability to predict the denoised image from noised image in any timestamp.
Watch this video of original author talking about various score-based generative models.
If we model our data distribution by a simple Gaussian distribution (assuming \mu is the only variable), then the model can be a 2-layer neural network.
In above image, p_\mu represents the model p parameterized by \mu. x is the input data point sampled from the data distribution. p_\mu(x) gives the likelihood of observing x from model distribution p_\mu.
The word "model" and "model distribution" are used interchangeably since a model is really representing a distribution.
We can get rid of Gaussian assumption by modeling PDF (probability density function) using a general function f_\theta modeled by deep neural network. The model is parameterized by model weights \theta.
The problem is the output might not be positive (since PDF must be positive). We take exponential e^{f_\theta(x)} to get positive result and normalize it by Z_\theta e^{f_\theta(x)} / Z_\theta to make it sum to 1. However, normalizing constant Z_\theta = \int e^{f_\theta(x)} dx is a \sharp \text{P-complete} problem (as hard as \text{NP-complete}). So we use a trick to make it easier to calculate.
For Gaussian models, Z_\mu = \frac{1}{(2\pi)^{d/2}}
So there are couple ways to solve Z_\theta problems:
approximating normalizing constant (inaccurate)
using restricted neural network models (restricted)
model generation process only (cannot evaluate probability)
// TODO: finish watching https://www.youtube.com/watch?v=nv-WTeKRLl0
Variational Inference: we can sample from dataset p(x), we want to optimize similarity (KL-divergence) between p and q where q is the model we currently have.
// TODO: finish watching https://www.youtube.com/watch?v=gV1NWMiiAEI // TODO: finish watching https://www.youtube.com/watch?v=UTMpM4orS30 (3 parts) // TODO: finish watching https://www.youtube.com/watch?v=oHnnz8gS-YM
Particle-based Variational Inference (ParVIs): represent variational distribution by particles. Update particles to minimize KL. It is more particle-efficient than Markov Chain Monte Carlo (MCMC), another main way to represent complex distribution. Another ParVIs is Stochastic Gradient Langevin Dynamics (SGLD).
If you wish it to lean it in Chinese, watch this. But this is bad for beginners. You might want to read article first.
Compared to previous methods, SVGD is a general approximation method for distributions without having heavy assumptions on the distribution family of q.
Stein Variational Gradient Descent (SVGD): we use a finite set of N particles \mathbf{x} to approximate distribution q and use this approximated q to approximate p. We model this approximation as a process of applying transformation T on the distribution (modeled by particles \mathbf{x}) in 2-Wasserstein space \mathbb{W}_2(\Theta), following steepest-direction of gradient descent.
2-Wasserstein space (\mathbb{W}_2(\Theta)): if \mathbb{P}(\Theta) is a set containing all the distributions on a support Euclidean space \Theta (Euclidean distance), then \mathbb{W}_2(\Theta) = \{\mu \in \mathbb{P}(\Theta) | (\exists \theta_0 \in \Theta)(\mathbb{E}_{\mu(\theta)} \left[\|\theta - \theta_0\|^2\right] < \infty)\} (2-Wasserstein space is a subset of distribution that is not long-tile with Wasserstein 2-distance)
Since applying T to x \sim q(x) mimics a step of gradient descent, we can constrain T: \mathbb{R}^N \to \mathbb{R}^N (where N is the number of particles) to have the form T(x) = x + \epsilon \phi(x). If \phi is smooth and \epsilon (step size) is small, then T is invertible, which means we can easily compute the density of z. To pick appropriate \phi, we want to minimize the KL divergence between q' (or q_{[T]} in some literature, the distribution after passing q through T) and p. We wish to know the steepest direction of gradient descent:
where A_p is the Stein operator A_p\phi(x) = \nabla_x \log p(x) \phi^T (x) + \nabla_x \phi(x).
The intuition of above formula:
Stein identity: let p(x) be a smooth density distribution on space X, and \phi(x) = [\phi_1(x), ..., \phi_d(x)]^T is a smooth vector-valued function, if \phi is sufficiently regular, we have Stain identity, saying that the following hold:
\mathbb{E}_p \left[A_p \phi(x)\right] = 0\\ A_p \phi(x) = \nabla_x \log p(x) \phi^T (x) + \nabla_x \phi(x)A_p is called Stein Operator.
Observe that \mathbb{E}_p \left[A_q \phi(x)\right] with p \neq q, the most likely value is not zero, so we can use this to measure the distance between p and q. But since we have \nabla_x, \mathbb{E}_p \left[A_q \phi(x)\right] will be a matrix. Because we wish it to act as a scalar-valued loss function, we need to take the trace.
We can define Stein Discrepancy measure between p and q as:
\mathbb{S}(p, q) = \max_{\phi \in H^d, \|\phi\| \leq 1} \{\mathbb{E}_q \left[A_p \phi(x))\right]^2\}And we can find the optimal \phi^*_{p, q}(\cdot) = \mathbb{E}_q \left[A_p k(x, \cdot)\right] that maximizes such discrepancy. (Of course, remember to normalize \phi(x) = \frac{\phi^*_{p, q}(x)}{\|\phi^*_{p, q}(x)\|}) We will later use this maximization to find the steepest direction of gradient descent.
\nabla_\epsilon KL(q_{[T]} \| p)|_{\epsilon = 0} = - \mathbb{E}_q \left[\text{trace}(A_p \phi(x))\right]This gives us
\nabla_f KL(q_{[T]} \| p)|_{f = 0} = - \phi^*_{p, q}(x)(for line by line proof, see here)
For steepest direction, we need to find \phi that minimize the above equation, but this minimization is not well defined since \phi can have an arbitrary scalar making the expectation unbounded. So we add a constraint that \phi should live in Reproducing Kernel Hilbert Spaces (RKHS).
According to this, usually \phi is Radial basis function kernel (RBF) Kernel. For h \in \mathbb{R}, the RBF kernel is defined as:
k(x, x') = \exp\left(-\frac{\|x - x'\|^2_2}{h}\right)
Reproducing Kernel Hilbert Spaces (RKHS, H): a Hilbert space of continuous functions on a set X that has some inner product <\cdot, \cdot>_H defined. A vector space of functions where kernel functions are the basis. (for more information, see here)
We force \phi to live in H_D (RKHS with D-dimensional vector-valued functions with inner product <\cdot, \cdot>_H defined as:
and force \|\phi\|_{H_D} = <\phi, \phi>_{H_D} \leq 1.
Now, we can prove the optimal kernel \phi^* that minimize the above KL-divergence is: where k is a positive-definite kernel function that we optimize.
Kernel function: k(x, x') = <\phi(x), \phi(x')> where \phi : X \to F is usually a function that takes input to a higher dimension feature space where dot product (<\cdot, \cdot> similarity metrics) is defined, and k: X \times X \to \mathbb{R} has scalar value. By taking input to higher dimension, linear classifier can be used to classify non-linearly separable data. Note that according to here, kernel functions are abstract (does not imply specific calculation rules). Because \phi can be computational costly, it is sometimes good to find a clever kernel that satisfy this kernel property, but computed differently.
We use the above objective to derive an update rule for Stein Variational Gradient Descent (SVGD): we use x_n^{(i)} to denote n^\text{th} \in [1, N] particles at i^\text{th} iteration of applying T and \{x_n^{(x)}\}_{n = 1}^N is a set of particles at initial position
Note that if initialized in a bad position, SVGD can get stuck in a local minimum. Annealed SVGD proposes an annealing term (similar to annealing learning rate).
SVGD simulates Wasserstein Gradient Flow, a continuous-time PDE. For more, see this presentation. The algorithm we derived here can be derived using perspective of simulating Wasserstein Gradient Flow to match two distributions. The only difference is that Wasserstein Gradient Flow is formulated in L^2 space but SVGD is formulated in H^D (RKHS) space, and because of that SVGD uses "smooth function and density" (see On Wasserstein Gradient Flows and Particle-Based Variational Inference, ICML 2019 by Ruiyi Zhang, it also covers acceleration of SVGD). Since H^D is a subset of L^2, result of SVGD is the projection of Wasserstein Gradient Flow in H^D space.
SVGD is also equivalent to black-box variational inference (paper by Casey Chu at Stanford, and two other people in Preferred Networks, Inc., the company that developed PaintsChainer).
For code implementation, see here.
DDPM: Denoising Diffusion Probabilistic Models DDIM: Denoising Diffusion Implicit Models DPM: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
DDIM: remove noise, become deterministic, 100~250 steps, turning SDE into ODE
Like DeepDream, use classifier-gradient to manipulate latent. You do need to train a separate classifier that support image at any noise level.
Train both conditioned and unconditioned model (together as one model by sometimes leaving out conditioning), and interpolate between them during inference.
To see the actual diffusion code with annotations (and things that is implemented differently than in the paper), look at this article.
NovelAI Leak: The first big Stable Diffusion Community Drama and the Banning of AUTOMATIC1111
AI GodLike: https://www.aigodlike.com/
Civitai: a platform for sharing fine-tuned models. Here
Twitter: where to learn news (#aiart OR #stablediffusion OR #midjourney OR #imagen) min_faves:1500 until:2022-10-20 since:2022-10-10
Reddit: midjourney, stablediffusion
Discord: stable diffusion, community research channel
LoRA paper: https://arxiv.org/abs/2106.09685 Dreambooth Paper: https://arxiv.org/abs/2208.12242 Textual Inversion Paper: https://arxiv.org/abs/2208.01618
WebUI: https://github.com/AUTOMATIC1111/stable-diffusion-webui
Variational Auto Encoder (VAE): turn latent to image
Comparison between models: https://stable-diffusion-art.com/models/#Stable_diffusion_v14
Official WebUI Documentation: https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features
ControlNet: https://huggingface.co/lllyasviel/ControlNet/tree/main/models
AnythingV3(w/VAE): a model fine-tuned from stable diffusion v2.
It seems like only the original 2.x model need
config.yml
in the same folder
768-v-ema.ckpt
,512-base-ema.ckpt config
,512-depth-ema.ckpt c
LoRA vs Dreambooth vs Textual Inversion vs Hypernetworks
Note that NovelAI uses CLIP Skip to train their model, which means it will work bettwer with CLIP Skip using leaked NovelAI model. But otherwise, don't use CLIP Skip.
.safetensors
: a format that is safer (some pickle
can be malicious) and faster than pickle
(now supported by webui
)
// QUESTION: why pickle malicious, what does safetensors do?
Tag Complete: https://github.com/DominikDoom/a1111-sd-webui-tagcomplete
OpenPose Editor: https://github.com/fkunn1326/openpose-editor
Deforum: https://www.youtube.com/watch?v=R52hxnpNews&ab_channel=Fictitiousness
Traditional Methods
text2img
img2img
video2video with this input image for this output image
video2video with this output image and next input frame as control for next output frame
Sampling Method: With extension, you might need Euler
. Otherwise DDIM
is good.
Resolution: 768 x 768 (since it was trained on)
See Miro Board: https://miro.com/app/board/uXjVPChrzFo=/
Seed: https://youtu.be/R52hxnpNews?t=294
With img2img, do exactly the amount of steps the slider specifies (normally you'd do less with less denoising).
= true
Do not append detectmap to output
= true
Allow detectmap auto saving
= false
Run: the same
Keyframe:
Init:
Use init: img2img or not
Strength: 1 means stick 100% to the first image (override control net if stick to previous image)
ControlNet: if you enable, you can't disable
What the difference between Weight and Guidance strength?: https://github.com/Mikubill/sd-webui-controlnet/discussions/175
De-noising Parameter: https://github.com/deforum-art/deforum-for-automatic1111-webui/issues/3
Strength: more strength means closer to input image
Zoom: higher zoom (0.995) means zoom out slower
Miku: 1girl, anime screencap, grey background, pixiv request, pixiv sample, hatsune miku, full body,
Rin: 1girl, anime screencap, grey background, pixiv request, pixiv sample, full body, kagamine rin, standing, simple background,
Negative: gradient background, (ugly:1.3), (fused fingers), (too many fingers), (bad anatomy:1.5), (watermark:1.5), (words), letters, untracked eyes, asymmetric eyes, floating head, (logo:1.5), (bad hands:1.3), (mangled hands:1.2), (missing hands), (missing arms), backward hands, floating jewelry, unattached jewelry, floating head, doubled head, unattached head, doubled head, head in body, (misshapen body:1.1), (badly fitted headwear:1.2), floating arms, (too many arms:1.5), limbs fused with body, (facial blemish:1.5), badly fitted clothes, imperfect eyes, untracked eyes, crossed eyes, hair growing from clothes, partial faces, hair not attached to head,
https://huggingface.co/JosephusCheung/ACertainThing
/home/hanke/Documents/stable-diffusion-webui/outputs/img2img-images/Deforum/20230222163920_video-settings.txt https://kokecacao.me/download?password=ec07a631-99c4-4be0-93be-85050c0e018d|9223372036854775807|-2592000.00000-22.png
No Control Net: /home/hanke/Documents/stable-diffusion-webui/outputs/img2img-images/Stable/20230223224439_video-settings.txt
// TODO: lora // TODO: vae: https://stable-diffusion-art.com/how-to-use-vae/
Seed: control net works bad with fixed or alternate seed, random and iter seed are fine, Euler a is fine Init: work well with noisy image init, even with 0 and 0 no init selected Guidance: small guidance strength = no pose, when both weight, guidance = max -> no control net Strength: 0.6 = keep image the same, 0.1 = don't need to keep the same, do pose (0.3 is good)
{
"0": " 1girl, anime screencap, grey background, pixiv request, pixiv sample, full body, kagamine rin, standing, simple background, blurry background, 1girl, anime screencap, grey background, pixiv request, pixiv sample, full body, kagamine rin, standing, simple background, tiny cute swamp bunny, highly detailed, intricate, ultra hd, sharp photo, crepuscular rays, in focus, by tomasz alen kopera --neg gradient background, (ugly:1.3), (fused fingers), (too many fingers), (bad anatomy:1.5), (watermark:1.5), (words), letters, untracked eyes, asymmetric eyes, floating head, (logo:1.5), (bad hands:1.3), (mangled hands:1.2), (missing hands), (missing arms), backward hands, floating jewelry, unattached jewelry, floating head, doubled head, unattached head, doubled head, head in body, (misshapen body:1.1), (badly fitted headwear:1.2), floating arms, (too many arms:1.5), limbs fused with body, (facial blemish:1.5), badly fitted clothes, imperfect eyes, untracked eyes, crossed eyes, hair growing from clothes, partial faces, hair not attached to head, gradient background, (ugly:1.3), (fused fingers), (too many fingers), (bad anatomy:1.5), (watermark:1.5), (words), letters, untracked eyes, asymmetric eyes, floating head, (logo:1.5), (bad hands:1.3), (mangled hands:1.2), (missing hands), (missing arms), backward hands, floating jewelry, unattached jewelry, floating head, doubled head, unattached head, doubled head, head in body, (misshapen body:1.1), (badly fitted headwear:1.2), floating arms, (too many arms:1.5), limbs fused with body, (facial blemish:1.5), badly fitted clothes, imperfect eyes, untracked eyes, crossed eyes, hair growing from clothes, partial faces, hair not attached to head, crossed legs, crossed arms, film grain, noise, ",
"30": " 1girl, anime screencap, grey background, pixiv request, pixiv sample, full body, kagamine rin, standing, simple background, blurry background, 1girl, anime screencap, grey background, pixiv request, pixiv sample, full body, kagamine rin, standing, simple background, anthropomorphic clean cat, surrounded by fractals, epic angle and pose, symmetrical, 3d, depth of field, ruan jia and fenghua zhong --neg gradient background, (ugly:1.3), (fused fingers), (too many fingers), (bad anatomy:1.5), (watermark:1.5), (words), letters, untracked eyes, asymmetric eyes, floating head, (logo:1.5), (bad hands:1.3), (mangled hands:1.2), (missing hands), (missing arms), backward hands, floating jewelry, unattached jewelry, floating head, doubled head, unattached head, doubled head, head in body, (misshapen body:1.1), (badly fitted headwear:1.2), floating arms, (too many arms:1.5), limbs fused with body, (facial blemish:1.5), badly fitted clothes, imperfect eyes, untracked eyes, crossed eyes, hair growing from clothes, partial faces, hair not attached to head, gradient background, (ugly:1.3), (fused fingers), (too many fingers), (bad anatomy:1.5), (watermark:1.5), (words), letters, untracked eyes, asymmetric eyes, floating head, (logo:1.5), (bad hands:1.3), (mangled hands:1.2), (missing hands), (missing arms), backward hands, floating jewelry, unattached jewelry, floating head, doubled head, unattached head, doubled head, head in body, (misshapen body:1.1), (badly fitted headwear:1.2), floating arms, (too many arms:1.5), limbs fused with body, (facial blemish:1.5), badly fitted clothes, imperfect eyes, untracked eyes, crossed eyes, hair growing from clothes, partial faces, hair not attached to head, crossed legs, crossed arms, film grain, noise, ",
"60": " 1girl, anime screencap, grey background, pixiv request, pixiv sample, full body, kagamine rin, standing, simple background, blurry background, 1girl, anime screencap, grey background, pixiv request, pixiv sample, full body, kagamine rin, standing, simple background, a beautiful coconut --neg photo, realistic gradient background, (ugly:1.3), (fused fingers), (too many fingers), (bad anatomy:1.5), (watermark:1.5), (words), letters, untracked eyes, asymmetric eyes, floating head, (logo:1.5), (bad hands:1.3), (mangled hands:1.2), (missing hands), (missing arms), backward hands, floating jewelry, unattached jewelry, floating head, doubled head, unattached head, doubled head, head in body, (misshapen body:1.1), (badly fitted headwear:1.2), floating arms, (too many arms:1.5), limbs fused with body, (facial blemish:1.5), badly fitted clothes, imperfect eyes, untracked eyes, crossed eyes, hair growing from clothes, partial faces, hair not attached to head, gradient background, (ugly:1.3), (fused fingers), (too many fingers), (bad anatomy:1.5), (watermark:1.5), (words), letters, untracked eyes, asymmetric eyes, floating head, (logo:1.5), (bad hands:1.3), (mangled hands:1.2), (missing hands), (missing arms), backward hands, floating jewelry, unattached jewelry, floating head, doubled head, unattached head, doubled head, head in body, (misshapen body:1.1), (badly fitted headwear:1.2), floating arms, (too many arms:1.5), limbs fused with body, (facial blemish:1.5), badly fitted clothes, imperfect eyes, untracked eyes, crossed eyes, hair growing from clothes, partial faces, hair not attached to head, crossed legs, crossed arms, film grain, noise, ",
"90": " 1girl, anime screencap, grey background, pixiv request, pixiv sample, full body, kagamine rin, standing, simple background, blurry background, 1girl, anime screencap, grey background, pixiv request, pixiv sample, full body, kagamine rin, standing, simple background, a beautiful durian, trending on Artstation --neg gradient background, (ugly:1.3), (fused fingers), (too many fingers), (bad anatomy:1.5), (watermark:1.5), (words), letters, untracked eyes, asymmetric eyes, floating head, (logo:1.5), (bad hands:1.3), (mangled hands:1.2), (missing hands), (missing arms), backward hands, floating jewelry, unattached jewelry, floating head, doubled head, unattached head, doubled head, head in body, (misshapen body:1.1), (badly fitted headwear:1.2), floating arms, (too many arms:1.5), limbs fused with body, (facial blemish:1.5), badly fitted clothes, imperfect eyes, untracked eyes, crossed eyes, hair growing from clothes, partial faces, hair not attached to head, gradient background, (ugly:1.3), (fused fingers), (too many fingers), (bad anatomy:1.5), (watermark:1.5), (words), letters, untracked eyes, asymmetric eyes, floating head, (logo:1.5), (bad hands:1.3), (mangled hands:1.2), (missing hands), (missing arms), backward hands, floating jewelry, unattached jewelry, floating head, doubled head, unattached head, doubled head, head in body, (misshapen body:1.1), (badly fitted headwear:1.2), floating arms, (too many arms:1.5), limbs fused with body, (facial blemish:1.5), badly fitted clothes, imperfect eyes, untracked eyes, crossed eyes, hair growing from clothes, partial faces, hair not attached to head, crossed legs, crossed arms, film grain, noise, "
}
// Good Generations: https://civitai.com/images/1236210?period=Week&periodMode=published&sort=Most+Reactions&view=feed&tags=5232 // Good: https://civitai.com/images/1257980?period=Week&periodMode=published&sort=Most+Reactions&view=feed&tags=5232
// TODO: copy everything you liked on twitter, everything you 3lian on bilibili, and wechat, things you sent to discord,
// TODO: explore this https://civitai.com/login?returnUrl=/models/6424/chiloutmix
// TODO: create website / image avatar like this: https://huggingface.co/WarriorMama777/OrangeMixs#abyssorangemix3-aom3
// TODO: lora extraction? https://www.youtube.com/watch?v=CnoXyIcXQV4 and lora in general https://www.youtube.com/watch?v=gw2XQ8HKTAI
// TODO: https://www.youtube.com/watch?v=QzM5iV-9jUU
// TODO: multi-control net: https://www.youtube.com/watch?v=cNIHZInV3mg
// TODO: awesome control net repo: https://github.com/cobanov/awesome-controlnet
Table of Content