Lecture 003

Speaker Presentation

Speaker: imisra

https://cvpr2022-tutorial-diffusion-models.github.io

Stable diffusion: latent might have leaking information of input image (beta scheduling: $\beta_T$ matters), $\beta_T = 1$ should not have information (a bug in stable diffusion?). Noise also depend on resolution: bigger image has more signal, so more noise should be added?

Having $T \neq 1$ in diffusion design is kinda frequency decomposition of image (first we reconstruct low frequency detail then high frequency)

FID: Compare statistics on latent with Inception with Imagenet (problematic, OOD) for feudality. Text-condition uses CLIP embedding.

Other paper:

Make A video: first public
Imagen-video: non-latent, (LAION-400M, internal, ) (http://cs231n.stanford.edu/slides/2023/lecture_15.pdf)
align your latent space from nvidia
preserve your own correlation from nvidia: mix of expert based on different timestep in image-diffusion

Superresolution: not as good in practice due to OOD. Most training on low resolution.

Text2Video-Zero

Make-A-Video

Localizing Object-level Shape Variations with Text-to-Image Diffusion Models

Table of Content