Stable diffusion: latent might have leaking information of input image (beta scheduling: \beta_T matters), \beta_T = 1 should not have information (a bug in stable diffusion?). Noise also depend on resolution: bigger image has more signal, so more noise should be added?

Having T \neq 1 in diffusion design is kinda frequency decomposition of image (first we reconstruct low frequency detail then high frequency)

FID: Compare statistics on latent with Inception with Imagenet (problematic, OOD) for feudality. Text-condition uses CLIP embedding.