Lecture 004

Questions:

  1. what are some root causes of high saturation color for 3D generation in general (one is high cfg in SDS, but there should be more)
  2. why do SDS actually work in practice (see prolific dreamer for why SDS shouldn’t work)
  3. do you see possibility to resolve the tradeoff between Jannis (trained using 3D GT) and variety. OOD images

SDS "produce one single plausible output" and looks like "mean" though...

model ray rotation instead of image rotation

Diffusion with Forward Models: Solving Stochastic Inverse Problems Without Direct Supervision

TLDR: The paper tries to build a diffusion with a "differentiable rendering function" (mainly https://alexyu.net/pixelnerf/) baked into the diffusion process.

Dataset requirement: A large collection of multi-view images (with known camera parameters) taken from many objects.

Theory: The author proved that their model would converge to maximize the likelihood of both producing observed and unobserved images.

Method:

  1. Sample two different views of the same object from the dataset
  2. Mark them as (O^{ctx}, \phi^{ctx}) and (O^{target}, \phi^{target}). The context "Observation (image + camera parameter)" is what the diffusion model will condition on, and the "Target (image + camera parameter)" is what the diffusion model will denoise.
  3. The model architecture is "exactly the same as image diffusion conditioned on another image in the 'first order approximation'", except it adds (1) pixelNeRF encoder as a scene-generator function, in the paper they call it "denoise". (2) pixelNeRF rendering function, in the paper they call it "forward".
  4. The idea is: (1) we add noise to the O^{target} "Target" similar to image diffusion models. (2) Instead of UNet and cross-attention, we have a different way of using (O^{ctx}, \phi^{ctx}).
  5. To use (O^{ctx}, \phi^{ctx}), we first encoder our O^{ctx} "Context" using PixelNeRF and hopefully it will give us a latent representation containing some idea of what the true 3D world looks like (since PixelNeRF has pretrained ResNet34). By rendering PixelNeRF latent with respect to \phi^{target} we obtain 3D latent corresponding to O^{target}_t at any denoising timestep. So truly what we condition on are: (1) O^{ctx} with its corresponding PixelNeRF latent and rendering of the latent at \phi^{target}.

Training:

Result: result is good, ... as expected

Discussion:

Table of Content