Lecture 004

Questions:

what are some root causes of high saturation color for 3D generation in general (one is high cfg in SDS, but there should be more)
why do SDS actually work in practice (see prolific dreamer for why SDS shouldn’t work)
do you see possibility to resolve the tradeoff between Jannis (trained using 3D GT) and variety. OOD images

SDS "produce one single plausible output" and looks like "mean" though...

model ray rotation instead of image rotation

Diffusion with Forward Models: Solving Stochastic Inverse Problems Without Direct Supervision

TLDR: The paper tries to build a diffusion with a "differentiable rendering function" (mainly https://alexyu.net/pixelnerf/) baked into the diffusion process.

Dataset requirement: A large collection of multi-view images (with known camera parameters) taken from many objects.

Theory: The author proved that their model would converge to maximize the likelihood of both producing observed and unobserved images.

Method:

Sample two different views of the same object from the dataset
Mark them as $(O^{ctx}, \phi^{ctx})$ and $(O^{target}, \phi^{target})$ . The context "Observation (image + camera parameter)" is what the diffusion model will condition on, and the "Target (image + camera parameter)" is what the diffusion model will denoise.
The model architecture is "exactly the same as image diffusion conditioned on another image in the 'first order approximation'", except it adds (1) pixelNeRF encoder as a scene-generator function, in the paper they call it "denoise". (2) pixelNeRF rendering function, in the paper they call it "forward".
The idea is: (1) we add noise to the $O^{target}$ "Target" similar to image diffusion models. (2) Instead of UNet and cross-attention, we have a different way of using $(O^{ctx}, \phi^{ctx})$ .
To use $(O^{ctx}, \phi^{ctx})$ , we first encoder our $O^{ctx}$ "Context" using PixelNeRF and hopefully it will give us a latent representation containing some idea of what the true 3D world looks like (since PixelNeRF has pretrained ResNet34). By rendering PixelNeRF latent with respect to $\phi^{target}$ we obtain 3D latent corresponding to $O^{target}_t$ at any denoising timestep. So truly what we condition on are: (1) $O^{ctx}$ with its corresponding PixelNeRF latent and rendering of the latent at $\phi^{target}$ .

Training:

It utilizes 24x24 patch training
Add diffusion timestep conditioning to PixelNeRF
7 to 3 days on 8xA100 GPUs on RealEstate10k and Co3D dataset with image resolution (64, 64) to (128, 128)

Result: result is good, ... as expected

Discussion:

The work is valuable in that it adds strong 3D prior to the image diffusion process and proves correctness. The method is generalizable to many stochastic inverse problems.
An advantage of using PixelNeRF compared to other ways https://liuyuan-pal.github.io/SyncDreamer/ of adding 3D consistency to diffusion is to avoid using 3D CNN.
Fair comparison: The author compared their result with PixelNeRF, but the method already consists of using pixelNeRF prior and using additional datasets.
To me, this method is not the future of 3D mesh generation since it generates one image at a time, meaning those images are not correlated in any way, which would still suffer from high variance image generation and thus long SDS inference time.
The paper is not well written and make audience confused by using the term "forward" which can mean the noising process in diffusion and rendering function.

Table of Content