Questions:
SDS "produce one single plausible output" and looks like "mean" though...
model ray rotation instead of image rotation
TLDR: The paper tries to build a diffusion with a "differentiable rendering function" (mainly https://alexyu.net/pixelnerf/) baked into the diffusion process.
Dataset requirement: A large collection of multi-view images (with known camera parameters) taken from many objects.
Theory: The author proved that their model would converge to maximize the likelihood of both producing observed and unobserved images.
Method:
Training:
It utilizes 24x24 patch training
Add diffusion timestep conditioning to PixelNeRF
7 to 3 days on 8xA100 GPUs on RealEstate10k and Co3D dataset with image resolution (64, 64) to (128, 128)
Result: result is good, ... as expected
Discussion:
The work is valuable in that it adds strong 3D prior to the image diffusion process and proves correctness. The method is generalizable to many stochastic inverse problems.
An advantage of using PixelNeRF compared to other ways https://liuyuan-pal.github.io/SyncDreamer/ of adding 3D consistency to diffusion is to avoid using 3D CNN.
Fair comparison: The author compared their result with PixelNeRF, but the method already consists of using pixelNeRF prior and using additional datasets.
To me, this method is not the future of 3D mesh generation since it generates one image at a time, meaning those images are not correlated in any way, which would still suffer from high variance image generation and thus long SDS inference time.
The paper is not well written and make audience confused by using the term "forward" which can mean the noising process in diffusion and rendering function.
Table of Content