NeRF

A good paper feed millions of researchers who can plagiarize some simple idea to publish papers.

CT Scan Reconstruction

CT Scan Reconstruction MRI Reconstruction Course Deep Learning for MRI MRI Basics Part 1 - Image Formation

It uses an analytical method relies on signal processing.

NeRF: Neural Radiance Fields for View Synthesis

NeRF Paper Reading: Youtube

Big Idea of NeRF: You get a bunch of images with labeled shooting directions. The algorithm gives you back a function that can be sampled to retrieve geometry information.

Big Idea of NeRF: You get a bunch of images with labeled shooting directions. The algorithm gives you back a function that can be sampled to retrieve geometry information.

NeRF's objective is to reconstruct 3D geometry using an array of images. However, the product of a NeRF is not an actual geometry, but a neural network that represent both the rendering function and a geometry.

Radiance Fields: It is a space in \mathbb{R}^3 such that each point contains a spherical function instead of containing a single value. This is essentially represent 3D volumes with 2D view-dependent appearance, basically geometry data viewed from different angle. It is a function of view direction on an object to color.

Approach

Input: ((\theta, \phi), (x, y, z)). Output: ((r, g, b), \alpha). Loss: difference between \int_D f(\cdot)_\alpha \cdot f(\cdot)_D and ground truth (r, g, b) where D is line along the ray.

Procedural:

  1. Prepare a bunch of images with shooting direction. First break each image down to pixels. So you have a bunch of pixels, each associated with a 3D vector represent shooting direction and pixel's rgb value.
  2. Randomly initialize weight of neural network. The network represents the rendering function (with viewing direction and pixel location as input, output color) with the geometry baked in. The network can be viewed as a generic function of radiance fields except for the extra parameter (x, y, z) and output \alpha for architecture (calculating the loss) purpose. We initialize the function by setting the function to a point in the function space.
  3. For each pixel, we feed the network with one viewing direction \mathbb{\theta, \phi} as constants integrate the network function along (x, y, z) \in \mathbb{D} (adding the result when input (x, y, z)) with estimated weights \alpha from output represent transparency to get a final color (r, g, b) output from the network. We take negative gradient after compare it with ground truth color.

// QUESTION: I kinda not sure how exactly to integrate \alpha.

We don't uniformly select voxel locations. We do two passes. The first pass is done with uniform voxel locations. And the second pass can be concentrated on the surface of the object.

However, if you just do that, the result will be poor, because for some reason, networks have hard time overfitting. Below is an example of a network trained with input (x, y) and output (r, g, b). The result is not great.

Sinusoid Encoding explained in "Fourier Features Let Networks Learn High Frequency"

Sinusoid Encoding explained in "Fourier Features Let Networks Learn High Frequency"

So the idea is to split the signal into different frequency layer to, in a sense, augment loss in higher frequency. This strategy can be found in transformers.

Further Readings: Fourier Features Let Networks Learn High Frequency

NTK: Infinite width fully connected layer, initialized with reasonable weights, and trained with infinite small steps. It is a good mathematical tool for giving good insights for fully connected layers.

How to differentiate rendering with density? We know:

\hat{C} = \int_0^\infty T(t) \alpha(t) c(t) dt

where is says the final color we are rendering onto the screen is the integral of T blockage multiplied by the density \alpha and color c and the blockage (occlusions) is another integral:

T(t) = \exp(-\int_0^t \alpha(s) ds)

You should think the density as the probability of a ray pass through a specific point along the ray. Also, h(t) := T(t)\alpha(t) is the probability that ray terminates at time t and we need to ensure that the ray will at some point terminate \int T(t)\alpha(t) dt = 1. Proving h(t) is indeed a distribution is just a small math exercise.

Due to practical constraints, NeRFs assume that the scene lies between a near and far bounds (t_n, t_f).

We often write

\hat{C} = \int_0^\infty h(t)c(t) dt = E_{h(t)}[c(t)]

Camera Convention

Camera Conventions

Camera Conventions

Why does the focal length in the camera intrinsics matrix have two dimensions?

In the pinhole camera model there is only one focal length which is between the principal point and the camera center.

However, after calculating the camera's intrinsic parameters, the matrix contains

(fx,  0,  offsetx,  0,
 0,  fy,  offsety,  0,
 0,   0,  1,        0)

Where fx and fy indicates focal length from x and y direction. This is OpenCV convention and also appear in Plenoxel's code.

struct PackedCameraSpec {
    PackedCameraSpec(CameraSpec& cam) :
        c2w(cam.c2w.packed_accessor32<float, 2, torch::RestrictPtrTraits>()),
        fx(cam.fx), fy(cam.fy),
        cx(cam.cx), cy(cam.cy),
        width(cam.width), height(cam.height),
        ndc_coeffx(cam.ndc_coeffx), ndc_coeffy(cam.ndc_coeffy) {}
    const torch::PackedTensorAccessor32<float, 2, torch::RestrictPtrTraits>
        c2w;
    float fx;
    float fy;
    float cx;
    float cy;
    int width;
    int height;

    float ndc_coeffx;
    float ndc_coeffy;
};

sx is the scaling factor in the x-direction (horizontally), it is a value that scales the horizontal dimensions of the image plane.

sy is the scaling factor in the y-direction (vertically), it is a value that scales the vertical dimensions of the image plane.

So the calculation looks like

fx = f*sx
fy = f*sy

sx = 10 pixel/mm (how many pixel per millimeter convertion)
f = 35mm (focal length)
then fx = sx*f = 35mm * 10 pixel/mm = 350 pixels

So fx is the focal length in pixel where the size of pixel is defined by how big the pixel is in x direction.

Norm

The actual loss uses L2 norm. Which is a topic on its own, but I will briefly mention it here.

L2 norm is the euclidian distance we are familiar with: given vector X, the size (norm) of a vector can be written as

(\sum_{i = 1}^k |X_i|^2)^\frac{1}{2}

and similarly, the L1 norm is the manhattan distance

(\sum_{i = 1}^k |X_i|^1)^\frac{1}{1} = \sum_{i = 1}^k |X_i|

So now we can define Ln norm as follow:

(\sum_{i = 1}^k |X_i|^n)^\frac{1}{n}

But why is it useful? Why it matters so much in machine learning? We need to penalize weight vectors that has big norms because they will lead to overfitting. To see why, imagine your weights w_0, w_1, w_2, ... are used in such way:

f(x) = w_0 + w_1x^1 + w_2x^2 + ...

Then we want the polynomial to have less degree and so we want for example, set as many w_i to 0 as possible. But setting a weight is not differentiable, so we instead require the vector [w_0, w_1, w_2, ...]^T to be "small". What is the formula to measure "small"? That's the norm.

klzzwxh:0039 Each point on the graph represent vector klzzwxh:0040 that has the same norm (ie. will receive same penalty in loss)

L_1, L_2, L_p, L_\infty Each point on the graph represent vector [w_0, w_1]^T that has the same norm (ie. will receive same penalty in loss)

When you add a regularization term in your loss function, you get the following landscape.

Loss landscape for regularization

Loss landscape for regularization

And because of this, our neural network might favor one weight over the other when using different regularization.

Choose different klzzwxh:0043 for different regularization. Notice klzzwxh:0044 is a very strong regularization that enforces less parameters. klzzwxh:0045 is a bit relaxed.

Choose different [w_0, w_1]^T for different regularization. Notice L_1 is a very strong regularization that enforces less parameters. L_2 is a bit relaxed.

The L1 regularization solution is sparse. The L2 regularization solution is non-sparse. Although in practical ML we never look for exact solution. L2 regularization doesn’t perform feature selection, since weights are only reduced to values near 0 instead of 0. L1 regularization has built-in feature selection.

Mean Square Error (MSE) is another common place where L1, L2 norm appears: the below function is MSE with a \lambda weighted L2 regularization.

MSE = \frac{1}{n}\sum_{i = 1}^n (\hat{y}_i(x_i, W) - y_i) + \lambda \sum_{i = 1}^n |w_i|^2

Dirac delta function

The delta function was introduced by physicist Paul Dirac as a tool for the normalization of state vectors. It also has uses in probability theory and signal processing.

The Dirac delta as the limit as klzzwxh:0048

The Dirac delta as the limit as a \to 0

In terms of NeRF, ideally we want the function to model the color integral distribution be like: (so instead of volumetric rendering, we only take color at exact ray hit)

\delta(x) \simeq \begin{cases} \infty & x = 0\\ 0 & x \neq 0\\ \end{cases}

For most objects, the variance of such distribution decreases with more training views from different directions (in terms of \delta function, it means a \to 0)

Various components in ray rendering equation ploted along the ray

Various components in ray rendering equation ploted along the ray

Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields (ICCV 2021)

Mip-NeRF: It integrates the entire cone area by a weighted 3D gaussian-like distribution

Mip-NeRF: It integrates the entire cone area by a weighted 3D gaussian-like distribution

Big Idea: Deal with small, real-life image that has anti-aliasing and poor resolution. It also improve NeRF by not require camera to center at object's central location.

Mip-NeRF Sampling Result: It keeps low frequency while averages high frequency. This way we can calculate expected high frequency more accurately especially with non-uniform sampling.

Mip-NeRF Sampling Result: It keeps low frequency while averages high frequency. This way we can calculate expected high frequency more accurately especially with non-uniform sampling.

// TODO: what is blurpool

It use characteristics function to determine whether the point is in the cone. And then calculate the complex 3D gaussian expectation with fourier transformed coordinates. // TODO: think about math if you have time

NeRF++: Analyzing and Improving Neural Radiance Fields

Idea: the reason why MLP works is because the players serve as a prior to assume smoothness of color (and therefore shape) on geometry surface.

Data flow design of original NeRF

Data flow design of original NeRF

In original design of NeRF, the direction d is smooth with respect to color c more than position x with respect to color c. Therefore, putting input d in later layers help the accuracy of the model.

Background and Foreground Tradeoff in Original NeRF

Background and Foreground Tradeoff in Original NeRF

For 360 degree captures of unbounded scenes, NeRF’s parameterization of space either models only a portion of the scene, leading to significant artifacts in background elements (a), or models the full scene and suffers from an overall loss of detail due to finite sampling resolution (b).

So we separate NeRF (into two MLPs) to one for foreground and one for background.

Re-Parameterization for Background Scene

Re-Parameterization for Background Scene

We parameterize location encoding for background scene to encode space outside of the sphere as a 4D coordinate where the 4th coordinate 1/r decrease with distance. The idea is that the original sparse encoding for background becomes more dense and therefore more images can contribute to the color of far backgrounds to resolve background ambiguity.

Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields (CVPR 2022)

Re-Parameterization: shrink background but keep monotonicity in sphere of radius 2

Re-Parameterization: shrink background but keep monotonicity in sphere of radius 2

Parameterization: kinda Mip-NeRF + another way to do NeRF++

The first network only provide density

The first network only provide density

Convergence: Separate into two network but with one gradient, first for density and the second for density and color using first network's density to reduce training cost.

// QUESTION: don't quiet understand how this would work

Distillation: // TODO: didn't read

PlenOctrees for Real-time Rendering of Neural Radiance Fields

Background and Survey

Representing Geometry as Multo-plane Images

Representing Geometry as Multo-plane Images

Representing Geometry as Voxels

Representing Geometry as Voxels

Above methods is topology-free and can be rendered in real-time but is memory intensive (can't capture detailed resolution).

Coordinate-Based Neural Networks

Coordinate-Based Neural Networks

NeRFs can be sampled with arbitrary resolution since the function is continuous. However, they are slow to train and test. Methods to accelerate NeRF includes

  1. train on datasets of similar scenes
  2. skip empty region during testing (Neural Sparse Voxel Fields)
  3. decompose scene into smaller networks (Decomposed Radiance Fields)
  4. Quality Tradeoff (AutoInt)
  5. ...

PlenOctrees

Octree Geometry Compression

Octree Geometry Compression

Spherical harmonics: used to speed up the process of converting NeRF to PlenOctree. This is because we put view-dependent calculation to evaluation time instead of PlenOctree-convertion time. (also the model looks cleaner)

Comments:

Plenoxels: Radiance Fields without Neural Networks

Normalized device coordinates: ?

multi-sphere images: ?

Trilinear Interpolation is crucial: it converts discrete representation to a continuous one to minimize reconstruction loss.

// QUESTION: I don't understand how optimizing voxel coefficients, and regularization formula work, haven't read into it.

Comments:

Instant Neural Graphics Primitives with a Multiresolution Hash Encoding

Learned Hash Table

Learned Hash Table

Procedural:

  1. We integrate the location along the ray D the same way as in NeRF, however, for each point (x, y. z) we calculate its value by interpolation from its nearby vertex value (with course to fine multi-layer) that is randomly distributed in the hash.
  2. Since it is randomly distributed, we need a neural network to distribute avaliable hash cells to each vertices. This way, we can let the neural network deicide where the hash cell should be used in the geometry and also calculate RGBD target from interpolated latent space.

Morton Code

Morton Code

Morton Code

Morton code is used to map n dimensional space to linear space. Morton code defines a Z-shape space filler, which preserves n dimensional locality.

Oct-tree Using Morton Code

Oct-tree Using Morton Code

Assuming we already stored a 3D map where each cell is an int into a Morton Coded array, and we want to extract the int value in (x, y, z) = (5, 9, 1) = (0101b, 1001b, 0001b) then there is an easy method to know which array position we need to look up. We need to look up (010 001 000 111b). This is because from z, y, x, we extract the most significant bits in every dimension to the least significant bits.

Morton Code 3D Convertion: above is 3D coordinate and below is corresponding 1D coordinate

Morton Code 3D Convertion: above is 3D coordinate and below is corresponding 1D coordinate

Note that to invert morton code to 3D coordinate, we only need to code >> 0 for x, code >> 1 for y, and code >> 2 for z and pass to the same decoding function.

If for loop, we can write the code like this:

#include <stdint.h>
#include <limits.h>
using namespace std;

inline uint64_t mortonEncode_for(unsigned int x, unsigned int y, unsigned int z) {
  uint64_t answer = 0;
  for (uint64_t i = 0; i < (sizeof(uint64_t)* CHAR_BIT)/3; ++i) {
    answer |= ((x & ((uint64_t)1 << i)) << 2*i) | ((y & ((uint64_t)1 << i)) << (2*i + 1)) | ((z & ((uint64_t)1 << i)) << (2*i + 2));
  }
  return answer;
}

To achieve better performance, we could use magic bits:

#include <stdint.h>
#include <limits.h>
using namespace std;

// method to seperate bits from a given integer 3 positions apart
inline uint64_t splitBy3(unsigned int a){
  uint64_t x = a & 0x1fffff; // we only look at the first 21 bits
  x = (x | x << 32) & 0x1f00000000ffff; // shift left 32 bits, OR with self, and 00011111000000000000000000000000000000001111111111111111
  x = (x | x << 16) & 0x1f0000ff0000ff; // shift left 32 bits, OR with self, and 00011111000000000000000011111111000000000000000011111111
  x = (x | x << 8) & 0x100f00f00f00f00f; // shift left 32 bits, OR with self, and 0001000000001111000000001111000000001111000000001111000000000000
  x = (x | x << 4) & 0x10c30c30c30c30c3; // shift left 32 bits, OR with self, and 0001000011000011000011000011000011000011000011000011000100000000
  x = (x | x << 2) & 0x1249249249249249;
  return x;
}

inline uint64_t mortonEncode_magicbits(unsigned int x, unsigned int y, unsigned int z){
  uint64_t answer = 0;
  answer |= splitBy3(x) | splitBy3(y) << 1 | splitBy3(z) << 2;
  return answer;
}

or to use a giant table to achieve the best performance

#include <stdint.h>
#include <limits.h>
using namespace std;

static const uint32_t morton256_x[256] = {
0x00000000,
0x00000001, 0x00000008, 0x00000009, 0x00000040, 0x00000041, 0x00000048, 0x00000049, 0x00000200,
0x00000201, 0x00000208, 0x00000209, 0x00000240, 0x00000241, 0x00000248, 0x00000249, 0x00001000,
0x00001001, 0x00001008, 0x00001009, 0x00001040, 0x00001041, 0x00001048, 0x00001049, 0x00001200,
0x00001201, 0x00001208, 0x00001209, 0x00001240, 0x00001241, 0x00001248, 0x00001249, 0x00008000,
0x00008001, 0x00008008, 0x00008009, 0x00008040, 0x00008041, 0x00008048, 0x00008049, 0x00008200,
0x00008201, 0x00008208, 0x00008209, 0x00008240, 0x00008241, 0x00008248, 0x00008249, 0x00009000,
0x00009001, 0x00009008, 0x00009009, 0x00009040, 0x00009041, 0x00009048, 0x00009049, 0x00009200,
0x00009201, 0x00009208, 0x00009209, 0x00009240, 0x00009241, 0x00009248, 0x00009249, 0x00040000,
0x00040001, 0x00040008, 0x00040009, 0x00040040, 0x00040041, 0x00040048, 0x00040049, 0x00040200,
0x00040201, 0x00040208, 0x00040209, 0x00040240, 0x00040241, 0x00040248, 0x00040249, 0x00041000,
0x00041001, 0x00041008, 0x00041009, 0x00041040, 0x00041041, 0x00041048, 0x00041049, 0x00041200,
0x00041201, 0x00041208, 0x00041209, 0x00041240, 0x00041241, 0x00041248, 0x00041249, 0x00048000,
0x00048001, 0x00048008, 0x00048009, 0x00048040, 0x00048041, 0x00048048, 0x00048049, 0x00048200,
0x00048201, 0x00048208, 0x00048209, 0x00048240, 0x00048241, 0x00048248, 0x00048249, 0x00049000,
0x00049001, 0x00049008, 0x00049009, 0x00049040, 0x00049041, 0x00049048, 0x00049049, 0x00049200,
0x00049201, 0x00049208, 0x00049209, 0x00049240, 0x00049241, 0x00049248, 0x00049249, 0x00200000,
0x00200001, 0x00200008, 0x00200009, 0x00200040, 0x00200041, 0x00200048, 0x00200049, 0x00200200,
0x00200201, 0x00200208, 0x00200209, 0x00200240, 0x00200241, 0x00200248, 0x00200249, 0x00201000,
0x00201001, 0x00201008, 0x00201009, 0x00201040, 0x00201041, 0x00201048, 0x00201049, 0x00201200,
0x00201201, 0x00201208, 0x00201209, 0x00201240, 0x00201241, 0x00201248, 0x00201249, 0x00208000,
0x00208001, 0x00208008, 0x00208009, 0x00208040, 0x00208041, 0x00208048, 0x00208049, 0x00208200,
0x00208201, 0x00208208, 0x00208209, 0x00208240, 0x00208241, 0x00208248, 0x00208249, 0x00209000,
0x00209001, 0x00209008, 0x00209009, 0x00209040, 0x00209041, 0x00209048, 0x00209049, 0x00209200,
0x00209201, 0x00209208, 0x00209209, 0x00209240, 0x00209241, 0x00209248, 0x00209249, 0x00240000,
0x00240001, 0x00240008, 0x00240009, 0x00240040, 0x00240041, 0x00240048, 0x00240049, 0x00240200,
0x00240201, 0x00240208, 0x00240209, 0x00240240, 0x00240241, 0x00240248, 0x00240249, 0x00241000,
0x00241001, 0x00241008, 0x00241009, 0x00241040, 0x00241041, 0x00241048, 0x00241049, 0x00241200,
0x00241201, 0x00241208, 0x00241209, 0x00241240, 0x00241241, 0x00241248, 0x00241249, 0x00248000,
0x00248001, 0x00248008, 0x00248009, 0x00248040, 0x00248041, 0x00248048, 0x00248049, 0x00248200,
0x00248201, 0x00248208, 0x00248209, 0x00248240, 0x00248241, 0x00248248, 0x00248249, 0x00249000,
0x00249001, 0x00249008, 0x00249009, 0x00249040, 0x00249041, 0x00249048, 0x00249049, 0x00249200,
0x00249201, 0x00249208, 0x00249209, 0x00249240, 0x00249241, 0x00249248, 0x00249249
};

// pre-shifted table for Y coordinates (1 bit to the left)
static const uint32_t morton256_y[256] = {
0x00000000,
0x00000002, 0x00000010, 0x00000012, 0x00000080, 0x00000082, 0x00000090, 0x00000092, 0x00000400,
0x00000402, 0x00000410, 0x00000412, 0x00000480, 0x00000482, 0x00000490, 0x00000492, 0x00002000,
0x00002002, 0x00002010, 0x00002012, 0x00002080, 0x00002082, 0x00002090, 0x00002092, 0x00002400,
0x00002402, 0x00002410, 0x00002412, 0x00002480, 0x00002482, 0x00002490, 0x00002492, 0x00010000,
0x00010002, 0x00010010, 0x00010012, 0x00010080, 0x00010082, 0x00010090, 0x00010092, 0x00010400,
0x00010402, 0x00010410, 0x00010412, 0x00010480, 0x00010482, 0x00010490, 0x00010492, 0x00012000,
0x00012002, 0x00012010, 0x00012012, 0x00012080, 0x00012082, 0x00012090, 0x00012092, 0x00012400,
0x00012402, 0x00012410, 0x00012412, 0x00012480, 0x00012482, 0x00012490, 0x00012492, 0x00080000,
0x00080002, 0x00080010, 0x00080012, 0x00080080, 0x00080082, 0x00080090, 0x00080092, 0x00080400,
0x00080402, 0x00080410, 0x00080412, 0x00080480, 0x00080482, 0x00080490, 0x00080492, 0x00082000,
0x00082002, 0x00082010, 0x00082012, 0x00082080, 0x00082082, 0x00082090, 0x00082092, 0x00082400,
0x00082402, 0x00082410, 0x00082412, 0x00082480, 0x00082482, 0x00082490, 0x00082492, 0x00090000,
0x00090002, 0x00090010, 0x00090012, 0x00090080, 0x00090082, 0x00090090, 0x00090092, 0x00090400,
0x00090402, 0x00090410, 0x00090412, 0x00090480, 0x00090482, 0x00090490, 0x00090492, 0x00092000,
0x00092002, 0x00092010, 0x00092012, 0x00092080, 0x00092082, 0x00092090, 0x00092092, 0x00092400,
0x00092402, 0x00092410, 0x00092412, 0x00092480, 0x00092482, 0x00092490, 0x00092492, 0x00400000,
0x00400002, 0x00400010, 0x00400012, 0x00400080, 0x00400082, 0x00400090, 0x00400092, 0x00400400,
0x00400402, 0x00400410, 0x00400412, 0x00400480, 0x00400482, 0x00400490, 0x00400492, 0x00402000,
0x00402002, 0x00402010, 0x00402012, 0x00402080, 0x00402082, 0x00402090, 0x00402092, 0x00402400,
0x00402402, 0x00402410, 0x00402412, 0x00402480, 0x00402482, 0x00402490, 0x00402492, 0x00410000,
0x00410002, 0x00410010, 0x00410012, 0x00410080, 0x00410082, 0x00410090, 0x00410092, 0x00410400,
0x00410402, 0x00410410, 0x00410412, 0x00410480, 0x00410482, 0x00410490, 0x00410492, 0x00412000,
0x00412002, 0x00412010, 0x00412012, 0x00412080, 0x00412082, 0x00412090, 0x00412092, 0x00412400,
0x00412402, 0x00412410, 0x00412412, 0x00412480, 0x00412482, 0x00412490, 0x00412492, 0x00480000,
0x00480002, 0x00480010, 0x00480012, 0x00480080, 0x00480082, 0x00480090, 0x00480092, 0x00480400,
0x00480402, 0x00480410, 0x00480412, 0x00480480, 0x00480482, 0x00480490, 0x00480492, 0x00482000,
0x00482002, 0x00482010, 0x00482012, 0x00482080, 0x00482082, 0x00482090, 0x00482092, 0x00482400,
0x00482402, 0x00482410, 0x00482412, 0x00482480, 0x00482482, 0x00482490, 0x00482492, 0x00490000,
0x00490002, 0x00490010, 0x00490012, 0x00490080, 0x00490082, 0x00490090, 0x00490092, 0x00490400,
0x00490402, 0x00490410, 0x00490412, 0x00490480, 0x00490482, 0x00490490, 0x00490492, 0x00492000,
0x00492002, 0x00492010, 0x00492012, 0x00492080, 0x00492082, 0x00492090, 0x00492092, 0x00492400,
0x00492402, 0x00492410, 0x00492412, 0x00492480, 0x00492482, 0x00492490, 0x00492492
};

// Pre-shifted table for z (2 bits to the left)
static const uint32_t morton256_z[256] = {
0x00000000,
0x00000004, 0x00000020, 0x00000024, 0x00000100, 0x00000104, 0x00000120, 0x00000124, 0x00000800,
0x00000804, 0x00000820, 0x00000824, 0x00000900, 0x00000904, 0x00000920, 0x00000924, 0x00004000,
0x00004004, 0x00004020, 0x00004024, 0x00004100, 0x00004104, 0x00004120, 0x00004124, 0x00004800,
0x00004804, 0x00004820, 0x00004824, 0x00004900, 0x00004904, 0x00004920, 0x00004924, 0x00020000,
0x00020004, 0x00020020, 0x00020024, 0x00020100, 0x00020104, 0x00020120, 0x00020124, 0x00020800,
0x00020804, 0x00020820, 0x00020824, 0x00020900, 0x00020904, 0x00020920, 0x00020924, 0x00024000,
0x00024004, 0x00024020, 0x00024024, 0x00024100, 0x00024104, 0x00024120, 0x00024124, 0x00024800,
0x00024804, 0x00024820, 0x00024824, 0x00024900, 0x00024904, 0x00024920, 0x00024924, 0x00100000,
0x00100004, 0x00100020, 0x00100024, 0x00100100, 0x00100104, 0x00100120, 0x00100124, 0x00100800,
0x00100804, 0x00100820, 0x00100824, 0x00100900, 0x00100904, 0x00100920, 0x00100924, 0x00104000,
0x00104004, 0x00104020, 0x00104024, 0x00104100, 0x00104104, 0x00104120, 0x00104124, 0x00104800,
0x00104804, 0x00104820, 0x00104824, 0x00104900, 0x00104904, 0x00104920, 0x00104924, 0x00120000,
0x00120004, 0x00120020, 0x00120024, 0x00120100, 0x00120104, 0x00120120, 0x00120124, 0x00120800,
0x00120804, 0x00120820, 0x00120824, 0x00120900, 0x00120904, 0x00120920, 0x00120924, 0x00124000,
0x00124004, 0x00124020, 0x00124024, 0x00124100, 0x00124104, 0x00124120, 0x00124124, 0x00124800,
0x00124804, 0x00124820, 0x00124824, 0x00124900, 0x00124904, 0x00124920, 0x00124924, 0x00800000,
0x00800004, 0x00800020, 0x00800024, 0x00800100, 0x00800104, 0x00800120, 0x00800124, 0x00800800,
0x00800804, 0x00800820, 0x00800824, 0x00800900, 0x00800904, 0x00800920, 0x00800924, 0x00804000,
0x00804004, 0x00804020, 0x00804024, 0x00804100, 0x00804104, 0x00804120, 0x00804124, 0x00804800,
0x00804804, 0x00804820, 0x00804824, 0x00804900, 0x00804904, 0x00804920, 0x00804924, 0x00820000,
0x00820004, 0x00820020, 0x00820024, 0x00820100, 0x00820104, 0x00820120, 0x00820124, 0x00820800,
0x00820804, 0x00820820, 0x00820824, 0x00820900, 0x00820904, 0x00820920, 0x00820924, 0x00824000,
0x00824004, 0x00824020, 0x00824024, 0x00824100, 0x00824104, 0x00824120, 0x00824124, 0x00824800,
0x00824804, 0x00824820, 0x00824824, 0x00824900, 0x00824904, 0x00824920, 0x00824924, 0x00900000,
0x00900004, 0x00900020, 0x00900024, 0x00900100, 0x00900104, 0x00900120, 0x00900124, 0x00900800,
0x00900804, 0x00900820, 0x00900824, 0x00900900, 0x00900904, 0x00900920, 0x00900924, 0x00904000,
0x00904004, 0x00904020, 0x00904024, 0x00904100, 0x00904104, 0x00904120, 0x00904124, 0x00904800,
0x00904804, 0x00904820, 0x00904824, 0x00904900, 0x00904904, 0x00904920, 0x00904924, 0x00920000,
0x00920004, 0x00920020, 0x00920024, 0x00920100, 0x00920104, 0x00920120, 0x00920124, 0x00920800,
0x00920804, 0x00920820, 0x00920824, 0x00920900, 0x00920904, 0x00920920, 0x00920924, 0x00924000,
0x00924004, 0x00924020, 0x00924024, 0x00924100, 0x00924104, 0x00924120, 0x00924124, 0x00924800,
0x00924804, 0x00924820, 0x00924824, 0x00924900, 0x00924904, 0x00924920, 0x00924924
};

inline uint64_t mortonEncode_LUT(unsigned int x, unsigned int y, unsigned int z){
  uint64_t answer = 0;
  answer = morton256_z[(z >> 16) & 0xFF ] | // we start by shifting the third byte, since we only look at the first 21 bits
  morton256_y[(y >> 16) & 0xFF ] |
  morton256_x[(x >> 16) & 0xFF ];
  answer = answer << 48 | morton256_z[(z >> 8) & 0xFF ] | // shifting second byte
  morton256_y[(y >> 8) & 0xFF ] |
  morton256_x[(x >> 8) & 0xFF ];
  answer = answer << 24 |
  morton256_z[(z) & 0xFF ] | // first byte
  morton256_y[(y) & 0xFF ] |
  morton256_x[(x) & 0xFF ];
  return answer;
}

Here is the source code for octree:

// Expands a 10-bit integer into 30 bits
// by inserting 2 zeros after each bit.
__host__ __device__ inline uint32_t expand_bits(uint32_t v) {
  v = (v * 0x00010001u) & 0xFF0000FFu;
  v = (v * 0x00000101u) & 0x0F00F00Fu;
  v = (v * 0x00000011u) & 0xC30C30C3u;
  v = (v * 0x00000005u) & 0x49249249u;
  return v;
}

// Calculates a 30-bit Morton code for the
// given 3D point located within the unit cube [0,1].
__host__ __device__ inline uint32_t morton3D(uint32_t x, uint32_t y, uint32_t z) {
  uint32_t xx = expand_bits(x);
  uint32_t yy = expand_bits(y);
  uint32_t zz = expand_bits(z);
  return xx | (yy << 1) | (zz << 2);
}

__host__ __device__ inline uint32_t morton3D_invert(uint32_t x) {
  x = x               & 0x49249249;
  x = (x | (x >> 2))  & 0xc30c30c3;
  x = (x | (x >> 4))  & 0x0f00f00f;
  x = (x | (x >> 8))  & 0xff0000ff;
  x = (x | (x >> 16)) & 0x0000ffff;
  return x;
}

For details, read Out-of-Core Construction of Sparse Voxel Octrees and Morton encoding/decoding through bit interleaving: Implementations

Linear Congruential Generator

idx = ((i+step*n_elements) * 56924617 + j * 19349663 + 96925573) % (NERF_GRIDSIZE()*NERF_GRIDSIZE()*NERF_GRIDSIZE());

A linear congruential generator (LCG) is an algorithm that yields a sequence of pseudo-randomized numbers calculated with a discontinuous piecewise linear equation. The method represents one of the oldest and best-known pseudorandom number generator algorithms. The theory behind them is relatively easy to understand, and they are easily implemented and fast, especially on computer hardware which can provide modular arithmetic by storage-bit truncation. However, the statistical properties are bad.

Here, we don't actually care much about its statistical properties. Rather, we care about its property of producing a permutation: this use case distributes the density grid update samples more-or-less uniformly over space (due to the pseudo-random nature), but ensures good coverage by never visiting a grid cell twice without having visited all other cells (due to the permutation property).

MIT Vision and Graphics Seminar - Atlas Wang

Background: general modality models (transformers, MLP mixer, Perceiver), can they beat domain-specific architecture?

over-simplified rendering equation

over-simplified rendering equation

over-simplified spherical harmonics (NeRF for reflection materials: NeRF: integrate optical reflection model into volume rendering, Ref-NeRF: Efficient directional encoding for structured view dependence)

over-simplified spherical harmonics (NeRF for reflection materials: NeRF: integrate optical reflection model into volume rendering, Ref-NeRF: Efficient directional encoding for structured view dependence)

NeRF: data distillation by overfitting, cross-view interpolation

PixelNeRF & IBRNet: acquire RGBA of each point by weighted summing the image features of its 2D projections

MVSNeRF: predict RGBA of each point from a cost volume induced by MVSNet using 2 cameras

Side note: Transformer - attention is all you need replacing LSTM and early RNNs.

Generalizable NeRF Transformer (GNT):

Preliminary Result

There are additional 9 articles in my reading list.

Paper:

  1. breaking up images into pixel rays is clever. It creates more data and well used the assumption that each pixel shares the same rendering function (with 3d object baked into rendering function).
    1. can be break the rendering function even more by separating model part and true rendering function part? So like we can generalize the rendering function accross different models. And we freeze the rendering function part and only overfit the model part, so train less parameters for each new model.
  2. instead of "x" being center of ray, what if you let it be camera view position?
    1. if it were to be camera position, then there are data that is not relevent to the object, it decrease generalization for one voxel.
  3. there is no separation between "base color" and "resulting color". You are only asking for resulting color. So environment is baked into it, which is bad.
    1. the paper assume "opacity" to be voxel's internal property. Why not base color, specular, ect...? This might increase output space, hard for generalization.
    2. because the paper use voxel representation, most of voxels are "completely transparant", so the model will generalize better when you assume opacity is model's basic property
      1. is there a way to avoid voxel representation as most of them are blank? I mean, we can maybe shrink (regularize, or smooth) representation space to achieve better generalization?
  4. for a ray, do you query one point or all points? transparency issue?
    1. for opaque objects, if we do not use voxel representation, we might not need query every point
    2. for opaque objects, if we do use voxel representation, we only need to query everything until the first opaque voxel (this is given our representation is very well - it satisfy only stationary equation. but in training, we don't know for sure it is opaque or not, so you need a majority vote on opacity. But can we do a hard-coded majority vote first? because we are in 3d, if we imagine queying a transparent voxel outside of a cube, then 2 views will vote opaque and 4 will vote transparent. However, if you have a transparent voxel in a bowl, then all 6 will vote opaque. This is a problem.)
    3. The integration along the ray to get a color should be weighted by transparency. However, transparency is the thing we gonna train... Large batch size help?
  5. the overfitted model can't guess unseen voxels
  6. can we actually implement them into games, what problems will raise?
  7. What, the paper is ancient technology
  8. Why do you bake the rendering function? You could have directly optimize model?
  9. Tricks
    1. separate course, fine network
    2. cos decomposition of voxel representation

Report

Report

  1. 16 cameras with uniform (semi-random) angle for hemisphere is minimum to get okay looking
  2. training time is independent of the number of camera, quality is dependent
  3. With 1 GPU, 3 seconds is enough to get okay looking. That means, with naive original implementation, we need at least 90 GPUs to achieve 30FPS if no optimization were done. As you may realize, this naive implementation is not practical. One intuitive next-step is to keep training on the same model with new dataset. This is what I am working on.

Hands on Experience without Code

Ideas and Questions

Questions

  1. what kind of sampling do we need for large open environment (I guess we cannot pre-define)
  2. game engine: alternative of octtree representation? (moving vs non-moving object, object changing shape?)

Image De-blur

De-blur reconstruction

De-blur reconstruction

Video De-blur Methods Categories

Video De-blur Methods Categories

NerfStudio

NerfStudio: a python framework running different kinds of NeRF using PyTorch. (Video instruction)

Monocular Dynamic View Synthesis: A Reality Check

Monocular: using only one camera, to capture either animated or static scene with animated or static object

Current Research:

Depth Supervision

Dense Depth Priors for Neural Radiance Fields from Sparse Input Views (CVPR'2022)

Dense Depth Priors for Neural Radiance Fields from Sparse Input Views (CVPR'2022)

Dense Depth Priors for Neural Radiance Fields from Sparse Input Views (CVPR'2022)

Dense Depth Priors for Neural Radiance Fields from Sparse Input Views (CVPR'2022):

  1. SfM produce camera orientation, location, and sparse(not many points, not uniform density)-noisy point cloud reconstruction
  2. augment each image with depth computed from point cloud
  3. feed augmented image to a network for dense depth map and "uncertainty" at object edge
  4. Use depth map and "uncertainty" to guide NeRF

Supervision with sparse depth to achieve better sampling strategy

Supervision with sparse depth to achieve better sampling strategy

\begin{align*} \mathcal{L} =& \sum_{r \in R} (\mathcal{L}_{MSE-color}(r) + \lambda \mathcal{L}_{GaussianNLL-depth}(r))\\ \mathcal{L}_{GaussianNLL-depth}(r) =& \log(\hat{s}(r)^2) + \frac{(\hat{z}(r)-z(r))^2}{\hat{s}(r)^2}\\ \hat{z}(r) =& \sum_{k = 1}^K w_kt_k \tag{weighted "average" of predicted depth, computed same way as color}\\ \hat{s}(r)^2 =& \sum{k = 1}^K w_k(t_k - \hat{z}(r))^2 \tag{weighted variance}\\ \end{align*}

where r is sample point, K is the space of a ray, t is view depth from camera. (and GNLL cuz MSE is bad)

Depth Completion

Depth Completion

The author does not mention that depth completion network is trained separately with large dataset until 4.5 Limitations. The depth network take not just sparse depth, but images too.

They use ResNet + Convolutional Spatial Propagation Network (CSPN) which spread depth to neighbor pixels (since depth is sparse) to predict dense depth.

They simulated SfM result using sensor depth + random perturbation to avoid running SfM on large dataset.

Sparse vs Completion

Sparse vs Completion

Depth with Uncertainty: uncertainty map act as a switch for loss to filter noisy depth

Depth with Uncertainty: uncertainty map act as a switch for loss to filter noisy depth

Depth Supervision When: the difference between the predicted and target depth is greater than the target standard deviation Eq. (12), or 2) the predicted standard deviation is greater than the target standard deviation.

Bound depth within 1 STD, while give some freedom to fit color.

// QUESTION: what is lantent code?

Depth-supervised NeRF: Fewer Views and Faster Training for Free

So in above we know:

Notice h(t) always look like a impulse regardless how many layers of object there is along the ray.

Say we know in reality, assume the ray terminates exactly at distance d, then we want our h(t) distribution to look like a impulse distribution (\delta function) centered at d with very high peak. So we want two distribution to be the same, we therefore measure Kullback–Leibler(KL) divergence between the two distribution.

By definition, KL[P|Q] = \sum_{x \in X} P(x) \log(\frac{P(x)}{Q(x)})

But we don't know whether ray terminates exactly at distance d even with the help of depth input, so d is itself a random variable. We can model it by a normal distribution D \sim N(\hat{d}, \hat{\sigma}) where \hat{d}, \hat{\sigma} are estimated depth and uncertainty by COLMAP.

With above uncertainty, our delta function looks like \delta(t - D) and we want h(t) to be close to it.

E_D[KL[\delta(t - D)|h(t)]] = KL[N(\hat{d}, \hat{\sigma})|h(t)] + \text{const}

KL divergence by fixing klzzwxh:0108

KL divergence by fixing P(x)

Although the method is correct, it is not intuitive to me why it is a good choice for loss function. So far, the only difference between this and the previous paper is that this uses KL as loss and previous paper use GaussianNLL with uncertainty. I want to show that two different ways of thinkings are actually quiet similar, but that's math and math consumes time. So let's end here. // QUESTION

NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo

NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo:

PNeRF: Probabilistic Neural Scene Representations for Uncertain 3D Visual Mapping

PNeRF:

RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse Inputs

RegNeRF:

Simulated annealing searching for a maximum. The objective here is to get to the highest point. In this example, it is not enough to use a simple hill climb algorithm, as there are many local maxima. By cooling the temperature slowly the global maximum is found.

Simulated annealing searching for a maximum. The objective here is to get to the highest point. In this example, it is not enough to use a simple hill climb algorithm, as there are many local maxima. By cooling the temperature slowly the global maximum is found.

360FusionNeRF: Panoramic Neural Radiance Fields with Joint Guidance

360FusionNeRF

\mathcal{L}_{depth} = \sum_{r \in R} \frac{|\hat{D}(r) - D(r)|}{\sqrt{\hat{D}_{var}(r)}}

where \hat{D}_{var}(r) = \sum_{i = 1}^N T_i(1 - \exp(-\sigma_i \delta_i))(\hat{D}(r) - t_i)^2.

By the way, this is done by CMU Engineering too

Other Notable Paper

Some NeRF Variants: two works tells us how to use depth as translating 3D keypoints through time.

Some Other NeRF Variants:

Classical Depth Processing: how to clean up captured depth (incremental updating, representation of directional uncertainty, fill gaps in reconstruction, robust to outlier), typically involve SDF, didn't read too much into it

Some Other Paper about Depth: that I can't classify

Rendering Speedup:

Regularization from Novel View:

Bad Paper: don't read them

Didn't Read:

// QUESTION: And there is one work that I don't know how to comment on - FWD: Real-time Novel View Synthesis with Forward Warping and Depth, didn't quiet understand their approach because of their way of written paper

Assumptions

  1. only one querying camera
  2. moving object is small in frame: mask update by (color/depth) delta
    1. depth voxel at boundary should change quickly
  3. assume lambertian object, no semitransparency: global impulse regularization
  4. About Dense Scene: don't strike for sparse scene and therefore don't use captured data, use synthetic simulation of captured data (added noise), since sparse scene will not improve training time dramatically and leads to poor generalization. Our goal is to make it really fast, cool demo video to attract attention.
  5. About Dense Depth: But we should strike for sparse depth: NeRF is using color to help density reconstruction, if we have dense depth, there is probably no meaning to use NeRF. The architecture will be very different and we should not call it NeRF. The idea of depth delta does not make sense. If you have depth delta, the you must have a dense depth map, then why bother.

Introduction: We are trying to speed up both training and rendering time for NeRF to eventually read to ward interactivity. One application is to achieve 3D monitoring by reconstruction on the fly. To do so, we make use of sensor depth since it is readily avaliable and can speed up NeRF greatly by putting regularization guidance and smarter sampling. Translation is challenging since NeRF fundamentally is more like a grided representation of the scene than particles (see Graphics), but being a neural network gives us flexibility.

Related Work:

Steps:

Ideas that shouldn't work:

Ideas that should work:

So, the problem is that the density fade out or come in too slow. I suspect that it is due to the diminishing gradient for the values in the grid. For example, when we have a transparency in the input image, we should be able to immediately know that region is completely transparent, right? So we want a way to directly change the representation of that grid to have transparency. But since we have a NN processing grid value in NGP, we can't do so directly. What's the solution. We enforce a layer of the grid to store only density by adding regularization to the architecture, so whenever the network sees the 0 density, it should output 0 density. So after we freeze the network, the behavior should be the same, and we are able to train the network faster.

sampling less rays, clamp depth

Some reading on Plenoxel

Residual Field

The (differentiable version of) volumetric rendering equation looks like this (fixing direction d and a point in space p(o, t) = o + td)

\hat{C} = \sum_i -\exp\left(\sum_{j = 0}^i \alpha_j \right)\alpha_i C_i

The simple MSE loss:

\begin{align*} \left(C - \sum_i -\exp\left(\sum_{j = 0}^i \alpha_j \right)\alpha_i C_i\right)^2 \end{align*}

with residual added to the field: where a_j', C_i' are constant

\begin{align*} \left(C - \sum_i -\exp\left(\sum_{j = 0}^i \alpha_j \right)\alpha_i C_i + -\exp\left(\sum_{j = 0}^i \alpha_j' \right)\alpha_i' C_i'\right)^2 \end{align*}

or if you do field-space instead of image space

\begin{align*} \left(C - \sum_i -\exp\left(\sum_{j = 0}^i \alpha_j+\alpha_j' \right)(\alpha_i+\alpha_i') (C_i + C_i')\right)^2 \end{align*}

So we can simplify above to

\begin{align*} \left(C - \text{some constant} - \sum_i -\exp\left(\sum_{j = 0}^i \alpha_j \right)\alpha_i C_i\right)^2 \end{align*}

So we can do image space?

Below is training pipeline of NerfStudio instant-ngp

entrypoint() in script/train.py
train_loop() in script/train.py
train() in trainer.py - output
train_iteration() in trainer.py - forward, backward
get_train_loss_dict() in base_pipeline.py - generate ray, feed model, get metric and loss
forward(), get_outputs() in base_model.py - any output
get_outputs() in instant_ngp.py - return rgb,acc,depth
forward() in base_field.py
get_density()->get_outputs() in base_field.py
get_density()->get_outputs() in instant_ngp_field.py

render_weight_from_density() in instant_ngp.py

renderer_rgb() in instant_ngp.py - start accumulate for loss
forward() in renderers.py

get_loss_dict() in instant_ngp.py - feed image and get rgb loss

FART-NeRF - Fast Accumulative Realtime Training of NeRF

Problem:

Hardness:

Assume:

Limitation:

Table of Content