NeRF

A good paper feed millions of researchers who can plagiarize some simple idea to publish papers.

CT Scan Reconstruction

CT Scan Reconstruction MRI Reconstruction Course Deep Learning for MRI MRI Basics Part 1 - Image Formation

It uses an analytical method relies on signal processing.

NeRF: Neural Radiance Fields for View Synthesis

NeRF Paper Reading: Youtube

Big Idea of NeRF: You get a bunch of images with labeled shooting directions. The algorithm gives you back a function that can be sampled to retrieve geometry information.

NeRF's objective is to reconstruct 3D geometry using an array of images. However, the product of a NeRF is not an actual geometry, but a neural network that represent both the rendering function and a geometry.

Radiance Fields: It is a space in $\mathbb{R}^3$ such that each point contains a spherical function instead of containing a single value. This is essentially represent 3D volumes with 2D view-dependent appearance, basically geometry data viewed from different angle. It is a function of view direction on an object to color.

Approach

Input: $((\theta, \phi), (x, y, z))$ . Output: $((r, g, b), \alpha)$ . Loss: difference between $\int_D f(\cdot)_\alpha \cdot f(\cdot)_D$ and ground truth $(r, g, b)$ where $D$ is line along the ray.

Procedural:

Prepare a bunch of images with shooting direction. First break each image down to pixels. So you have a bunch of pixels, each associated with a 3D vector represent shooting direction and pixel's rgb value.
Randomly initialize weight of neural network. The network represents the rendering function (with viewing direction and pixel location as input, output color) with the geometry baked in. The network can be viewed as a generic function of radiance fields except for the extra parameter $(x, y, z)$ and output $\alpha$ for architecture (calculating the loss) purpose. We initialize the function by setting the function to a point in the function space.
For each pixel, we feed the network with one viewing direction $\mathbb{\theta, \phi}$ as constants integrate the network function along $(x, y, z) \in \mathbb{D}$ (adding the result when input $(x, y, z)$ ) with estimated weights $\alpha$ from output represent transparency to get a final color $(r, g, b)$ output from the network. We take negative gradient after compare it with ground truth color.

// QUESTION: I kinda not sure how exactly to integrate $\alpha$ .

We don't uniformly select voxel locations. We do two passes. The first pass is done with uniform voxel locations. And the second pass can be concentrated on the surface of the object.

However, if you just do that, the result will be poor, because for some reason, networks have hard time overfitting. Below is an example of a network trained with input $(x, y)$ and output $(r, g, b)$ . The result is not great.

Sinusoid Encoding explained in "Fourier Features Let Networks Learn High Frequency"

So the idea is to split the signal into different frequency layer to, in a sense, augment loss in higher frequency. This strategy can be found in transformers.

Further Readings: Fourier Features Let Networks Learn High Frequency

NTK: Infinite width fully connected layer, initialized with reasonable weights, and trained with infinite small steps. It is a good mathematical tool for giving good insights for fully connected layers.

NTK is not suited with raw pixel input as the quality is shown in above picture. Therefore you need fouier transform to preserve high frequency details.

How to differentiate rendering with density? We know:

$\hat{C} = \int_0^\infty T(t) \alpha(t) c(t) dt$

where is says the final color we are rendering onto the screen is the integral of $T$ blockage multiplied by the density $\alpha$ and color $c$ and the blockage (occlusions) is another integral:

$T(t) = \exp(-\int_0^t \alpha(s) ds)$

You should think the density as the probability of a ray pass through a specific point along the ray. Also, $h(t) := T(t)\alpha(t)$ is the probability that ray terminates at time $t$ and we need to ensure that the ray will at some point terminate $\int T(t)\alpha(t) dt = 1$ . Proving $h(t)$ is indeed a distribution is just a small math exercise.

Due to practical constraints, NeRFs assume that the scene lies between a near and far bounds $(t_n, t_f)$ .

We often write

$\hat{C} = \int_0^\infty h(t)c(t) dt = E_{h(t)}[c(t)]$

Camera Convention

Why does the focal length in the camera intrinsics matrix have two dimensions?

In the pinhole camera model there is only one focal length which is between the principal point and the camera center.

However, after calculating the camera's intrinsic parameters, the matrix contains

(fx,  0,  offsetx,  0,
 0,  fy,  offsety,  0,
 0,   0,  1,        0)

Where fx and fy indicates focal length from x and y direction. This is OpenCV convention and also appear in Plenoxel's code.

struct PackedCameraSpec {
    PackedCameraSpec(CameraSpec& cam) :
        c2w(cam.c2w.packed_accessor32<float, 2, torch::RestrictPtrTraits>()),
        fx(cam.fx), fy(cam.fy),
        cx(cam.cx), cy(cam.cy),
        width(cam.width), height(cam.height),
        ndc_coeffx(cam.ndc_coeffx), ndc_coeffy(cam.ndc_coeffy) {}
    const torch::PackedTensorAccessor32<float, 2, torch::RestrictPtrTraits>
        c2w;
    float fx;
    float fy;
    float cx;
    float cy;
    int width;
    int height;

    float ndc_coeffx;
    float ndc_coeffy;
};

sx is the scaling factor in the x-direction (horizontally), it is a value that scales the horizontal dimensions of the image plane.

sy is the scaling factor in the y-direction (vertically), it is a value that scales the vertical dimensions of the image plane.

So the calculation looks like

fx = f*sx
fy = f*sy

sx = 10 pixel/mm (how many pixel per millimeter convertion)
f = 35mm (focal length)
then fx = sx*f = 35mm * 10 pixel/mm = 350 pixels

So fx is the focal length in pixel where the size of pixel is defined by how big the pixel is in x direction.

Norm

The actual loss uses L2 norm. Which is a topic on its own, but I will briefly mention it here.

L2 norm is the euclidian distance we are familiar with: given vector $X$ , the size (norm) of a vector can be written as

$(\sum_{i = 1}^k |X_i|^2)^\frac{1}{2}$

and similarly, the L1 norm is the manhattan distance

$(\sum_{i = 1}^k |X_i|^1)^\frac{1}{1} = \sum_{i = 1}^k |X_i|$

So now we can define Ln norm as follow:

$(\sum_{i = 1}^k |X_i|^n)^\frac{1}{n}$

But why is it useful? Why it matters so much in machine learning? We need to penalize weight vectors that has big norms because they will lead to overfitting. To see why, imagine your weights $w_0, w_1, w_2, ...$ are used in such way:

$f(x) = w_0 + w_1x^1 + w_2x^2 + ...$

Then we want the polynomial to have less degree and so we want for example, set as many $w_i$ to $0$ as possible. But setting a weight is not differentiable, so we instead require the vector $[w_0, w_1, w_2, ...]^T$ to be "small". What is the formula to measure "small"? That's the norm.

klzzwxh:0039 Each point on the graph represent vector klzzwxh:0040 that has the same norm (ie. will receive same penalty in loss) — $L_1, L_2, L_p, L_\infty$ Each point on the graph represent vector $[w_0, w_1]^T$ that has the same norm (ie. will receive same penalty in loss)

When you add a regularization term in your loss function, you get the following landscape.

And because of this, our neural network might favor one weight over the other when using different regularization.

Choose different klzzwxh:0043 for different regularization. Notice klzzwxh:0044 is a very strong regularization that enforces less parameters. klzzwxh:0045 is a bit relaxed. — Choose different $[w_0, w_1]^T$ for different regularization. Notice $L_1$ is a very strong regularization that enforces less parameters. $L_2$ is a bit relaxed.

The L1 regularization solution is sparse. The L2 regularization solution is non-sparse. Although in practical ML we never look for exact solution. L2 regularization doesn’t perform feature selection, since weights are only reduced to values near 0 instead of 0. L1 regularization has built-in feature selection.

Mean Square Error (MSE) is another common place where L1, L2 norm appears: the below function is MSE with a $\lambda$ weighted L2 regularization.

$MSE = \frac{1}{n}\sum_{i = 1}^n (\hat{y}_i(x_i, W) - y_i) + \lambda \sum_{i = 1}^n |w_i|^2$

Dirac delta function

The delta function was introduced by physicist Paul Dirac as a tool for the normalization of state vectors. It also has uses in probability theory and signal processing.

The Dirac delta as the limit as klzzwxh:0048 — The Dirac delta as the limit as $a \to 0$

In terms of NeRF, ideally we want the function to model the color integral distribution be like: (so instead of volumetric rendering, we only take color at exact ray hit)

$\delta(x) \simeq \begin{cases} \infty & x = 0\\ 0 & x \neq 0\\ \end{cases}$

For most objects, the variance of such distribution decreases with more training views from different directions (in terms of $\delta$ function, it means $a \to 0$ )

Various components in ray rendering equation ploted along the ray

Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields (ICCV 2021)

Mip-NeRF: It integrates the entire cone area by a weighted 3D gaussian-like distribution

Big Idea: Deal with small, real-life image that has anti-aliasing and poor resolution. It also improve NeRF by not require camera to center at object's central location.

pixel: becomes an area on image
ray: becomes a cone
sample point: become a weighted sampling surface

Mip-NeRF Sampling Result: It keeps low frequency while averages high frequency. This way we can calculate expected high frequency more accurately especially with non-uniform sampling.

// TODO: what is blurpool

It use characteristics function to determine whether the point is in the cone. And then calculate the complex 3D gaussian expectation with fourier transformed coordinates. // TODO: think about math if you have time

NeRF++: Analyzing and Improving Neural Radiance Fields

Idea: the reason why MLP works is because the players serve as a prior to assume smoothness of color (and therefore shape) on geometry surface.

In original design of NeRF, the direction $d$ is smooth with respect to color $c$ more than position $x$ with respect to color $c$ . Therefore, putting input $d$ in later layers help the accuracy of the model.

Background and Foreground Tradeoff in Original NeRF

For 360 degree captures of unbounded scenes, NeRF’s parameterization of space either models only a portion of the scene, leading to significant artifacts in background elements (a), or models the full scene and suffers from an overall loss of detail due to finite sampling resolution (b).

So we separate NeRF (into two MLPs) to one for foreground and one for background.

Re-Parameterization for Background Scene

We parameterize location encoding for background scene to encode space outside of the sphere as a 4D coordinate where the 4th coordinate $1/r$ decrease with distance. The idea is that the original sparse encoding for background becomes more dense and therefore more images can contribute to the color of far backgrounds to resolve background ambiguity.

Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields (CVPR 2022)

Re-Parameterization: shrink background but keep monotonicity in sphere of radius 2

Parameterization: kinda Mip-NeRF + another way to do NeRF++

Convergence: Separate into two network but with one gradient, first for density and the second for density and color using first network's density to reduce training cost.

// QUESTION: don't quiet understand how this would work

Distillation: // TODO: didn't read

PlenOctrees for Real-time Rendering of Neural Radiance Fields

Background and Survey

Representing Geometry as Multo-plane Images

Above methods is topology-free and can be rendered in real-time but is memory intensive (can't capture detailed resolution).

NeRFs can be sampled with arbitrary resolution since the function is continuous. However, they are slow to train and test. Methods to accelerate NeRF includes

train on datasets of similar scenes
skip empty region during testing (Neural Sparse Voxel Fields)
decompose scene into smaller networks (Decomposed Radiance Fields)
Quality Tradeoff (AutoInt)
...

PlenOctrees

Spherical harmonics: used to speed up the process of converting NeRF to PlenOctree. This is because we put view-dependent calculation to evaluation time instead of PlenOctree-convertion time. (also the model looks cleaner)

Comments:

The article makes use spherical harmonic basis for decomposition of view direction-depended color as spherical functions on a 3D space origin. Although most object can be encoded well with spherical harmonics, 3D scenes with pores and camera inside the geometry can hardly be viewed as spherical functions, which is not an issue here. Therefore, encoding geometry as spherical function is not possible.
I am thinking throwing away the network entirely and parameterize color by spherical harmonics and density by spherical harmonics but with continuous functional coefficients (It is a function that take in radius and spit out actual coefficient).
Also the PlenOctree isn't spherical, and it not continuous. It is not a good representation as it isn't rotational invariant, meaning the object might look bad when rotated at a specific angle. It is not continuous and the voxels will exhibit minecraft-like looking if tree isn't deep enough. Well, on the other hand, png compression is more popular than "fourier compression" in practice.
What is the benifit of represending geometry in neuralnetwork and then in octree? Why not directly in octree? If you want to directly tune octree, it is not possible because tree structure is fixed. You can instead generate tree structure first and then tune the leaf values. This is the same as paper's approach in which a neural network is used to obtain corse tree structure.
The paper use tree leaf separated by occupancy, but we can do that for colors too, in fact, for each channel to further compress the model. So we have in total of 4 octree with different shape each containing $(r, g, b, alpha)$ values. This might be a better compression while it might be more costly to evaluate.
The Octree extraction is too slow: 15 minutes

Plenoxels: Radiance Fields without Neural Networks

Normalized device coordinates: ?

multi-sphere images: ?

Trilinear Interpolation is crucial: it converts discrete representation to a continuous one to minimize reconstruction loss.

// QUESTION: I don't understand how optimizing voxel coefficients, and regularization formula work, haven't read into it.

Comments:

This paper gives me a different idea on how to define a neural network. The main difference between Octree and NeRF SGD is that the difference in architecture. Just like how CNN is specialized at images than NN.

Instant Neural Graphics Primitives with a Multiresolution Hash Encoding

Procedural:

We integrate the location along the ray $D$ the same way as in NeRF, however, for each point $(x, y. z)$ we calculate its value by interpolation from its nearby vertex value (with course to fine multi-layer) that is randomly distributed in the hash.
Since it is randomly distributed, we need a neural network to distribute avaliable hash cells to each vertices. This way, we can let the neural network deicide where the hash cell should be used in the geometry and also calculate RGBD target from interpolated latent space.

Morton Code

Morton code is used to map n dimensional space to linear space. Morton code defines a Z-shape space filler, which preserves n dimensional locality.

Assuming we already stored a 3D map where each cell is an int into a Morton Coded array, and we want to extract the int value in (x, y, z) = (5, 9, 1) = (0101b, 1001b, 0001b) then there is an easy method to know which array position we need to look up. We need to look up (010 001 000 111b). This is because from z, y, x, we extract the most significant bits in every dimension to the least significant bits.

Morton Code 3D Convertion: above is 3D coordinate and below is corresponding 1D coordinate

Note that to invert morton code to 3D coordinate, we only need to code >> 0 for x, code >> 1 for y, and code >> 2 for z and pass to the same decoding function.

If for loop, we can write the code like this:

#include <stdint.h>
#include <limits.h>
using namespace std;

inline uint64_t mortonEncode_for(unsigned int x, unsigned int y, unsigned int z) {
  uint64_t answer = 0;
  for (uint64_t i = 0; i < (sizeof(uint64_t)* CHAR_BIT)/3; ++i) {
    answer |= ((x & ((uint64_t)1 << i)) << 2*i) | ((y & ((uint64_t)1 << i)) << (2*i + 1)) | ((z & ((uint64_t)1 << i)) << (2*i + 2));
  }
  return answer;
}

To achieve better performance, we could use magic bits:

#include <stdint.h>
#include <limits.h>
using namespace std;

// method to seperate bits from a given integer 3 positions apart
inline uint64_t splitBy3(unsigned int a){
  uint64_t x = a & 0x1fffff; // we only look at the first 21 bits
  x = (x | x << 32) & 0x1f00000000ffff; // shift left 32 bits, OR with self, and 00011111000000000000000000000000000000001111111111111111
  x = (x | x << 16) & 0x1f0000ff0000ff; // shift left 32 bits, OR with self, and 00011111000000000000000011111111000000000000000011111111
  x = (x | x << 8) & 0x100f00f00f00f00f; // shift left 32 bits, OR with self, and 0001000000001111000000001111000000001111000000001111000000000000
  x = (x | x << 4) & 0x10c30c30c30c30c3; // shift left 32 bits, OR with self, and 0001000011000011000011000011000011000011000011000011000100000000
  x = (x | x << 2) & 0x1249249249249249;
  return x;
}

inline uint64_t mortonEncode_magicbits(unsigned int x, unsigned int y, unsigned int z){
  uint64_t answer = 0;
  answer |= splitBy3(x) | splitBy3(y) << 1 | splitBy3(z) << 2;
  return answer;
}

or to use a giant table to achieve the best performance

#include <stdint.h>
#include <limits.h>
using namespace std;

static const uint32_t morton256_x[256] = {
0x00000000,
0x00000001, 0x00000008, 0x00000009, 0x00000040, 0x00000041, 0x00000048, 0x00000049, 0x00000200,
0x00000201, 0x00000208, 0x00000209, 0x00000240, 0x00000241, 0x00000248, 0x00000249, 0x00001000,
0x00001001, 0x00001008, 0x00001009, 0x00001040, 0x00001041, 0x00001048, 0x00001049, 0x00001200,
0x00001201, 0x00001208, 0x00001209, 0x00001240, 0x00001241, 0x00001248, 0x00001249, 0x00008000,
0x00008001, 0x00008008, 0x00008009, 0x00008040, 0x00008041, 0x00008048, 0x00008049, 0x00008200,
0x00008201, 0x00008208, 0x00008209, 0x00008240, 0x00008241, 0x00008248, 0x00008249, 0x00009000,
0x00009001, 0x00009008, 0x00009009, 0x00009040, 0x00009041, 0x00009048, 0x00009049, 0x00009200,
0x00009201, 0x00009208, 0x00009209, 0x00009240, 0x00009241, 0x00009248, 0x00009249, 0x00040000,
0x00040001, 0x00040008, 0x00040009, 0x00040040, 0x00040041, 0x00040048, 0x00040049, 0x00040200,
0x00040201, 0x00040208, 0x00040209, 0x00040240, 0x00040241, 0x00040248, 0x00040249, 0x00041000,
0x00041001, 0x00041008, 0x00041009, 0x00041040, 0x00041041, 0x00041048, 0x00041049, 0x00041200,
0x00041201, 0x00041208, 0x00041209, 0x00041240, 0x00041241, 0x00041248, 0x00041249, 0x00048000,
0x00048001, 0x00048008, 0x00048009, 0x00048040, 0x00048041, 0x00048048, 0x00048049, 0x00048200,
0x00048201, 0x00048208, 0x00048209, 0x00048240, 0x00048241, 0x00048248, 0x00048249, 0x00049000,
0x00049001, 0x00049008, 0x00049009, 0x00049040, 0x00049041, 0x00049048, 0x00049049, 0x00049200,
0x00049201, 0x00049208, 0x00049209, 0x00049240, 0x00049241, 0x00049248, 0x00049249, 0x00200000,
0x00200001, 0x00200008, 0x00200009, 0x00200040, 0x00200041, 0x00200048, 0x00200049, 0x00200200,
0x00200201, 0x00200208, 0x00200209, 0x00200240, 0x00200241, 0x00200248, 0x00200249, 0x00201000,
0x00201001, 0x00201008, 0x00201009, 0x00201040, 0x00201041, 0x00201048, 0x00201049, 0x00201200,
0x00201201, 0x00201208, 0x00201209, 0x00201240, 0x00201241, 0x00201248, 0x00201249, 0x00208000,
0x00208001, 0x00208008, 0x00208009, 0x00208040, 0x00208041, 0x00208048, 0x00208049, 0x00208200,
0x00208201, 0x00208208, 0x00208209, 0x00208240, 0x00208241, 0x00208248, 0x00208249, 0x00209000,
0x00209001, 0x00209008, 0x00209009, 0x00209040, 0x00209041, 0x00209048, 0x00209049, 0x00209200,
0x00209201, 0x00209208, 0x00209209, 0x00209240, 0x00209241, 0x00209248, 0x00209249, 0x00240000,
0x00240001, 0x00240008, 0x00240009, 0x00240040, 0x00240041, 0x00240048, 0x00240049, 0x00240200,
0x00240201, 0x00240208, 0x00240209, 0x00240240, 0x00240241, 0x00240248, 0x00240249, 0x00241000,
0x00241001, 0x00241008, 0x00241009, 0x00241040, 0x00241041, 0x00241048, 0x00241049, 0x00241200,
0x00241201, 0x00241208, 0x00241209, 0x00241240, 0x00241241, 0x00241248, 0x00241249, 0x00248000,
0x00248001, 0x00248008, 0x00248009, 0x00248040, 0x00248041, 0x00248048, 0x00248049, 0x00248200,
0x00248201, 0x00248208, 0x00248209, 0x00248240, 0x00248241, 0x00248248, 0x00248249, 0x00249000,
0x00249001, 0x00249008, 0x00249009, 0x00249040, 0x00249041, 0x00249048, 0x00249049, 0x00249200,
0x00249201, 0x00249208, 0x00249209, 0x00249240, 0x00249241, 0x00249248, 0x00249249
};

// pre-shifted table for Y coordinates (1 bit to the left)
static const uint32_t morton256_y[256] = {
0x00000000,
0x00000002, 0x00000010, 0x00000012, 0x00000080, 0x00000082, 0x00000090, 0x00000092, 0x00000400,
0x00000402, 0x00000410, 0x00000412, 0x00000480, 0x00000482, 0x00000490, 0x00000492, 0x00002000,
0x00002002, 0x00002010, 0x00002012, 0x00002080, 0x00002082, 0x00002090, 0x00002092, 0x00002400,
0x00002402, 0x00002410, 0x00002412, 0x00002480, 0x00002482, 0x00002490, 0x00002492, 0x00010000,
0x00010002, 0x00010010, 0x00010012, 0x00010080, 0x00010082, 0x00010090, 0x00010092, 0x00010400,
0x00010402, 0x00010410, 0x00010412, 0x00010480, 0x00010482, 0x00010490, 0x00010492, 0x00012000,
0x00012002, 0x00012010, 0x00012012, 0x00012080, 0x00012082, 0x00012090, 0x00012092, 0x00012400,
0x00012402, 0x00012410, 0x00012412, 0x00012480, 0x00012482, 0x00012490, 0x00012492, 0x00080000,
0x00080002, 0x00080010, 0x00080012, 0x00080080, 0x00080082, 0x00080090, 0x00080092, 0x00080400,
0x00080402, 0x00080410, 0x00080412, 0x00080480, 0x00080482, 0x00080490, 0x00080492, 0x00082000,
0x00082002, 0x00082010, 0x00082012, 0x00082080, 0x00082082, 0x00082090, 0x00082092, 0x00082400,
0x00082402, 0x00082410, 0x00082412, 0x00082480, 0x00082482, 0x00082490, 0x00082492, 0x00090000,
0x00090002, 0x00090010, 0x00090012, 0x00090080, 0x00090082, 0x00090090, 0x00090092, 0x00090400,
0x00090402, 0x00090410, 0x00090412, 0x00090480, 0x00090482, 0x00090490, 0x00090492, 0x00092000,
0x00092002, 0x00092010, 0x00092012, 0x00092080, 0x00092082, 0x00092090, 0x00092092, 0x00092400,
0x00092402, 0x00092410, 0x00092412, 0x00092480, 0x00092482, 0x00092490, 0x00092492, 0x00400000,
0x00400002, 0x00400010, 0x00400012, 0x00400080, 0x00400082, 0x00400090, 0x00400092, 0x00400400,
0x00400402, 0x00400410, 0x00400412, 0x00400480, 0x00400482, 0x00400490, 0x00400492, 0x00402000,
0x00402002, 0x00402010, 0x00402012, 0x00402080, 0x00402082, 0x00402090, 0x00402092, 0x00402400,
0x00402402, 0x00402410, 0x00402412, 0x00402480, 0x00402482, 0x00402490, 0x00402492, 0x00410000,
0x00410002, 0x00410010, 0x00410012, 0x00410080, 0x00410082, 0x00410090, 0x00410092, 0x00410400,
0x00410402, 0x00410410, 0x00410412, 0x00410480, 0x00410482, 0x00410490, 0x00410492, 0x00412000,
0x00412002, 0x00412010, 0x00412012, 0x00412080, 0x00412082, 0x00412090, 0x00412092, 0x00412400,
0x00412402, 0x00412410, 0x00412412, 0x00412480, 0x00412482, 0x00412490, 0x00412492, 0x00480000,
0x00480002, 0x00480010, 0x00480012, 0x00480080, 0x00480082, 0x00480090, 0x00480092, 0x00480400,
0x00480402, 0x00480410, 0x00480412, 0x00480480, 0x00480482, 0x00480490, 0x00480492, 0x00482000,
0x00482002, 0x00482010, 0x00482012, 0x00482080, 0x00482082, 0x00482090, 0x00482092, 0x00482400,
0x00482402, 0x00482410, 0x00482412, 0x00482480, 0x00482482, 0x00482490, 0x00482492, 0x00490000,
0x00490002, 0x00490010, 0x00490012, 0x00490080, 0x00490082, 0x00490090, 0x00490092, 0x00490400,
0x00490402, 0x00490410, 0x00490412, 0x00490480, 0x00490482, 0x00490490, 0x00490492, 0x00492000,
0x00492002, 0x00492010, 0x00492012, 0x00492080, 0x00492082, 0x00492090, 0x00492092, 0x00492400,
0x00492402, 0x00492410, 0x00492412, 0x00492480, 0x00492482, 0x00492490, 0x00492492
};

// Pre-shifted table for z (2 bits to the left)
static const uint32_t morton256_z[256] = {
0x00000000,
0x00000004, 0x00000020, 0x00000024, 0x00000100, 0x00000104, 0x00000120, 0x00000124, 0x00000800,
0x00000804, 0x00000820, 0x00000824, 0x00000900, 0x00000904, 0x00000920, 0x00000924, 0x00004000,
0x00004004, 0x00004020, 0x00004024, 0x00004100, 0x00004104, 0x00004120, 0x00004124, 0x00004800,
0x00004804, 0x00004820, 0x00004824, 0x00004900, 0x00004904, 0x00004920, 0x00004924, 0x00020000,
0x00020004, 0x00020020, 0x00020024, 0x00020100, 0x00020104, 0x00020120, 0x00020124, 0x00020800,
0x00020804, 0x00020820, 0x00020824, 0x00020900, 0x00020904, 0x00020920, 0x00020924, 0x00024000,
0x00024004, 0x00024020, 0x00024024, 0x00024100, 0x00024104, 0x00024120, 0x00024124, 0x00024800,
0x00024804, 0x00024820, 0x00024824, 0x00024900, 0x00024904, 0x00024920, 0x00024924, 0x00100000,
0x00100004, 0x00100020, 0x00100024, 0x00100100, 0x00100104, 0x00100120, 0x00100124, 0x00100800,
0x00100804, 0x00100820, 0x00100824, 0x00100900, 0x00100904, 0x00100920, 0x00100924, 0x00104000,
0x00104004, 0x00104020, 0x00104024, 0x00104100, 0x00104104, 0x00104120, 0x00104124, 0x00104800,
0x00104804, 0x00104820, 0x00104824, 0x00104900, 0x00104904, 0x00104920, 0x00104924, 0x00120000,
0x00120004, 0x00120020, 0x00120024, 0x00120100, 0x00120104, 0x00120120, 0x00120124, 0x00120800,
0x00120804, 0x00120820, 0x00120824, 0x00120900, 0x00120904, 0x00120920, 0x00120924, 0x00124000,
0x00124004, 0x00124020, 0x00124024, 0x00124100, 0x00124104, 0x00124120, 0x00124124, 0x00124800,
0x00124804, 0x00124820, 0x00124824, 0x00124900, 0x00124904, 0x00124920, 0x00124924, 0x00800000,
0x00800004, 0x00800020, 0x00800024, 0x00800100, 0x00800104, 0x00800120, 0x00800124, 0x00800800,
0x00800804, 0x00800820, 0x00800824, 0x00800900, 0x00800904, 0x00800920, 0x00800924, 0x00804000,
0x00804004, 0x00804020, 0x00804024, 0x00804100, 0x00804104, 0x00804120, 0x00804124, 0x00804800,
0x00804804, 0x00804820, 0x00804824, 0x00804900, 0x00804904, 0x00804920, 0x00804924, 0x00820000,
0x00820004, 0x00820020, 0x00820024, 0x00820100, 0x00820104, 0x00820120, 0x00820124, 0x00820800,
0x00820804, 0x00820820, 0x00820824, 0x00820900, 0x00820904, 0x00820920, 0x00820924, 0x00824000,
0x00824004, 0x00824020, 0x00824024, 0x00824100, 0x00824104, 0x00824120, 0x00824124, 0x00824800,
0x00824804, 0x00824820, 0x00824824, 0x00824900, 0x00824904, 0x00824920, 0x00824924, 0x00900000,
0x00900004, 0x00900020, 0x00900024, 0x00900100, 0x00900104, 0x00900120, 0x00900124, 0x00900800,
0x00900804, 0x00900820, 0x00900824, 0x00900900, 0x00900904, 0x00900920, 0x00900924, 0x00904000,
0x00904004, 0x00904020, 0x00904024, 0x00904100, 0x00904104, 0x00904120, 0x00904124, 0x00904800,
0x00904804, 0x00904820, 0x00904824, 0x00904900, 0x00904904, 0x00904920, 0x00904924, 0x00920000,
0x00920004, 0x00920020, 0x00920024, 0x00920100, 0x00920104, 0x00920120, 0x00920124, 0x00920800,
0x00920804, 0x00920820, 0x00920824, 0x00920900, 0x00920904, 0x00920920, 0x00920924, 0x00924000,
0x00924004, 0x00924020, 0x00924024, 0x00924100, 0x00924104, 0x00924120, 0x00924124, 0x00924800,
0x00924804, 0x00924820, 0x00924824, 0x00924900, 0x00924904, 0x00924920, 0x00924924
};

inline uint64_t mortonEncode_LUT(unsigned int x, unsigned int y, unsigned int z){
  uint64_t answer = 0;
  answer = morton256_z[(z >> 16) & 0xFF ] | // we start by shifting the third byte, since we only look at the first 21 bits
  morton256_y[(y >> 16) & 0xFF ] |
  morton256_x[(x >> 16) & 0xFF ];
  answer = answer << 48 | morton256_z[(z >> 8) & 0xFF ] | // shifting second byte
  morton256_y[(y >> 8) & 0xFF ] |
  morton256_x[(x >> 8) & 0xFF ];
  answer = answer << 24 |
  morton256_z[(z) & 0xFF ] | // first byte
  morton256_y[(y) & 0xFF ] |
  morton256_x[(x) & 0xFF ];
  return answer;
}

Here is the source code for octree:

// Expands a 10-bit integer into 30 bits
// by inserting 2 zeros after each bit.
__host__ __device__ inline uint32_t expand_bits(uint32_t v) {
  v = (v * 0x00010001u) & 0xFF0000FFu;
  v = (v * 0x00000101u) & 0x0F00F00Fu;
  v = (v * 0x00000011u) & 0xC30C30C3u;
  v = (v * 0x00000005u) & 0x49249249u;
  return v;
}

// Calculates a 30-bit Morton code for the
// given 3D point located within the unit cube [0,1].
__host__ __device__ inline uint32_t morton3D(uint32_t x, uint32_t y, uint32_t z) {
  uint32_t xx = expand_bits(x);
  uint32_t yy = expand_bits(y);
  uint32_t zz = expand_bits(z);
  return xx | (yy << 1) | (zz << 2);
}

__host__ __device__ inline uint32_t morton3D_invert(uint32_t x) {
  x = x               & 0x49249249;
  x = (x | (x >> 2))  & 0xc30c30c3;
  x = (x | (x >> 4))  & 0x0f00f00f;
  x = (x | (x >> 8))  & 0xff0000ff;
  x = (x | (x >> 16)) & 0x0000ffff;
  return x;
}

For details, read Out-of-Core Construction of Sparse Voxel Octrees and Morton encoding/decoding through bit interleaving: Implementations

Linear Congruential Generator

idx = ((i+step*n_elements) * 56924617 + j * 19349663 + 96925573) % (NERF_GRIDSIZE()*NERF_GRIDSIZE()*NERF_GRIDSIZE());

A linear congruential generator (LCG) is an algorithm that yields a sequence of pseudo-randomized numbers calculated with a discontinuous piecewise linear equation. The method represents one of the oldest and best-known pseudorandom number generator algorithms. The theory behind them is relatively easy to understand, and they are easily implemented and fast, especially on computer hardware which can provide modular arithmetic by storage-bit truncation. However, the statistical properties are bad.

Here, we don't actually care much about its statistical properties. Rather, we care about its property of producing a permutation: this use case distributes the density grid update samples more-or-less uniformly over space (due to the pseudo-random nature), but ensures good coverage by never visiting a grid cell twice without having visited all other cells (due to the permutation property).

MIT Vision and Graphics Seminar - Atlas Wang

Background: general modality models (transformers, MLP mixer, Perceiver), can they beat domain-specific architecture?

over-simplified spherical harmonics (NeRF for reflection materials: NeRF: integrate optical reflection model into volume rendering, Ref-NeRF: Efficient directional encoding for structured view dependence)

NeRF: data distillation by overfitting, cross-view interpolation

photorealistic result
view-consistent geometry with view-dependent lighting
network optimization cost is high
limited material and optical effects
interpolation is restricting our representation
over-simplified rendering equation

PixelNeRF & IBRNet: acquire RGBA of each point by weighted summing the image features of its 2D projections

MVSNeRF: predict RGBA of each point from a cost volume induced by MVSNet using 2 cameras

Side note: Transformer - attention is all you need replacing LSTM and early RNNs.

Generalizable NeRF Transformer (GNT):

breaking down into two problems
- View Transformer: aggregate images to represent the scene into one latent vector (therefore you can query the latent vector for any view direction's feature, not color, you want)
- Ray Transformer: aggregates points along the ray to represent pixel RGB

Preliminary Result

There are additional 9 articles in my reading list.

Paper:

breaking up images into pixel rays is clever. It creates more data and well used the assumption that each pixel shares the same rendering function (with 3d object baked into rendering function).
1. can be break the rendering function even more by separating model part and true rendering function part? So like we can generalize the rendering function accross different models. And we freeze the rendering function part and only overfit the model part, so train less parameters for each new model.
instead of "x" being center of ray, what if you let it be camera view position?
1. if it were to be camera position, then there are data that is not relevent to the object, it decrease generalization for one voxel.
there is no separation between "base color" and "resulting color". You are only asking for resulting color. So environment is baked into it, which is bad.
1. the paper assume "opacity" to be voxel's internal property. Why not base color, specular, ect...? This might increase output space, hard for generalization.
2. because the paper use voxel representation, most of voxels are "completely transparant", so the model will generalize better when you assume opacity is model's basic property
  1. is there a way to avoid voxel representation as most of them are blank? I mean, we can maybe shrink (regularize, or smooth) representation space to achieve better generalization?
for a ray, do you query one point or all points? transparency issue?
1. for opaque objects, if we do not use voxel representation, we might not need query every point
2. for opaque objects, if we do use voxel representation, we only need to query everything until the first opaque voxel (this is given our representation is very well - it satisfy only stationary equation. but in training, we don't know for sure it is opaque or not, so you need a majority vote on opacity. But can we do a hard-coded majority vote first? because we are in 3d, if we imagine queying a transparent voxel outside of a cube, then 2 views will vote opaque and 4 will vote transparent. However, if you have a transparent voxel in a bowl, then all 6 will vote opaque. This is a problem.)
3. The integration along the ray to get a color should be weighted by transparency. However, transparency is the thing we gonna train... Large batch size help?
the overfitted model can't guess unseen voxels
can we actually implement them into games, what problems will raise?
What, the paper is ancient technology
Why do you bake the rendering function? You could have directly optimize model?
Tricks
1. separate course, fine network
2. cos decomposition of voxel representation

Report

16 cameras with uniform (semi-random) angle for hemisphere is minimum to get okay looking
training time is independent of the number of camera, quality is dependent
With 1 GPU, 3 seconds is enough to get okay looking. That means, with naive original implementation, we need at least 90 GPUs to achieve 30FPS if no optimization were done. As you may realize, this naive implementation is not practical. One intuitive next-step is to keep training on the same model with new dataset. This is what I am working on.

Hands on Experience without Code

Short Video: Youtube
COLMAP with NGP with real camera data and detailed procedural: Youtube
From Video Automatic Tool: Youtube

Ideas and Questions

Questions

what kind of sampling do we need for large open environment (I guess we cannot pre-define)
game engine: alternative of octtree representation? (moving vs non-moving object, object changing shape?)

Image De-blur

NerfStudio

NerfStudio: a python framework running different kinds of NeRF using PyTorch. (Video instruction)

polycam, phone, insta360

Monocular Dynamic View Synthesis: A Reality Check

Monocular: using only one camera, to capture either animated or static scene with animated or static object

Current Research:

static object: unrealistic
animated object, but with camera teleportation: assume an Olympic runner taking a video of a moving scene without introducing any motion blur, unrealistic.

Depth Supervision

Dense Depth Priors for Neural Radiance Fields from Sparse Input Views (CVPR'2022)

Dense Depth Priors for Neural Radiance Fields from Sparse Input Views (CVPR'2022):

SfM produce camera orientation, location, and sparse(not many points, not uniform density)-noisy point cloud reconstruction
augment each image with depth computed from point cloud
feed augmented image to a network for dense depth map and "uncertainty" at object edge
Use depth map and "uncertainty" to guide NeRF

Supervision with sparse depth to achieve better sampling strategy

$\begin{align*} \mathcal{L} =& \sum_{r \in R} (\mathcal{L}_{MSE-color}(r) + \lambda \mathcal{L}_{GaussianNLL-depth}(r))\\ \mathcal{L}_{GaussianNLL-depth}(r) =& \log(\hat{s}(r)^2) + \frac{(\hat{z}(r)-z(r))^2}{\hat{s}(r)^2}\\ \hat{z}(r) =& \sum_{k = 1}^K w_kt_k \tag{weighted "average" of predicted depth, computed same way as color}\\ \hat{s}(r)^2 =& \sum{k = 1}^K w_k(t_k - \hat{z}(r))^2 \tag{weighted variance}\\ \end{align*}$

where $r$ is sample point, $K$ is the space of a ray, $t$ is view depth from camera. (and GNLL cuz MSE is bad)

The author does not mention that depth completion network is trained separately with large dataset until 4.5 Limitations. The depth network take not just sparse depth, but images too.

They use ResNet + Convolutional Spatial Propagation Network (CSPN) which spread depth to neighbor pixels (since depth is sparse) to predict dense depth.

They simulated SfM result using sensor depth + random perturbation to avoid running SfM on large dataset.

Depth with Uncertainty: uncertainty map act as a switch for loss to filter noisy depth

Depth Supervision When: the difference between the predicted and target depth is greater than the target standard deviation Eq. (12), or 2) the predicted standard deviation is greater than the target standard deviation.

Bound depth within 1 STD, while give some freedom to fit color.

// QUESTION: what is lantent code?

Depth-supervised NeRF: Fewer Views and Faster Training for Free

So in above we know:

$\sigma(t)$ : the opacity
$T(t) = \exp(-\int_0^t \sigma(s) ds)$ : the occlusion
$h(t) = T(t)\sigma(t)$ : the color distribution linear interpolator

Notice $h(t)$ always look like a impulse regardless how many layers of object there is along the ray.

Say we know in reality, assume the ray terminates exactly at distance $d$ , then we want our $h(t)$ distribution to look like a impulse distribution ( $\delta$ function) centered at $d$ with very high peak. So we want two distribution to be the same, we therefore measure Kullback–Leibler(KL) divergence between the two distribution.

By definition, $<span class="arithmatex"><span class="MathJax_Preview">KL[P|Q] = \sum_{x \in X} P(x) \log(\frac{P(x)}{Q(x)})</span><script type="math/tex">KL[P|Q] = \sum_{x \in X} P(x) \log(\frac{P(x)}{Q(x)})$

But we don't know whether ray terminates exactly at distance $d$ even with the help of depth input, so $d$ is itself a random variable. We can model it by a normal distribution $D \sim N(\hat{d}, \hat{\sigma})$ where $\hat{d}, \hat{\sigma}$ are estimated depth and uncertainty by COLMAP.

With above uncertainty, our delta function looks like $\delta(t - D)$ and we want $h(t)$ to be close to it.

$E_D[KL[\delta(t - D)|h(t)]] = KL[N(\hat{d}, \hat{\sigma})|h(t)] + \text{const}$

KL divergence by fixing klzzwxh:0108 — KL divergence by fixing $P(x)$

Although the method is correct, it is not intuitive to me why it is a good choice for loss function. So far, the only difference between this and the previous paper is that this uses KL as loss and previous paper use GaussianNLL with uncertainty. I want to show that two different ways of thinkings are actually quiet similar, but that's math and math consumes time. So let's end here. // QUESTION

NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo

NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo:

fine tune depth network for each test scene using sparse SfM
query depth oracle for each image, project into cube, find error for depth
and hand clamp sampling interval that scale with error

PNeRF: Probabilistic Neural Scene Representations for Uncertain 3D Visual Mapping

PNeRF:

Similar to DSNeRF, model captured depth as a gaussian
Still using course and fine network
Loss has: color term, density term (exactly the same as color), and gaussian depth term
instead of doing KL divergence, it just subtract two distributions for a loss

RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse Inputs

RegNeRF:

regularize from novel view without ground truth
- regularize for smooth of predicted depth
- regularize for maximum log-likelihood after feeding small color patch into pretrained flow model. (we want the patch to look "real")
using MipNeRF to get rid of course-fine network // QUESTION: which to me is questionable, since course-fine has to do with sampling but Mip NeRF has to do with anti-aliasing. Mip trade time for quality.
Sample Space Annealing: sample ray midpoint more in the beginning

Simulated annealing searching for a maximum. The objective here is to get to the highest point. In this example, it is not enough to use a simple hill climb algorithm, as there are many local maxima. By cooling the temperature slowly the global maximum is found.

360FusionNeRF: Panoramic Neural Radiance Fields with Joint Guidance

360FusionNeRF

depth supervision by error divided by depth variance (another method compared to above)
CLIP’s Vision Transformer for semantic consistency

$\mathcal{L}_{depth} = \sum_{r \in R} \frac{|\hat{D}(r) - D(r)|}{\sqrt{\hat{D}_{var}(r)}}$

where $\hat{D}_{var}(r) = \sum_{i = 1}^N T_i(1 - \exp(-\sigma_i \delta_i))(\hat{D}(r) - t_i)^2$ .

By the way, this is done by CMU Engineering too

Other Notable Paper

Some NeRF Variants: two works tells us how to use depth as translating 3D keypoints through time.

Nerfies: Deformable Neural Radiance Fields: model geometry changes using deformation, which I assume is very costly.
NeuralScene Flow Fields (NSFF): similar to above, but encoded time as another dimension.

Some Other NeRF Variants:

DINER: Depth-aware Image-based NEural Radiance fields: a variant of NeRF that separate depth and color network, so we can inject depth at test time instead of only at traning time (I think has the potential for some generative application, EXCITING WORK)
FWD: Real-Time Novel View Synthesis With Forward Warping and Depth: using explicit representation, sparse input, no per-scene optimization
Moving in a 360 World: Synthesizing Panoramic Parallaxes from a Single Panorama: simply filter out pixels whose depth values are larger than the local median depth multiplied by a tolerance ratio // QUESTION: don't quiet understand either
GeoNeRF: Generalizing NeRF with Geometry Priors: assume ground truth depth for each ray obtainable? // QUESTION: didn't quiet understand, looks like the guy writing paper is comming from the old DL community (I remember using FPN was hot when I was in highschool)

Classical Depth Processing: how to clean up captured depth (incremental updating, representation of directional uncertainty, fill gaps in reconstruction, robust to outlier), typically involve SDF, didn't read too much into it

Real-time 3d reconstruction and interaction using a moving depth camera.
A Volumetric Method for Building Complex Models from Range Images
Sparse generative neural networks for self-supervised scene completion of rgb-d scans
Scancomplete: Large-scale scene completion and semantic segmentation for 3d scans.

Some Other Paper about Depth: that I can't classify

SPARF: Neural Radiance Fields from Sparse and Noisy Poses: a good dataset for training small prior network if needed
NeRF-SLAM: Real-Time Dense Monocular SLAM with Neural Radiance Fields: added depth loss to instant-ngp, achieve realtime with RTX 2080 Ti with pretrained Droid-SLAM. // QUESTION: it looks like the SLAM produces a depth field? Didn't understand that part. But their result is bad and proposed nothing new.
Handheld Neural Multi-frame Depth Refinement: use handshakeing to reconstruct depth from phone video
(Rejected ICLR) MaskNeRF: Masked Neural Radiance Fields for Sparse View Synthesis: mask out stuff

Rendering Speedup:

DONeRF: Towards Real-Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks: for rendering, predict where to sample rays using yet another layer of network (but they are smart enough to predict logarithmically discretized and spherically warped depth values instead of raw depth, 48x speed up and 20FPS)
FastNeRF: High-Fidelity Neural Rendering at 200FPS (2021): speed up rendering by some magnitude with caching, querying
Baking Neural Radiance Fields for Real-Time View Synthesis
PlenOctrees for Real-time Rendering of Neural Radiance Fields (very different approach than plenoxels)
NeX: Real-time View Synthesis with Neural Basis Expansion: multi-plane method speed up original NeRF rendering time from 0.02FPS to 60FPS
Plenoxels: Radiance fields without neural networks
TensoRF: speed up blender dataset training time down to 3min // TODO: read
InstantNGP
AdaNeRF: Adaptive Sampling for Real-time Rendering of Neural Radiance Fields // TODO: read

Regularization from Novel View:

Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis: regularize color in semantic space using pre-trained network
RegNeRF(see above)

Bad Paper: don't read them

CDNeRF: A Multi-modal Feature Guided Neural Radiance Fields: fuse predicted depth with image encoding using transformer, which is too much of a stretch
RGB-D NEURAL RADIANCE FIELDS: LOCAL SAMPLING FOR FASTER TRAINING: Stratified sampling and Gaussian sampling (result is trivial)
Neural Sparse Voxel Fields: octree rendering 10x faster than original NeRF
Mip-NeRF RGB-D: Depth Assisted Fast Neural Radiance Fields: no new stuff
INGeo: Accelerating Instant Neural Scene Reconstruction with Noisy Geometry Priors: exactly the same as my summer work, and thankfully it did not get published

Didn't Read:

CLONeR: Camera-Lidar Fusion for Occupancy Grid-aided Neural Representations
TermiNeRF: Ray Termination Prediction for Efficient Neural Rendering

// QUESTION: And there is one work that I don't know how to comment on - FWD: Real-time Novel View Synthesis with Forward Warping and Depth, didn't quiet understand their approach because of their way of written paper

Assumptions

only one querying camera
moving object is small in frame: mask update by (color/depth) delta
1. depth voxel at boundary should change quickly
assume lambertian object, no semitransparency: global impulse regularization
About Dense Scene: don't strike for sparse scene and therefore don't use captured data, use synthetic simulation of captured data (added noise), since sparse scene will not improve training time dramatically and leads to poor generalization. Our goal is to make it really fast, cool demo video to attract attention.
About Dense Depth: But we should strike for sparse depth: NeRF is using color to help density reconstruction, if we have dense depth, there is probably no meaning to use NeRF. The architecture will be very different and we should not call it NeRF. The idea of depth delta does not make sense. If you have depth delta, the you must have a dense depth map, then why bother.

Introduction: We are trying to speed up both training and rendering time for NeRF to eventually read to ward interactivity. One application is to achieve 3D monitoring by reconstruction on the fly. To do so, we make use of sensor depth since it is readily avaliable and can speed up NeRF greatly by putting regularization guidance and smarter sampling. Translation is challenging since NeRF fundamentally is more like a grided representation of the scene than particles (see Graphics), but being a neural network gives us flexibility.

Related Work:

Depth: Dense Depth Priors for Neural Radiance Fields from Sparse Input Views uses a pre-trained CNN to complete sparse depth map and use that as a loss term with GNLL. DSNeRF directly uses sparse sensor depth but guide the output depth as a impulse distribution around uncertain sensor input. 360FusionNeRF added depth loss divided by variance. PNeRF added both density (calculated the same way as color) and depth (normal distribution) term to the loss. NerfingMVS applied a hard clamp over sampling region using depth. RegNeRF regularizes depth smoothness using novel view. Notably, built upon instant-ngp, NeRF-SLAM achieved realtime scene reconstruction for static scene.
Time: Nerfies models geometry updates as deformation, and NeuralScene Flow Fields (NSFF) encoded time as another dimension. Both approaches added another dimension of search space for a valid representation.
Mask: (Rejected ICLR) MaskNeRF important samples masked region for better quality, but the methodology is strange.
Rendering Speedup: DONeRF predicts ray termination region using NN. FastNeRF uses cache. NeX, PlenOctrees, Plenoxels, InstantNGP uses acceleration datastructure.

Steps:

regularize for impulse distribution simultaneously from all views (one approximation: you might need some volumetric global illumination for NeRF algorithm for occlusion stuff, something like fading boarder region)
continuous learning: since you can't importance sample a neural network backpropagation (especially with hash collision), you need to avoid catestraphic forgetting. Here is where continuous learning come into play (yah, I didn't believed I will ever touch that) A continual learning survey: Defying forgetting in classification tasks

Ideas that shouldn't work:

small prior network for regularization: not novel, but I did find dataset to help this, but require time to train and tweak, not worth it.
view-time sampling: only sample rays near the querying view, but don't forget other rays, this is a challenging but interesting approach (inspired by RegNeRF)
main method is to prioritize sampling similar rays as view rays, and delta pixels, but this is very tricky
- we can utilize edge device to do small computation for quality improvement: train a fading NeRF overlay using only delta input by sampling ground truth rays near the same direction. The idea is that we want to do importance sampling, but doing so with NN is tricky since catestrophic forgetting. So we apply a common continuous learning method: parameter separation. But this method will result in significant artifact when camera is moving fast.

Ideas that should work:

we can compute a closed form (approxmated) depth distribution regularization for every ray in the ray space. It mainly deals with with sparse view. Generally:
- soft depth: speed up by modifying sampling distribution along ray
- hard depth: speed up by clamping sampling distribution, major speed up at beginning of training new region
main method is to prioritize sampling similar rays as view rays, and delta pixels, but this is very tricky
- we can do importance sampling with caching loss over static region to avoid oversampling while maintain memory of static region.
- we can try fall back not using hash encoding, I think it has the chance of still doing okay with plenoxels
could cache depth completion result and use diffusion to update completed depth with delta depth
we could apply Mip-NeRF to depth, but this has nothing to do with this paper.
we could do some network regularization directly on NGP grid, so that we can apply acceleration algorithm directly on NGP grid (especially for synthetic scene)
- compute (inverse interpolate) and store explicit depth grid on NGP, concatenate that to NGP-NN input as constant. So 0 is no density, 0.5 is unknown density, 1 is known density. So a
- freezing final NN in NGP after done pre-training, since info in NN is not view dependent, not location dependent
editable NGP
mipmap combined with multiresolution training
try smarter batch: different angle

So, the problem is that the density fade out or come in too slow. I suspect that it is due to the diminishing gradient for the values in the grid. For example, when we have a transparency in the input image, we should be able to immediately know that region is completely transparent, right? So we want a way to directly change the representation of that grid to have transparency. But since we have a NN processing grid value in NGP, we can't do so directly. What's the solution. We enforce a layer of the grid to store only density by adding regularization to the architecture, so whenever the network sees the 0 density, it should output 0 density. So after we freeze the network, the behavior should be the same, and we are able to train the network faster.

sampling less rays, clamp depth

Some reading on Plenoxel

octrees: [11, 38, 50, 52] (see [16] for a survey)
standard signal processing methods with interpolation: [30]
forward volume rendering formula introduced: Max [22], [12]
other voxel: Neural Volumes [20]
subdividing the 3D volume: [19, 35]
Numerical: [7], [18], [49]
predicting a surface or sampling near the surface: [14, 27, 33]
RMSProp: [10]
Realworld Loss: encourage empty, clear foreground-background decomposition
Speed: CUDA implementation, speedup not supported for volumetric rendering

Residual Field

The (differentiable version of) volumetric rendering equation looks like this (fixing direction $d$ and a point in space $p(o, t) = o + td$ )

$\hat{C} = \sum_i -\exp\left(\sum_{j = 0}^i \alpha_j \right)\alpha_i C_i$

The simple MSE loss:

$\begin{align*} \left(C - \sum_i -\exp\left(\sum_{j = 0}^i \alpha_j \right)\alpha_i C_i\right)^2 \end{align*}$

with residual added to the field: where $a_j', C_i'$ are constant

$\begin{align*} \left(C - \sum_i -\exp\left(\sum_{j = 0}^i \alpha_j \right)\alpha_i C_i + -\exp\left(\sum_{j = 0}^i \alpha_j' \right)\alpha_i' C_i'\right)^2 \end{align*}$

or if you do field-space instead of image space

$\begin{align*} \left(C - \sum_i -\exp\left(\sum_{j = 0}^i \alpha_j+\alpha_j' \right)(\alpha_i+\alpha_i') (C_i + C_i')\right)^2 \end{align*}$

So we can simplify above to

$\begin{align*} \left(C - \text{some constant} - \sum_i -\exp\left(\sum_{j = 0}^i \alpha_j \right)\alpha_i C_i\right)^2 \end{align*}$

So we can do image space?

Below is training pipeline of NerfStudio instant-ngp

entrypoint() in script/train.py
train_loop() in script/train.py
train() in trainer.py - output
train_iteration() in trainer.py - forward, backward
get_train_loss_dict() in base_pipeline.py - generate ray, feed model, get metric and loss
forward(), get_outputs() in base_model.py - any output
get_outputs() in instant_ngp.py - return rgb,acc,depth
forward() in base_field.py
get_density()->get_outputs() in base_field.py
get_density()->get_outputs() in instant_ngp_field.py

render_weight_from_density() in instant_ngp.py

renderer_rgb() in instant_ngp.py - start accumulate for loss
forward() in renderers.py

get_loss_dict() in instant_ngp.py - feed image and get rgb loss

FART-NeRF - Fast Accumulative Realtime Training of NeRF

Problem:

can't train in realtime
use case can enable NeRF realtime streaming

Hardness:

smart sampling strategy for speed
multi-frame consistency for flickering removal
catestrophic forgetting of scene element

Assume:

most objects are non-tranparent

Limitation:

model size: but can compress after trained

Table of Content