A good paper feed millions of researchers who can plagiarize some simple idea to publish papers.
CT Scan Reconstruction MRI Reconstruction Course Deep Learning for MRI MRI Basics Part 1 - Image Formation
It uses an analytical method relies on signal processing.
NeRF Paper Reading: Youtube
NeRF's objective is to reconstruct 3D geometry using an array of images. However, the product of a NeRF is not an actual geometry, but a neural network that represent both the rendering function and a geometry.
Radiance Fields: It is a space in \mathbb{R}^3 such that each point contains a spherical function instead of containing a single value. This is essentially represent 3D volumes with 2D view-dependent appearance, basically geometry data viewed from different angle. It is a function of view direction on an object to color.
Input: ((\theta, \phi), (x, y, z)). Output: ((r, g, b), \alpha). Loss: difference between \int_D f(\cdot)_\alpha \cdot f(\cdot)_D and ground truth (r, g, b) where D is line along the ray.
Procedural:
// QUESTION: I kinda not sure how exactly to integrate \alpha.
We don't uniformly select voxel locations. We do two passes. The first pass is done with uniform voxel locations. And the second pass can be concentrated on the surface of the object.
However, if you just do that, the result will be poor, because for some reason, networks have hard time overfitting. Below is an example of a network trained with input (x, y) and output (r, g, b). The result is not great.
So the idea is to split the signal into different frequency layer to, in a sense, augment loss in higher frequency. This strategy can be found in transformers.
Further Readings: Fourier Features Let Networks Learn High Frequency
NTK: Infinite width fully connected layer, initialized with reasonable weights, and trained with infinite small steps. It is a good mathematical tool for giving good insights for fully connected layers.
How to differentiate rendering with density? We know:
where is says the final color we are rendering onto the screen is the integral of T blockage multiplied by the density \alpha and color c and the blockage (occlusions) is another integral:
You should think the density as the probability of a ray pass through a specific point along the ray. Also, h(t) := T(t)\alpha(t) is the probability that ray terminates at time t and we need to ensure that the ray will at some point terminate \int T(t)\alpha(t) dt = 1. Proving h(t) is indeed a distribution is just a small math exercise.
Due to practical constraints, NeRFs assume that the scene lies between a near and far bounds (t_n, t_f).
We often write
Why does the focal length in the camera intrinsics matrix have two dimensions?
In the pinhole camera model there is only one focal length which is between the principal point and the camera center.
However, after calculating the camera's intrinsic parameters, the matrix contains
(fx, 0, offsetx, 0,
0, fy, offsety, 0,
0, 0, 1, 0)
Where fx
and fy
indicates focal length from x
and y
direction. This is OpenCV convention and also appear in Plenoxel's code.
struct PackedCameraSpec {
PackedCameraSpec(CameraSpec& cam) :
c2w(cam.c2w.packed_accessor32<float, 2, torch::RestrictPtrTraits>()),
fx(cam.fx), fy(cam.fy),
cx(cam.cx), cy(cam.cy),
width(cam.width), height(cam.height),
ndc_coeffx(cam.ndc_coeffx), ndc_coeffy(cam.ndc_coeffy) {}
const torch::PackedTensorAccessor32<float, 2, torch::RestrictPtrTraits>
c2w;
float fx;
float fy;
float cx;
float cy;
int width;
int height;
float ndc_coeffx;
float ndc_coeffy;
};
sx
is the scaling factor in the x-direction (horizontally), it is a value that scales the horizontal dimensions of the image plane.
sy
is the scaling factor in the y-direction (vertically), it is a value that scales the vertical dimensions of the image plane.
So the calculation looks like
fx = f*sx
fy = f*sy
sx = 10 pixel/mm (how many pixel per millimeter convertion)
f = 35mm (focal length)
then fx = sx*f = 35mm * 10 pixel/mm = 350 pixels
So fx
is the focal length in pixel where the size of pixel is defined by how big the pixel is in x
direction.
The actual loss uses L2 norm. Which is a topic on its own, but I will briefly mention it here.
L2 norm is the euclidian distance we are familiar with: given vector X, the size (norm) of a vector can be written as
and similarly, the L1 norm is the manhattan distance
So now we can define Ln norm as follow:
But why is it useful? Why it matters so much in machine learning? We need to penalize weight vectors that has big norms because they will lead to overfitting. To see why, imagine your weights w_0, w_1, w_2, ... are used in such way:
Then we want the polynomial to have less degree and so we want for example, set as many w_i to 0 as possible. But setting a weight is not differentiable, so we instead require the vector [w_0, w_1, w_2, ...]^T to be "small". What is the formula to measure "small"? That's the norm.
When you add a regularization term in your loss function, you get the following landscape.
And because of this, our neural network might favor one weight over the other when using different regularization.
The L1 regularization solution is sparse. The L2 regularization solution is non-sparse. Although in practical ML we never look for exact solution. L2 regularization doesn’t perform feature selection, since weights are only reduced to values near 0 instead of 0. L1 regularization has built-in feature selection.
Mean Square Error (MSE) is another common place where L1, L2 norm appears: the below function is MSE with a \lambda weighted L2 regularization.
The delta function was introduced by physicist Paul Dirac as a tool for the normalization of state vectors. It also has uses in probability theory and signal processing.
In terms of NeRF, ideally we want the function to model the color integral distribution be like: (so instead of volumetric rendering, we only take color at exact ray hit)
For most objects, the variance of such distribution decreases with more training views from different directions (in terms of \delta function, it means a \to 0)
Big Idea: Deal with small, real-life image that has anti-aliasing and poor resolution. It also improve NeRF by not require camera to center at object's central location.
pixel: becomes an area on image
ray: becomes a cone
sample point: become a weighted sampling surface
// TODO: what is blurpool
It use characteristics function to determine whether the point is in the cone. And then calculate the complex 3D gaussian expectation with fourier transformed coordinates. // TODO: think about math if you have time
Idea: the reason why MLP works is because the players serve as a prior to assume smoothness of color (and therefore shape) on geometry surface.
In original design of NeRF, the direction d is smooth with respect to color c more than position x with respect to color c. Therefore, putting input d in later layers help the accuracy of the model.
For 360 degree captures of unbounded scenes, NeRF’s parameterization of space either models only a portion of the scene, leading to significant artifacts in background elements (a), or models the full scene and suffers from an overall loss of detail due to finite sampling resolution (b).
So we separate NeRF (into two MLPs) to one for foreground and one for background.
We parameterize location encoding for background scene to encode space outside of the sphere as a 4D coordinate where the 4th coordinate 1/r decrease with distance. The idea is that the original sparse encoding for background becomes more dense and therefore more images can contribute to the color of far backgrounds to resolve background ambiguity.
Parameterization: kinda Mip-NeRF + another way to do NeRF++
Convergence: Separate into two network but with one gradient, first for density and the second for density and color using first network's density to reduce training cost.
// QUESTION: don't quiet understand how this would work
Distillation: // TODO: didn't read
Above methods is topology-free and can be rendered in real-time but is memory intensive (can't capture detailed resolution).
NeRFs can be sampled with arbitrary resolution since the function is continuous. However, they are slow to train and test. Methods to accelerate NeRF includes
Spherical harmonics: used to speed up the process of converting NeRF to PlenOctree. This is because we put view-dependent calculation to evaluation time instead of PlenOctree-convertion time. (also the model looks cleaner)
Comments:
The article makes use spherical harmonic basis for decomposition of view direction-depended color as spherical functions on a 3D space origin. Although most object can be encoded well with spherical harmonics, 3D scenes with pores and camera inside the geometry can hardly be viewed as spherical functions, which is not an issue here. Therefore, encoding geometry as spherical function is not possible.
I am thinking throwing away the network entirely and parameterize color by spherical harmonics and density by spherical harmonics but with continuous functional coefficients (It is a function that take in radius and spit out actual coefficient).
Also the PlenOctree isn't spherical, and it not continuous. It is not a good representation as it isn't rotational invariant, meaning the object might look bad when rotated at a specific angle. It is not continuous and the voxels will exhibit minecraft-like looking if tree isn't deep enough. Well, on the other hand, png
compression is more popular than "fourier compression" in practice.
What is the benifit of represending geometry in neuralnetwork and then in octree? Why not directly in octree? If you want to directly tune octree, it is not possible because tree structure is fixed. You can instead generate tree structure first and then tune the leaf values. This is the same as paper's approach in which a neural network is used to obtain corse tree structure.
The paper use tree leaf separated by occupancy, but we can do that for colors too, in fact, for each channel to further compress the model. So we have in total of 4 octree with different shape each containing (r, g, b, alpha) values. This might be a better compression while it might be more costly to evaluate.
The Octree extraction is too slow: 15 minutes
Normalized device coordinates: ?
multi-sphere images: ?
Trilinear Interpolation is crucial: it converts discrete representation to a continuous one to minimize reconstruction loss.
// QUESTION: I don't understand how optimizing voxel coefficients, and regularization formula work, haven't read into it.
Comments:
Procedural:
NeRF
, however, for each point (x, y. z) we calculate its value by interpolation from its nearby vertex value (with course to fine multi-layer) that is randomly distributed in the hash.RGBD
target from interpolated latent space.
Morton code is used to map n
dimensional space to linear space. Morton code defines a Z-shape space filler, which preserves n
dimensional locality.
Assuming we already stored a 3D map where each cell is an int into a Morton Coded array, and we want to extract the int value in (x, y, z) = (5, 9, 1) = (0101b, 1001b, 0001b)
then there is an easy method to know which array position we need to look up. We need to look up (010 001 000 111b)
. This is because from z, y, x
, we extract the most significant bits in every dimension to the least significant bits.
Note that to invert morton code to 3D coordinate, we only need to
code >> 0
forx
,code >> 1
fory
, andcode >> 2
forz
and pass to the same decoding function.
If for
loop, we can write the code like this:
#include <stdint.h>
#include <limits.h>
using namespace std;
inline uint64_t mortonEncode_for(unsigned int x, unsigned int y, unsigned int z) {
uint64_t answer = 0;
for (uint64_t i = 0; i < (sizeof(uint64_t)* CHAR_BIT)/3; ++i) {
answer |= ((x & ((uint64_t)1 << i)) << 2*i) | ((y & ((uint64_t)1 << i)) << (2*i + 1)) | ((z & ((uint64_t)1 << i)) << (2*i + 2));
}
return answer;
}
To achieve better performance, we could use magic bits:
#include <stdint.h>
#include <limits.h>
using namespace std;
// method to seperate bits from a given integer 3 positions apart
inline uint64_t splitBy3(unsigned int a){
uint64_t x = a & 0x1fffff; // we only look at the first 21 bits
x = (x | x << 32) & 0x1f00000000ffff; // shift left 32 bits, OR with self, and 00011111000000000000000000000000000000001111111111111111
x = (x | x << 16) & 0x1f0000ff0000ff; // shift left 32 bits, OR with self, and 00011111000000000000000011111111000000000000000011111111
x = (x | x << 8) & 0x100f00f00f00f00f; // shift left 32 bits, OR with self, and 0001000000001111000000001111000000001111000000001111000000000000
x = (x | x << 4) & 0x10c30c30c30c30c3; // shift left 32 bits, OR with self, and 0001000011000011000011000011000011000011000011000011000100000000
x = (x | x << 2) & 0x1249249249249249;
return x;
}
inline uint64_t mortonEncode_magicbits(unsigned int x, unsigned int y, unsigned int z){
uint64_t answer = 0;
answer |= splitBy3(x) | splitBy3(y) << 1 | splitBy3(z) << 2;
return answer;
}
or to use a giant table to achieve the best performance
#include <stdint.h>
#include <limits.h>
using namespace std;
static const uint32_t morton256_x[256] = {
0x00000000,
0x00000001, 0x00000008, 0x00000009, 0x00000040, 0x00000041, 0x00000048, 0x00000049, 0x00000200,
0x00000201, 0x00000208, 0x00000209, 0x00000240, 0x00000241, 0x00000248, 0x00000249, 0x00001000,
0x00001001, 0x00001008, 0x00001009, 0x00001040, 0x00001041, 0x00001048, 0x00001049, 0x00001200,
0x00001201, 0x00001208, 0x00001209, 0x00001240, 0x00001241, 0x00001248, 0x00001249, 0x00008000,
0x00008001, 0x00008008, 0x00008009, 0x00008040, 0x00008041, 0x00008048, 0x00008049, 0x00008200,
0x00008201, 0x00008208, 0x00008209, 0x00008240, 0x00008241, 0x00008248, 0x00008249, 0x00009000,
0x00009001, 0x00009008, 0x00009009, 0x00009040, 0x00009041, 0x00009048, 0x00009049, 0x00009200,
0x00009201, 0x00009208, 0x00009209, 0x00009240, 0x00009241, 0x00009248, 0x00009249, 0x00040000,
0x00040001, 0x00040008, 0x00040009, 0x00040040, 0x00040041, 0x00040048, 0x00040049, 0x00040200,
0x00040201, 0x00040208, 0x00040209, 0x00040240, 0x00040241, 0x00040248, 0x00040249, 0x00041000,
0x00041001, 0x00041008, 0x00041009, 0x00041040, 0x00041041, 0x00041048, 0x00041049, 0x00041200,
0x00041201, 0x00041208, 0x00041209, 0x00041240, 0x00041241, 0x00041248, 0x00041249, 0x00048000,
0x00048001, 0x00048008, 0x00048009, 0x00048040, 0x00048041, 0x00048048, 0x00048049, 0x00048200,
0x00048201, 0x00048208, 0x00048209, 0x00048240, 0x00048241, 0x00048248, 0x00048249, 0x00049000,
0x00049001, 0x00049008, 0x00049009, 0x00049040, 0x00049041, 0x00049048, 0x00049049, 0x00049200,
0x00049201, 0x00049208, 0x00049209, 0x00049240, 0x00049241, 0x00049248, 0x00049249, 0x00200000,
0x00200001, 0x00200008, 0x00200009, 0x00200040, 0x00200041, 0x00200048, 0x00200049, 0x00200200,
0x00200201, 0x00200208, 0x00200209, 0x00200240, 0x00200241, 0x00200248, 0x00200249, 0x00201000,
0x00201001, 0x00201008, 0x00201009, 0x00201040, 0x00201041, 0x00201048, 0x00201049, 0x00201200,
0x00201201, 0x00201208, 0x00201209, 0x00201240, 0x00201241, 0x00201248, 0x00201249, 0x00208000,
0x00208001, 0x00208008, 0x00208009, 0x00208040, 0x00208041, 0x00208048, 0x00208049, 0x00208200,
0x00208201, 0x00208208, 0x00208209, 0x00208240, 0x00208241, 0x00208248, 0x00208249, 0x00209000,
0x00209001, 0x00209008, 0x00209009, 0x00209040, 0x00209041, 0x00209048, 0x00209049, 0x00209200,
0x00209201, 0x00209208, 0x00209209, 0x00209240, 0x00209241, 0x00209248, 0x00209249, 0x00240000,
0x00240001, 0x00240008, 0x00240009, 0x00240040, 0x00240041, 0x00240048, 0x00240049, 0x00240200,
0x00240201, 0x00240208, 0x00240209, 0x00240240, 0x00240241, 0x00240248, 0x00240249, 0x00241000,
0x00241001, 0x00241008, 0x00241009, 0x00241040, 0x00241041, 0x00241048, 0x00241049, 0x00241200,
0x00241201, 0x00241208, 0x00241209, 0x00241240, 0x00241241, 0x00241248, 0x00241249, 0x00248000,
0x00248001, 0x00248008, 0x00248009, 0x00248040, 0x00248041, 0x00248048, 0x00248049, 0x00248200,
0x00248201, 0x00248208, 0x00248209, 0x00248240, 0x00248241, 0x00248248, 0x00248249, 0x00249000,
0x00249001, 0x00249008, 0x00249009, 0x00249040, 0x00249041, 0x00249048, 0x00249049, 0x00249200,
0x00249201, 0x00249208, 0x00249209, 0x00249240, 0x00249241, 0x00249248, 0x00249249
};
// pre-shifted table for Y coordinates (1 bit to the left)
static const uint32_t morton256_y[256] = {
0x00000000,
0x00000002, 0x00000010, 0x00000012, 0x00000080, 0x00000082, 0x00000090, 0x00000092, 0x00000400,
0x00000402, 0x00000410, 0x00000412, 0x00000480, 0x00000482, 0x00000490, 0x00000492, 0x00002000,
0x00002002, 0x00002010, 0x00002012, 0x00002080, 0x00002082, 0x00002090, 0x00002092, 0x00002400,
0x00002402, 0x00002410, 0x00002412, 0x00002480, 0x00002482, 0x00002490, 0x00002492, 0x00010000,
0x00010002, 0x00010010, 0x00010012, 0x00010080, 0x00010082, 0x00010090, 0x00010092, 0x00010400,
0x00010402, 0x00010410, 0x00010412, 0x00010480, 0x00010482, 0x00010490, 0x00010492, 0x00012000,
0x00012002, 0x00012010, 0x00012012, 0x00012080, 0x00012082, 0x00012090, 0x00012092, 0x00012400,
0x00012402, 0x00012410, 0x00012412, 0x00012480, 0x00012482, 0x00012490, 0x00012492, 0x00080000,
0x00080002, 0x00080010, 0x00080012, 0x00080080, 0x00080082, 0x00080090, 0x00080092, 0x00080400,
0x00080402, 0x00080410, 0x00080412, 0x00080480, 0x00080482, 0x00080490, 0x00080492, 0x00082000,
0x00082002, 0x00082010, 0x00082012, 0x00082080, 0x00082082, 0x00082090, 0x00082092, 0x00082400,
0x00082402, 0x00082410, 0x00082412, 0x00082480, 0x00082482, 0x00082490, 0x00082492, 0x00090000,
0x00090002, 0x00090010, 0x00090012, 0x00090080, 0x00090082, 0x00090090, 0x00090092, 0x00090400,
0x00090402, 0x00090410, 0x00090412, 0x00090480, 0x00090482, 0x00090490, 0x00090492, 0x00092000,
0x00092002, 0x00092010, 0x00092012, 0x00092080, 0x00092082, 0x00092090, 0x00092092, 0x00092400,
0x00092402, 0x00092410, 0x00092412, 0x00092480, 0x00092482, 0x00092490, 0x00092492, 0x00400000,
0x00400002, 0x00400010, 0x00400012, 0x00400080, 0x00400082, 0x00400090, 0x00400092, 0x00400400,
0x00400402, 0x00400410, 0x00400412, 0x00400480, 0x00400482, 0x00400490, 0x00400492, 0x00402000,
0x00402002, 0x00402010, 0x00402012, 0x00402080, 0x00402082, 0x00402090, 0x00402092, 0x00402400,
0x00402402, 0x00402410, 0x00402412, 0x00402480, 0x00402482, 0x00402490, 0x00402492, 0x00410000,
0x00410002, 0x00410010, 0x00410012, 0x00410080, 0x00410082, 0x00410090, 0x00410092, 0x00410400,
0x00410402, 0x00410410, 0x00410412, 0x00410480, 0x00410482, 0x00410490, 0x00410492, 0x00412000,
0x00412002, 0x00412010, 0x00412012, 0x00412080, 0x00412082, 0x00412090, 0x00412092, 0x00412400,
0x00412402, 0x00412410, 0x00412412, 0x00412480, 0x00412482, 0x00412490, 0x00412492, 0x00480000,
0x00480002, 0x00480010, 0x00480012, 0x00480080, 0x00480082, 0x00480090, 0x00480092, 0x00480400,
0x00480402, 0x00480410, 0x00480412, 0x00480480, 0x00480482, 0x00480490, 0x00480492, 0x00482000,
0x00482002, 0x00482010, 0x00482012, 0x00482080, 0x00482082, 0x00482090, 0x00482092, 0x00482400,
0x00482402, 0x00482410, 0x00482412, 0x00482480, 0x00482482, 0x00482490, 0x00482492, 0x00490000,
0x00490002, 0x00490010, 0x00490012, 0x00490080, 0x00490082, 0x00490090, 0x00490092, 0x00490400,
0x00490402, 0x00490410, 0x00490412, 0x00490480, 0x00490482, 0x00490490, 0x00490492, 0x00492000,
0x00492002, 0x00492010, 0x00492012, 0x00492080, 0x00492082, 0x00492090, 0x00492092, 0x00492400,
0x00492402, 0x00492410, 0x00492412, 0x00492480, 0x00492482, 0x00492490, 0x00492492
};
// Pre-shifted table for z (2 bits to the left)
static const uint32_t morton256_z[256] = {
0x00000000,
0x00000004, 0x00000020, 0x00000024, 0x00000100, 0x00000104, 0x00000120, 0x00000124, 0x00000800,
0x00000804, 0x00000820, 0x00000824, 0x00000900, 0x00000904, 0x00000920, 0x00000924, 0x00004000,
0x00004004, 0x00004020, 0x00004024, 0x00004100, 0x00004104, 0x00004120, 0x00004124, 0x00004800,
0x00004804, 0x00004820, 0x00004824, 0x00004900, 0x00004904, 0x00004920, 0x00004924, 0x00020000,
0x00020004, 0x00020020, 0x00020024, 0x00020100, 0x00020104, 0x00020120, 0x00020124, 0x00020800,
0x00020804, 0x00020820, 0x00020824, 0x00020900, 0x00020904, 0x00020920, 0x00020924, 0x00024000,
0x00024004, 0x00024020, 0x00024024, 0x00024100, 0x00024104, 0x00024120, 0x00024124, 0x00024800,
0x00024804, 0x00024820, 0x00024824, 0x00024900, 0x00024904, 0x00024920, 0x00024924, 0x00100000,
0x00100004, 0x00100020, 0x00100024, 0x00100100, 0x00100104, 0x00100120, 0x00100124, 0x00100800,
0x00100804, 0x00100820, 0x00100824, 0x00100900, 0x00100904, 0x00100920, 0x00100924, 0x00104000,
0x00104004, 0x00104020, 0x00104024, 0x00104100, 0x00104104, 0x00104120, 0x00104124, 0x00104800,
0x00104804, 0x00104820, 0x00104824, 0x00104900, 0x00104904, 0x00104920, 0x00104924, 0x00120000,
0x00120004, 0x00120020, 0x00120024, 0x00120100, 0x00120104, 0x00120120, 0x00120124, 0x00120800,
0x00120804, 0x00120820, 0x00120824, 0x00120900, 0x00120904, 0x00120920, 0x00120924, 0x00124000,
0x00124004, 0x00124020, 0x00124024, 0x00124100, 0x00124104, 0x00124120, 0x00124124, 0x00124800,
0x00124804, 0x00124820, 0x00124824, 0x00124900, 0x00124904, 0x00124920, 0x00124924, 0x00800000,
0x00800004, 0x00800020, 0x00800024, 0x00800100, 0x00800104, 0x00800120, 0x00800124, 0x00800800,
0x00800804, 0x00800820, 0x00800824, 0x00800900, 0x00800904, 0x00800920, 0x00800924, 0x00804000,
0x00804004, 0x00804020, 0x00804024, 0x00804100, 0x00804104, 0x00804120, 0x00804124, 0x00804800,
0x00804804, 0x00804820, 0x00804824, 0x00804900, 0x00804904, 0x00804920, 0x00804924, 0x00820000,
0x00820004, 0x00820020, 0x00820024, 0x00820100, 0x00820104, 0x00820120, 0x00820124, 0x00820800,
0x00820804, 0x00820820, 0x00820824, 0x00820900, 0x00820904, 0x00820920, 0x00820924, 0x00824000,
0x00824004, 0x00824020, 0x00824024, 0x00824100, 0x00824104, 0x00824120, 0x00824124, 0x00824800,
0x00824804, 0x00824820, 0x00824824, 0x00824900, 0x00824904, 0x00824920, 0x00824924, 0x00900000,
0x00900004, 0x00900020, 0x00900024, 0x00900100, 0x00900104, 0x00900120, 0x00900124, 0x00900800,
0x00900804, 0x00900820, 0x00900824, 0x00900900, 0x00900904, 0x00900920, 0x00900924, 0x00904000,
0x00904004, 0x00904020, 0x00904024, 0x00904100, 0x00904104, 0x00904120, 0x00904124, 0x00904800,
0x00904804, 0x00904820, 0x00904824, 0x00904900, 0x00904904, 0x00904920, 0x00904924, 0x00920000,
0x00920004, 0x00920020, 0x00920024, 0x00920100, 0x00920104, 0x00920120, 0x00920124, 0x00920800,
0x00920804, 0x00920820, 0x00920824, 0x00920900, 0x00920904, 0x00920920, 0x00920924, 0x00924000,
0x00924004, 0x00924020, 0x00924024, 0x00924100, 0x00924104, 0x00924120, 0x00924124, 0x00924800,
0x00924804, 0x00924820, 0x00924824, 0x00924900, 0x00924904, 0x00924920, 0x00924924
};
inline uint64_t mortonEncode_LUT(unsigned int x, unsigned int y, unsigned int z){
uint64_t answer = 0;
answer = morton256_z[(z >> 16) & 0xFF ] | // we start by shifting the third byte, since we only look at the first 21 bits
morton256_y[(y >> 16) & 0xFF ] |
morton256_x[(x >> 16) & 0xFF ];
answer = answer << 48 | morton256_z[(z >> 8) & 0xFF ] | // shifting second byte
morton256_y[(y >> 8) & 0xFF ] |
morton256_x[(x >> 8) & 0xFF ];
answer = answer << 24 |
morton256_z[(z) & 0xFF ] | // first byte
morton256_y[(y) & 0xFF ] |
morton256_x[(x) & 0xFF ];
return answer;
}
Here is the source code for octree:
// Expands a 10-bit integer into 30 bits
// by inserting 2 zeros after each bit.
__host__ __device__ inline uint32_t expand_bits(uint32_t v) {
v = (v * 0x00010001u) & 0xFF0000FFu;
v = (v * 0x00000101u) & 0x0F00F00Fu;
v = (v * 0x00000011u) & 0xC30C30C3u;
v = (v * 0x00000005u) & 0x49249249u;
return v;
}
// Calculates a 30-bit Morton code for the
// given 3D point located within the unit cube [0,1].
__host__ __device__ inline uint32_t morton3D(uint32_t x, uint32_t y, uint32_t z) {
uint32_t xx = expand_bits(x);
uint32_t yy = expand_bits(y);
uint32_t zz = expand_bits(z);
return xx | (yy << 1) | (zz << 2);
}
__host__ __device__ inline uint32_t morton3D_invert(uint32_t x) {
x = x & 0x49249249;
x = (x | (x >> 2)) & 0xc30c30c3;
x = (x | (x >> 4)) & 0x0f00f00f;
x = (x | (x >> 8)) & 0xff0000ff;
x = (x | (x >> 16)) & 0x0000ffff;
return x;
}
For details, read Out-of-Core Construction of Sparse Voxel Octrees and Morton encoding/decoding through bit interleaving: Implementations
idx = ((i+step*n_elements) * 56924617 + j * 19349663 + 96925573) % (NERF_GRIDSIZE()*NERF_GRIDSIZE()*NERF_GRIDSIZE());
A linear congruential generator (LCG) is an algorithm that yields a sequence of pseudo-randomized numbers calculated with a discontinuous piecewise linear equation. The method represents one of the oldest and best-known pseudorandom number generator algorithms. The theory behind them is relatively easy to understand, and they are easily implemented and fast, especially on computer hardware which can provide modular arithmetic by storage-bit truncation. However, the statistical properties are bad.
Here, we don't actually care much about its statistical properties. Rather, we care about its property of producing a permutation: this use case distributes the density grid update samples more-or-less uniformly over space (due to the pseudo-random nature), but ensures good coverage by never visiting a grid cell twice without having visited all other cells (due to the permutation property).
Background: general modality models (transformers, MLP mixer, Perceiver), can they beat domain-specific architecture?
NeRF: data distillation by overfitting, cross-view interpolation
photorealistic result
view-consistent geometry with view-dependent lighting
network optimization cost is high
limited material and optical effects
interpolation is restricting our representation
over-simplified rendering equation
PixelNeRF & IBRNet: acquire RGBA of each point by weighted summing the image features of its 2D projections
MVSNeRF: predict RGBA of each point from a cost volume induced by MVSNet using 2 cameras
Side note: Transformer - attention is all you need replacing LSTM and early RNNs.
Generalizable NeRF Transformer (GNT):
There are additional 9 articles in my reading list.
Paper:
Report
Hands on Experience without Code
Short Video: Youtube
COLMAP with NGP with real camera data and detailed procedural: Youtube
From Video Automatic Tool: Youtube
Questions
NerfStudio: a python framework running different kinds of NeRF using PyTorch. (Video instruction)
Monocular: using only one camera, to capture either animated or static scene with animated or static object
Current Research:
static object: unrealistic
animated object, but with camera teleportation: assume an Olympic runner taking a video of a moving scene without introducing any motion blur, unrealistic.
Dense Depth Priors for Neural Radiance Fields from Sparse Input Views (CVPR'2022):
where r is sample point, K is the space of a ray, t is view depth from camera. (and GNLL cuz MSE is bad)
The author does not mention that depth completion network is trained separately with large dataset until 4.5 Limitations. The depth network take not just sparse depth, but images too.
They use ResNet + Convolutional Spatial Propagation Network (CSPN) which spread depth to neighbor pixels (since depth is sparse) to predict dense depth.
They simulated SfM result using sensor depth + random perturbation to avoid running SfM on large dataset.
Depth Supervision When: the difference between the predicted and target depth is greater than the target standard deviation Eq. (12), or 2) the predicted standard deviation is greater than the target standard deviation.
Bound depth within 1 STD, while give some freedom to fit color.
// QUESTION: what is lantent code?
So in above we know:
\sigma(t): the opacity
T(t) = \exp(-\int_0^t \sigma(s) ds): the occlusion
h(t) = T(t)\sigma(t): the color distribution linear interpolator
Notice h(t) always look like a impulse regardless how many layers of object there is along the ray.
Say we know in reality, assume the ray terminates exactly at distance d, then we want our h(t) distribution to look like a impulse distribution (\delta function) centered at d with very high peak. So we want two distribution to be the same, we therefore measure Kullback–Leibler(KL) divergence between the two distribution.
By definition, KL[P|Q] = \sum_{x \in X} P(x) \log(\frac{P(x)}{Q(x)})
But we don't know whether ray terminates exactly at distance d even with the help of depth input, so d is itself a random variable. We can model it by a normal distribution D \sim N(\hat{d}, \hat{\sigma}) where \hat{d}, \hat{\sigma} are estimated depth and uncertainty by COLMAP.
With above uncertainty, our delta function looks like \delta(t - D) and we want h(t) to be close to it.
Although the method is correct, it is not intuitive to me why it is a good choice for loss function. So far, the only difference between this and the previous paper is that this uses KL as loss and previous paper use GaussianNLL with uncertainty. I want to show that two different ways of thinkings are actually quiet similar, but that's math and math consumes time. So let's end here. // QUESTION
NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo:
fine tune depth network for each test scene using sparse SfM
query depth oracle for each image, project into cube, find error for depth
and hand clamp sampling interval that scale with error
PNeRF:
Similar to DSNeRF, model captured depth as a gaussian
Still using course and fine network
Loss has: color term, density term (exactly the same as color), and gaussian depth term
instead of doing KL divergence, it just subtract two distributions for a loss
RegNeRF:
regularize from novel view without ground truth
using MipNeRF to get rid of course-fine network // QUESTION: which to me is questionable, since course-fine has to do with sampling but Mip NeRF has to do with anti-aliasing. Mip trade time for quality.
Sample Space Annealing: sample ray midpoint more in the beginning
360FusionNeRF
depth supervision by error divided by depth variance (another method compared to above)
CLIP’s Vision Transformer for semantic consistency
where \hat{D}_{var}(r) = \sum_{i = 1}^N T_i(1 - \exp(-\sigma_i \delta_i))(\hat{D}(r) - t_i)^2.
By the way, this is done by CMU Engineering too
Some NeRF Variants: two works tells us how to use depth as translating 3D keypoints through time.
Nerfies: Deformable Neural Radiance Fields: model geometry changes using deformation, which I assume is very costly.
NeuralScene Flow Fields (NSFF): similar to above, but encoded time as another dimension.
Some Other NeRF Variants:
DINER: Depth-aware Image-based NEural Radiance fields: a variant of NeRF that separate depth and color network, so we can inject depth at test time instead of only at traning time (I think has the potential for some generative application, EXCITING WORK)
FWD: Real-Time Novel View Synthesis With Forward Warping and Depth: using explicit representation, sparse input, no per-scene optimization
Moving in a 360 World: Synthesizing Panoramic Parallaxes from a Single Panorama: simply filter out pixels whose depth values are larger than the local median depth multiplied by a tolerance ratio // QUESTION: don't quiet understand either
GeoNeRF: Generalizing NeRF with Geometry Priors: assume ground truth depth for each ray obtainable? // QUESTION: didn't quiet understand, looks like the guy writing paper is comming from the old DL community (I remember using FPN was hot when I was in highschool)
Classical Depth Processing: how to clean up captured depth (incremental updating, representation of directional uncertainty, fill gaps in reconstruction, robust to outlier), typically involve SDF, didn't read too much into it
Real-time 3d reconstruction and interaction using a moving depth camera.
A Volumetric Method for Building Complex Models from Range Images
Sparse generative neural networks for self-supervised scene completion of rgb-d scans
Scancomplete: Large-scale scene completion and semantic segmentation for 3d scans.
Some Other Paper about Depth: that I can't classify
SPARF: Neural Radiance Fields from Sparse and Noisy Poses: a good dataset for training small prior network if needed
NeRF-SLAM: Real-Time Dense Monocular SLAM with Neural Radiance Fields: added depth loss to instant-ngp, achieve realtime with RTX 2080 Ti with pretrained Droid-SLAM. // QUESTION: it looks like the SLAM produces a depth field? Didn't understand that part. But their result is bad and proposed nothing new.
Handheld Neural Multi-frame Depth Refinement: use handshakeing to reconstruct depth from phone video
(Rejected ICLR) MaskNeRF: Masked Neural Radiance Fields for Sparse View Synthesis: mask out stuff
Rendering Speedup:
DONeRF: Towards Real-Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks: for rendering, predict where to sample rays using yet another layer of network (but they are smart enough to predict logarithmically discretized and spherically warped depth values instead of raw depth, 48x speed up and 20FPS)
FastNeRF: High-Fidelity Neural Rendering at 200FPS (2021): speed up rendering by some magnitude with caching, querying
Baking Neural Radiance Fields for Real-Time View Synthesis
PlenOctrees for Real-time Rendering of Neural Radiance Fields (very different approach than plenoxels)
NeX: Real-time View Synthesis with Neural Basis Expansion: multi-plane method speed up original NeRF rendering time from 0.02FPS to 60FPS
Plenoxels: Radiance fields without neural networks
TensoRF: speed up blender dataset training time down to 3min // TODO: read
InstantNGP
AdaNeRF: Adaptive Sampling for Real-time Rendering of Neural Radiance Fields // TODO: read
Regularization from Novel View:
Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis: regularize color in semantic space using pre-trained network
RegNeRF(see above)
Bad Paper: don't read them
CDNeRF: A Multi-modal Feature Guided Neural Radiance Fields: fuse predicted depth with image encoding using transformer, which is too much of a stretch
RGB-D NEURAL RADIANCE FIELDS: LOCAL SAMPLING FOR FASTER TRAINING: Stratified sampling and Gaussian sampling (result is trivial)
Neural Sparse Voxel Fields: octree rendering 10x faster than original NeRF
Mip-NeRF RGB-D: Depth Assisted Fast Neural Radiance Fields: no new stuff
INGeo: Accelerating Instant Neural Scene Reconstruction with Noisy Geometry Priors: exactly the same as my summer work, and thankfully it did not get published
Didn't Read:
CLONeR: Camera-Lidar Fusion for Occupancy Grid-aided Neural Representations
TermiNeRF: Ray Termination Prediction for Efficient Neural Rendering
// QUESTION: And there is one work that I don't know how to comment on - FWD: Real-time Novel View Synthesis with Forward Warping and Depth, didn't quiet understand their approach because of their way of written paper
Assumptions
Introduction: We are trying to speed up both training and rendering time for NeRF to eventually read to ward interactivity. One application is to achieve 3D monitoring by reconstruction on the fly. To do so, we make use of sensor depth since it is readily avaliable and can speed up NeRF greatly by putting regularization guidance and smarter sampling. Translation is challenging since NeRF fundamentally is more like a grided representation of the scene than particles (see Graphics), but being a neural network gives us flexibility.
Related Work:
Depth: Dense Depth Priors for Neural Radiance Fields from Sparse Input Views uses a pre-trained CNN to complete sparse depth map and use that as a loss term with GNLL. DSNeRF directly uses sparse sensor depth but guide the output depth as a impulse distribution around uncertain sensor input. 360FusionNeRF added depth loss divided by variance. PNeRF added both density (calculated the same way as color) and depth (normal distribution) term to the loss. NerfingMVS applied a hard clamp over sampling region using depth. RegNeRF regularizes depth smoothness using novel view. Notably, built upon instant-ngp, NeRF-SLAM achieved realtime scene reconstruction for static scene.
Time: Nerfies models geometry updates as deformation, and NeuralScene Flow Fields (NSFF) encoded time as another dimension. Both approaches added another dimension of search space for a valid representation.
Mask: (Rejected ICLR) MaskNeRF important samples masked region for better quality, but the methodology is strange.
Rendering Speedup: DONeRF predicts ray termination region using NN. FastNeRF uses cache. NeX, PlenOctrees, Plenoxels, InstantNGP uses acceleration datastructure.
Steps:
regularize for impulse distribution simultaneously from all views (one approximation: you might need some volumetric global illumination for NeRF algorithm for occlusion stuff, something like fading boarder region)
continuous learning: since you can't importance sample a neural network backpropagation (especially with hash collision), you need to avoid catestraphic forgetting. Here is where continuous learning come into play (yah, I didn't believed I will ever touch that) A continual learning survey: Defying forgetting in classification tasks
Ideas that shouldn't work:
small prior network for regularization: not novel, but I did find dataset to help this, but require time to train and tweak, not worth it.
view-time sampling: only sample rays near the querying view, but don't forget other rays, this is a challenging but interesting approach (inspired by RegNeRF)
main method is to prioritize sampling similar rays as view rays, and delta pixels, but this is very tricky
Ideas that should work:
we can compute a closed form (approxmated) depth distribution regularization for every ray in the ray space. It mainly deals with with sparse view. Generally:
main method is to prioritize sampling similar rays as view rays, and delta pixels, but this is very tricky
could cache depth completion result and use diffusion to update completed depth with delta depth
we could apply Mip-NeRF to depth, but this has nothing to do with this paper.
we could do some network regularization directly on NGP grid, so that we can apply acceleration algorithm directly on NGP grid (especially for synthetic scene)
editable NGP
mipmap combined with multiresolution training
try smarter batch: different angle
So, the problem is that the density fade out or come in too slow. I suspect that it is due to the diminishing gradient for the values in the grid. For example, when we have a transparency in the input image, we should be able to immediately know that region is completely transparent, right? So we want a way to directly change the representation of that grid to have transparency. But since we have a NN processing grid value in NGP, we can't do so directly. What's the solution. We enforce a layer of the grid to store only density by adding regularization to the architecture, so whenever the network sees the 0 density, it should output 0 density. So after we freeze the network, the behavior should be the same, and we are able to train the network faster.
sampling less rays, clamp depth
Some reading on Plenoxel
octrees: [11, 38, 50, 52] (see [16] for a survey)
standard signal processing methods with interpolation: [30]
forward volume rendering formula introduced: Max [22], [12]
other voxel: Neural Volumes [20]
subdividing the 3D volume: [19, 35]
Numerical: [7], [18], [49]
predicting a surface or sampling near the surface: [14, 27, 33]
RMSProp: [10]
Realworld Loss: encourage empty, clear foreground-background decomposition
Speed: CUDA implementation, speedup not supported for volumetric rendering
The (differentiable version of) volumetric rendering equation looks like this (fixing direction d and a point in space p(o, t) = o + td)
The simple MSE loss:
with residual added to the field: where a_j', C_i' are constant
or if you do field-space instead of image space
So we can simplify above to
So we can do image space?
Below is training pipeline of NerfStudio instant-ngp
entrypoint() in script/train.py
train_loop() in script/train.py
train() in trainer.py - output
train_iteration() in trainer.py - forward, backward
get_train_loss_dict() in base_pipeline.py - generate ray, feed model, get metric and loss
forward(), get_outputs() in base_model.py - any output
get_outputs() in instant_ngp.py - return rgb,acc,depth
forward() in base_field.py
get_density()->get_outputs() in base_field.py
get_density()->get_outputs() in instant_ngp_field.py
render_weight_from_density() in instant_ngp.py
renderer_rgb() in instant_ngp.py - start accumulate for loss
forward() in renderers.py
get_loss_dict() in instant_ngp.py - feed image and get rgb loss
Problem:
can't train in realtime
use case can enable NeRF realtime streaming
Hardness:
smart sampling strategy for speed
multi-frame consistency for flickering removal
catestrophic forgetting of scene element
Assume:
Limitation:
Table of Content