CSE5519 Advances in Computer Vision (Topic C: 2021 and before: Neural Rendering)

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

We represent a static scene as a continuous 5D function:

F: (\mathbf{x}, \boldsymbol{\theta}) = (x, y, z, \theta, \phi) \mapsto (\sigma, \mathbf{c})

where $(x, y, z)$ denotes a 3D position in space, $(\theta, \phi)$ specifies a viewing direction, $\sigma$ is the volume density at point $(x, y, z)$ (which acts as a differential opacity controlling how much radiance is accumulated along a ray), and $\mathbf{c}$ is the emitted RGB radiance in direction $(\theta, \phi)$ at that point.

Our method learns this function $F$ by optimizing a deep, fully-connected neural network (a multilayer perceptron, or MLP) that maps each 5D input coordinate $(x, y, z, \theta, \phi)$ to a corresponding volume density $\sigma$ and view-dependent color $\mathbf{c}$ .

The expected camera ray color $r(t)=o+td$ where $o$ is the camera position and $d$ is the camera direction is:

C(r)=\int_{t_n}^{t_f} T(t) \sigma(r(t)) c(r(t), d) d t

Where $T(t)$ is the transmittance along the ray:

T(t)=exp\left(-\int_{t_n}^{t} \sigma(r(s)) d s\right)

Novelty in NeRF

Positional encoding

deep networks are biased towards learning lower frequency functions.

They additionally show that mapping the inputs to a higher dimensional space using high frequency functions before passing them to the network enables better fitting of data that contains high frequency variation.

Let $\gamma(p)$ be the positional encoding of $p$ that maps $\mathbb{R}$ to $\mathbb{R}^{2L}$ where $L$ is the number of frequencies.

\gamma(p)=\left[\sin\left(2^0\pi p\right), \cos\left(2^0\pi p\right), \ldots, \sin\left(2^{L-1}\pi p\right), \cos\left(2^{L-1}\pi p\right)\right]

Hierarchical volume sampling

Optimize coarse and find network simultaneously.

Let $\hat{C}_c(r)$ be the coarse prediction of the camera ray color.

\hat{C}_c(r)=\sum_{i=1}^{N_c} w_i c_i,\quad w_i=T_i(1-\exp(-\sigma_i \delta_i))

We sample a second set of $N_f$ locations from this distribution using inverse transform sampling, evaluate our “fine” network at the union of the first and second set of samples, and compute the final rendered color of the ray $\hat{C}_f(r)$ but with all $N_c+N_f$ samples.

Tip

This paper reminds me of Gaussian Splatting. In this paper setting, we can treat the scene as a function of 5D coordinates. (all the cameras are focusing on the world origin) However, in general settings, we have 6D coordinates (3D position and 3D direction). Is there any way to use Gaussian Splatting to reconstruct the scene?
In the positional encoding, the function $\gamma(p)$ reminds me of the Fourier transform. Is there any connection between the two?

Volume Rendering

Output of color and density.