Skip to Content
CSE5519CSE5519 Advances in Computer Vision (Topic C: 2021 and before: Neural Rendering)

CSE5519 Advances in Computer Vision (Topic C: 2021 and before: Neural Rendering)

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

link to the paper 

We represent a static scene as a continuous 5D function:

F:(x,θ)=(x,y,z,θ,ϕ)(σ,c)F: (\mathbf{x}, \boldsymbol{\theta}) = (x, y, z, \theta, \phi) \mapsto (\sigma, \mathbf{c})

where (x,y,z)(x, y, z) denotes a 3D position in space, (θ,ϕ)(\theta, \phi) specifies a viewing direction, σ\sigma is the volume density at point (x,y,z)(x, y, z) (which acts as a differential opacity controlling how much radiance is accumulated along a ray), and c\mathbf{c} is the emitted RGB radiance in direction (θ,ϕ)(\theta, \phi) at that point.

Our method learns this function FF by optimizing a deep, fully-connected neural network (a multilayer perceptron, or MLP) that maps each 5D input coordinate (x,y,z,θ,ϕ)(x, y, z, \theta, \phi) to a corresponding volume density σ\sigma and view-dependent color c\mathbf{c}.

The expected camera ray color r(t)=o+tdr(t)=o+td where oo is the camera position and dd is the camera direction is:

C(r)=tntfT(t)σ(r(t))c(r(t),d)dtC(r)=\int_{t_n}^{t_f} T(t) \sigma(r(t)) c(r(t), d) d t

Where T(t)T(t) is the transmittance along the ray:

T(t)=exp(tntσ(r(s))ds)T(t)=exp\left(-\int_{t_n}^{t} \sigma(r(s)) d s\right)

Novelty in NeRF

Positional encoding

deep networks are biased towards learning lower frequency functions.

They additionally show that mapping the inputs to a higher dimensional space using high frequency functions before passing them to the network enables better fitting of data that contains high frequency variation.

Let γ(p)\gamma(p) be the positional encoding of pp that maps R\mathbb{R} to R2L\mathbb{R}^{2L} where LL is the number of frequencies.

γ(p)=[sin(20πp),cos(20πp),,sin(2L1πp),cos(2L1πp)]\gamma(p)=\left[\sin\left(2^0\pi p\right), \cos\left(2^0\pi p\right), \ldots, \sin\left(2^{L-1}\pi p\right), \cos\left(2^{L-1}\pi p\right)\right]

Hierarchical volume sampling

Optimize coarse and find network simultaneously.

Let C^c(r)\hat{C}_c(r) be the coarse prediction of the camera ray color.

C^c(r)=i=1Ncwici,wi=Ti(1exp(σiδi))\hat{C}_c(r)=\sum_{i=1}^{N_c} w_i c_i,\quad w_i=T_i(1-\exp(-\sigma_i \delta_i))

We sample a second set of NfN_f locations from this distribution using inverse transform sampling, evaluate our “fine” network at the union of the first and second set of samples, and compute the final rendered color of the ray C^f(r)\hat{C}_f(r) but with all Nc+NfN_c+N_f samples.

Tip
  1. This paper reminds me of Gaussian Splatting. In this paper setting, we can treat the scene as a function of 5D coordinates. (all the cameras are focusing on the world origin) However, in general settings, we have 6D coordinates (3D position and 3D direction). Is there any way to use Gaussian Splatting to reconstruct the scene?
  2. In the positional encoding, the function γ(p)\gamma(p) reminds me of the Fourier transform. Is there any connection between the two?

Volume Rendering

Output of color and density.

Last updated on