Untitled

1. Introduction

NeRF is often impractical as it requires a large number of posed images and a lengthy per-scene optimiztion.
pixelNeRF overcomes this framework making it possible for the model to learn NeRF representation from one or several images.
- Fully convolutional encoder
- Desirable properties for few-view novel-view synthesis
  1. Can be trained multiview without additional supervision
  2. NeRF representation in the camera coordinate system
  3. Fully convolutional
  4. Can incorporate variable number of posed input views at test time

Untitled

2. pixelNeRF

Fully convolutional image encoder $E$ → Encodes the input into a pixel-aligned feature grid
NeRF network $f$ → Outputs color and density given a spatial location and its corresponding encoded feature

Untitled

Given a input image of a scene $I$, extract feature volume $W$ = $E(I)$
For a point on a camera ray $x$, retrieve corresponding image feature by projecting $x$ onto the image plane to the image coordinates $\pi(x)$ getting $W(\pi(x))$
The image feature is passed to the NeRF network along with position and view direction

$$ f(\gamma(x),d;W(\pi(x))) = (\sigma, c) $$

allow for an arbitrary number of views at test time, which distinguishes our method from existing approaches that are designed to only use single input view at test time.
Transforming world space to its view space as

$$ P^{(i)} = [ R^{(i)}\ \ \ t^{(i)}] $$

For a new target camera ray, we transfrom a query point $x$ with view direction $d$ into coordinate system of each input view $i$ as