
1. Introduction
- NeRF is often impractical as it requires a large number of posed images and a lengthy per-scene optimiztion.
- pixelNeRF overcomes this framework making it possible for the model to learn NeRF representation from one or several images.
- Fully convolutional encoder
- Desirable properties for few-view novel-view synthesis
- Can be trained multiview without additional supervision
- NeRF representation in the camera coordinate system
- Fully convolutional
- Can incorporate variable number of posed input views at test time

2. pixelNeRF
- Fully convolutional image encoder $E$ → Encodes the input into a pixel-aligned feature grid
- NeRF network $f$ → Outputs color and density given a spatial location and its corresponding encoded feature
2-1. Single-Image pixelNeRF

- Given a input image of a scene $I$, extract feature volume $W$ = $E(I)$
- For a point on a camera ray $x$, retrieve corresponding image feature by projecting $x$ onto the image plane to the image coordinates $\pi(x)$ getting $W(\pi(x))$
- The image feature is passed to the NeRF network along with position and view direction
$$
f(\gamma(x),d;W(\pi(x))) = (\sigma, c)
$$
2-2. Incorporating Multi Views
- allow for an arbitrary number of views at test time, which distinguishes our method from existing approaches that are designed to only use single input view at test time.
- Transforming world space to its view space as
$$
P^{(i)} = [ R^{(i)}\ \ \ t^{(i)}]
$$
- For a new target camera ray, we transfrom a query point $x$ with view direction $d$ into coordinate system of each input view $i$ as