1. Introduction

Most problems can’t be solved by linear regression. Hence, we extend from linear regression to kernel regression to learn a nonlinear mapping from samples to labels. Kernel regression leads to Neural network Gaussian Processes(NNGP) and Neural Tangent Kernel(NTK)

2. Linear to Kernel

$$ \mathcal{L}(w) = \frac{1}{2}\sum^{n}_{i=1}(y^{(i)}-wx^{(i)})^2 $$

we extend this to nonlinear regression by applying a fixed nonlinear transform to samples $x$ before performing linear regression.

Consider set of class of functions,

$$ \mathcal{F}=\{ f : \mathbb{R^d}\rightarrow \mathbb{R} \ ; \ f(x) = \lang w,\psi(x)\rang_{\mathcal{H}}, \psi : \mathbb{R}^d \rightarrow \mathcal{H}, w \isin \mathcal{H}\} $$

where $\mathcal{H}$ is a Hilbert space with inner product $\lang \ \cdot \ , \ \cdot \ \rang_\mathcal{H}$ and $\psi$ is a nonlinear feature map.

Example 1.

Consider the following feature map $\psi : \mathbb{R}^2 \rightarrow \mathbb{R}^3$ given by the following

$$ \psi \begin{pmatrix} \begin{bmatrix} x1 \\ x2\end{bmatrix} \end{pmatrix} = \begin{bmatrix} x1 \\ x2 \\ \sqrt{x_1^2 + x_2^2}\end{bmatrix} $$

Untitled

Classes are not linearly separable but by adding a non linear feature map, classes are linearly separable.

→ How can we select an appropriate feature transformation given any dataset when there is no prior information?

2.1 Representer Theorem

Theorem 1(Representer Theorem)

let $\mathcal{H}$ be a Hilbert space with inner product $\lang \ \cdot \ , \ \cdot \ \rang_\mathcal{H}$. Let $\{\psi(x^{(i)})\}^{n}{i=1} \subset \mathcal{H}$ and $\{y^{(i)}\}^{n}{i=1} \subset \mathbb{R}$. Then there exists $\{ \alpha_i \}^{n}_{i=1}\subset\mathbb{R}$, such that the minimum $\mathcal{H}$-norm minimizer, $w^{*}$, for the loss :