Most problems can’t be solved by linear regression. Hence, we extend from linear regression to kernel regression to learn a nonlinear mapping from samples to labels. Kernel regression leads to Neural network Gaussian Processes(NNGP) and Neural Tangent Kernel(NTK)
$$ \mathcal{L}(w) = \frac{1}{2}\sum^{n}_{i=1}(y^{(i)}-wx^{(i)})^2 $$
we extend this to nonlinear regression by applying a fixed nonlinear transform to samples $x$ before performing linear regression.
Consider set of class of functions,
$$ \mathcal{F}=\{ f : \mathbb{R^d}\rightarrow \mathbb{R} \ ; \ f(x) = \lang w,\psi(x)\rang_{\mathcal{H}}, \psi : \mathbb{R}^d \rightarrow \mathcal{H}, w \isin \mathcal{H}\} $$
where $\mathcal{H}$ is a Hilbert space with inner product $\lang \ \cdot \ , \ \cdot \ \rang_\mathcal{H}$ and $\psi$ is a nonlinear feature map.
Consider the following feature map $\psi : \mathbb{R}^2 \rightarrow \mathbb{R}^3$ given by the following
$$ \psi \begin{pmatrix} \begin{bmatrix} x1 \\ x2\end{bmatrix} \end{pmatrix} = \begin{bmatrix} x1 \\ x2 \\ \sqrt{x_1^2 + x_2^2}\end{bmatrix} $$

Classes are not linearly separable but by adding a non linear feature map, classes are linearly separable.
→ How can we select an appropriate feature transformation given any dataset when there is no prior information?
Theorem 1(Representer Theorem)
let $\mathcal{H}$ be a Hilbert space with inner product $\lang \ \cdot \ , \ \cdot \ \rang_\mathcal{H}$. Let $\{\psi(x^{(i)})\}^{n}{i=1} \subset \mathcal{H}$ and $\{y^{(i)}\}^{n}{i=1} \subset \mathbb{R}$. Then there exists $\{ \alpha_i \}^{n}_{i=1}\subset\mathbb{R}$, such that the minimum $\mathcal{H}$-norm minimizer, $w^{*}$, for the loss :