Multi-layer perceptron

A multi-layer perceptron architecture can be written as the following function \(\mathbb{R}\to\mathbb{R}\). The boxes (\(\square\)) represents the parameters \(\theta\) in this MLP (to be determined during training).

\[ \mathcal{NN}(x;\theta):= \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix}^\intercal \tanh \left( \begin{bmatrix} \square & \cdots & \square \\ \vdots & \ddots & \vdots \\ \square & \cdots & \square \end{bmatrix}\tanh \left( \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix} x + \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix} \right) + \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix} \right). \]

If the input is changed to \((x_1, x_2)\in\mathbb{R}^2\) while the output is still in \(\mathbb{R}\), the corresponding MLP becomes

\[ \mathcal{NN}(x_1, x_2;\theta):= \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix}^\intercal \tanh \left( \begin{bmatrix} \square & \cdots & \square \\ \vdots & \ddots & \vdots \\ \square & \cdots & \square \end{bmatrix}\tanh \left( \begin{bmatrix} \square & \square \\ \vdots & \vdots \\ \square & \square \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} + \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix} \right) + \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix} \right). \]

\(\mathbb{R}\to\mathbb{R}\) Graphical representation\(\mathbb{R}^2\to\mathbb{R}\) Graphical representation
altalt

Universal approximation theorem (Cybenko, 89'). Let \(X\subset\mathbb{R}^n\) be a compact subset, \(f\in C(X;\mathbb{R})\), then for any \(\epsilon>0\), there exists a neural network function \(\mathcal{NN}(x;\theta)\) with proper depth and width (\(\mathrm{dim}\ \theta\) depends on \(\epsilon\)) and proper parameters (\(\theta\) also depends on \(\epsilon\)) such that \[ \sup_{x\in X} |f(x) - \mathcal{NN}(x;\theta)| < \epsilon. \]