Motivation

The Dawning of a New Era in Applied Mathematics
— Weinan E

Current difficulties in FDM/FEM

combine data with physics (PDE)
remesh for complicated boundary geometry
higher dimension domain (curse of dimensionality)

Multi-layer perceptron

A multi-layer perceptron architecture can be written as the following function \(\mathbb{R}\to\mathbb{R}\). The boxes (\(\square\)) represents the parameters \(\theta\) in this MLP (to be determined during training).

\[ \mathcal{NN}(x;\theta):= \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix}^\intercal \tanh \left( \begin{bmatrix} \square & \cdots & \square \\ \vdots & \ddots & \vdots \\ \square & \cdots & \square \end{bmatrix}\tanh \left( \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix} x + \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix} \right) + \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix} \right). \]

If the input is changed to \((x_1, x_2)\in\mathbb{R}^2\) while the output is still in \(\mathbb{R}\), the corresponding MLP becomes

\[ \mathcal{NN}(x_1, x_2;\theta):= \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix}^\intercal \tanh \left( \begin{bmatrix} \square & \cdots & \square \\ \vdots & \ddots & \vdots \\ \square & \cdots & \square \end{bmatrix}\tanh \left( \begin{bmatrix} \square & \square \\ \vdots & \vdots \\ \square & \square \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} + \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix} \right) + \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix} \right). \]

\(\mathbb{R}\to\mathbb{R}\) Graphical representation	\(\mathbb{R}^2\to\mathbb{R}\) Graphical representation

Universal approximation theorem (Cybenko, 89'). Let \(X\subset\mathbb{R}^n\) be a compact subset, \(f\in C(X;\mathbb{R})\), then for any \(\epsilon>0\), there exists a neural network function \(\mathcal{NN}(x;\theta)\) with proper depth and width (\(\mathrm{dim}\ \theta\) depends on \(\epsilon\)) and proper parameters (\(\theta\) also depends on \(\epsilon\)) such that \[ \sup_{x\in X} |f(x) - \mathcal{NN}(x;\theta)| < \epsilon. \]

Low-rank structure

Goal: to mitigate overparametrization of wide linear layers

We solve a very simple ODE: \[ f'(x)=2x, \hspace{2em} x\in[-1, 1] \] with initial condition \(f(0)=0\). The MLP is chosen to be \[ \mathcal{NN}(x;\theta):= \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix}^\intercal \tanh \left( \begin{bmatrix} \square & \cdots & \square \\ \vdots & \ddots & \vdots \\ \square & \cdots & \square \end{bmatrix}\tanh \left( \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix} x + \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix} \right) + \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix} \right), \] where width is chosen to be 100 (\(\dim\theta=10400\)). We choose the collocation points to be \[ x_1=-1, x_2=-0.99, x_3=-0.98, \cdots, x_{199}=0.98, x_{200}=0.99, x_{201}=1. \] Collocation points can be uniformly chosen from the ODE/PDE domain, and can be randomly selected.

The loss function is proposed as \[ L(\theta)= \lambda\cdot(\mathcal{NN}(0;\theta)-0)^2 + \sum_{j=1}^{201} [ \mathcal{NN}'(x_j;\theta) - 2x_j ]^2. \] \(\lambda\) is a tunable parameter to leverage the importances of data and physics. The optimization method here is chosen to be gradient-descent based method with momentum (Adam).

For each epoch, we find the singular value distribution of the square matrix of \(100\times100\) in the MLP

Theorem (Marchenko-Pastur, 67'). Let \(X_{ij}, 1\leq i\leq p, 1\leq j\leq n\) be independent random variables with \(\mathbb{E}[X_{ij}]=0\) and \(\mathbb{E}[X_{ij}^2]=1\). Denote by \(\lambda_{1}\geq\cdots\geq\lambda_{p}\geq0\) the eigenvalues of the symmetric matrix \(W_p:=\frac{1}{n} X X^\intercal\in\mathbb{R}^{p\times p}\). Then, when \(p\to\infty\) and \(n\to\infty\) with fixed ratio \(\frac{p}{n}\to\lambda\in(0,\infty)\), there is convergence in distribution \[ \frac{1}{p} \sum_{k=1}^{p} 1_{\{\lambda_k \leq x\}} \to F_{\lambda}(x) \hspace{1em} \forall x\in\mathbb{R}, \] and its distributional derivative is \[ F'_{\lambda}(x) = \frac{\sqrt{(b-x)(x-a)}}{2\pi \lambda x}, \] where \(a=\left(1-\sqrt{\lambda}\right)^2\) and \(b=\left(1+\sqrt{\lambda}\right)^2\).

So if we denote the singular values of \(\frac{1}{\sqrt{n}}X\in\mathbb{R}^{n\times n}\) as \(\sigma_1\geq\sigma_2\geq\cdots\geq\sigma_n\geq0\), then the distribution of them follows \[ \lim_{n\to\infty} \frac{1}{n}\sum_{k=1}^{n} 1_{\{\sigma_k\leq x\}} =\frac{x\sqrt{4-x^2}}{2\pi} + 1 - \frac{2}{\pi}\arctan\frac{\sqrt{4-x^2}}{x}, \hspace{1em} x\in(0,2). \]

Observing this feature, we use an alternative architecture, Kronecker neural network [Jagtap, Shin, Kawaguchi and Karniadakis (2022)] (rank=1) \[ \mathcal{KNN}(x;\theta):= \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix}^\intercal \tanh \left( (A\otimes B)\tanh \left( \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix} x + \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix} \right) + \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix} \right) \] to solve the same ODE. The Kronecker product is defined as \[ A\otimes B = \begin{bmatrix} a_{1,1} B & \cdots & a_{1,n}B \\ \vdots & \ddots & \vdots \\ a_{m,1}B & \cdots & a_{n,n}B \end{bmatrix} \]

This architecture is equivalent to replacing \(A\otimes B\) with \(U_1 V_1^\intercal\), where \(U,V\) are singular value decomposition (SVD) factors. The resulting architecture has \(\dim\theta=600\), yet preserving much of the representability of the full-rank MLP.

Loss Function

Type From

residual PDE/physics

regression data

variational physics

Type	From
residual	PDE/physics
regression	data
variational	physics

We will mainly focus on PDEs instead of ODEs. To fully visualize the solution, the input of \(u\) is usually of dimension=2, so the selected MLP architecure is typically

where the solution to the PDE in the context will be denoated as \(u\), and the approximation will be denoted as \(u(\cdot;\theta)\).

The convention here is, when semicolon and \(\theta\) appears in the term \(u\) it refers to the ANN approximation, and on the contrary, when \(\theta\) is missing \(u\) denotes the PDE solution.

Denote the loss function associated with the PDE problem as \(L(\theta)\), where \(\theta\in\mathbb{R}^d\) is the parameter vector in an ANN

The goal is to design \(L(\theta)\) such that when \(u(\cdot;\theta)\) well approximates \(u\), \(L\) attains its global minimum.

Residual loss

The residual loss plugs in the neural network approximation to the PDE and measures the residual. For example, when solving the (1D) heat equation \[ u_t = u_{xx}, \hspace{2em} (t,x)\in[0,T]\times\Omega \] with given initial condition \(u(t=0,x)=u_0(x)\). Residual loss at a collocation point \((t,x)\) is \[ L(\theta; (t,x)) = [u_t(t,x;\theta) - u_{xx}(t,x;\theta)]^2. \] The total residual loss on the domain of interest is usually designed to be \[ L_\mathrm{res}(\theta) \propto \int_{0}^{T} \int_\Omega | u_t(t,x;\theta) - u_{xx}(t,x;\theta)|^2 \mathrm{d}x \mathrm{d}t \]

The calculation of total residual loss is equivalent to perform the numerical integral.

If the PDE is changed to (inviscid) Burger's equation \[ u_t + u u_x = 0, \hspace{2em} (t,x)\in (0,\infty)\times\mathbb{R} \] with Riemann IVP: \(u(t=0, x)=1-\mathrm{Heaviside}(x)\), then residual loss at a collocation point \((t,x)\) is \[ L(\theta; (t,x)) = [u_t(t,x;\theta) + u(t,x;\theta)\cdot u_x(t,x;\theta)]^2. \]

We almost never encounter the problem of lack of training data or overfitting issues in the residual loss: we can always choose sufficiently many collocation points \((t,x)_{n}\) for \(n=1,\cdots, N\) from the domain of interest to achieve \(N\gg\dim\theta\).

Regression loss

Regression loss measures the difference between ANN prediction and observed data, including the initial data of PDE, but can be more general and provides much flexibility.

If the Riemann IVP \(u_0(x)=1-\mathrm{Heaviside}(x)\) is posed as initial, the exact solution to the inviscid Burger's equation with Riemann IVP is \[ u(t,x) = u_0(x-t/2) \]

then the corresponding regression loss can be chosen as \[ L_\mathrm{reg}(\theta)\propto \int_{\Omega} |u(0,x;\theta) - u_0(x)|^2 \mathrm{d}x. \]

A minimization on total loss \[ L(\theta) = L_\mathrm{res}(\theta) + L_\mathrm{reg}(\theta) \] leads to the follow approximate solutions.

Shallow PINN solution 2D View	Shallow PINN solution 3D View

Increasing \(\dim\theta\) can lead to a more accurate solution.

Deep PINN solution 2D View	Deep PINN solution 3D View

This setup can be much more flexible in the observed data. Suppose the initial data \(u_0(x)\) in the Riemann IVP are not observed. Instead, a moving sensor is placed in the domain of interest to measure the gas pressure. (Inviscid Burger's equation for ideal gas dynamics) We can control the motion of the sensor to be \[ x(t)=\frac{1+\sin(4\pi t)}{4}, \] and the observed data from the sensor becomes \[ u(t, x(t)) = f(t) \]

In this case, the regression loss is redefined as \[ L_\mathrm{reg}(\theta)\propto \int_{0}^{1} |u(t,x(t);\theta) - f(t)|^2 \mathrm{d}t. \] Minimizing the same total loss \(L(\theta) = L_\mathrm{res}(\theta) + L_\mathrm{reg}(\theta)\), we can recover the full solution on the space-time domain of interest.

The recovered solution can achieve \[ \|u(t,x;\theta)-u_\mathrm{exact}(t,x)\|_{L^2([0,1]\times[-1,1])} = 1.16\times10^{-2}. \]

As a mesh-free method, PINN can overcome the difficults for traditional methods (e.g. finite difference) caused by data misalignment.

\begin{align} u_t + u u_x &= 0 \\ \frac{u(t+\Delta t, x) - u(t, x)}{\Delta t} + u(t,x)\cdot \frac{u(t, x+\Delta x) - u(t, x-\Delta x)}{2\Delta x} &= 0 \end{align}

The reasons to distinguish regression loss from residual loss are

regression loss can be evaluated without back-propagation, while residual loss cannot
if there's limited number of observed data, regression loss may place constrains on model complexity (\(\dim\theta\))

Variational loss

Variational loss can be adopted when a PDE has a variational form, in which case the PDE is the Euler-Lagrange equation to the variational quantity. In such problems the PDEs are guaranteed to be elliptic.

A typical variational loss is the Dirichlet energy \[ E(u):=\int_{\Omega} |\nabla u|^2 \mathrm{d} x, \] where \(\Omega\subset\mathbb{R}^n\) is a bounded open domain with smooth boundary. If we restrict the search space to \(u\in H^1(\Omega)\) with constraint \(u_\mathrm{b}\in C^\infty(\partial\Omega)\), then the minimizer of \(E\) exists and is unique, and coincide with the solution to Laplace equation on \(\Omega\) \[ \begin{cases} \Delta u(x) = 0 & x\in\Omega \\ u(x) = u_\text{b}(x) & x\in\partial\Omega \end{cases}. \]

Since we mainly work in \(\mathbb{R}^2\), we rename the input variables as \(x,y\) instead of \(x\in\mathbb{R}^2\).

To illustrate the convenience of the PINN to solve complicated boundary conditions, we chose \(\Omega\) to be a Koch snowflake.

Boundary condition is chosen to be \[ u_\mathrm{b}(x,y)= \frac{x^2-y^2}{x^2+y^2}, \] which is equivalent to the polar cooridinates \[ u_\mathrm{b}(r,\varphi)=\cos(2\varphi). \]

The collocation points are chosen to be triangle vertices of the Koch snowflake at fractal level 5, so \(\Omega\) is approximated by a polygon of \(3\times4^5=3072\) sides. Consequently, there are 45397 collocation points in the interior of this polygon, as the vertices of the triangles. The variational loss in this case is chosen to be \[ L_\mathrm{var}(\theta; A) = \sum_{j=1}^{45397} \left[ u_x^2(x_j,y_j;\theta) + u_y^2(x_j, y_j;\theta) \right], \hspace{1em} A = \{(x_j, y_j)\}_{j=1}^{45397} \subset \mathring{\Omega}. \]

This is a discretized (and rescaled) version of \( \iint_{\Omega} |\nabla u(x,y)|^2 \mathrm{d} x \mathrm{d} y \).

To enforce the boundary condition, we pose the regression loss as \[ L_\mathrm{reg}(\theta; B) = \sum_{j=1}^{3072} \left[ u(x_j,y_j;\theta) - u_\text{b}(x_j, y_j) \right]^2, \hspace{1em} B = \{(x_j, y_j)\}_{j=1}^{3072} \subset \partial\Omega. \]

This is a discretized (and rescaled) version of \( \|u(x,y;\theta) - u_\text{b}(x,y)\|_{L^2(\partial \Omega)}^2 \).

The total loss can be defined as the sum \[ L_\mathrm{total}(\theta; A\cup B) = L_\mathrm{var}(\theta; A) + L_\mathrm{reg}(\theta; B). \]

We can also compare with the solution obtained from finite difference method.

Gradient descent

Gradient Descent is a fundamental optimization algorithm used in machine learning and mathematical optimization.

Consider a loss function \(L(\theta;{\text{collocation points}})\) of ANN solving a differential equation (\textit{e. g.}, residual loss plus regression loss), where \(\theta\in\mathbb{R}^d\) is the vector of parameters in ANN.

We will fix the collocation points and simply write \(L:\mathbb{R}^d\to\mathbb{R}\) as the loss function.

Denote the index of iteration (epoch) as the subscript, \(\theta_0\) is the initial guess of the parameters in ANN, which is typically chosen to follow \(\mathcal{N}\) or uniformly i.i.d.

The update rule for gradient descent is \[ \theta_{n+1} = \theta_{n} - \gamma \nabla_\theta L(\theta_n), \hspace{1em} n=0,1,2,\cdots \tag{GD} \] where \(\gamma>0\) is a small tunable learning rate.

The caveat is if \(d\) is large and number of collocation points is also large, computing the gradient \(\nabla_\theta L\) may become infeasible by hardware (typically, memory) constraints. This leads to the stochastic gradient descent (SGD), and a version is \begin{align*} \theta_{n+1} &= \theta_{n} - \gamma \nabla_\theta L_\text{res}(\theta_n) && n=0,2,4,\cdots \\ \theta_{n+2} &= \theta_{n+1} - \gamma \nabla_\theta L_\text{reg}(\theta_{n+1}) \tag{SGD} \end{align*} where \(L_\text{total}(\theta)=L_\text{res}(\theta)+L_\text{reg}(\theta)\) is the function to minimize.

Ensemble Kalman Filter [Kovachki, Stuart (2019)]

Idea: use differences to approximate the gradient \(\nabla_\theta L\)

Instead of calculating the gradient of loss at a given parameter \(\nabla_\theta L\) to update \(\theta\), we introduce Ensemble Kalman Filter (EnKF). Let \(\theta\in\mathbb{R}^d\) be the parameter vector in an ANN and the objective is to minimize a loss function associated with solving PDE using ANN. EnKF will sample \[ \theta^{(1)}, \cdots, \theta^{(J)} \stackrel{i.i.d}{\sim} \mathcal{N}(\theta, \Sigma) \] where \(\theta^{(j)}\) is called a particle and \(J\) is the ensemble size.

Notably, \(\theta^{(j)}\) always have the same dimension as \(\theta\). The loss function will be evaluated at all collocation points in the PDE domain of interest for each particle \(\theta^{(j)}\) as the ANN parameter,

The loss function will be evaluated at all collocation points in the PDE domain of interest for each particle \(\theta^{(j)}\) as the ANN parameter, i.e. \[ \mathcal{G}\left(\{\theta^{(j)}\}_{j=1}^{J}\right)= \begin{bmatrix} L(\theta^{(1)}; (t,x)) & L(\theta^{(1)}; (t,x)) & \cdots & L(\theta^{(1)}; (t,x)) & L(\theta^{(1)}; (t,x)) \\ L(\theta^{(2)}; (t,x)) & L(\theta^{(2)}; (t,x)) & \cdots & L(\theta^{(2)}; (t,x)) & L(\theta^{(2)}; (t,x)) \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ L(\theta^{(J-1)}; (t,x)) & L(\theta^{(J-1)}; (t,x)) & \cdots & L(\theta^{(J-1)}; (t,x)) & L(\theta^{(J-1)}; (t,x)) \\ L(\theta^{(J)}; (t,x)) & L(\theta^{(J)}; (t,x)) & \cdots & L(\theta^{(J)}; (t,x)) & L(\theta^{(J)}; (t,x)) \end{bmatrix}. \]

We denote the row of this matrix forward-mapping \(\mathcal{G}(\cdot):\mathbb{R}^{d}\to\mathbb{R}^{N}\), so the \(j\)-th row of this matrix is \(\mathcal{G}(\theta^{(j)})^\intercal\). We use the transpose \( ^\intercal\) notation here because we view row vectors different from column vectors, and by default, an element in \(\mathbb{R}^N\) is viewed as a column vector. Using the convention that upper bar denotes the mean, \[ \bar{\mathcal{G}} := \frac{1}{J}\sum_{j=1}^{J} \mathcal{G}(\theta^{(j)}). \]

The most important matrix \(D(\theta)\in\mathbb{R}^{J\times J}\) is defined elementwise, whose \(k\)-th row and \(j\)-th column is \[ [D(\theta)]_{k,j} := \frac{1}{J} \left(\mathcal{G}(\theta^{(k)})-\bar{\mathcal{G}} \right)^\intercal \mathcal{G}(\theta^{(j)}). \]

The update rule for each particle \(\theta^{(j)}\) is \[ \theta^{(j)} \leftarrow \theta^{(j)} - h D(\theta)\theta^{(j)}, \] where \(h\) is a scalar that plays the same role as learning rate in gradient-based optimization methods, and a typical choice is \[ h = \frac{1}{\epsilon+ \|D(\theta)\|_F }, \] where \(\|\cdot\|_F\) denotes the Frobenius norm.

Ensemble Kalman Filter mininizing Rosenbrock function

We apply EnKF to minimize the Rosenbrock function where \(d=2\) and \(\theta=(x,y)\). \[ L(x,y) = (1-x)^2 + 5 \left(y-x^2\right)^2. \] It has a unique global minimizer \((x,y)=(1,1)\). We choose \(J=10\) such that \[ \theta_0^{(1)},\cdots,\theta_0^{(10)} \stackrel{i.i.d}{\sim} \mathcal{N}([0,0]^\intercal, I_{2}), \] where \(I_2\) is the \(2\times2\) identity matrix, and the subscript is the epoch in \({0,1,\cdots,8,9}\).

Revisiting the previous simple ODE, compare with Adam

main advantages

useful when loss.backward() is unavailable

parallel computation

faster during early epochs

We revisit the previous simple ODE: \[ f'(x)=2x, \hspace{2em} x\in[-1, 1] \] with initial condition \(f(0)=0\). The collocation points are still \[ x_1=-1, x_2=-0.99, x_3=-0.98, \cdots, x_{199}=0.98, x_{200}=0.99, x_{201}=1 \] The neural architecture is still an MLP, \[ u(x;\theta):= \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix}^\intercal \tanh\left( \begin{bmatrix} \square & \cdots & \square \\ \vdots & \ddots & \vdots \\ \square & \cdots & \square \end{bmatrix} \tanh \left( \begin{bmatrix} \square & \cdots & \square \\ \vdots & \ddots & \vdots \\ \square & \cdots & \square \end{bmatrix}\tanh \left( \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix} x + \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix} \right) + \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix} \right) + \begin{bmatrix} \square \\ \vdots \\ \square \end{bmatrix}\right) . \]

The corresponding forward-mapping is \[ \mathcal{G}(\theta)=\begin{bmatrix} u'(x_1;\theta)-2x_1 & u'(x_2;\theta)-2x_2 & \cdots & u'(x_{201};\theta)-2x_{201} & \lambda u(0;\theta) \end{bmatrix}^\intercal. \] \(\lambda\) is the same tunable parameter to leverage the importance of data and physics, and we set \(\lambda=1000\) here.

If we glue the solutions from each epoch together, we can see EnKF changes the prediction earlier than Adam.

Deep Ritz Method [Weinan E & Bing Yu (2018)]

Equivalent formulation: variational PINN [Kharazmi, Zhang and Karniadakis (2021)]

\[ \min_{u\in H} \int_{\Omega} \left(\frac{1}{2} |\nabla u(x)|^2 - f(x)u(x) \right) \mathrm{d}x \iff \Delta u(x) = f(x) \]

main idea: replace the trial function by a neural network \(u(x;\theta)\), search in \(\theta\in\mathbb{R}^d\).

key advantage:

unified method against complexity in boundary geometry \(\partial\Omega\)

Ritz-Rayleigh quotient and Laplace Eigenfunction

Consider the eigenvalue problem of the Laplace operator on \(\Omega\subset\mathbb{R}^n\) with Dirichlet boundary condition: \[ \begin{cases} -\Delta u = \lambda u & x\in\Omega \\ u = 0 & x\in\partial\Omega \end{cases}. \]

We'll begin by illustrating examples when \(n=2\), where \(\Omega\) represents commonly encountered shapes. In these scenarios, we'll use \(x\) and \(y\) to denote the input variables of \(u\) instead of \(x\in\mathbb{R}^2\). The variational formulation of the solution through Ritz-Rayleigh quotient is \[ \lambda = \inf_{u\in H_{0}^{1}(\Omega)} \frac{\int_{\Omega} |\nabla u|^2 \mathrm{d} x}{\int_{\Omega} u^2 \mathrm{d} x}, \] where the infimum is attainable, and the minimizer is the solution to the Dirichlet BVP of the Laplace operator associated with the smallest eigenvalue \(\lambda_1(\Omega)\).

Triangle

The first harmonic function on a equilateral triangle with vertices \((0,0)\), \((1,0)\), \((1/2, \sqrt{3}/2)\) is \[ u_\text{exact}(x, y) = \sin\left(2\pi\cdot \frac{2y}{\sqrt{3}}\right) + \sin2\pi\left(x - \frac{y}{\sqrt{3}}\right) + \sin2\pi\left(1-x-\frac{y}{\sqrt{3}} \right), \hspace{1em} (x,y)\in\bigtriangleup. \] The corresponding eigenvalue is \(\lambda=\frac{16}{3}\pi^2\).

Square

The first harmonic function on a unit square with vertices \((0,0)\), \((1,0)\), \((0, 1)\) and \((1,1)\) is \[ u_\text{exact}(x, y) = \sin(\pi x)\sin(\pi y), \hspace{1em} (x,y)\in\square. \] The corresponding eigenvalue is \(2\pi^2\).

Circle/Disk

The first harmonic function on a unit disk \(\Omega = {(x,y): x^2+y^2\leq1}\) is \[ u_\text{exact}(r\cos\varphi, r\sin\varphi) = J_0(\beta_{0,1} r), \hspace{1em} r\in[0,1], \] where \(J_0\) is the Bessel function of the first kind and \(\beta_{0,1}\approx2.4048\) is its first zero. There is radial symmetry in this eigenfunction so the angular component \(\varphi\) in polar coordinates does not appear explicitly.

Koch snowflake [Lapidus, Neuberger, Renka and Griffith (1996)]

Deep Ritz training log

Mesh (triangle) formulation

alt

The boundary points are indexed in the counter-clockwise order, while the interior points are indexed by the spiral order.

Spiral order of interior points in the snowflake at fractal level L=2

Finite difference scheme using 6-point for the Laplace operator:

\[ \sum_{j=2}^{7} \mathbf{u}_j = 6 \mathbf{u}_1 + \frac{3}{2}h^2 \Delta u(x_1,y_1) + O(h^4) \]

First 8 rows and columns of the Laplace matrix (finite-differnece)

A	1	2	3	4	5	6	7	8
1	6	-1	-1	-1	-1	-1	-1
2	-1	6	-1				-1	-1
3	-1	-1	6	-1
4	-1		-1	6	-1
5	-1			-1	6	-1
6	-1				-1	6	-1
7	-1	-1				-1	6
8		-1						6

Eigenvalue problem becomes \[ A\mathbf{u} = \lambda\mathbf{u}, \] so the eigenvector of the corresponding eigenvalue gives the eigenfunction (on the triangle mesh).

Boundary regression loss

To enforce \(u|_{\partial \Omega}=0\) required in \(u\in H_0^1(\Omega)\), we need to pose regression loss

\[ L_\text{reg}(\theta; B) = \sum_{B} \left[ u(x_j,y_j;\theta) \right]^2, \hspace{1em} B = \{(x_j, y_j)\}_{j} \subset \partial\Omega. \]

The Ritz-Rayleigh quotient can be discretized to

\[ L_\text{var}(\theta; A) = \frac{\sum_{A} \left[ u_x^2(x_j,y_j;\theta) + u_y^2(x_j, y_j;\theta) \right]}{\sum_{A} u^2(x_j, y_j;\theta) }, \hspace{1em} A = \{(x_j, y_j)\}_{j} \subset \mathring{\Omega}. \]

The total loss to minimize is \[ L_\text{total}(\theta;A\cup B) = L_\text{var}(\theta; A) + \beta\cdot L_\text{reg}(\theta; B). \]

Here the boundary loss is much more important than variational loss to ensure successful training:

\[ \beta\cdot|B| \gg |A|. \]

To avoid searching around the zero function \(u(\cdot;\theta)\approx0\), we pretrain \(\theta\) to fit a prior function.

\[ \text{prior}(x,y)\propto \exp\left(-\frac{1}{\text{dist}((x,y), \partial\Omega )} \right) \]

This is because the first eigenfunction doesn't change sign and can be chosen positive.

Triangle (Failed due to boundary)

Obstacle problems: search with PDE constraints

Given an obstacle function \(v\in H^1(\Omega)\), the variational problem \[ \inf_{u\geq v, u\in H_0^1(\Omega) } \int_{\Omega} |\nabla u|^2 \mathrm{d}x. \]

The Euler-Lagrange equation is \[\begin{align} \Delta u \leq 0 & & \text{in }\Omega \\ \Delta u = 0 & & \{x: u(x)>v(x)\} \end{align}\]

The corresponding regression loss is \[ L_\text{reg}(\theta) \propto \int_{\Omega} (v(x)-u(x;\theta))^+ \mathrm{d}x. \]

Case \(\Omega=[-1,1]\), obstacle is a cylinder

Case \(\Omega=[-1,1]^2\), obstacle is half of a unit ball

Case \(\Omega=[-1,1]^2\), obstacle is a cross (thin)

Inverse problem in Heat PDE

Governing PDE: \[ u_t = \nabla\cdot(a(x) \nabla u(x)), \] where \(a(x)\) means the physical heat conductivity.

Spectral methods have been adpoted to solve this with final time \(t=T\) and dense observations [Boumenir (2023)].

We further simplify the PDE to be \[ u_t = a(x) u_{xx}, \\ u(t=0,x)=u_0(x), \] with periodic boundary condition \(x=1\sim x=-1\).

Finite difference (Forward-Euler in time, central difference in space) \[ \frac{u(t+\Delta t, x) - u(t,x)}{\Delta t} = a(x)\frac{u(t, x+\Delta x) - 2u(t,x) + u(t,x-\Delta x)}{\Delta x^2} \]

Naive recovery on dense observation

\[ a(x) = \frac{\Delta x^2}{\Delta t} \cdot \frac{u(t+\Delta t, x) - u(t,x)}{u(t, x+\Delta x) - 2u(t,x) + u(t,x-\Delta x)} \]

\(a(x)\) from spare measurement

Regression loss: \[ L_\text{reg}(\theta)= \sum_{j} [ u(t_j, x_j;\theta) - u(t_j, x_j) ]^2. \]

Residual loss: \[ L_\text{res}(\theta) \propto \int_{0}^{1} \int_{-1}^{1} \left[ \frac{\partial}{\partial t}\left(\frac{u_t(t,x;\theta)}{u_{xx}(t,x;\theta)} \right) \right]^2 \mathrm{d}x \mathrm{d}t. \]

This takes advantage of the form of the PDE: \[ a(x) = \frac{u_t(t,x)}{u_{xx}(t,x)} \]

Future work

Advised by Dr. Shiwei Lan
- improve recovery of \(a(\cdot)\) in the inverse heat (filtering out-of-range prediction, smoothing)
- build inference of \(u(t,x;\theta), a(x;\theta)\) from PINN, based on residual and regression losses
- combine the optimization method Ensemble Kalman filter with gradient based method
Advised by Dr. Kookjin Lee
- low-rank approximation of large matrices in MLP with adapted rank-selection in low-rank optimization [DLRT 2022]
- Fourier neural operator [Li, Kovachki, Stuart and Anandkumar] for inverse heat
Application: Phoenix area temperature prediction based on car sensors (Prof. Alex Mahalov)

Deep Learning Methods for Computational Math