Gradient descent

Gradient Descent is a fundamental optimization algorithm used in machine learning and mathematical optimization.

Consider a loss function \(L(\theta;{\text{collocation points}})\) of ANN solving a differential equation (\textit{e. g.}, residual loss plus regression loss), where \(\theta\in\mathbb{R}^d\) is the vector of parameters in ANN.

We will fix the collocation points and simply write \(L:\mathbb{R}^d\to\mathbb{R}\) as the loss function.

Denote the index of iteration (epoch) as the subscript, \(\theta_0\) is the initial guess of the parameters in ANN, which is typically chosen to follow \(\mathcal{N}\) or uniformly i.i.d.

The update rule for gradient descent is \[ \theta_{n+1} = \theta_{n} - \gamma \nabla_\theta L(\theta_n), \hspace{1em} n=0,1,2,\cdots \tag{GD} \] where \(\gamma>0\) is a small tunable learning rate.

The caveat is if \(d\) is large and number of collocation points is also large, computing the gradient \(\nabla_\theta L\) may become infeasible by hardware (typically, memory) constraints. This leads to the stochastic gradient descent (SGD), and a version is \begin{align*} \theta_{n+1} &= \theta_{n} - \gamma \nabla_\theta L_\text{res}(\theta_n) && n=0,2,4,\cdots \\ \theta_{n+2} &= \theta_{n+1} - \gamma \nabla_\theta L_\text{reg}(\theta_{n+1}) \tag{SGD} \end{align*} where \(L_\text{total}(\theta)=L_\text{res}(\theta)+L_\text{reg}(\theta)\) is the function to minimize.