Neural Network Optimization Notes (1) Hessian

Neural Network Optimization Notes

Posted by CL on March 8, 2020

DNN Optimization

Hessian

Hessian Matrix

People believe training stuck because the parameters are near a critical point.

  • critical point: gradient is zero
  • minima point: gradient is zero and is local minima
  • saddle point: gradient is zero however is not a local minima

We use Hessian matrix to judge whether a critical point is a minima or not.

According to Taylor’s formula (all the requirements satisfied), at a closed interval containing $x_0$, the function $f(x)$ can be stated as:

\[f(x) = f(x_0)/0!+f'(x)(x-x_0)/1!+f''(x_0)(x-x_0)^2/2!+ \ ...\]

In higher dimension, $\theta^0$ denotes the point we are analysing.:

\[f(\theta) = f(\theta_0)+(\theta-\theta_0)^Tg+\frac{1}{2}(\theta-\theta^0)^TH(\theta-\theta_0)+ \dots\]

Hessian H is a matrix, where

\[H_{ij} = \frac{\partial^2f(\theta^0) }{\partial\theta_i\partial\theta_j}\]

Hessian is symmetric since $H_{ij}=H_{ji}$ according to the property of second derivative.

Newton’s Method

In gradient descent we only care about the first derivative of the loss function, while in Newton’s method we take the second derivative into concern to find the minima point.

\[\begin{aligned} \nabla f(\theta)&\approx \nabla \big[f(\theta_0)+(\theta-\theta_0)^Tg+\frac{1}{2}(\theta-\theta^0)^TH(\theta-\theta_0)\big] \\ &= \nabla(\theta-\theta_0)^Tg+\frac{1}{2}\nabla (\theta-\theta^0)^TH(\theta-\theta_0)\\ &= g + H(\theta-\theta_0) \end{aligned}\]

To find point where $\nabla = 0$ equals to $g+H(\theta-\theta_0)=0$ , so that

\[\begin{aligned} H(\theta-\theta_0) &= -g \\ \theta-\theta_0 &= -H^{-1}g \end{aligned}\]

so we have minima $\theta = \theta_0 - H^{-1}g$, while in gradient descent $\theta = \theta_0 - \eta g$

What’s the difference?

Newton method actually is a replacement method.

Drawbacks

  • Computing inverse is difficult
  • Doesn’t guarantee local minima, can be saddle point or maxima

Use Hessian and Eigenvalue to Verify Minima

In linear algebra we have learned:

  • If $Av = \lambda v$:
    • $v$ is an eigenvector of A
    • $\lambda$ is an eigenvalue of A that corresponds to $v$
  • If matrix B is symmetric and for every none-zero vector $x$, if
    • $x^TBx>0$ : positive definite <=> All eigenvalues are positive
    • $x^TBx \geq0$ : positive semi-definite <=> All eigenvalues are non-negative
    • $x^TBx<0$ : negtive definite <=> All eigenvalues are negative
    • $x^TBx \leq0$: negative semi-definite <=> All eigenvalues are none-positive

At critical point, the first derivatives are zero,so :

\[f(\theta) \approx f(\theta)+\frac{1}{2}(\theta-\theta^0)^TH(\theta-\theta_0)\]

Let $x = (\theta-\theta^0)$, if H is positive definite: so that $x^THx > 0 $, so that around $\theta^0$ : $f(\theta)>f(\theta^0)$, so that $\theta^0$ is a local minima ! The same method goes for maxima.

If sometimes $x^THx > 0 $ and sometimes $x^THx < 0 $ then $\theta^0$ is a saddle point.

If $H$ is semi-positive, we can’t confirm whether the point is a local minima since it’s up to the higher derivatives of the function.

Let’s suppose $v$ is an unit eigenvector of $H$ and $\lambda$ is the correspondin eigenvalue, so that $v^THv = v^T(\lambda v) = \lambda||v||^2 = \lambda$

So we have another interesting property about the eigenvector around the minima that: \(f(\theta^0+v) \approx f(\theta^0) + \frac{1}{2}\lambda\)

Since $H$ is a asymmetric matrix. It can have eigenvectors ${v1,v2,…,v_n}$ to from a standard orthonormal basis.

Let $x = a_1v_1 + a_2v_2$ , so $x^THx = (a_1)^2\lambda_1 +(a_2)^2\lambda_2$ if $v_1,v_2$ are orthonormal and \(f(\theta^0+x) \approx f(\theta^0) + \frac{1}{2}\big[(a_1)^2\lambda_1 +(a_2)^2\lambda_2\big]\)

If $u$ is a linear combination of the standard orthonormal basis of Hessian matrix $H$, then we have the final conclusion that \(f(\theta^0+u) \approx f(\theta^0) + \frac{1}{2}\big[(a_1)^2\lambda_1 +(a_2)^2\lambda_2+...+(a_n)^2\lambda_n\big]\)

With the analysis above we can directly judge a critical point’s property.

Example

\(f(x,y)=x^2+3y^2\)

Easily we get (0,0) a critical point where the Hessian matrix \(H= \begin{bmatrix} 2 & 0 \\ 0 & 6 \end{bmatrix}\)

$H$ is postive definite, so (0,0) is a local minima.