Sheet 2

Prof. Simon Weißmann, Felix Benning

^†^†course: Optimization in Machine Learning^†^†semester: FSS 2023^†^†tutorialDate: 16.03.2023^†^†dueDate: 12:00 in the exercise on Thursday 16.03.2023

Exercise 1 (Descent Directions of a Maximum).

Let $x_{*}\in\mathbb{R}^{d}$ be a strict local maximum of $f:\mathbb{R}^{d}\to\mathbb{R}$ . Prove that every $d\in\mathbb{R}^{d}$ is a descent direction of $f$ in $x_{*}$ .

Solution.

Let $\epsilon>0$ be such, that $x_{*}$ is a strict maximum in $B_{\epsilon}(x_{*})\setminus\{x_{*}\}$ , where existence of such an $\epsilon$ is the strict local maximum property. We now have for any direction $d$ that it is a descent direction, because with $\bar{\alpha}=\frac{\epsilon}{\|d\|}>0$ we have for all $\alpha\in(0,\bar{\alpha}]$

f(x_{*}+\alpha d)<f(x_{*}),

since $x_{*}+\alpha d\in B_{\epsilon}(x_{*})\setminus\{x_{*}\}$ as $\|\alpha d\|\leq\bar{\alpha}\|d\|=\epsilon$ . ∎

Exercise 2 (Convergence to Stationary Point).

Let $f:\mathbb{R}^{d}\to\mathbb{R}$ be a continuously differentiable function.

(i)

Let $(x_{k})_{k\in\mathbb{N}}$ be defined by gradient descent

x_{k+1}=x_{k}-\alpha_{k}\nabla f(x_{k}),\quad x_{0}\in\mathbb{R}^{d}

with diminishing step size $\alpha_{k}>0$ such that $\sum_{k=1}^{\infty}\alpha_{k}=\infty$ . Suppose that $(x_{k})_{k\in\mathbb{N}}$ converges to some $x_{*}\in\mathbb{R}^{d}$ . Prove that $x_{*}$ is a stationary point of $f$ , i.e. $\nabla f(x_{*})=0$ .

Hint.

You might want to prove for large enough $i, j$

\langle\nabla f(x_{i}),\nabla f(x_{j})\rangle\geq\|\nabla f(x_{*})\|^{2}-2% \epsilon\|\nabla f(x_{*})\|-\epsilon^{2}=:p(\epsilon).

Solution.

For any $\epsilon>0$ there exists $n\geq 0$ such that for all $i,j\geq n$ we have by Cauchy-Schwarz and convergence of $\nabla f(x_{i})$ to $\nabla f(x_{*})$ due to continuity of $\nabla f$

		$\displaystyle\langle\nabla f(x_{i}),\nabla f(x_{j})\rangle$
		$\displaystyle=\\|\nabla f(x_{})\\|^{2}+\langle\nabla f(x_{i})-\nabla f(x_{}),% \nabla f(x_{})\rangle+\langle\nabla f(x_{i}),\nabla f(x_{j})-\nabla f(x_{})\rangle$
	≥∥∇f(x_)∥^2 - ⏟ ∥∇f(x_i) - ∇f(x_)∥ _≤ϵ ∥∇f(x_)∥ - ⏟ ∥∇f(x_i)∥ _≤∥∇f(x_)∥ + ϵ ⏟ ∥∇f(x_j)-∇f(x_*)∥ _≤ϵ
		$\displaystyle\geq\\|\nabla f(x_{})\\|^{2}-2\epsilon\\|\nabla f(x_{})\\|-\epsilon% ^{2}=:p(\epsilon)$

This results in

	$\displaystyle\\|x_{n}-x_{m}\\|^{2}$	$\displaystyle=\Bigl{\\|}\sum_{k=n}^{m-1}\alpha_{k}\nabla f(x_{k})\Bigr{\\|}^{2}=% \sum_{i,j=n}^{m-1}\alpha_{i}\alpha_{j}\langle\nabla f(x_{i}),\nabla f(x_{j})\rangle$
		$\displaystyle\geq\biggl{(}\sum_{k=n}^{m-1}\alpha_{k}\biggr{)}^{2}p(\epsilon)$

Taking the limit over $m$ results in

\infty>\|x_{n}-x_{*}\|^{2}\geq\underbrace{\biggl{(}\sum_{k=n}^{\infty}\alpha_{% k}\biggr{)}^{2}}_{=\infty}p(\epsilon).

So we necessarily need $p(\epsilon)\leq 0$ . But as $\epsilon$ was arbitrary, we have

0\leq\|\nabla f(x_{*})\|^{2}=\lim_{\epsilon\to 0}p(\epsilon)\leq 0.\qed

(ii)

Assume that $f$ is also $L$ -smooth. Prove for $x_{n}$ generated by gradient descent with constant step size $\alpha\in(0,\frac{2}{L})$ we have

$\sum_{k=n}^{m}\|\nabla f(x_{k})\|^{2}\leq\frac{f(x_{n})-f(x_{m})}{\alpha(1-% \tfrac{L}{2}\alpha)}\leq\frac{f(x_{n})-\min_{x}f(x)}{\alpha(1-\tfrac{L}{2}% \alpha)}$

for any $n,m\in\mathbb{N}$ . Deduce for the case $\min_{x}f(x)>-\infty$ , that we have

$\min_{k\leq n}\|\nabla f(x_{k})\|^{2}\in o(1/n).$

Hint.

Consider the minimizer from Sheet 1, Ex. 6(i).

Solution.

By $L$ -smoothness of $f$ , we have

$\displaystyle f(x_{k+1})$ $\displaystyle\leq f(x_{k})+\langle\nabla f(x_{k}),\overbrace{x_{k+1}-x_{k}}^{-% \alpha\nabla f(x_{k})}\rangle+\tfrac{L}{2}\|x_{k+1}-x_{k}\|^{2}$

$\displaystyle=f(x_{k})-(\alpha-\tfrac{L}{2}\alpha^{2})\|\nabla f(x_{k})\|^{2}$

and therefore

$\sum_{k=n}^{m}\|\nabla f(x_{k})\|^{2}\leq\sum_{k=n}^{m}\frac{f(x_{k})-f(x_{k+1% })}{\alpha(1-\tfrac{L}{2}\alpha)}\overset{\text{telescope}}{=}\frac{f(x_{n})-f% (x_{m})}{\alpha(1-\tfrac{L}{2}\alpha)}.$

Now $a_{n}:=\min_{k\leq n}\|\nabla f(x_{k})\|^{2}$ is non-increasing, therefore

$na_{2n}\leq\sum_{k=n}^{2n}a_{k}\leq\sum_{k=n}^{\infty}a_{k}\to 0\quad(n\to% \infty).$

Thus $a_{2n}\in o(1/n)$ . And we can simply bound the odd elements of the sequence

$a_{2n+1}\leq a_{2n}\in o(1/n).\qed$
Exercise 3 (Optimizing Quadratic Functions).
In this exercise we consider functions of type

$f(x)=x^{T}Ax+b^{T}x+c,$

where $x\in\mathbb{R}^{d},A\in\mathbb{R}^{d\times d},b\in\mathbb{R}^{d},c\in\mathbb{R}$ .
1. (a)
  
  Let $H:=A^{T}+A$ be invertible. Prove that $f$ can be written in the forms
  
  $\displaystyle f(x)$ $\displaystyle=(x-x_{\ast})^{T}A(x-x_{\ast})+\tilde{c}$ (1)
  
  $\displaystyle=\tfrac{1}{2}(x-x_{\ast})^{T}\underbrace{(A^{T}+A)}_{=:H}(x-x_{% \ast})+\tilde{c}$ (2)
  
  for some $x_{\ast}\in\mathbb{R}^{d}$ and $\tilde{c}\in\mathbb{R}$ . Argue that $H$ is always symmetric. Under which circumstances is $x_{\ast}$ a minimum?
  
  Solution.
  
  We want for some $x_{\ast}$
  
  $f(x)\overset{!}{=}(x-x_{\ast})^{T}A(x-x_{\ast})+\tilde{c}=x^{T}Ax\underbrace{-% x^{T}Ax_{\ast}-x_{\ast}^{T}Ax}_{=-x_{\ast}^{T}(A+A^{T})x\overset{!}{=}b^{T}x}+% \underbrace{x_{\ast}^{T}Ax_{\ast}+\tilde{c}}_{\overset{!}{=}c}$
  
  So we simply select
  
  $x_{\ast}:=-(A+A^{T})^{-1}b\quad\text{and}\quad\tilde{c}:=c-x_{\ast}^{T}Ax_{% \ast}.$
  
  This proves our first representation (1). For (2) we simply note
  
  $y^{T}Ay=\langle y,Ay\rangle\overset{\text{symm.}}{=}\langle Ay,y\rangle=y^{T}A% ^{T}y.$
  
  Applying this to $y=x-x_{\ast}$ in (1) we are done.
  
  Symmetry of $H$ follows directly from its definition as $H_{ij}=A_{ji}+A_{ij}=H_{ji}$ .
  
  Now $x_{\ast}$ is a minimum iff $H$ is positive definite. If it is positive definite, then $x_{\ast}$ is a minimum by $\nabla^{2}f(x)=H$ and the lecture. If $H$ is not positive definite, we need to show that there exists some $x$ such that $f(x)\leq m$ for all $m\leq\tilde{c}$ . Since $H$ is not positive definite, there exists some $y$ such that
  
  $y^{T}Hy=:-\varepsilon<0$
  
  define $x=x_{\ast}+y\sqrt{\frac{\tilde{c}-m}{\varepsilon}}$ . Then
  
  $f(x)=\frac{\tilde{c}-m}{\varepsilon}y^{T}Hy+\tilde{c}=m.\qed$
2. (b)
  
  Argue that the Newton Method (with step size $\alpha_{n}=1$ ) applied to $f$ would jump to $x_{\ast}$ in one step and then stop moving.
  
  Solution.
  
  Taking the derivative of (2) we get
  
  $\nabla f(x)=H(x-x_{\ast}).$ (3)
  
  So with $\nabla^{2}f(x)=H$ and
  
  $x_{\ast}=x-H^{-1}H(x-x_{\ast})=x-[\nabla^{2}f(x)]^{-1}\nabla f(x),$
  
  we have that the Newton Method finds $x_{\ast}$ in one step. By (3) we also get $\nabla f(x_{\ast})=0$ which stops the Newton method afterwards. ∎
3. (c)
  
  Let $V=(v_{1},\dots,v_{d})$ be an orthonormal basis such that
  
  $H=V\operatorname*{diag}[\lambda_{1},\dots,\lambda_{d}]V^{T}$
  
  with $0<\lambda_{1}\leq\dots\leq\lambda_{d}$ and write
  
  $y^{(i)}:=\langle y,v_{i}\rangle.$
  
  Express $(x_{n}-x_{\ast})^{(i)}$ in terms of $(x_{0}-x_{\ast})^{(i)}$ , where $x_{n}$ is given by the gradient descent recursion
  
  $x_{n+1}=x_{n}-h\nabla f(x_{n}).$
  
  For which step size $h$ do all the components $(x_{n}-x_{\ast})^{(i)}$ converge to zero? Which component has the slowest convergence speed? Find the optimal learning rate $h^{*}$ and deduce for this learning rate
  
  $\|x_{n}-x_{\ast}\|\leq(1-\tfrac{2}{1+\kappa})^{n}\|x_{0}-x_{\ast}\|.$
  
  with the condition number $\kappa=\frac{\lambda_{d}}{\lambda_{1}}$ .
  
  Solution.
  
  Using the representation (3) of the gradient again and subtracting $x_{\ast}$ from our recursion, we get
  
  $x_{n+1}-x_{\ast}=x_{n}-x_{\ast}-hH(x_{n}-x_{\ast})=[\mathbb{I}-hH](x_{n}-x_{% \ast})$
  
  Therefore

		$\displaystyle\langle\nabla f(x_{i}),\nabla f(x_{j})\rangle$
		$\displaystyle=\\|\nabla f(x_{})\\|^{2}+\langle\nabla f(x_{i})-\nabla f(x_{}),% \nabla f(x_{})\rangle+\langle\nabla f(x_{i}),\nabla f(x_{j})-\nabla f(x_{})\rangle$
	≥∥∇f(x_)∥^2 - ⏟ ∥∇f(x_i) - ∇f(x_)∥ _≤ϵ ∥∇f(x_)∥ - ⏟ ∥∇f(x_i)∥ _≤∥∇f(x_)∥ + ϵ ⏟ ∥∇f(x_j)-∇f(x_*)∥ _≤ϵ
		$\displaystyle\geq\\|\nabla f(x_{})\\|^{2}-2\epsilon\\|\nabla f(x_{})\\|-\epsilon% ^{2}=:p(\epsilon)$

	$\displaystyle f(x_{k+1})$	$\displaystyle\leq f(x_{k})+\langle\nabla f(x_{k}),\overbrace{x_{k+1}-x_{k}}^{-% \alpha\nabla f(x_{k})}\rangle+\tfrac{L}{2}\\|x_{k+1}-x_{k}\\|^{2}$
		$\displaystyle=f(x_{k})-(\alpha-\tfrac{L}{2}\alpha^{2})\\|\nabla f(x_{k})\\|^{2}$

	$\displaystyle f(x)$	$\displaystyle=(x-x_{\ast})^{T}A(x-x_{\ast})+\tilde{c}$		(1)
		$\displaystyle=\tfrac{1}{2}(x-x_{\ast})^{T}\underbrace{(A^{T}+A)}_{=:H}(x-x_{% \ast})+\tilde{c}$		(2)

Sheet 2

Exercise 1 (Descent Directions of a Maximum).

Solution.

Exercise 2 (Convergence to Stationary Point).

Hint.

Solution.

Hint.

Solution.

Exercise 3 (Optimizing Quadratic Functions).

Solution.

Solution.

Solution.