Sheet 4

Prof. Simon Weißmann, Felix Benning

^†^†course: Optimization in Machine Learning^†^†semester: FSS 2023^†^†tutorialDate: 27.04.2023^†^†dueDate: 12:00 in the exercise on Thursday 27.04.2023

While there are \totalpoints in total, you may consider all points above the standard $24$ to be bonus points.

Exercise 1 (Lower Bounds).

In this exercise, we will bound the convergence rates of algorithms which pick their iterates $x_{k+1}$ from

\operatorname*{span}[\nabla f(x_{0}),\dots,\nabla f(x_{k})]+x_{0}.

We consider the function

f_{d}(x)=\frac{1}{2}(x^{(1)}-1)^{2}+\frac{1}{2}\sum_{i=1}^{d-1}(x^{(i)}-x^{(i+% 1)})^{2}

(i)

To understand our function $f_{d}$ better, we want to view it as a potential on a graph. For this consider the undirected graph $G=(V,E)$ with vertices

$V=\{1,\dots,d\}$

and edges

$E=\{(i,i+1):1\leq i\leq d-1\}.$

Draw a picture of this graph.

Solution.

The graph is simply a chain

∎
(ii)

We now interpret $x^{(i)}$ as a quantity (e.g. of heat) at vertex $i$ of our graph $G$ . Our potential $f_{d}$ decreases, if the quantities at connected vertices $i$ and $i+1$ are of similar size. I.e. if $(x^{(i)}-x^{(i+1)})^{2}$ is small. Additionally there is a pull for $x^{(1)}$ to be equal to $1$ . Use this intuition to find the minimizer $x_{*}$ of $f_{d}$ .

Solution.

The minimizer is $x_{*}=(1,\dots,1)^{T}\in\mathbb{R}^{d}$ since $f_{d}(x_{*})=0$ and $f_{d}(x)\geq 0$ . ∎
(iii)

The matrix $A^{G}\in\mathbb{R}^{d\times d}$ with

$A^{G}_{i,j}=\begin{cases}\text{degree of vertex }i&i=j\\ -1&(i,j)\in E\text{ or }(j,i)\in E\\ 0&\text{else}\end{cases}$

is called the “Graph-Laplacian” of $G$ . The degree of vertex $i$ are the number of connecting edges. Calculate $A^{G}$ for $G$ and prove that

$\nabla f_{d}(x)=A^{G}x+(x^{(1)}-1)e_{1}=(A^{G}+e_{1}e_{1}^{T})x-e_{1}.$

Solution.

The Graph-Laplacian of $G$ is given by

$A^{G}=\begin{pmatrix}1&-1&0&\cdots&\cdots&0\\ -1&2&-1&\ddots&&\vdots\\ 0&-1&2&\ddots&\ddots&\vdots\\ \vdots&\ddots&\ddots&\ddots&-1&0\\ \vdots&&\ddots&-1&2&-1\\ 0&\cdots&\cdots&0&-1&1\end{pmatrix}$

Let $i\neq 1,d$ then

$\frac{\partial f_{d}}{\partial x_{i}}=[x^{(i)}-x^{(i+1)}]-[x^{(i-1)}-x^{(i)}]=% 2x^{(i)}-x^{(i-1)}-x^{(i+1)}=(A^{G}x)_{i}$

similarly looking at the cases $i=1,d$ individually immediately reveals

$\nabla f_{d}(x)=A^{G}x+(x^{(1)}-1)e_{1}.\qed$
(iv)

Prove that the Hessian $H=\nabla^{2}f_{d}(x)$ is constant and positive definite to show that $f_{d}$ is convex. Prove that the operator norm of $H$ is smaller than $4$ . Argue that

$g_{d}(x):=\tfrac{L}{4}f_{d}(x)$

is therefore $L$ -smooth.

Solution.

Taking the derivative of the gradient we calculated previously yields

$H=\nabla^{2}f_{d}(x)=A^{G}+e_{1}e_{1}^{T}.$

To show positive definiteness, let $y$ be arbitrary

$y^{T}Hy=(e_{1}^{T}y)^{2}+y^{T}A^{G}y=(y^{(1)})^{2}+\sum_{i=1}^{d-1}(y^{(i)}-y^% {(i+1)})^{2}\geq 0.$

To find the largest eigenvalue of $H$ we want to calculate the operator norm. For this we use $(a-b)^{2}\leq 2(a^{2}+b^{2})$ to get

$\langle y,Hy\rangle\leq(y^{(1)})^{2}+2\sum_{i=1}^{d-1}(y^{(i)})^{2}+(y^{(i+1)}% )^{2}\leq 4\sum_{i=1}^{d}(y^{(i)})^{2}=4\|y\|^{2}.$

Thus we get

$\|H\|=\sup_{y:\|y\|=1}\langle y,Hy\rangle\leq 4.$

Since the operator norm coincides with the largest absolute eigenvalue for symmetric matrices, this proves our claim. Finally $L$ -smoothness of $g_{d}$ follows from

$\|\nabla g_{d}(x)-\nabla g_{d}(y)\|=\tfrac{L}{4}\|\nabla f_{d}(x)-\nabla f_{d}% (y)\|=\tfrac{L}{4}\|H(x-y)\|\leq\underbrace{\tfrac{L}{4}\|H\|}_{\leq L}\|x-y\|.\qed$
(v)

Assume $x_{0}=0$ and and that $(x_{n})_{n\in\mathbb{N}}$ is chosen with the restriction

$x_{n+1}\in\mathcal{K}_{n}:=\operatorname*{span}[\nabla g_{d}(x_{0}),\dots,% \nabla g_{d}(x_{n})].$

To make notation easier we are going to identify $\mathbb{R}^{d}$ with an isomorph subset of sequences

$\mathbb{R}^{d}:=\{x\in\ell^{2}:x^{(i)}=0\quad\forall i>n\}$

then $\mathbb{R}^{n}$ is a subset of $\mathbb{R}^{d}$ for $n\leq d$ . Prove inductively that

$\mathcal{K}_{n}\subseteq\mathbb{R}^{n+1}\subseteq\mathbb{R}^{d}$

Solution.

We have the induction start $n=0$ by

$g_{d}(x_{0})=-\tfrac{L}{4}e_{1}\in\mathbb{R}^{1}.$

Now assume

$\mathcal{K}_{n-1}\subseteq\mathbb{R}^{n},$

then by our selection process $x_{n}\in\mathbb{R}^{n}$ . But then

$\tfrac{4}{L}\nabla g_{d}(x_{n})=\underbrace{A^{G}x_{n}}_{\in\mathbb{R}^{n+1}}+% \underbrace{(x_{n}-1)e_{1}}_{\in\mathbb{R}^{1}}\in\mathbb{R}^{n+1}.$

We therefore have $\mathcal{K}_{n}=\operatorname*{span}[\mathcal{K}_{n-1},\nabla g_{d}(x_{n})]% \subseteq\mathbb{R}^{n+1}$ .

Notice how the low connectedness of the graph $G$ limits the spread of our quantity $x_{n}$ . A higher connectedness would allow for information to travel much quicker. ∎
(vi)

We now want to bound the convergence speed of $x_{n}$ to $x_{*}$ . For this we select $d=2n+1$ .

Note: We may choose a larger dimension $d$ by defining $f_{2n+1}$ on the subset $\mathbb{R}^{2n+1}$ in $\mathbb{R}^{d}$ . The important requirement is therefore $2n+1\leq d$ . But without loss of generality we assume equality.

Use the knowledge we have collected so far to argue

$\|x_{*}-x_{n}\|^{2}\geq d-n\geq\tfrac{1}{2}\|x_{*}-x_{0}\|^{2}.$

Solution.

Since $x_{n}\in\mathbb{R}^{n}$ we know that

$\displaystyle\|x_{*}-x_{n}\|^{2}$ $\displaystyle=\sum_{i=1}^{d}(x_{*}^{(i)}-x_{n}^{(i)})^{2}$

$\displaystyle\geq\sum_{i=n+1}^{d}(x_{*}^{(i)})^{2}$

$\displaystyle=d-n\overset{d=2n+1}{=}n+1=\tfrac{n+1}{2n+1}d\geq\tfrac{1}{2}d=% \frac{1}{2}\sum_{i=1}^{d}1^{2}=\tfrac{1}{2}\|x_{*}-x_{0}\|^{2}.\qed$
(vii)

To prevent the convergence of the loss $g_{d}(x_{n})$ to $g_{d}(x_{*})$ we need a more sophisticated argument. For this consider

$\tilde{g}_{n}(x):=\tfrac{L}{4}[f_{n}(x)+\tfrac{1}{2}(x^{(n)}-0)^{2}].$

Argue that on $\mathbb{R}^{n}\subset\mathbb{R}^{d}$ the functions $\tilde{g}_{n}$ and $g_{d}$ are identical. Use this observation to prove

$g_{d}(x_{n})-\inf_{x}g_{d}(x)\geq\inf_{x}\tilde{g}_{n}(x).$

Solution.

Let $x\in\mathbb{R}^{n}$ . Then using $x^{(n+1)}=0$ we have

$\tilde{g}_{n}(x)=\frac{L}{8}\Bigl{[}(x^{(1)}-1)^{2}+\sum_{i=1}^{n}(x^{(i)}-x^{% (i+1)})\Bigr{]}=g_{d}(x)$

using $x^{(i)}=0$ for all $i>n$ for the second equality sign. Since $x_{n}\in\mathbb{R}^{n}$ we therefore can replace $g_{d}$ with $g_{n}$ at will to get

$g_{d}(x_{n})-\underbrace{\inf_{x}g_{d}(x)}_{=0}=\tilde{g}_{n}(x_{n})\geq\inf_{% x}\tilde{g}(x).\qed$

(viii)

Our goal is now to calculate $\inf_{x}\tilde{g}_{n}(x)$ . Prove convexity of $\tilde{g}_{n}$ and prove that

\hat{x}_{n}^{(i)}=\begin{cases}1-\frac{i}{n+1}&i\leq n+1\\ 0&i\geq n+1\end{cases}

is its minimum. Then plug our solution into $\tilde{g}_{n}$ (or $g_{d}$ , since $\hat{x}_{n}$ is in the subset $\mathbb{R}^{n}$ after all), to obtain the lower bound

g_{d}(x_{n})-\inf_{x}g_{d}(x)\geq\frac{L\|x_{0}-x_{*}\|^{2}}{8(n+1)d}\geq\frac% {L\|x_{0}-x_{*}\|^{2}}{16(n+1)^{2}}.

Solution.

We have

\nabla\tilde{g}_{n}(x)=\frac{L}{4}[A^{G_{n}}x+(x^{(1)}-1)e_{1}+(x^{(n)})e_{n}]% =\frac{L}{4}(A^{G_{n}}+e_{1}e_{1}^{T}+e_{n}e_{n}^{T})x-e_{1}

where $A^{G_{n}}$ is the Graph-Laplacian for $f_{n}$ . Then the Hessian is obviously positive definite

\nabla^{2}\tilde{g}_{n}(x)=\frac{L}{4}(A^{G_{n}}+e_{1}e_{1}^{T}+e_{n}e_{n}^{T})

as we could apply the same arguments as for $f_{n}$ . So $\tilde{g}_{n}$ is convex. We now plug $\hat{x}_{n}$ into $\nabla\tilde{g}_{n}$ to verify the first order condition, proving it is a minimum

	$\displaystyle\frac{4}{L}\frac{\partial\tilde{g}_{n}}{\partial x_{i}}(\hat{x}_{% n})$	$\displaystyle=(A^{G_{n}}\hat{x}_{n})_{i}+\underbrace{(\hat{x}_{n}^{(1)}-1)}_{=% -\tfrac{1}{n+1}}\delta_{i1}+\underbrace{(\hat{x}_{n}^{(n)})}_{=\tfrac{1}{n+1}}% \delta_{in}$
		$\displaystyle=\underbrace{[\hat{x}_{n}^{(i)}-\hat{x}_{n}^{(i+1)}]}_{-\frac{1}{% n+1}}\mathbbm{1}_{i\neq n}-\underbrace{[\hat{x}_{n}^{(i-1)}-\hat{x}_{n}^{(i)}]% }_{-\frac{1}{n+1}}\mathbbm{1}_{i\neq 1}-\tfrac{1}{n+1}\delta_{i1}+\tfrac{1}{n+% 1}\delta_{in}$
		$\displaystyle=0.$

We now know that

	$\displaystyle\inf_{x}\tilde{g}_{n}(x)$	$\displaystyle=\tilde{g}_{n}(\hat{x}_{n})=\frac{L}{8}\Bigl{[}\bigl{(}-\tfrac{1}% {n+1}\bigr{)}^{2}+\bigl{(}1-\tfrac{n}{n+1}\bigr{)}^{2}+\sum_{i=1}^{n-1}\bigl{(% }\tfrac{i+1}{n+1}-\tfrac{i}{n+1}\bigr{)}^{2}\Bigr{]}$
		$\displaystyle=\frac{L}{8}\sum_{i=0}^{n}\bigl{(}\tfrac{1}{n+1}\bigr{)}^{2}=% \frac{L}{8(n+1)}\overset{d=\\|x_{0}-x_{}\\|^{2}}{=}\frac{L\\|x_{0}-x_{}\\|^{2}}{% 8(n+1)d}$
		$\displaystyle\geq\frac{L\\|x_{0}-x_{*}\\|^{2}}{16(n+1)^{2}}$

using $d=2n+1$ again. ∎

(ix)

Argue that we only needed

$x_{n}=x_{0}+\sum_{k=0}^{n-1}A_{k}\nabla f(x_{k})$

with upper triangular matrices $A_{k}$ to make these bounds work. Since adaptive methods (like Adam) use diagonal matrices $A_{k}$ , they are therefore covered by these bounds.

Solution.

We only needed $\mathcal{K}_{n}\subseteq\mathbb{R}^{n+1}$ which we proved by induction using only this fact about $\mathcal{K}_{n-1}$ . Since upper triangular matrices do not change this fact, we may as well allow them. ∎
(x)

Bask in our glory! For we have proven that …? Summarize our results into a theorem.

Solution.

Theorem (Nesterov).

Assume there exists upper triangular matrices $A_{k,n}$ such that the sequence $(x_{n})_{n\in\mathbb{N}}$ in $\mathbb{R}^{d}$ is selected by the rule

$x_{n}=x_{0}+\sum_{k=0}^{n-1}A_{k,n}\nabla f(x_{k})$

for a convex, $L$ -smooth $f$ to minimize. Then up to $n\leq\frac{d-1}{2}$ there exists a convex, $L$ -smooth function $f$ such that

$\displaystyle\|x_{n}-x_{*}\|$ $\displaystyle\geq\tfrac{1}{\sqrt{2}}\|x_{0}-x_{*}\|$

$\displaystyle f(x_{n})-\inf_{x}f(x)$ $\displaystyle\geq\frac{L\|x_{0}-x_{*}\|^{2}}{16(n+1)^{2}}$

for $f(x_{*})=\inf_{x}f(x)$ .

∎
(xi)

(Bonus) If you wish, you may want to try and repeat those steps for

$G_{d}(x)=\frac{L-\mu}{L}g_{d}(x)+\frac{\mu}{2}\|x\|^{2}$

to prove an equivalent result for $\mu$ -strongly convex functions. Unfortunately finding $x_{*}$ is much more difficult in this case. Letting $d\to\infty$ makes this problem tractable again with solution

$x_{*}^{(i)}=\Bigl{(}\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\Bigr{)}^{i}.$

Exercise 2 (Conjugate Gradient Descent).

Consider a quadratic function

f(x)=\tfrac{1}{2}(x-x_{*})^{T}H(x-x_{*})

for some symmetric and positive definite $H$ and consider the hilbert space $\mathcal{H}=(\mathbb{R}^{d},\langle\cdot,\cdot\rangle_{H})$ with

\langle x,y\rangle_{H}=\langle x,Hy\rangle

(i)

Prove that $\langle\cdot,\cdot\rangle_{H}$ is a well-defined scalar product. Convince yourself that

$f(x)=\tfrac{1}{2}\|x-x_{*}\|_{H}^{2}.$

Solution.

Bilinearity is trivial, the positive-definiteness follows from this property of $H$ . We have

$f(x)=\tfrac{1}{2}\langle x-x_{*},H(x-x_{*})\rangle=\tfrac{1}{2}\langle x-x_{*}% ,(x-x_{*})\rangle_{H}=\tfrac{1}{2}\|x-x_{*}\|_{H}^{2}.\qed$

(ii)

Determine the derivative $\nabla_{H}f(x)$ of $f$ in $\mathcal{H}$

Hint.

Recall that $\nabla_{H}f(x)$ is the unique vector satisfying

0=\lim_{v\to 0}\frac{|f(x+v)-f(x)-\langle\nabla_{H}f(x),v\rangle_{H}|}{\|v\|_{% H}}.

Solution.

We need

	$\displaystyle 0$ = lim_v→0 —f(x+v) - f(x) - ⟨∇ H f(x) , v⟩ H — ∥v∥ H
		$\displaystyle=\lim_{v\to 0}\frac{\|f(x+v)-f(x)-\langle H\nabla_{H}f(x),v\rangle% \|}{\\|v\\|}\underbrace{\frac{\\|v\\|}{\\|v\\|_{H}}}_{\geq c}.$

We can bound the fraction of norms by a constant $c>0$ from below due to equivalence of all norms in $\mathbb{R}^{d}$ . This lower bound on the second fraction forces the first fraction to converge to zero. But this implies that

\nabla f(x)=H\nabla_{H}f(x)

by the definition (and uniqueness) of $\nabla f(x)$ . Thus the gradient we are looking for is

\nabla_{H}f(x)=H^{-1}\nabla f(x).\qed

(iii)

Since gradient descent in the space $\mathcal{H}$ is therefore computationally the Newton method, we want to find a different method of optimization. Consider an arbitrary set of conjugate ( $H$ -orthogonal) directions $(v_{1},\dots v_{d})$ , i.e. $\langle v_{i},v_{j}\rangle_{H}=\delta_{ij}$ , and for some starting point $x_{0}\in\mathbb{R}^{d}$ the following descent algorithm:

x_{k+1}=x_{k}-\alpha_{k}v_{k+1}\quad\text{with}\quad\alpha_{k}:=\operatorname*% {argmin}_{\alpha}f(x_{k}-\alpha v_{k+1}).

Optimizing over $\alpha$ in this manner is known as “line-search”. Using $y^{(i)}:=\langle y,v_{i}\rangle$ prove that

(x_{k}-x_{*})=\sum_{i=k+1}^{d}(x_{0}-x_{*})^{(i)}v_{i}=\operatorname*{argmin}_% {x}\{f(x):x\in x_{0}+\operatorname*{span}[v_{1},\dots,v_{k}]\}-x_{*}.

Deduce that conjugate descent (CD) converges in $d$ steps.

Solution.

We proceed by induction. The induction start with $k=0$ is obvious. Let us now consider $x_{k+1}$ . By its definition we have

	$\displaystyle 2f(x_{k+1})$	$\displaystyle=\min_{\alpha}2f(x_{k}-\alpha v_{k+1})$
		$\displaystyle=\min_{\alpha}\\|x_{k}-\alpha v_{k+1}\\|_{H}$
		$\displaystyle=\min_{\alpha}\Bigl{\\|}\sum_{i=1}^{d}(x_{k}-x_{*})^{(i)}v_{i}-% \alpha v_{k+1}\Bigr{\\|}_{H}$
		$\displaystyle=\min_{\alpha}[(x_{k}-x_{})^{(k+1)}-\alpha]^{2}\\|v_{k+1}\\|_{H}^{% 2}+\sum_{i=k+2}^{d}[(x_{k}-x_{})^{(i)}]^{2}\\|v_{i}\\|_{H}^{2}$
		$\displaystyle=\sum_{i=k+2}^{d}[(x_{k}-x_{*})^{(i)}]^{2}.$

the minimizer is therefore $\alpha_{k}=(x_{k}-x_{*})^{(k+1)}$ . This removes the $v_{k+1}$ component leaving us with the components $v_{k+2}$ and up. Note that $(x_{k}-x_{*})^{(i)}=(x_{0}-x_{*})^{(i)}$ for all $i\geq k+1$ by induction. Similarly we can see that this is a minimum in the span of $v_{1},\dots,v_{k+1}$ , as we have removed those components completely and

f(x)=\|x-x_{*}\|_{H}^{2}=\sum_{i=1}^{d}[(x-x_{*})^{(i)}]^{2}.

Since we can not touch the other components due to $H$ -orthogonality, this is the best we can do. ∎

(iv)

If we had $v_{i}=\nabla f(x_{i-1})$ , then this algorithm would be optimal in the set of algorithms we considered in the previous exercise. Unfortunately the gradients $\nabla f(x_{i-1})$ are generally not conjugate. So while we may select an arbitrary set of conjugate $v_{i}$ , we cannot select the gradients directly.

Instead we are going to do the next best thing and inductively select $v_{k+1}$ such that

$\mathcal{K}_{k}:=\operatorname*{span}[\nabla f(x_{0}),\dots\nabla f(x_{k})]=% \operatorname*{span}[v_{1},\dots,v_{k+1}]$

using the Gram-Schmidt procedure to make $v_{k+1}$ conjugate to $v_{1},\dots,v_{k}$ . Since Gram-Schmidt is still computationally too expensive for our tastes, you please inductively prove

$\mathcal{K}_{k}=\operatorname*{span}[H^{1}(x_{0}-x_{*}),\dots,H^{k+1}(x_{0}-x_% {*})].$

assuming $\mathcal{K}_{k}$ is $(k+1)$ -dimensional. I.e. $\mathcal{K}_{k}$ is a “ $H$ -Krylov subspace”.

Solution.

The induction start $k=0$ follows directly from

$\nabla f(x_{0})=H(x_{0}-x_{*})$

and the definition of $\mathcal{K}_{0}$ . Assume we have the claim for $k-1$ , then

$\nabla f(x_{k})=H(x_{k}-x_{*})=H(x_{k-1}-\alpha_{k-1}v_{k}-x_{*})=\underbrace{% H(x_{k-1}-x_{*})}_{=\nabla f(x_{k-1})\in\mathcal{K}_{k-1}}-\alpha_{k-1}H% \underbrace{v_{k}}_{\in\mathcal{K}_{k-1}}.$

As $\mathcal{K}_{k-1}=\operatorname*{span}[H^{1}(x_{0}-x_{*}),\dots,H^{k}(x_{0}-x_% {*})]$ by the induction hypothesis, we therefore have

$\nabla f(x_{k})\in\operatorname*{span}[H^{1}(x_{0}-x_{*}),\dots,H^{k+1}(x_{0}-% x_{*})].$

Since $\nabla f(x_{0}),\dots,\nabla f(x_{k-1})\in\mathcal{K}_{k-1}$ they are by the induction hypothesis also in the span

$\mathcal{K}_{k}=\operatorname*{span}[\nabla f(x_{0}),\dots,\nabla f(x_{k})]% \subseteq\operatorname*{span}[H^{1}(x_{0}-x_{*}),\dots,H^{k+1}(x_{0}-x_{*})].$

Since the space on the left is $k+1$ dimensional, we have equality. ∎
(v)

Argue that $\nabla f(x_{k+1})$ is orthogonal to every vector in $\mathcal{K}_{k}$ and inductively deduce either

$\nabla f(x_{k+1})=0$

which implies $x_{k+1}=x_{*}$ , or $\mathcal{K}_{k+1}$ has full rank. Deduce from the $H$ -Krylov-subspace property, that $\nabla f(x_{k+1})$ is already $H$ -orthogonal to $\mathcal{K}_{k-1}$ .

Hint.

$x_{k+1}=\operatorname*{argmin}_{x}\{f(x):x\in\mathcal{K}_{k}+x_{0}\}$ .

Solution.

By the selection process of $x_{k+1}$ , we have

$x_{k+1}=\operatorname*{argmin}_{x}\{f(x):x\in\mathcal{K}_{k}+x_{0}\}.$

assume $\nabla f(x_{k+1})$ were not orthogonal to $\mathcal{K}_{k}$ . Then there would exist $v\in\mathcal{K}_{k}$ such that

$\langle\nabla f(x_{k+1}),v\rangle>0$

By the Taylor approximation we therefore have

$f(x_{k+1}-\delta v)=f(x_{k+1})-\delta\underbrace{\langle\nabla f(x_{k+1}),v% \rangle}_{>0}+O(\delta^{2})$

so there exists a small $\delta>0$ such that $f(x_{k+1}-\delta v)<f(x_{k+1})$ . But this is a contradiction since $x_{k+1}$ was optimal.

$\nabla f(x_{k+1})$ is therefore orthogonal to $\mathcal{K}_{k}$ . So if it is not zero, $\mathcal{K}_{k+1}$ has (as the span of both) full rank. $\nabla f(x_{k+1})$ being orthogonal to $\mathcal{K}_{k}$ also implies it is orthogonal to $H\mathcal{K}_{k-1}$ , since that is a subspace of $\mathcal{K}_{k}$ by the Krylov property. But this implies $\nabla f(x_{k+1})$ is $H$ -orthogonal to $\mathcal{K}_{k-1}$ . ∎
(vi)

Collect the ideas we have gathered to prove the recursively defined

$v_{k+1}=\nabla f(x_{k})-\frac{\langle\nabla f(x_{k}),v_{k}\rangle_{H}}{\|v_{k}% \|_{H}^{2}}v_{k}$

are $H$ -conjugate and have the same span as the gradients up to $\nabla f(x_{k})$ .

Solution.

These $v_{k}$ are the same $v_{k}$ we would obtain using Gram-Schmidt on the gradients. In fact this is Gram-Schmidt together with the fact that $\nabla f(x_{k})$ is already $H$ -orthogonal to the $v_{1},\dots,v_{k-1}\in\mathcal{K}_{k-2}$ . So only the last summand remains. ∎

(vii)

To make our procedure truly computable, we want to show

\frac{\langle\nabla f(x_{k}),v_{k}\rangle_{H}}{\|v_{k}\|_{H}^{2}}=-\frac{\|% \nabla f(x_{k})\|^{2}}{\|\nabla f(x_{k-1})\|^{2}}.

Hint.

Proving

\nabla f(x_{k})=\nabla f(x_{k-1})-\alpha_{k-1}Hv_{k}

should allow you to conclude $\langle\nabla f(x_{k}),v_{k}\rangle_{h}=-\frac{\|\nabla f(x_{k})\|^{2}}{\alpha% _{k-1}}.$ Then it makes sense to calculate

\alpha_{k-1}=-\frac{\langle\nabla f(x_{k-1}),v_{k}\rangle}{\|v_{k}\|_{H}^{2}}

by solving its optimization problem. Finally you may want to consider $v_{k}=\nabla f(x_{k-1})-cv_{k-1}$ and $v_{k-1}\in\mathcal{K}_{k-2}$ .

Solution.

We have

\nabla f(x_{k})=H(\overbrace{x_{k-1}-\alpha_{k-1}v_{k}}^{x_{k}}-x_{*})=\nabla f% (x_{k-1})-\alpha_{k-1}Hv_{k}.

This implies $v_{k}=\tfrac{1}{\alpha_{k-1}}H^{-1}[\nabla f(x_{k-1})-\nabla f(x_{k})]$ and therefore

\langle\nabla f(x_{k}),v_{k}\rangle_{H}=\tfrac{1}{\alpha_{k-1}}\langle\nabla f% (x_{k}),[\nabla f(x_{k-1})-\nabla f(x_{k})]\rangle=-\frac{\|\nabla f(x_{k})\|^% {2}}{\alpha_{k-1}},

where we have used $\langle\nabla f(x_{k}),\nabla f(x_{k-1})\rangle=0$ , which follows from $\nabla f(x_{k-1})\in\mathcal{K}_{k-1}$ and $\nabla f(x_{k})\perp\mathcal{K}_{k-1}$ .

Now we need to find $\alpha_{k-1}$ . But the first order condition

	$\displaystyle 0$ = d dα f(x_k-1 - αv_k)
		$\displaystyle=-\langle\nabla f(x_{k-1}-\alpha v_{k}),v_{k}\rangle$
		$\displaystyle=-\langle H(x_{k-1}-x_{*}-\alpha v_{k}),v_{k}\rangle$
		$\displaystyle=-\langle\nabla f(x_{k-1}),v_{k}\rangle+\alpha\\|v_{k}\\|_{H}^{2}.$

implies

\alpha_{k-1}=\frac{\langle\nabla f(x_{k-1}),v_{k}\rangle}{\|v_{k}\|_{H}^{2}}.

Before we put things together, note that by definition of $v_{k}$

\langle\nabla f(x_{k-1}),v_{k}\rangle=\langle\nabla f(x_{k-1}),\nabla f(x_{k-1% })-cv_{k-1}\rangle=\|\nabla f(x_{k-1})\|^{2},

since $\nabla f(x_{k-1})$ is orthogonal to $v_{k-1}\in\mathcal{K}_{k-2}$ . From this we get

\alpha_{k-1}=\frac{\|\nabla f(x_{k-1})\|^{2}}{\|v_{k}\|_{H}^{2}},

So we finally get

\frac{\langle\nabla f(x_{k}),v_{k}\rangle_{H}}{\|v_{k}\|_{H}^{2}}=-\frac{\|% \nabla f(x_{k})\|^{2}}{\|v_{k}\|_{H}^{2}}\frac{\|v_{k}\|_{H}^{2}}{\|\nabla f(x% _{k-1})\|^{2}}=-\frac{\|\nabla f(x_{k})\|^{2}}{\|\nabla f(x_{k-1})\|^{2}}.\qed

(viii)

Summarize everything into a pseudo-algorithm for conjugate gradient descent (CGD) and compare it to heavy-ball momentum with

\beta_{k}=\frac{\alpha_{k}\|\nabla f(x_{k})\|^{2}}{\alpha_{k-1}\|\nabla f(x_{k% -1})\|^{2}}

using identical $\alpha_{k}$ as CGD.

Solution.

We set $v_{1}=\nabla f(x_{0})$ or later

v_{k+1}=\nabla f(x_{k})+\frac{\|\nabla f(x_{k})\|^{2}}{\|\nabla f(x_{k-1})\|^{% 2}}v_{k}

determine the step-size

\alpha_{k}=\operatorname*{argmin}_{\alpha}f(x_{k}-\alpha v_{k+1})

and finally make our step

x_{k+1}=x_{k}-\alpha_{k}v_{k+1}.

Using the fact $v_{k}=\frac{x_{k-1}-x_{k}}{\alpha_{k-1}}$ and inserting $v_{k+1}$ into the last equation, we notice

	$\displaystyle x_{k+1}$	$\displaystyle=x_{k}-\alpha_{k}\Bigl{[}\nabla f(x_{k})+\frac{\\|\nabla f(x_{k})% \\|^{2}}{\\|\nabla f(x_{k-1})\\|^{2}}\frac{x_{k-1}-x_{k}}{\alpha_{k-1}}\Bigr{]}$
		$\displaystyle=x_{k}-\alpha_{k}\nabla f(x_{k})+\underbrace{\frac{\alpha_{k}}{% \alpha_{k-1}}\frac{\\|\nabla f(x_{k})\\|^{2}}{\\|\nabla f(x_{k-1})\\|^{2}}}_{=% \beta_{k}}(x_{k}-x_{k-1})$

that CGD is identical to HBM with certain parameters $\alpha_{k}$ , $\beta_{k}$ . ∎

Exercise 3 (Momentum).

In this exercise, we take a closer look at heavy-ball momentum

x_{k+1}=x_{k}+\beta_{k}(x_{k}-x_{k-1})+\alpha_{k}\nabla f(x_{k})

(a)

Find a continuous function $f:\mathbb{R}\to\mathbb{R}$ such that

$f^{\prime}(x)=\begin{cases}25x&x<1\\ x+24&1<x<2\\ 25x-24&2<x.\end{cases}$

Prove that $f$ is $\mu$ -strongly convex with $\mu=1$ , $L$ -smooth with $L=25$ and has a minimum in zero.

Solution.

We define

$f(x)=\begin{cases}\tfrac{25}{2}x^{2}&x\leq 1\\ \tfrac{1}{2}x^{2}+24x-12&1<x<2\\ \tfrac{25}{2}x^{2}-24x+36&2\leq x,\end{cases}$

note that it is continuous in $1$ and $2$ and therefore everywhere, and that it has the correct derivative. Further note that

$f^{\prime\prime}(x)=\begin{cases}1&1<x<2\\ 25&\text{else}\end{cases}$

is the derivative of $f^{\prime}(x)$ in the following sense:

$f^{\prime}(x)=\int_{0}^{x}f^{\prime\prime}(t)dt,$

which follows from differentiability of $f^{\prime}$ on its segments with the fundamental theorem of calculus and continuity between segments. Thus we have

$\displaystyle f(y)$ $\displaystyle=f(x)+\int_{x}^{y}f^{\prime}(t)dt=f(x)+f^{\prime}(x)(y-x)+\int_{x% }^{y}f^{\prime}(t)-f^{\prime}(x)dt$

$\displaystyle=f(x)+f^{\prime}(x)(y-x)+\int_{x}^{y}\int_{x}^{t}f^{\prime\prime}% (s)dsdt.$

For the Bregman divergence this implies

$\tfrac{1}{2}\|y-x\|^{2}\leq D^{(B)}_{f}(y,x)=\int_{x}^{y}\int_{x}^{t}f^{\prime% \prime}(s)dsdt\leq\tfrac{25}{2}\|y-x\|^{2},$

thus $f$ is $\mu=1$ -strongly convex and $L=25$ -smooth. ∎

(b)

Recall, we required for convergence of HBM

1>\beta\geq\max\{(1-\sqrt{\alpha\mu})^{2},(1-\sqrt{\alpha L})^{2}\}.

Calculate the optimal $\alpha$ and $\beta$ to minimize the rate $\sqrt{\beta}$ .

Solution.

To minimize $\sqrt{\beta}$ , we first set

\beta=\max\{(1-\sqrt{\alpha\mu})^{2},(1-\sqrt{\alpha L})^{2}\}

and then proceed to minimize this over $\alpha$ . Which results in

	$\displaystyle\alpha^{*}$	$\displaystyle=\operatorname*{argmin}_{\alpha}\max\{(1-\sqrt{\alpha\mu})^{2},(1% -\sqrt{\alpha L})^{2}\}$
		$\displaystyle=\operatorname*{argmin}_{\alpha}\max\{\|1-\sqrt{\alpha\mu}\|,\|1-% \sqrt{\alpha L}\|\}$
		$\displaystyle=\operatorname*{argmin}_{\alpha}\max\{(1-\sqrt{\alpha\mu}),-(1-% \sqrt{\alpha\mu}),(1-\sqrt{\alpha L}),-(1-\sqrt{\alpha L})\}$
		$\displaystyle=\operatorname*{argmin}_{\alpha}\max\{(1-\sqrt{\alpha\mu}),-(1-% \sqrt{\alpha L})\}$

which is monotonously falling for

1-\sqrt{\alpha\mu}>\sqrt{\alpha L}-1

and monotonously increasing otherwise. Therefore its minimum is at equality. Thus

1-\sqrt{\alpha^{*}\mu}=\sqrt{\alpha^{*}L}-1\iff 2=\sqrt{\alpha^{*}}(\sqrt{\mu}% +\sqrt{L})\iff\alpha^{*}=\frac{4}{(\sqrt{\mu}+\sqrt{L})^{2}}.

This results in

\beta^{*}=\Bigl{(}1-\frac{2}{1+\sqrt{L/\mu}}\Bigr{)}^{2}.\qed

(c)

Prove, using heavy ball momentum on $f$ with the optimal parameters results in the recursion

$x_{k+1}=\tfrac{13}{9}x_{k}-\tfrac{4}{9}x_{k-1}-\tfrac{1}{9}\nabla f(x_{k}).$

Solution.

Using our previous results about optimal rates we have for $f$

$\alpha^{*}=\frac{4}{(1+5)^{2}}=\frac{1}{9}\qquad\beta^{*}=(1-\tfrac{2}{1+5})^{% 2}=\frac{4}{9}.$

Thus

$x_{k+1}=\underbrace{x_{k}+\tfrac{4}{9}(x_{k}-x_{k-1})}_{=\tfrac{13}{9}x_{k}-% \tfrac{4}{9}x_{k-1}}+\tfrac{1}{9}\nabla f(x_{k}).$

∎
(d)

We want to find a cycle of points $p\to q\to r\to p$ , such that for $x_{0}=p$ we have

$x_{3k}=p\quad x_{3k+1}=q\quad x_{3k+2}=r\qquad\forall k\in\mathbb{N}_{0}.$

Assume $p<1$ , $q<1$ and $r>2$ and use the heavy-ball recursion to create linear equations for $p, q, r$ . Solve this linear equation. What does this mean for convergence?

Solution.

We have

$\begin{pmatrix}p\\ q\\ r\\ \end{pmatrix}=\begin{pmatrix}0&-\tfrac{4}{9}&\tfrac{13}{9}\\ \tfrac{13}{9}&0&-\tfrac{4}{9}\\ -\tfrac{4}{9}&\tfrac{13}{9}&0\\ \end{pmatrix}\begin{pmatrix}p\\ q\\ r\\ \end{pmatrix}-\frac{1}{9}\begin{pmatrix}\nabla f(r)\\ \nabla f(p)\\ \nabla f(q)\end{pmatrix}$

Multiplying both sides by $9$ , using $\nabla f(r)=25r-24$ and $\nabla f(p)=25p$ and similarly $q$ and reordering, we get

$\begin{pmatrix}9&4&12\\ 12&9&4\\ 4&12&9\end{pmatrix}\begin{pmatrix}p\\ q\\ r\end{pmatrix}=\begin{pmatrix}24\\ 0\\ 0\end{pmatrix}$

solving this system of equations results in

$p=\tfrac{792}{1225}\approx 0.65,\quad q=-\tfrac{2208}{1225}\approx-1.80,\quad r% =\tfrac{2592}{1225}\approx 2.12.$

As we have managed to find a cycle of points, HBM does not converge to the minimum at zero in this case. Note: it is also possible to show that this cycle is attractive if you start in an epsilon environment away from these points. ∎
(e)

Implement Heavy-Ball momentum, Nesterov’s momentum and CGD https://classroom.github.com/a/f3PnRxTs.

	$\displaystyle\\|x_{*}-x_{n}\\|^{2}$	$\displaystyle=\sum_{i=1}^{d}(x_{*}^{(i)}-x_{n}^{(i)})^{2}$
		$\displaystyle\geq\sum_{i=n+1}^{d}(x_{*}^{(i)})^{2}$
		$\displaystyle=d-n\overset{d=2n+1}{=}n+1=\tfrac{n+1}{2n+1}d\geq\tfrac{1}{2}d=% \frac{1}{2}\sum_{i=1}^{d}1^{2}=\tfrac{1}{2}\\|x_{*}-x_{0}\\|^{2}.\qed$

	$\displaystyle\\|x_{n}-x_{*}\\|$	$\displaystyle\geq\tfrac{1}{\sqrt{2}}\\|x_{0}-x_{*}\\|$
	$\displaystyle f(x_{n})-\inf_{x}f(x)$	$\displaystyle\geq\frac{L\\|x_{0}-x_{*}\\|^{2}}{16(n+1)^{2}}$

	$\displaystyle f(y)$	$\displaystyle=f(x)+\int_{x}^{y}f^{\prime}(t)dt=f(x)+f^{\prime}(x)(y-x)+\int_{x% }^{y}f^{\prime}(t)-f^{\prime}(x)dt$
		$\displaystyle=f(x)+f^{\prime}(x)(y-x)+\int_{x}^{y}\int_{x}^{t}f^{\prime\prime}% (s)dsdt.$

Sheet 4

Exercise 1 (Lower Bounds).

Solution.

Solution.

Solution.

Solution.

Solution.

Solution.

Solution.

Solution.

Solution.

Solution.

Theorem (Nesterov).

Exercise 2 (Conjugate Gradient Descent).

Solution.

Hint.

Solution.

Solution.

Solution.

Hint.

Solution.

Solution.

Hint.

Solution.

Solution.

Exercise 3 (Momentum).

Solution.

Solution.

Solution.

Solution.