Sheet 3

Prof. Simon Weißmann, Felix Benning

^†^†course: Optimization in Machine Learning^†^†semester: FSS 2023^†^†tutorialDate: 30.03.2023^†^†dueDate: 12:00 in the exercise on Thursday 30.03.2023

Exercise 1 (Convergence Speed).

Proof that

(i)

if we have

$\limsup_{k\to\infty}\frac{e(x_{k+1})}{e(x_{k})}=0,$

then $e(x_{k})$ converges super-linearly.

Solution.

We define $c_{n}:=\sup_{k\geq n}\frac{e(x_{k+1})}{e(x_{k})}$ . Then

$\lim_{n\to\infty}c_{n}=\limsup_{k\to\infty}\frac{e(x_{k+1})}{e(x_{k})}=0$

and by definition

$e(x_{k+1})\leq c_{k}e(x_{k}).$

Thus we have super-linear convergence. ∎
(ii)

If for $c\in(0,1)$ we have

$\limsup_{k\to\infty}\frac{e(x_{k+1})}{e(x_{k})}<c,$

then $e(x_{k})$ converges linearly with rate $c$ .

Solution.

We again define $c_{n}:=\sup_{k\geq n}\frac{e(x_{k+1})}{e(x_{k})}$

$\lim_{n\to\infty}c_{n}=\limsup_{k\to\infty}\frac{e(x_{k+1})}{e(x_{k})}<c$

thus there exists $N\geq 0$ such that for all $n\geq N$ we have $c_{n}\leq c$ and therefore for all $n\geq N$

$e(x_{n+1})\leq c_{n}e(x_{n})\leq ce(x_{n}).\qed$
(iii)

If for $c\in(0,1)$ we have

$\limsup_{k\to\infty}\frac{e(x_{k+1})}{e(x_{k})^{2}}<c,$

then $e(x_{k})$ converges super-linearly with rate $c$ .

Solution.

We similarly define $c_{n}:=\sup_{k\geq n}\frac{e(x_{k+1})}{e(x_{k})^{2}}$ and again get $\lim_{n\to\infty}c_{n}<c$ . Thus there exists $N\geq 0$ such that for all $n\geq K$ we have $c_{n}\leq c$ and therefore for all $n\geq N$

$e(x_{n+1})\leq c_{n}e(x_{n})^{2}\leq ce(x_{n})^{2}.\qed$

Exercise 2 (Sub-gradients).

Let $f,g:\mathbb{R}^{d}\to\mathbb{R}$ be convex functions.

(i)

Prove that $\partial f(x)$ is a convex set for any $x\in\mathbb{R}^{d}$ .

Solution.

Let $g_{1},g_{2}\in\partial f(x)$ . Then for any $\lambda\in[0,1]$ and any $y\in\mathbb{R}^{d}$

$\displaystyle f(y)$ $\displaystyle=\lambda f(y)+(1-\lambda)f(y)$

≥λ(f(x) + ⟨g_1, y-x⟩) + (1-λ) (f(x) + ⟨g_2, y-x⟩)

$\displaystyle=f(x)+\Bigl{\langle}\lambda g_{1}+(1-\lambda)g_{2},y-x\Bigr{\rangle}$

thus $\lambda g_{1}+(1-\lambda)g_{2}\in\partial f(x)$ by definition. ∎
(ii)

Prove for $a>0$ , $\partial(af)=a\partial f$

Solution.

We only need to prove “ $\supseteq$ ”. Using $\tilde{f}=af$ with $\tilde{a}=\tfrac{1}{a}$ the other inclusion immediately follows.

Let $ag_{x}\in a\partial f(x)$ with $g_{x}\in\partial f(x)$ . We need to show that $ag_{x}\in\partial(af)(x)$ . But this follows immediately

$\underbrace{a}_{>0}f(y)\overset{g_{x}\in\partial f(x)}{\geq}a\Bigl{(}f(x)+% \langle g_{x},y-x\rangle\Bigr{)}=(af)(x)+\langle ag_{x},y-x\rangle.\qed$
(iii)

Prove that $\partial(f_{1}+f_{2})\supseteq\partial f_{1}+\partial f_{2}$

Solution.

Let $g_{i}\in\partial f_{i}(x)$ for $i=1,2$ . Then we have that $g_{1}+g_{2}\in\partial(f_{1}+f_{2})$ because

$\displaystyle(f_{1}+f_{2})(y)$ $\displaystyle\geq\Bigl{(}f_{1}(x)+\langle g_{1},y-x\rangle\Bigr{)}+\Bigl{(}f_{% 2}(x)+\langle g_{2},y-x\rangle\Bigr{)}$

$\displaystyle=(f_{1}+f_{2})(x)+\langle g_{1}+g_{2},y-x\rangle.\qed$

(iv)

For $h(x)=f(Ax+b)$ prove $\partial h(x)\supseteq A^{T}\partial f(Ax+b)$ . Prove equality for invertible $A$ .

Solution.

Let $g_{x}\in\partial f(x)$ i.e. $g_{Ax+b}\in\partial f(Ax+b)$ . Then $A^{T}g_{Ax+b}\in\partial h(x)$ because

	$\displaystyle h(y)=f(Ay+b)$	$\displaystyle\geq f(Ax+b)+\langle g_{Ax+b},(Ay+b)-(Ax+b)\rangle$
		$\displaystyle=h(x)+\langle A^{T}g_{Ax+b},y-x\rangle.$

If $A$ is invertible, we have $f(x)=h(A^{-1}x-A^{-1}b)$ so by the previous statement with $\tilde{A}=A^{-1}$ and $\tilde{b}=-A^{-1}b$ , we get the other direction. ∎

Exercise 3 (Lasso).

Let

f(x)=\tfrac{1}{2}\|x-y\|^{2}+\lambda\|x\|_{1}

for $x\in\mathbb{R}^{d}$ be the Lagrangian form of the least squares LASSO method.

(a)

Compute a sub-gradient of $f$ .

Solution.

Using $\partial(g+\lambda h)(x)\supseteq\partial g(x)+\lambda\partial h(x)$ , we only need to determine the subgradient of $g(x):=\tfrac{1}{2}\|x-y\|^{2}$ and

$h(x):=\|x\|_{1}=\sum_{i=1}^{d}|x_{i}|.$

But $\nabla g(x)=x-y$ as $g$ is differentiable. And since it is also convex, we have

$\partial g(x)=\{\nabla g(x)\}.$

Now the subgradient of $h_{i}(x)=|x_{i}|$ is given by $\operatorname*{sgn}(x_{i})e_{i}$ , where $\operatorname*{sgn}(0)\in[-1,1]$ can be selected arbitrarily, because

$\displaystyle h_{i}(x)+\langle\operatorname*{sgn}(x_{i})e_{i},y-x\rangle$ $\displaystyle=|x_{i}|+\operatorname*{sgn}(x_{i})y_{i}-\underbrace{% \operatorname*{sgn}(x_{i})x_{i}}_{|x_{i}|}=\operatorname*{sgn}(x_{i})y_{i}$

So again

$\partial h(x)\supseteq\sum_{i=1}^{d}\partial h_{i}(x)\ni(\operatorname*{sgn}(x% _{1}),\dots,\operatorname*{sgn}(x_{n}))^{T}=:s(x).$

So putting everything together we have

$\partial f(x)\ni x-y+\lambda s(x).\qed$
(b)

Prove that $f$ is convex.

Solution.

As its sets of sub-gradients is nowhere empty, it is convex. ∎

(c)

Find a global minimum of $f$ .

Solution.

By the lecture it is sufficient to find a point $x$ such that $0\in\partial f(x)$ . By the previous exercise we therefore want to solve

0\overset{!}{=}x-y+\lambda s(x)

entry-wise this implies

	$\displaystyle x_{i}$ = y_i - λsgn(x_i) = { y i + λ x i ¡ 0 y i - λ[-1, 1] x i = 0 y i - λ x i ¿ 0
		$\displaystyle=\begin{cases}y_{i}+\lambda&y_{i}+\lambda<0\\ 0&y_{i}\in[-\lambda,\lambda]\\ y_{i}-\lambda&y_{i}-\lambda>0\end{cases}$
		$\displaystyle=\begin{cases}y_{i}+\lambda&y_{i}<-\lambda\\ 0&y_{i}\in[-\lambda,\lambda]\\ y_{i}-\lambda&y_{i}>\lambda.\end{cases}\qed$

(d)

Implement $f$ as a sub-type of ”DifferentiableFunction” (even though it is not) by returning a single sub-gradient and apply gradient descent to verify the global minimum https://classroom.github.com/a/XqNuifmO .

Exercise 4 (Momentum Matrix).

let $D=\operatorname*{diag}(\lambda_{1},\dots,\lambda_{d})$ , $\alpha,\beta>0$ and define

$T=\begin{pmatrix}(1+\beta)\mathbb{I}-\alpha D&-\beta\mathbb{I}\\ \mathbb{I}&\mathbf{0}\end{pmatrix}\in\mathbb{R}^{2d\times 2d}$

Prove there exists a regular $S\in\mathbb{R}^{2d\times 2d}$ such that

$S^{-1}TS=\hat{T}=\begin{pmatrix}T_{1}&&\\ &\ddots&\\ &&T_{d}\end{pmatrix}$

with

$T_{i}=\begin{pmatrix}1+\beta-\alpha\lambda_{i}&-\beta\\ 1&0\end{pmatrix}\in\mathbb{R}^{2\times 2}.$

Solution.

We simply define for the standard basis $e_{i}\in\mathbb{R}^{d}$

$S=\begin{pmatrix}e_{1}&0&\dots&e_{d}&0\\ 0&e_{1}&\dots&0&e_{d}\end{pmatrix}\in\mathbb{R}^{2d\times 2d}$

in particular $S^{T}=S^{-1}$ . ∎
Exercise 5 (PL-Inequality).
Assume $f:\mathbb{R}^{d}\to\mathbb{R}$ is $L$ -smooth and satisfies the Polyak-Łojasiewicz inequality

$\|\nabla f(x)\|^{2}\geq 2c(f(x)-f_{*})$ (PL)

for some $c>0$ and all $x\in\mathbb{R}^{d}$ with $f_{*}=\min_{x}f(x)>-\infty$ .
1. (I)
  
  Prove that gradient descent with fixed step size $\alpha_{k}=\frac{1}{L}$ converges linearly in the sense
  
  $f(x_{k})-f_{*}\leq(1-\tfrac{c}{L})^{k}(f(x_{0})-f_{*}).$
  
  Solution.
  
  By $L$ -smoothness and the descent lemma, we have
  
  $f(x_{k+1})\leq f(x_{k})-\tfrac{1}{2L}\|\nabla f(x_{k})\|^{2}\overset{\eqref{eq% : PL-condition}}{\leq}f(x_{k})-\tfrac{c}{L}(f(x_{k})-f_{*}).$
  
  Subtracting $f_{*}$ from both sides, we get
  
  $f(x_{k+1})-f_{*}\leq(1-\tfrac{c}{L})(f(x_{k})-f_{*})\qed$
2. (II)
  
  Prove that $\mu$ -strong-convexity and $L$ -smoothness imply the PL-inequality.
  
  Solution.
  
  Recall by the solution of sheet 1, exercise 6 (iii), and strong convexity we have
  
  $\displaystyle\mu\|x-y\|^{2}$ $\displaystyle\leq D^{(B)}_{f}(x,y)+D^{(B)}_{f}(y,x)$
  
  $\displaystyle=\langle\nabla f(x)-\nabla f(y),x-y\rangle$
  
  and therefore
  
  $\mu\|x-y\|\leq\|\nabla f(x)-\nabla f(y)\|.$ (1)
  
  Finally we know by $L$ -smoothness and $\nabla f(x_{*})=0$ where $x_{*}$ is the minimum
  
  $f(x)-f(x_{*})\overset{\nabla f(x_{*})=0}{=}D^{(B)}_{f}(x,x_{*})\overset{L\text% {-smooth}}{\leq}\frac{L}{2}\|x-x_{*}\|^{2}\overset{\eqref{eq: lower bound on % grad difference}}{\leq}\frac{L}{2\mu}\|\nabla f(x)-\underbrace{\nabla f(x_{*})% }_{=0}\|^{2}.\qed$
3. (III)
  
  Use a graphing calculator to find $c$ such that $f(x)=x^{2}+3\sin^{2}(x)$ satisfies the PL-condition (argue why $x\to\infty$ is not a problem) and prove it is not convex.
  
  Solution.
  
  For $c=\tfrac{1}{6}$ we have the PL-condition
  
  As $f^{\prime}(x)=2(x+3\sin(x)\cos(x))$ and therefore
  
  $f^{\prime}(x)^{2}=4(x+3\underbrace{\sin(x)\cos(x)}_{\in[-1,1]})^{2}\overset{|x% |\geq 3}{\geq}4(|x|-3)^{2}$
  
  the $x^{2}$ dominates for large $x$ , so if we make $c$ small enough we can ensure the inequality for large $x$ .
  
  $f$ is not convex because
  
  $f(\tfrac{1}{2}\pi+\tfrac{1}{2}0)=\frac{\pi^{2}}{4}+3>\tfrac{1}{2}\pi^{2}=% \tfrac{1}{2}f(\pi)+\tfrac{1}{2}f(0).$
  
  ∎
  Exercise 6 (Weak PL-Inequality).
  
  Assume $f:\mathbb{R}^{d}\to\mathbb{R}$ is $L$ -smooth and satisfies the “weak PL inequality”
  
  $\|\nabla f(x)\|\geq 2c(f(x)-f_{*})$
  
  for some $c>0$ and all $x\in\mathbb{R}^{d}$ with $f_{*}=\min_{x}f(x)>-\infty$ .
  
  A.
  
  Let $a_{0}\in[0,\tfrac{1}{q}]$ for some $q>0$ and assume for the sequence $(a_{n})_{n\in\mathbb{N}}$ that it is positive and satisfies a diminishing contraction
  
  $0\leq a_{n+1}\leq(1-qa_{n})a_{n}\qquad\forall n\geq 0.$
  
  Prove the convergence rate
  
  $a_{n}\leq\frac{1}{nq+1/a_{0}}\leq\frac{1}{(n+1)q}.$
  
  Hint.
  
  A useful checkpoint might be the telescoping sum of
  
  $\frac{1}{a_{n+1}}-\frac{1}{a_{n}}\geq q.$
  
  Solution.
  
  Divide the reordered contraction
  
  $a_{n}\geq a_{n+1}+qa_{n}^{2}$
  
  by $a_{n}a_{n+1}$ to obtain
  
  $\frac{1}{a_{n+1}}\geq\frac{1}{a_{n}}+q\underbrace{\frac{a_{n}}{a_{n+1}}}_{\geq 1% }\geq\frac{1}{a_{n}}+q$
  
  which leads to
  
  $\frac{1}{a_{n}}-\frac{1}{a_{0}}=\sum_{k=0}^{n-1}\frac{1}{a_{k+1}}-\frac{1}{a_{% k}}\geq nq.$
  
  Reordering we obtain our claim
  
  $a_{n}\leq\frac{1}{nq+\tfrac{1}{a_{0}}}\overset{a_{0}\leq\tfrac{1}{q}}{\leq}% \frac{1}{(n+1)q}.\qed$
  
  B.
  
  Prove that $f$ is bounded. More specifically $e(x):=f(x)-f_{*}\leq\frac{L}{2c^{2}}$ for all $x$ .
  
  Hint.
  
  Use Sheet 1 Exercise 1 (i).
  
  Solution.
  
  Using Sheet 1 Exercise 1 (i), we get
  
  $f_{*}\leq f(x)-\tfrac{1}{2L}\|\nabla f(x)\|^{2}$
  
  and therefore
  
  $e(x)\geq\tfrac{1}{2L}\|\nabla f(x)\|^{2}\overset{\text{weak PL}}{\geq}\tfrac{4% c^{2}}{2L}e(x)^{2}.$
  
  Dividing both sides by $e(x)$ we obtain
  
  $1\geq\tfrac{2c^{2}}{L}e(x)$
  
  and thus
  
  $e(x)\leq\tfrac{L}{2c^{2}}.\qed$
  
  C.
  
  For gradient descent $x_{n+1}-x_{n}=-\alpha_{n}\nabla f(x_{n})$ with constant step size $\alpha_{k}=\tfrac{1}{L}$ prove the convergence rate
  
  $f(x_{n})-f_{*}\leq\frac{L}{2c^{2}(n+1)}.$
  
  Solution.
  
  Using $L$ -smoothness, we have
  
  $\displaystyle f(x_{k+1})$ $\displaystyle\leq f(x_{k})+\langle\nabla f(x_{k}),x_{k+1}-x_{k}\rangle+\tfrac{% L}{2}\|x_{k+1}-x_{k}\|^{2}$
  
  $\displaystyle\leq f(x_{k})-\underbrace{\alpha_{k}(1-\tfrac{L}{2}\alpha_{k})}_{% =\tfrac{1}{2L}}\underbrace{\|\nabla f(x_{k})\|^{2}}_{\geq 4c^{2}e(x_{k})^{2}}$
  
  If we subtract $f_{*}$ from both sides and apply our weak PL inequality we get
  
  $e(x_{k+1})\leq e(x_{k})-\tfrac{4c^{2}}{2L}e(x_{k})^{2}=(1-\tfrac{2c^{2}}{L}e(x% _{k}))e(x_{k})$
  
  with $q=\tfrac{2c^{2}}{L}$ and $e(x_{0})\leq\tfrac{L}{2c^{2}}=\tfrac{1}{q}$ by (iv)(III)B, we can apply (iv)(III)A to obtain our claim. ∎

	$\displaystyle f(y)$	$\displaystyle=\lambda f(y)+(1-\lambda)f(y)$
	≥λ(f(x) + ⟨g_1, y-x⟩) + (1-λ) (f(x) + ⟨g_2, y-x⟩)
		$\displaystyle=f(x)+\Bigl{\langle}\lambda g_{1}+(1-\lambda)g_{2},y-x\Bigr{\rangle}$

	$\displaystyle(f_{1}+f_{2})(y)$	$\displaystyle\geq\Bigl{(}f_{1}(x)+\langle g_{1},y-x\rangle\Bigr{)}+\Bigl{(}f_{% 2}(x)+\langle g_{2},y-x\rangle\Bigr{)}$
		$\displaystyle=(f_{1}+f_{2})(x)+\langle g_{1}+g_{2},y-x\rangle.\qed$

	$\displaystyle\mu\\|x-y\\|^{2}$	$\displaystyle\leq D^{(B)}_{f}(x,y)+D^{(B)}_{f}(y,x)$
		$\displaystyle=\langle\nabla f(x)-\nabla f(y),x-y\rangle$

	$\displaystyle f(x_{k+1})$	$\displaystyle\leq f(x_{k})+\langle\nabla f(x_{k}),x_{k+1}-x_{k}\rangle+\tfrac{% L}{2}\\|x_{k+1}-x_{k}\\|^{2}$
		$\displaystyle\leq f(x_{k})-\underbrace{\alpha_{k}(1-\tfrac{L}{2}\alpha_{k})}_{% =\tfrac{1}{2L}}\underbrace{\\|\nabla f(x_{k})\\|^{2}}_{\geq 4c^{2}e(x_{k})^{2}}$

Sheet 3

Exercise 1 (Convergence Speed).

Solution.

Solution.

Solution.

Exercise 2 (Sub-gradients).

Solution.

Solution.

Solution.

Solution.

Exercise 3 (Lasso).

Solution.

Solution.

Solution.

Exercise 4 (Momentum Matrix).

Solution.

Exercise 5 (PL-Inequality).

Solution.

Solution.

Solution.

Exercise 6 (Weak PL-Inequality).

Hint.

Solution.

Hint.

Solution.

Solution.