Sheet 1

Prof. Simon Weißmann, Felix Benning

^†^†course: Optimization in Machine Learning^†^†semester: FSS 2023^†^†tutorialDate: 02.03.2023^†^†dueDate: 12:00 in the exercise on Thursday 02.03.2023

Exercise 1 (Convex Examples).

Prove the following functions are convex

(i)

affine linear functions, i.e. $f(x)=a^{T}x+c$ for $a\in\mathbb{R}^{d}$ , $c\in\mathbb{R}$ ,

Solution.

We have

	$\displaystyle f(\lambda x+(1-\lambda)y)$	$\displaystyle=a^{T}(\lambda x+(1-\lambda)y)+\overbrace{(\lambda+1-\lambda)}^{1}c$
	= λ ⏟ (a^Tx+c) _f(x) +(1-λ) ⏟ (a^Ty+c) _f(y). ∎

(ii)

norms, i.e. $x\mapsto\|x\|$ ,

Solution.

We have

$\|\lambda x+(1-\lambda)y\|\overset{\Delta}{\leq}\|\lambda x\|+\|(1-\lambda)y\|% \overset{\text{scaling}}{=}\lambda\|x\|+(1-\lambda)\|y\|.\qed$
(iii)

sums of convex functions $f_{k}$ , i.e. $f(x)=\sum_{k=1}^{n}f_{k}(x)$ ,

Solution.

We have

$f(\lambda x+(1-\lambda y))=\sum_{k=1}^{n}\underbrace{f_{k}(\lambda x+(1-% \lambda)y)}_{\leq\lambda f_{k}(x)+(1-\lambda)f_{k}(y)}\leq\lambda\underbrace{% \sum_{k=1}^{n}f_{k}(x)}_{=f(x)}+(1-\lambda)\underbrace{\sum_{k=1}^{n}f_{k}(y)}% _{=f(y)}.\qed$

(iv)

$F(x):=\sup_{f\in\mathcal{F}}f(x)$ for a set of convex functions $\mathcal{F}$ .

Solution.

We have

	$\displaystyle F(\lambda x+(1-\lambda)y)$	$\displaystyle\leq\sup_{f\in\mathcal{F}}\lambda f(x)+(1-\lambda)f(y)\leq\sup_{f% \in\mathcal{F}}\lambda f(x)+\sup_{g\in\mathcal{F}}(1-\lambda)g(y)$
		$\displaystyle=\lambda F(x)+(1-\lambda)F(y).\qed$

Exercise 2 (Finite Jensen).

Let $\varphi$ be convex, and $\sum_{i=1}^{n}\lambda_{i}=1$ for $\lambda_{i}\geq 0$ . Prove

\varphi\Bigl{(}\sum_{i=1}^{n}\lambda_{i}x_{i}\Bigr{)}\leq\sum_{i=1}^{n}\lambda% _{i}\varphi(x_{i})

and deduce $(\tfrac{1}{n}\sum_{i=1}^{n}x_{i})^{2}\leq\tfrac{1}{n}\sum_{i=1}^{n}x_{i}^{2}$ .

Solution.

For $n=2$ this is simply the definition of convexity. Assume it holds for $n$ . Then the induction step is given by

	$\displaystyle\varphi\biggl{(}\sum_{i=1}^{n+1}\lambda_{i}x_{i}\biggr{)}$	$\displaystyle=\varphi\biggl{(}\lambda_{n+1}x_{n+1}+(1-\lambda_{n+1})\sum_{i=1}% ^{n}\underbrace{\frac{\lambda_{i}}{1-\lambda_{n+1}}}_{=:\tilde{\lambda}_{i}}x_% {i}\biggr{)}$
		$\displaystyle\leq\lambda_{n+1}\varphi(x_{n+1})+(1-\lambda_{n+1})\biggl{.}% \smash{\underbrace{\varphi\biggl{(}\sum_{i=1}^{n}\tilde{\lambda}_{i}x_{i}% \biggr{)}}_{\overset{\text{ind.}}{\leq}\sum_{i=1}^{n}\tilde{\lambda}_{i}% \varphi(x_{i})}}$
		$\displaystyle\leq\sum_{i=1}^{n+1}\lambda_{i}\varphi(x_{i})$

Where it is easy to check that $\tilde{\lambda}_{i}$ sums to one. For the second part select $\lambda_{i}=\tfrac{1}{n}$ and $\varphi(x)=x^{2}$ , which is convex because

	$\displaystyle(\lambda x+(1-\lambda)y)^{2}$	$\displaystyle=\underbrace{\lambda^{2}}_{=\lambda(1-\mathrlap{(1-\lambda))}}x^{% 2}+\underbrace{(1-\lambda)^{2}}_{=(1-\mathrlap{\lambda)-\lambda(1-\lambda)}}y^% {2}+2\lambda(1-\lambda)xy$
		$\displaystyle=\lambda x^{2}+(1-\lambda)y^{2}-\lambda(1-\lambda)\underbrace{(x^% {2}+y^{2}-2xy)}_{=(x-y)^{2}\geq 0}$
		$\displaystyle\leq\lambda x^{2}+(1-\lambda)y^{2}.\qed$

Exercise 3 (Strict & Strong Convexity).

Prove the following statements

(a)

$\mu$ -strong convexity implies strict convexity.

Solution.

Assuming $x\neq y$ we have

$0<\tfrac{\mu}{2}\|x-y\|^{2}\overset{\mu\text{-strong conv.}}{\leq}f(x)-f(y)-% \langle\nabla f(y),x-y\rangle$

But after moving the negative parts to the left we are left with a condition which is equivalent to strict convexity by Proposition A.1.8 in the lecture. ∎

(b)

For twice differentiable $f$ , the following are equivalent

(I)

$\nabla^{2}f(x)\succeq\mu\mathbb{I}$
(II)

$z^{T}\nabla^{2}f(x)z\geq\mu\|z\|^{2}$
(III)

$f$ is $\mu$ -strongly convex

where $\mathbb{I}$ is the identity matrix and

A\succeq B:\iff A-B\text{ is (weakly) positive definite}.

Solution.

Let us get (iv)((b))(I) $\Leftrightarrow$ (iv)((b))(II) out of the way first:

z^{T}\nabla^{2}f(x)z-\mu\|z\|^{2}=\langle z,\nabla^{2}f(x)z\rangle-\langle z,% \mu\mathbb{I}z\rangle=z^{T}(\nabla^{2}f(x)-\mu\mathbb{I})z

If we have (iv)((b))(I), then the rightmost side is positive and thus we also have (iv)((b))(II). For the other direction we simply start from the left.

For (iv)((b))(I) $\Rightarrow$ (iv)((b))(III) let

g(y):=f(y)-\tfrac{\mu}{2}\|y\|^{2},

because $f$ is $\mu$ strongly convex

	$\displaystyle g(y)$	$\displaystyle=f(y)-\tfrac{\mu}{2}\\|y\\|^{2}$
	≥f(x) + ⟨∇f(x), y-x⟩+ μ 2 ∥y-x∥^2 - μ 2 ∥y∥^2
		$\displaystyle=\underbrace{f(x)-\tfrac{\mu}{2}\\|x\\|^{2}}_{=g(x)}+\tfrac{\mu}{2}% \\|x\\|^{2}+\langle\underbrace{\nabla f(x)-\mu x}_{\nabla g(x)},y-x\rangle+\mu% \langle x,y-x\rangle-\mu\langle x,y\rangle+\tfrac{\mu}{2}\\|x\\|^{2}$
		$\displaystyle=g(x)+\langle\nabla g(x),y-x\rangle$

$g$ is still convex. So we know by the lecture that

\nabla^{2}g(x)=\nabla^{2}f(x)-\mu\mathbb{I}

is positive definite.

On the other hand for (iv)((b))(III) $\Rightarrow$ (iv)((b))(I): if $\nabla^{2}f\succeq\mu\mathbb{I}$ , we know that $\nabla^{2}g$ is positive definite and thus that $g$ is convex by the lecture. Reusing the equations above except for the inequality we obtain

	$\displaystyle f(x)+\langle\nabla f(x),y-x\rangle+\tfrac{\mu}{2}\\|y-x\\|^{2}-% \tfrac{\mu}{2}\\|y\\|^{2}$	$\displaystyle=g(x)+\langle\nabla g(x),y-x\rangle$
		$\displaystyle=f(y)-\tfrac{\mu}{2}\\|y\\|^{2}$

Adding $\tfrac{\mu}{2}\|y\|^{2}$ on both sides results in the definition of $\mu$ -strong convexity of $f$ . ∎

Exercise 4 (Convexity and Minima).

Prove the following statements

(I)

If $f$ is convex, then every local minimum is also a global minimum.

Solution.

Let $x^{*}$ be a local minimum and assume it was not global. Then there exists $y$ such that

$f(y)<f(x^{*}).$

Select $\lambda$ small enough such that $x^{*}+\lambda(y-x^{*})$ is still in the $\epsilon$ neighborhood of $x^{*}$ where $x^{*}$ is a local minimum. Then obtain a contradiction

$f(x^{*})\leq f(x^{*}+\lambda(y-x^{*}))\overset{\text{conv.}}{\leq}(1-\lambda)f% (x^{*})+\underbrace{\lambda}_{>0}\underbrace{f(y)}_{<f(x^{*})}<f(x^{*}).\qed$
(II)

If $f$ is strictly convex, then there exists at most one minimum.

Solution.

Let $x\neq y$ both be minima (which are global by (iv)((b))(I)). Then for some $\lambda\in(0,1)$

$f(\lambda x+(1-\lambda)y)\overset{\text{strict conv.}}{<}\lambda f(x)+(1-% \lambda)f(y)=\min_{z}f(z),$

which is a contradiction. ∎
(III)

If $f$ convex and differentiable and $\nabla f(x^{*})=0$ , then $x^{*}$ is a minimum.

Solution.

By Proposition A.1.8 convexity is equivalent to

$f(x)\geq f(x^{*})+\langle\underbrace{\nabla f(x^{*})}_{=0},x-x^{*}\rangle=f(x^% {*})\quad\forall x.$

So $x^{*}$ is a (global) minimum. ∎

Exercise 5 (Directional Minima).

Let $f$ be some differentiable function. For every direction $d\in\mathbb{R}^{n}$ define

g_{d}(\alpha):=f(x^{*}+\alpha d).

Assume that for every $d$ , $g_{d}$ is minimized by $\alpha=0$ . Prove that

(I)

We have necessarily $\nabla f(x^{*})=0$ .

Solution.

If we select a standard basis vector as the direction, then

$0=g_{e_{i}}^{\prime}(\alpha)\Bigl{|}_{\alpha=0}=\langle\nabla f(x^{*}+\alpha e% _{i}),e_{i}\rangle\Bigl{|}_{\alpha=0}=\langle\nabla f(x^{*}),e_{i}\rangle$

So the $i$ -th entry of $\nabla f(x^{*})$ is zero. As $i$ was arbitrary, we are done. ∎
(II)

$f(x^{*})$ is not necessarily a minimum of $f$ .

Hint.

Let $0<p<q$ and define

$f(y,z):=(z-py^{2})(z-qy^{2})$

consider $x^{*}=(0,0)$ and prove that $f(y,my^{2})<0$ for $p<m<q$ .

Solution.

Let $0<p<q$ and define

$f(y,z):=(z-py^{2})(z-qy^{2})$

then for $p<m<q$ we have for $y\neq 0$

$f(y,my^{2})=(my^{2}-py^{2})(my^{2}-qy^{2})=\underbrace{y^{2}}_{>0}\underbrace{% (m-p)}_{>0}\underbrace{(m-q)}_{<0}<0.$

This means that $f(0,0)=0$ is not a minimum since we can get arbitrarily close to $x^{*}=(0,0)$ with $(y,my^{2})$ with smaller and smaller $y$ .

Let $d=(d_{1},d_{2})$ be an arbitrary direction. Then

$\displaystyle g_{d}(\alpha)$ $\displaystyle=(\alpha d_{2}-p(\alpha d_{1})^{2})(\alpha d_{2}-q(\alpha d_{1})^% {2})$

$\displaystyle=\alpha^{2}(d_{2}-p\alpha d_{1}^{2})(d_{2}-q\alpha d_{1}^{2})$

$\displaystyle=\alpha^{2}(d_{2}^{2}-\alpha d_{2}(p+q)d_{1}^{2}+\alpha^{2}pqd_{1% }^{2})$

Thus $g^{\prime}(0)=0$ and $g^{\prime\prime}(0)=2d_{2}^{2}>0$ which implies $g$ has a minimum in zero. ∎

Exercise 6 (Bregman Divergence).

The Bregman Divergence $D^{(B)}_{f}$ of a continuously differentiable function $f:\mathbb{R}^{d}\to\mathbb{R}$ is defined as the error of the linear approximation and is related to $\mu$ -strong convexity and Lipschitz continuous gradients as follows

\tfrac{\mu}{2}\|x-x_{0}\|^{2}\overset{\begin{subarray}{c}\mu\text{-strongly % convex}\\ \text{(definition)}\end{subarray}}{\leq}\underbrace{f(x)-\overbrace{f(x_{0})-% \langle\nabla f(x_{0}),x-x_{0}\rangle}^{\text{linear approximation}}}_{% \displaystyle=:D^{(B)}_{f}(x,x_{0})}\overset{\begin{subarray}{c}L\text{-% Lipschitz gradient}\\ \text{(Descent Lemma)}\end{subarray}}{\leq}\tfrac{L}{2}\|x-x_{0}\|^{2}.

For $\mu=0$ this is simply the convexity condition by Prop. A.1.8. So non-negativity of the Bregman divergence implies convexity. The $L$ -Lipschitz gradients provide us with an upper bound on the Bregman divergence on the other hand which immediately results in an upper bound on $f$

f(x)=f(x_{0})+\langle\nabla f(x_{0}),x-x_{0}\rangle+\underbrace{D^{(B)}_{f}(x,% x_{0})}_{\leq\tfrac{L}{2}\|x-x_{0}\|^{2}}.

(1)

(I)

Prove for functions $f$ with $L$ -Lipschitz gradients, we have for all $x_{0}$

\min_{x}f(x)\leq f(x_{0})-\tfrac{1}{2L}\|\nabla f(x_{0})\|^{2}.

By minimizing the upper bound (1). What is the minimizer of the upper bound?

Hint.

First minimize over the direction $x-x_{0}$ subject to the length $\|x-x_{0}\|=r$ being constant. Then minimize over $r$ .

Solution.

We first solve the directional minimization problem

$\displaystyle\operatorname*{argmin}_{x:\\|x-x_{0}\\|=r}f(x_{0})+\langle\nabla f(% x_{0}),x-x_{0}\rangle+\tfrac{L}{2}\\|x-x_{0}\\|^{2}$	$\displaystyle=x_{0}+\operatorname*{argmin}_{d:\\|d\\|=r}\langle\nabla f(x_{0}),d\rangle$
	$\displaystyle=x_{0}+\operatorname*{argmax}_{d:\\|d\\|=r}\langle-\nabla f(x_{0}),d\rangle$
	$\displaystyle=x_{0}-\frac{r\nabla f(x_{0})}{\\|\nabla f(x_{0})\\|},$	(2)

where the last equation is true because by Cauchy-Schwartz

\langle-\nabla f(x_{0}),d\rangle\overset{C.S.}{\leq}\|\nabla f(x_{0})\|r.

So in summary we have

\displaystyle\min_{x}f(x)

\displaystyle\leq\min_{r}\underbrace{\min_{x:\|x-x_{0}\|=r}f(x_{0})+\langle% \nabla f(x_{0}),x-x_{0}\rangle+\tfrac{L}{2}\|x-x_{0}\|^{2}}_{\overset{\eqref{% eq: argmax direction}}{=}f(x_{0})-r\|\nabla f(x_{0})\|+\tfrac{L}{2}r^{2}}.

Minimizing over the length $r$ implies minimizing a convex parabola, so the first order condition is sufficient yielding

r^{*}=\frac{\|\nabla f(x_{0})\|}{L}.

Reinserting $r^{*}$ into our upper bound yields the claim and we get the minimizer by inserting $r^{*}$ into (2) resulting in a gradient descent step

x^{*}=x_{0}-\tfrac{1}{L}\nabla f(x_{0}).\qed

(II)

Prove for convex functions $f$ with $L$ -Lipschitz gradients

$D^{(B)}_{f}(x,x_{0})\geq\tfrac{1}{2L}\|\nabla f(x)-\nabla f(x_{0})\|^{2}.$

Hint.

Apply (iv)((b))(I) to

$\phi(x):=D^{(B)}_{f}(x,x_{0}).$

Due to convexity you should already know the global minimum of $\phi$ .

Solution.

To apply (iv)((b))(I), we want to prove

$\phi(x):=D^{(B)}_{f}(x,x_{0})=f(x)-f(x_{0})-\langle\nabla f(x_{0}),x-x_{0}\rangle$

has $L$ -Lipschitz gradient. But the gradient of $\phi$ is given by

$\nabla\phi(x)=\nabla f(x)-\nabla f(x_{0}),$

which is merely a shift of the gradient of $f$ and therefore still $L$ -Lipschitz. Due to convexity we know the Bregman Divergence is greater than zero. So we have

$0=D^{(B)}_{f}(x_{0},x_{0})=\min_{y}\phi(y)\overset{\text{\ref{itm: one step % minimum}}}{\leq}\phi(x)-\tfrac{1}{2L}\|\nabla\phi(x)\|^{2}.$

Reordering and inserting the definition we obtain our claim

$\tfrac{1}{2L}\|\nabla f(x)-\nabla f(x_{0})\|^{2}=\tfrac{1}{2L}\|\nabla\phi(x_{% 0})\|^{2}\leq\phi(x)=D^{(B)}_{f}(x,x_{0}).\qed$

(III)

Prove for convex functions $f$ with $L$ -Lipschitz gradients, we have for all $x, y$

\langle\nabla f(x)-\nabla f(y),x-y\rangle\geq\tfrac{1}{L}\|\nabla f(x)-\nabla f% (y)\|^{2}.

Hint.

Use (iv)((b))(II) twice.

Solution.

We simply apply (iv)((b))(II) twice:

	$\displaystyle\tfrac{1}{L}\\|\nabla f(x)-\nabla f(y)\\|^{2}$ ≤D^(B)_f(y,x) + D^(B)_f(x,y)
		$\displaystyle=\bcancel{f(y)}-\cancel{f(x)}-\langle\nabla f(x),y-x\rangle+% \cancel{f(x)}-\bcancel{f(y)}-\langle\nabla f(y),x-y\rangle$
		$\displaystyle=\langle\nabla f(x)-\nabla f(y),x-y\rangle.\qed$

(IV)

Prove for convex $f$ that the upper bound

$D^{(B)}_{f}(x,x_{0})\leq\frac{L}{2}\|x-x_{0}\|^{2}$

is sufficient for Lipschitz continuity of the gradient $\nabla f$ .

Hint.

Argue why you can use (iv)((b))(III) without circular reasoning.

Solution.

We deduced this upper bound in (iv)((b))(I) and then never used the Lipschitz property again. In (iv)((b))(II) we only needed it to apply (iv)((b))(I) and in (iv)((b))(III) we only needed the property to apply (iv)((b))(II). So our premise is sufficient for (iv)((b))(III) without Lipschitz continuity of the gradient. So by (iv)((b))(III) and Cauchy-Schwarz we get

$\tfrac{1}{L}\|\nabla f(x)-\nabla f(y)\|^{2}\leq\langle\nabla f(x)-\nabla f(y),% x-y\rangle\overset{\text{C.S.}}{\leq}\|\nabla f(x)-\nabla f(y)\|\|x-y\|.$

Multiplying this inequality by $L/\|\nabla f(x)-\nabla f(y)\|$ results in the claim. ∎

(V)

In this last part we want to assume $f$ is $\mu$ -strongly convex (and its gradient $L$ -Lipschitz). Prove for all $x, y$

\langle\nabla f(x)-\nabla f(y),x-y\rangle\geq\tfrac{\mu}{L+\mu}\|x-y\|^{2}+% \tfrac{1}{L+\mu}\|\nabla f(x)-\nabla f(y)\|^{2}.

Hint.

Note that strong convexity provides an alternative lower bound to (iv)((b))(II). Using this alternative lower bound we could directly obtain a modified version of (iv)((b))(III). But we are greedy. We want to use both lower bounds for an even tighter bound on the scalar product. We therefore want to use

g_{x}(y):=f(y)-\tfrac{\mu}{2}\|x-y\|^{2}

to break down the Bregman divergence of $f$ , so we can have our cake and eat it:

D^{(B)}_{f}(y,z)=\underbrace{D^{(B)}_{g_{x}}(y,z)}_{\text{use \ref{itm: lower % bound on bregmanDiv}}}+\underbrace{\tfrac{\mu}{2}\|y-z\|^{2}}_{\text{strong % convexity}}.

Remember to check convexity of $g_{x}$ and Lipschitz continuity of $\nabla g_{x}$ before applying (iv)((b))(II). When applying (iv)((b))(II), the selection $z=x$ might be helpful.

Solution.

We define

g_{x}(y):=f(y)-\tfrac{\mu}{2}\|x-y\|^{2}.

for which we have

\nabla g_{x}(y)=\nabla f(y)-\mu(y-x).

Therefore we have

	$\displaystyle D^{(B)}_{g_{x}}(y,z)$	$\displaystyle=g_{x}(y)-g_{x}(z)-\langle\nabla g_{x}(z),y-z\rangle$
		$\displaystyle=f(y)-\tfrac{\mu}{2}\\|x-y\\|^{2}-f(z)+\tfrac{\mu}{2}\\|x-z\\|^{2}-% \langle\nabla f(z)-\mu(z-x),y-z\rangle$
		$\displaystyle=D^{(B)}_{f}(y,z)+\tfrac{\mu}{2}\underbrace{\Bigl{[}\\|x-z\\|^{2}-% \\|x-y\\|^{2}+2\langle z-x,y-z\rangle\Bigr{]}}_{=-\bigl{(}-\\|x-z\\|^{2}+\\|x-z+z-y% \\|^{2}-2\langle x-z,z-y\rangle\bigr{)}}$
		$\displaystyle=D^{(B)}_{f}(y,z)-\tfrac{\mu}{2}\\|z-y\\|^{2}.$

Convexity is a given because $D^{(B)}_{g_{x}}\geq 0$ since $D^{(B)}_{f}(y,z)\geq\tfrac{\mu}{2}\|y-z\|^{2}$ . Similarly we obtain $L-\mu$ -Lipschitz continuity of $\nabla g_{x}(y)$ because

D^{(B)}_{g_{x}}(y,z)=D^{(B)}_{f}(y,z)-\tfrac{\mu}{2}\|z-y\|^{2}\leq\tfrac{L-% \mu}{2}\|z-y\|^{2}.

So we can apply (iv)((b))(II) to $D^{(B)}_{g_{x}}$ with $z=x$ to obtain

	$\displaystyle D^{(B)}_{f}(y,x)$	$\displaystyle=D^{(B)}_{g_{x}}(y,x)+\tfrac{\mu}{2}\\|x-y\\|^{2}$
	≥ 1 2(L-μ) ∥∇g_x(y) - ∇g_x(x)∥^2 + μ 2 ∥x-y∥^2
		$\displaystyle=\tfrac{1}{2(L-\mu)}\underbrace{\\|\nabla f(y)-\mu(y-x)-\nabla f(x% )\\|^{2}}_{\\|(\nabla f(y)-\mu y)-(\nabla f(x)-\mu x)\\|^{2}}+\tfrac{\mu}{2}\\|x-y% \\|^{2}.$

Which is symmetric in $x$ and $y$ so we can apply the same trick as in (iv)((b))(III) to get

	$\displaystyle\langle\nabla f(y)-\nabla f(x),y-x\rangle$	$\displaystyle=D^{(B)}_{f}(y,x)+D^{(B)}_{f}(y,x)$
		$\displaystyle\geq\tfrac{1}{(L-\mu)}\underbrace{\\|\nabla f(y)-\mu(y-x)-\nabla f% (x)\\|^{2}}_{=\\|\nabla f(y)-\nabla f(x)\\|^{2}-2\mu\langle\nabla f(y)-\nabla f(x% ),y-x\rangle\mathrlap{+\mu^{2}\\|y-x\\|^{2}}}+\mu\\|x-y\\|^{2}.$

Collecting the scalar product on the left results in

\underbrace{\Bigl{(}1+\tfrac{2\mu}{L-\mu}\Bigr{)}}_{=\frac{L+\mu}{L-\mu}}% \langle\nabla f(y)-\nabla f(x),y-x\rangle\geq\tfrac{1}{(L-\mu)}\|\nabla f(y)-% \nabla f(x)\|^{2}+\Bigl{(}\tfrac{\mu^{2}}{L-\mu}+\mu\Bigr{)}\|x-y\|^{2}.

Multiplying both sides by $\frac{L-\mu}{L+\mu}$ we finally get

\langle\nabla f(y)-\nabla f(x),y-x\rangle\geq\tfrac{1}{(L+\mu)}\|\nabla f(y)-% \nabla f(x)\|^{2}+\underbrace{\Bigl{(}\tfrac{\mu^{2}+\mu(L-\mu)}{L+\mu}\Bigr{)% }}_{=\tfrac{\mu L}{L+\mu}}\|x-y\|^{2}.\qed

$\displaystyle\operatorname*{argmin}_{x:\\|x-x_{0}\\|=r}f(x_{0})+\langle\nabla f(% x_{0}),x-x_{0}\rangle+\tfrac{L}{2}\\|x-x_{0}\\|^{2}$	$\displaystyle=x_{0}+\operatorname*{argmin}_{d:\\|d\\|=r}\langle\nabla f(x_{0}),d\rangle$
	$\displaystyle=x_{0}+\operatorname*{argmax}_{d:\\|d\\|=r}\langle-\nabla f(x_{0}),d\rangle$
	$\displaystyle=x_{0}-\frac{r\nabla f(x_{0})}{\\|\nabla f(x_{0})\\|},$	(2)

	$\displaystyle D^{(B)}_{g_{x}}(y,z)$	$\displaystyle=g_{x}(y)-g_{x}(z)-\langle\nabla g_{x}(z),y-z\rangle$
		$\displaystyle=f(y)-\tfrac{\mu}{2}\\|x-y\\|^{2}-f(z)+\tfrac{\mu}{2}\\|x-z\\|^{2}-% \langle\nabla f(z)-\mu(z-x),y-z\rangle$
		$\displaystyle=D^{(B)}_{f}(y,z)+\tfrac{\mu}{2}\underbrace{\Bigl{[}\\|x-z\\|^{2}-% \\|x-y\\|^{2}+2\langle z-x,y-z\rangle\Bigr{]}}_{=-\bigl{(}-\\|x-z\\|^{2}+\\|x-z+z-y% \\|^{2}-2\langle x-z,z-y\rangle\bigr{)}}$
		$\displaystyle=D^{(B)}_{f}(y,z)-\tfrac{\mu}{2}\\|z-y\\|^{2}.$

	$\displaystyle g_{d}(\alpha)$	$\displaystyle=(\alpha d_{2}-p(\alpha d_{1})^{2})(\alpha d_{2}-q(\alpha d_{1})^% {2})$
		$\displaystyle=\alpha^{2}(d_{2}-p\alpha d_{1}^{2})(d_{2}-q\alpha d_{1}^{2})$
		$\displaystyle=\alpha^{2}(d_{2}^{2}-\alpha d_{2}(p+q)d_{1}^{2}+\alpha^{2}pqd_{1% }^{2})$