Sheet 1
††course: Optimization in Machine Learning††semester: FSS 2023††tutorialDate: 02.03.2023††dueDate: 12:00 in the exercise on Thursday 02.03.2023Exercise 1 (Convex Examples).
Prove the following functions are convex
-
(i)
affine linear functions, i.e. for , ,
Solution.
We have
= λ ⏟ (a^Tx+c) _f(x) +(1-λ) ⏟ (a^Ty+c) _f(y). ∎ -
(ii)
norms, i.e. ,
Solution.
We have
-
(iii)
sums of convex functions , i.e. ,
Solution.
We have
-
(iv)
for a set of convex functions .
Solution.
We have
Exercise 2 (Finite Jensen).
Let be convex, and for . Prove
and deduce .
Solution.
For this is simply the definition of convexity. Assume it holds for . Then the induction step is given by
Where it is easy to check that sums to one. For the second part select and , which is convex because
Exercise 3 (Strict & Strong Convexity).
Prove the following statements
-
(a)
-strong convexity implies strict convexity.
Solution.
Assuming we have
But after moving the negative parts to the left we are left with a condition which is equivalent to strict convexity by Proposition A.1.8 in the lecture. ∎
-
(b)
For twice differentiable , the following are equivalent
-
(I)
-
(II)
-
(III)
is -strongly convex
where is the identity matrix and
Solution.
Let us get (iv)((b))(I)(iv)((b))(II) out of the way first:
If we have (iv)((b))(I), then the rightmost side is positive and thus we also have (iv)((b))(II). For the other direction we simply start from the left.
For (iv)((b))(I)(iv)((b))(III) let
because is strongly convex
≥f(x) + ⟨∇f(x), y-x⟩+ μ 2 ∥y-x∥^2 - μ 2 ∥y∥^2 is still convex. So we know by the lecture that
is positive definite.
On the other hand for (iv)((b))(III)(iv)((b))(I): if , we know that is positive definite and thus that is convex by the lecture. Reusing the equations above except for the inequality we obtain
Adding on both sides results in the definition of -strong convexity of . ∎
Exercise 4 (Convexity and Minima).
Prove the following statements
-
(I)
If is convex, then every local minimum is also a global minimum.
Solution.
Let be a local minimum and assume it was not global. Then there exists such that
Select small enough such that is still in the neighborhood of where is a local minimum. Then obtain a contradiction
-
(II)
If is strictly convex, then there exists at most one minimum.
Solution.
-
(III)
If convex and differentiable and , then is a minimum.
Solution.
By Proposition A.1.8 convexity is equivalent to
So is a (global) minimum. ∎
Exercise 5 (Directional Minima).
Let be some differentiable function. For every direction define
Assume that for every , is minimized by . Prove that
-
(I)
We have necessarily .
Solution.
If we select a standard basis vector as the direction, then
So the -th entry of is zero. As was arbitrary, we are done. ∎
-
(II)
is not necessarily a minimum of .
Hint.
Let and define
consider and prove that for .
Solution.
Let and define
then for we have for
This means that is not a minimum since we can get arbitrarily close to with with smaller and smaller .
Let be an arbitrary direction. Then
Thus and which implies has a minimum in zero. ∎
Exercise 6 (Bregman Divergence).
The Bregman Divergence of a continuously differentiable function is defined as the error of the linear approximation and is related to -strong convexity and Lipschitz continuous gradients as follows
For this is simply the convexity condition by Prop. A.1.8. So non-negativity of the Bregman divergence implies convexity. The -Lipschitz gradients provide us with an upper bound on the Bregman divergence on the other hand which immediately results in an upper bound on
(1) -
(I)
Prove for functions with -Lipschitz gradients, we have for all
By minimizing the upper bound (1). What is the minimizer of the upper bound?
Hint.
First minimize over the direction subject to the length being constant. Then minimize over .
Solution.
We first solve the directional minimization problem
(2) where the last equation is true because by Cauchy-Schwartz
So in summary we have
Minimizing over the length implies minimizing a convex parabola, so the first order condition is sufficient yielding
Reinserting into our upper bound yields the claim and we get the minimizer by inserting into (2) resulting in a gradient descent step
-
(II)
Prove for convex functions with -Lipschitz gradients
Hint.
Solution.
To apply (iv)((b))(I), we want to prove
has -Lipschitz gradient. But the gradient of is given by
which is merely a shift of the gradient of and therefore still -Lipschitz. Due to convexity we know the Bregman Divergence is greater than zero. So we have
Reordering and inserting the definition we obtain our claim
-
(III)
Prove for convex functions with -Lipschitz gradients, we have for all
Hint.
Use (iv)((b))(II) twice.
Solution.
We simply apply (iv)((b))(II) twice:
≤D^(B)_f(y,x) + D^(B)_f(x,y) -
(IV)
Prove for convex that the upper bound
is sufficient for Lipschitz continuity of the gradient .
Hint.
Argue why you can use (iv)((b))(III) without circular reasoning.
Solution.
We deduced this upper bound in (iv)((b))(I) and then never used the Lipschitz property again. In (iv)((b))(II) we only needed it to apply (iv)((b))(I) and in (iv)((b))(III) we only needed the property to apply (iv)((b))(II). So our premise is sufficient for (iv)((b))(III) without Lipschitz continuity of the gradient. So by (iv)((b))(III) and Cauchy-Schwarz we get
Multiplying this inequality by results in the claim. ∎
-
(V)
In this last part we want to assume is -strongly convex (and its gradient -Lipschitz). Prove for all
Hint.
Note that strong convexity provides an alternative lower bound to (iv)((b))(II). Using this alternative lower bound we could directly obtain a modified version of (iv)((b))(III). But we are greedy. We want to use both lower bounds for an even tighter bound on the scalar product. We therefore want to use
to break down the Bregman divergence of , so we can have our cake and eat it:
Remember to check convexity of and Lipschitz continuity of before applying (iv)((b))(II). When applying (iv)((b))(II), the selection might be helpful.
Solution.
We define
for which we have
Therefore we have
Convexity is a given because since . Similarly we obtain -Lipschitz continuity of because
So we can apply (iv)((b))(II) to with to obtain
≥ 1 2(L-μ) ∥∇g_x(y) - ∇g_x(x)∥^2 + μ 2 ∥x-y∥^2 Which is symmetric in and so we can apply the same trick as in (iv)((b))(III) to get
Collecting the scalar product on the left results in
Multiplying both sides by we finally get
-
(I)
-
(a)