Sheet 2
††course: Optimization in Machine Learning††semester: FSS 2023††tutorialDate: 16.03.2023††dueDate: 12:00 in the exercise on Thursday 16.03.2023Exercise 1 (Descent Directions of a Maximum).
Let be a strict local maximum of . Prove that every is a descent direction of in .
Solution.
Let be such, that is a strict maximum in , where existence of such an is the strict local maximum property. We now have for any direction that it is a descent direction, because with we have for all
since as . ∎
Exercise 2 (Convergence to Stationary Point).
Let be a continuously differentiable function.
-
(i)
Let be defined by gradient descent
with diminishing step size such that . Suppose that converges to some . Prove that is a stationary point of , i.e. .
Hint.
You might want to prove for large enough
Solution.
For any there exists such that for all we have by Cauchy-Schwarz and convergence of to due to continuity of
≥∥∇f(x_*)∥^2 - ⏟ ∥∇f(x_i) - ∇f(x_*)∥ _≤ϵ ∥∇f(x_*)∥ - ⏟ ∥∇f(x_i)∥ _≤∥∇f(x_*)∥ + ϵ ⏟ ∥∇f(x_j)-∇f(x_*)∥ _≤ϵ This results in
Taking the limit over results in
So we necessarily need . But as was arbitrary, we have
-
(ii)
Assume that is also -smooth. Prove for generated by gradient descent with constant step size we have
for any . Deduce for the case , that we have
Hint.
Consider the minimizer from Sheet 1, Ex. 6(i).
Solution.
By -smoothness of , we have
and therefore
Now is non-increasing, therefore
Thus . And we can simply bound the odd elements of the sequence
Exercise 3 (Optimizing Quadratic Functions).
In this exercise we consider functions of type
where .
-
(a)
Let be invertible. Prove that can be written in the forms
(1) (2) for some and . Argue that is always symmetric. Under which circumstances is a minimum?
Solution.
We want for some
So we simply select
This proves our first representation (1). For (2) we simply note
Applying this to in (1) we are done.
Symmetry of follows directly from its definition as .
Now is a minimum iff is positive definite. If it is positive definite, then is a minimum by and the lecture. If is not positive definite, we need to show that there exists some such that for all . Since is not positive definite, there exists some such that
define . Then
-
(b)
Argue that the Newton Method (with step size ) applied to would jump to in one step and then stop moving.
-
(c)
Let be an orthonormal basis such that
with and write
Express in terms of , where is given by the gradient descent recursion
For which step size do all the components converge to zero? Which component has the slowest convergence speed? Find the optimal learning rate and deduce for this learning rate
with the condition number .
Solution.
Using the representation (3) of the gradient again and subtracting from our recursion, we get
Therefore
-
(a)