Sheet 2

Prof. Simon Weißmann, Felix Benning
course: Optimization in Machine Learningsemester: FSS 2023tutorialDate: 16.03.2023dueDate: 12:00 in the exercise on Thursday 16.03.2023
Exercise 1 (Descent Directions of a Maximum).

Let x*d be a strict local maximum of f:d. Prove that every dd is a descent direction of f in x*.

Solution.

Let ϵ>0 be such, that x* is a strict maximum in Bϵ(x*){x*}, where existence of such an ϵ is the strict local maximum property. We now have for any direction d that it is a descent direction, because with α¯=ϵd>0 we have for all α(0,α¯]

f(x*+αd)<f(x*),

since x*+αdBϵ(x*){x*} as αdα¯d=ϵ. ∎

Exercise 2 (Convergence to Stationary Point).

Let f:d be a continuously differentiable function.

  1. (i)

    Let (xk)k be defined by gradient descent

    xk+1=xk-αkf(xk),x0d

    with diminishing step size αk>0 such that k=1αk=. Suppose that (xk)k converges to some x*d. Prove that x* is a stationary point of f , i.e. f(x*)=0.

    Hint.

    You might want to prove for large enough i,j

    f(xi),f(xj)f(x*)2-2ϵf(x*)-ϵ2=:p(ϵ).
    Solution.

    For any ϵ>0 there exists n0 such that for all i,jn we have by Cauchy-Schwarz and convergence of f(xi) to f(x*) due to continuity of f

    f(xi),f(xj)
    =f(x*)2+f(xi)-f(x*),f(x*)+f(xi),f(xj)-f(x*)
    ≥∥∇f(x_*)∥^2 - ⏟ ∥∇f(x_i) - ∇f(x_*)∥ _≤ϵ ∥∇f(x_*)∥ - ⏟ ∥∇f(x_i)∥ _≤∥∇f(x_*)∥ + ϵ ⏟ ∥∇f(x_j)-∇f(x_*)∥ _≤ϵ
    f(x*)2-2ϵf(x*)-ϵ2=:p(ϵ)

    This results in

    xn-xm2 =k=nm-1αkf(xk)2=i,j=nm-1αiαjf(xi),f(xj)
    (k=nm-1αk)2p(ϵ)

    Taking the limit over m results in

    >xn-x*2(k=nαk)2=p(ϵ).

    So we necessarily need p(ϵ)0. But as ϵ was arbitrary, we have

    0f(x*)2=limϵ0p(ϵ)0.
  2. (ii)

    Assume that f is also L-smooth. Prove for xn generated by gradient descent with constant step size α(0,2L) we have

    k=nmf(xk)2f(xn)-f(xm)α(1-L2α)f(xn)-minxf(x)α(1-L2α)

    for any n,m. Deduce for the case minxf(x)>-, that we have

    minknf(xk)2o(1/n).
    Hint.

    Consider the minimizer from Sheet 1, Ex. 6(i).

    Solution.

    By L-smoothness of f, we have

    f(xk+1) f(xk)+f(xk),xk+1-xk-αf(xk)+L2xk+1-xk2
    =f(xk)-(α-L2α2)f(xk)2

    and therefore

    k=nmf(xk)2k=nmf(xk)-f(xk+1)α(1-L2α)=telescopef(xn)-f(xm)α(1-L2α).

    Now an:=minknf(xk)2 is non-increasing, therefore

    na2nk=n2nakk=nak0(n).

    Thus a2no(1/n). And we can simply bound the odd elements of the sequence

    a2n+1a2no(1/n).
    Exercise 3 (Optimizing Quadratic Functions).

    In this exercise we consider functions of type

    f(x)=xTAx+bTx+c,

    where xd,Ad×d,bd,c.

    1. (a)

      Let H:=AT+A be invertible. Prove that f can be written in the forms

      f(x) =(x-x)TA(x-x)+c~ (1)
      =12(x-x)T(AT+A)=:H(x-x)+c~ (2)

      for some xd and c~. Argue that H is always symmetric. Under which circumstances is x a minimum?

      Solution.

      We want for some x

      f(x)=!(x-x)TA(x-x)+c~=xTAx-xTAx-xTAx=-xT(A+AT)x=!bTx+xTAx+c~=!c

      So we simply select

      x:=-(A+AT)-1bandc~:=c-xTAx.

      This proves our first representation (1). For (2) we simply note

      yTAy=y,Ay=symm.Ay,y=yTATy.

      Applying this to y=x-x in (1) we are done.

      Symmetry of H follows directly from its definition as Hij=Aji+Aij=Hji.

      Now x is a minimum iff H is positive definite. If it is positive definite, then x is a minimum by 2f(x)=H and the lecture. If H is not positive definite, we need to show that there exists some x such that f(x)m for all mc~. Since H is not positive definite, there exists some y such that

      yTHy=:-ε<0

      define x=x+yc~-mε. Then

      f(x)=c~-mεyTHy+c~=m.
    2. (b)

      Argue that the Newton Method (with step size αn=1) applied to f would jump to x in one step and then stop moving.

      Solution.

      Taking the derivative of (2) we get

      f(x)=H(x-x). (3)

      So with 2f(x)=H and

      x=x-H-1H(x-x)=x-[2f(x)]-1f(x),

      we have that the Newton Method finds x in one step. By (3) we also get f(x)=0 which stops the Newton method afterwards. ∎

    3. (c)

      Let V=(v1,,vd) be an orthonormal basis such that

      H=Vdiag[λ1,,λd]VT

      with 0<λ1λd and write

      y(i):=y,vi.

      Express (xn-x)(i) in terms of (x0-x)(i), where xn is given by the gradient descent recursion

      xn+1=xn-hf(xn).

      For which step size h do all the components (xn-x)(i) converge to zero? Which component has the slowest convergence speed? Find the optimal learning rate h* and deduce for this learning rate

      xn-x(1-21+κ)nx0-x.

      with the condition number κ=λdλ1.

      Solution.

      Using the representation (3) of the gradient again and subtracting x from our recursion, we get

      xn+1-x=xn-x-hH(xn-x)=[𝕀-hH](xn-x)

      Therefore