Sheet 4

Prof. Simon Weißmann, Felix Benning
course: Optimization in Machine Learningsemester: FSS 2023tutorialDate: 27.04.2023dueDate: 12:00 in the exercise on Thursday 27.04.2023

While there are \totalpoints in total, you may consider all points above the standard 24 to be bonus points.

Exercise 1 (Lower Bounds).

In this exercise, we will bound the convergence rates of algorithms which pick their iterates xk+1 from

span[f(x0),,f(xk)]+x0.

We consider the function

fd(x)=12(x(1)-1)2+12i=1d-1(x(i)-x(i+1))2
  1. (i)

    To understand our function fd better, we want to view it as a potential on a graph. For this consider the undirected graph G=(V,E) with vertices

    V={1,,d}

    and edges

    E={(i,i+1):1id-1}.

    Draw a picture of this graph.

    Solution.

    The graph is simply a chain

  2. (ii)

    We now interpret x(i) as a quantity (e.g. of heat) at vertex i of our graph G. Our potential fd decreases, if the quantities at connected vertices i and i+1 are of similar size. I.e. if (x(i)-x(i+1))2 is small. Additionally there is a pull for x(1) to be equal to 1. Use this intuition to find the minimizer x* of fd.

    Solution.

    The minimizer is x*=(1,,1)Td since fd(x*)=0 and fd(x)0. ∎

  3. (iii)

    The matrix AGd×d with

    Ai,jG={degree of vertex ii=j-1(i,j)E or (j,i)E0𝑒𝑙𝑠𝑒

    is called the “Graph-Laplacian” of G. The degree of vertex i are the number of connecting edges. Calculate AG for G and prove that

    fd(x)=AGx+(x(1)-1)e1=(AG+e1e1T)x-e1.
    Solution.

    The Graph-Laplacian of G is given by

    AG=(1-100-12-10-12-10-12-100-11)

    Let i1,d then

    fdxi=[x(i)-x(i+1)]-[x(i-1)-x(i)]=2x(i)-x(i-1)-x(i+1)=(AGx)i

    similarly looking at the cases i=1,d individually immediately reveals

    fd(x)=AGx+(x(1)-1)e1.
  4. (iv)

    Prove that the Hessian H=2fd(x) is constant and positive definite to show that fd is convex. Prove that the operator norm of H is smaller than 4. Argue that

    gd(x):=L4fd(x)

    is therefore L-smooth.

    Solution.

    Taking the derivative of the gradient we calculated previously yields

    H=2fd(x)=AG+e1e1T.

    To show positive definiteness, let y be arbitrary

    yTHy=(e1Ty)2+yTAGy=(y(1))2+i=1d-1(y(i)-y(i+1))20.

    To find the largest eigenvalue of H we want to calculate the operator norm. For this we use (a-b)22(a2+b2) to get

    y,Hy(y(1))2+2i=1d-1(y(i))2+(y(i+1))24i=1d(y(i))2=4y2.

    Thus we get

    H=supy:y=1y,Hy4.

    Since the operator norm coincides with the largest absolute eigenvalue for symmetric matrices, this proves our claim. Finally L-smoothness of gd follows from

    gd(x)-gd(y)=L4fd(x)-fd(y)=L4H(x-y)L4HLx-y.
  5. (v)

    Assume x0=0 and and that (xn)n is chosen with the restriction

    xn+1𝒦n:=span[gd(x0),,gd(xn)].

    To make notation easier we are going to identify d with an isomorph subset of sequences

    d:={x2:x(i)=0i>n}

    then n is a subset of d for nd. Prove inductively that

    𝒦nn+1d
    Solution.

    We have the induction start n=0 by

    gd(x0)=-L4e11.

    Now assume

    𝒦n-1n,

    then by our selection process xnn. But then

    4Lgd(xn)=AGxnn+1+(xn-1)e11n+1.

    We therefore have 𝒦n=span[𝒦n-1,gd(xn)]n+1.

    Notice how the low connectedness of the graph G limits the spread of our quantity xn. A higher connectedness would allow for information to travel much quicker. ∎

  6. (vi)

    We now want to bound the convergence speed of xn to x*. For this we select d=2n+1.

    Note: We may choose a larger dimension d by defining f2n+1 on the subset 2n+1 in d. The important requirement is therefore 2n+1d. But without loss of generality we assume equality.

    Use the knowledge we have collected so far to argue

    x*-xn2d-n12x*-x02.
    Solution.

    Since xnn we know that

    x*-xn2 =i=1d(x*(i)-xn(i))2
    i=n+1d(x*(i))2
    =d-n=d=2n+1n+1=n+12n+1d12d=12i=1d12=12x*-x02.
  7. (vii)

    To prevent the convergence of the loss gd(xn) to gd(x*) we need a more sophisticated argument. For this consider

    g~n(x):=L4[fn(x)+12(x(n)-0)2].

    Argue that on nd the functions g~n and gd are identical. Use this observation to prove

    gd(xn)-infxgd(x)infxg~n(x).
    Solution.

    Let xn. Then using x(n+1)=0 we have

    g~n(x)=L8[(x(1)-1)2+i=1n(x(i)-x(i+1))]=gd(x)

    using x(i)=0 for all i>n for the second equality sign. Since xnn we therefore can replace gd with gn at will to get

    gd(xn)-infxgd(x)=0=g~n(xn)infxg~(x).
  8. (viii)

    Our goal is now to calculate infxg~n(x). Prove convexity of g~n and prove that

    x^n(i)={1-in+1in+10in+1

    is its minimum. Then plug our solution into g~n (or gd, since x^n is in the subset n after all), to obtain the lower bound

    gd(xn)-infxgd(x)Lx0-x*28(n+1)dLx0-x*216(n+1)2.
    Solution.

    We have

    g~n(x)=L4[AGnx+(x(1)-1)e1+(x(n))en]=L4(AGn+e1e1T+enenT)x-e1

    where AGn is the Graph-Laplacian for fn. Then the Hessian is obviously positive definite

    2g~n(x)=L4(AGn+e1e1T+enenT)

    as we could apply the same arguments as for fn. So g~n is convex. We now plug x^n into g~n to verify the first order condition, proving it is a minimum

    4Lg~nxi(x^n) =(AGnx^n)i+(x^n(1)-1)=-1n+1δi1+(x^n(n))=1n+1δin
    =[x^n(i)-x^n(i+1)]-1n+1𝟙in-[x^n(i-1)-x^n(i)]-1n+1𝟙i1-1n+1δi1+1n+1δin
    =0.

    We now know that

    infxg~n(x) =g~n(x^n)=L8[(-1n+1)2+(1-nn+1)2+i=1n-1(i+1n+1-in+1)2]
    =L8i=0n(1n+1)2=L8(n+1)=d=x0-x*2Lx0-x*28(n+1)d
    Lx0-x*216(n+1)2

    using d=2n+1 again. ∎

  9. (ix)

    Argue that we only needed

    xn=x0+k=0n-1Akf(xk)

    with upper triangular matrices Ak to make these bounds work. Since adaptive methods (like Adam) use diagonal matrices Ak, they are therefore covered by these bounds.

    Solution.

    We only needed 𝒦nn+1 which we proved by induction using only this fact about 𝒦n-1. Since upper triangular matrices do not change this fact, we may as well allow them. ∎

  10. (x)

    Bask in our glory! For we have proven that …? Summarize our results into a theorem.

    Solution.
    Theorem (Nesterov).

    Assume there exists upper triangular matrices Ak,n such that the sequence (xn)n in d is selected by the rule

    xn=x0+k=0n-1Ak,nf(xk)

    for a convex, L-smooth f to minimize. Then up to nd-12 there exists a convex, L-smooth function f such that

    xn-x* 12x0-x*
    f(xn)-infxf(x) Lx0-x*216(n+1)2

    for f(x*)=infxf(x).

  11. (xi)

    (Bonus) If you wish, you may want to try and repeat those steps for

    Gd(x)=L-μLgd(x)+μ2x2

    to prove an equivalent result for μ-strongly convex functions. Unfortunately finding x* is much more difficult in this case. Letting d makes this problem tractable again with solution

    x*(i)=(κ-1κ+1)i.
Exercise 2 (Conjugate Gradient Descent).

Consider a quadratic function

f(x)=12(x-x*)TH(x-x*)

for some symmetric and positive definite H and consider the hilbert space =(d,,H) with

x,yH=x,Hy
  1. (i)

    Prove that ,H is a well-defined scalar product. Convince yourself that

    f(x)=12x-x*H2.
    Solution.

    Bilinearity is trivial, the positive-definiteness follows from this property of H. We have

    f(x)=12x-x*,H(x-x*)=12x-x*,(x-x*)H=12x-x*H2.
  2. (ii)

    Determine the derivative Hf(x) of f in

    Hint.

    Recall that Hf(x) is the unique vector satisfying

    0=limv0|f(x+v)-f(x)-Hf(x),vH|vH.
    Solution.

    We need

    0= lim_v→0 —f(x+v) - f(x) - ⟨∇ H f(x) , v⟩ H — ∥v∥ H
    =limv0|f(x+v)-f(x)-HHf(x),v|vvvHc.

    We can bound the fraction of norms by a constant c>0 from below due to equivalence of all norms in d. This lower bound on the second fraction forces the first fraction to converge to zero. But this implies that

    f(x)=HHf(x)

    by the definition (and uniqueness) of f(x). Thus the gradient we are looking for is

    Hf(x)=H-1f(x).
  3. (iii)

    Since gradient descent in the space is therefore computationally the Newton method, we want to find a different method of optimization. Consider an arbitrary set of conjugate (H-orthogonal) directions (v1,vd), i.e. vi,vjH=δij, and for some starting point x0d the following descent algorithm:

    xk+1=xk-αkvk+1withαk:=argminαf(xk-αvk+1).

    Optimizing over α in this manner is known as “line-search”. Using y(i):=y,vi prove that

    (xk-x*)=i=k+1d(x0-x*)(i)vi=argminx{f(x):xx0+span[v1,,vk]}-x*.

    Deduce that conjugate descent (CD) converges in d steps.

    Solution.

    We proceed by induction. The induction start with k=0 is obvious. Let us now consider xk+1. By its definition we have

    2f(xk+1) =minα2f(xk-αvk+1)
    =minαxk-αvk+1H
    =minαi=1d(xk-x*)(i)vi-αvk+1H
    =minα[(xk-x*)(k+1)-α]2vk+1H2+i=k+2d[(xk-x*)(i)]2viH2
    =i=k+2d[(xk-x*)(i)]2.

    the minimizer is therefore αk=(xk-x*)(k+1). This removes the vk+1 component leaving us with the components vk+2 and up. Note that (xk-x*)(i)=(x0-x*)(i) for all ik+1 by induction. Similarly we can see that this is a minimum in the span of v1,,vk+1, as we have removed those components completely and

    f(x)=x-x*H2=i=1d[(x-x*)(i)]2.

    Since we can not touch the other components due to H-orthogonality, this is the best we can do. ∎

  4. (iv)

    If we had vi=f(xi-1), then this algorithm would be optimal in the set of algorithms we considered in the previous exercise. Unfortunately the gradients f(xi-1) are generally not conjugate. So while we may select an arbitrary set of conjugate vi, we cannot select the gradients directly.

    Instead we are going to do the next best thing and inductively select vk+1 such that

    𝒦k:=span[f(x0),f(xk)]=span[v1,,vk+1]

    using the Gram-Schmidt procedure to make vk+1 conjugate to v1,,vk. Since Gram-Schmidt is still computationally too expensive for our tastes, you please inductively prove

    𝒦k=span[H1(x0-x*),,Hk+1(x0-x*)].

    assuming 𝒦k is (k+1)-dimensional. I.e. 𝒦k is a “H-Krylov subspace”.

    Solution.

    The induction start k=0 follows directly from

    f(x0)=H(x0-x*)

    and the definition of 𝒦0. Assume we have the claim for k-1, then

    f(xk)=H(xk-x*)=H(xk-1-αk-1vk-x*)=H(xk-1-x*)=f(xk-1)𝒦k-1-αk-1Hvk𝒦k-1.

    As 𝒦k-1=span[H1(x0-x*),,Hk(x0-x*)] by the induction hypothesis, we therefore have

    f(xk)span[H1(x0-x*),,Hk+1(x0-x*)].

    Since f(x0),,f(xk-1)𝒦k-1 they are by the induction hypothesis also in the span

    𝒦k=span[f(x0),,f(xk)]span[H1(x0-x*),,Hk+1(x0-x*)].

    Since the space on the left is k+1 dimensional, we have equality. ∎

  5. (v)

    Argue that f(xk+1) is orthogonal to every vector in 𝒦k and inductively deduce either

    f(xk+1)=0

    which implies xk+1=x*, or 𝒦k+1 has full rank. Deduce from the H-Krylov-subspace property, that f(xk+1) is already H-orthogonal to 𝒦k-1.

    Hint.

    xk+1=argminx{f(x):x𝒦k+x0}.

    Solution.

    By the selection process of xk+1, we have

    xk+1=argminx{f(x):x𝒦k+x0}.

    assume f(xk+1) were not orthogonal to 𝒦k. Then there would exist v𝒦k such that

    f(xk+1),v>0

    By the Taylor approximation we therefore have

    f(xk+1-δv)=f(xk+1)-δf(xk+1),v>0+O(δ2)

    so there exists a small δ>0 such that f(xk+1-δv)<f(xk+1). But this is a contradiction since xk+1 was optimal.

    f(xk+1) is therefore orthogonal to 𝒦k. So if it is not zero, 𝒦k+1 has (as the span of both) full rank. f(xk+1) being orthogonal to 𝒦k also implies it is orthogonal to H𝒦k-1, since that is a subspace of 𝒦k by the Krylov property. But this implies f(xk+1) is H-orthogonal to 𝒦k-1. ∎

  6. (vi)

    Collect the ideas we have gathered to prove the recursively defined

    vk+1=f(xk)-f(xk),vkHvkH2vk

    are H-conjugate and have the same span as the gradients up to f(xk).

    Solution.

    These vk are the same vk we would obtain using Gram-Schmidt on the gradients. In fact this is Gram-Schmidt together with the fact that f(xk) is already H-orthogonal to the v1,,vk-1𝒦k-2. So only the last summand remains. ∎

  7. (vii)

    To make our procedure truly computable, we want to show

    f(xk),vkHvkH2=-f(xk)2f(xk-1)2.
    Hint.

    Proving

    f(xk)=f(xk-1)-αk-1Hvk

    should allow you to conclude f(xk),vkh=-f(xk)2αk-1. Then it makes sense to calculate

    αk-1=-f(xk-1),vkvkH2

    by solving its optimization problem. Finally you may want to consider vk=f(xk-1)-cvk-1 and vk-1𝒦k-2.

    Solution.

    We have

    f(xk)=H(xk-1-αk-1vkxk-x*)=f(xk-1)-αk-1Hvk.

    This implies vk=1αk-1H-1[f(xk-1)-f(xk)] and therefore

    f(xk),vkH=1αk-1f(xk),[f(xk-1)-f(xk)]=-f(xk)2αk-1,

    where we have used f(xk),f(xk-1)=0, which follows from f(xk-1)𝒦k-1 and f(xk)𝒦k-1.

    Now we need to find αk-1. But the first order condition

    0= d dα f(x_k-1 - αv_k)
    =-f(xk-1-αvk),vk
    =-H(xk-1-x*-αvk),vk
    =-f(xk-1),vk+αvkH2.

    implies

    αk-1=f(xk-1),vkvkH2.

    Before we put things together, note that by definition of vk

    f(xk-1),vk=f(xk-1),f(xk-1)-cvk-1=f(xk-1)2,

    since f(xk-1) is orthogonal to vk-1𝒦k-2. From this we get

    αk-1=f(xk-1)2vkH2,

    So we finally get

    f(xk),vkHvkH2=-f(xk)2vkH2vkH2f(xk-1)2=-f(xk)2f(xk-1)2.
  8. (viii)

    Summarize everything into a pseudo-algorithm for conjugate gradient descent (CGD) and compare it to heavy-ball momentum with

    βk=αkf(xk)2αk-1f(xk-1)2

    using identical αk as CGD.

    Solution.

    We set v1=f(x0) or later

    vk+1=f(xk)+f(xk)2f(xk-1)2vk

    determine the step-size

    αk=argminαf(xk-αvk+1)

    and finally make our step

    xk+1=xk-αkvk+1.

    Using the fact vk=xk-1-xkαk-1 and inserting vk+1 into the last equation, we notice

    xk+1 =xk-αk[f(xk)+f(xk)2f(xk-1)2xk-1-xkαk-1]
    =xk-αkf(xk)+αkαk-1f(xk)2f(xk-1)2=βk(xk-xk-1)

    that CGD is identical to HBM with certain parameters αk, βk. ∎

    Exercise 3 (Momentum).

    In this exercise, we take a closer look at heavy-ball momentum

    xk+1=xk+βk(xk-xk-1)+αkf(xk)
    1. (a)

      Find a continuous function f: such that

      f(x)={25xx<1x+241<x<225x-242<x.

      Prove that f is μ-strongly convex with μ=1, L-smooth with L=25 and has a minimum in zero.

      Solution.

      We define

      f(x)={252x2x112x2+24x-121<x<2252x2-24x+362x,

      note that it is continuous in 1 and 2 and therefore everywhere, and that it has the correct derivative. Further note that

      f′′(x)={11<x<225else

      is the derivative of f(x) in the following sense:

      f(x)=0xf′′(t)𝑑t,

      which follows from differentiability of f on its segments with the fundamental theorem of calculus and continuity between segments. Thus we have

      f(y) =f(x)+xyf(t)𝑑t=f(x)+f(x)(y-x)+xyf(t)-f(x)dt
      =f(x)+f(x)(y-x)+xyxtf′′(s)𝑑s𝑑t.

      For the Bregman divergence this implies

      12y-x2Df(B)(y,x)=xyxtf′′(s)𝑑s𝑑t252y-x2,

      thus f is μ=1-strongly convex and L=25-smooth. ∎

    2. (b)

      Recall, we required for convergence of HBM

      1>βmax{(1-αμ)2,(1-αL)2}.

      Calculate the optimal α and β to minimize the rate β.

      Solution.

      To minimize β, we first set

      β=max{(1-αμ)2,(1-αL)2}

      and then proceed to minimize this over α. Which results in

      α* =argminαmax{(1-αμ)2,(1-αL)2}
      =argminαmax{|1-αμ|,|1-αL|}
      =argminαmax{(1-αμ),-(1-αμ),(1-αL),-(1-αL)}
      =argminαmax{(1-αμ),-(1-αL)}

      which is monotonously falling for

      1-αμ>αL-1

      and monotonously increasing otherwise. Therefore its minimum is at equality. Thus

      1-α*μ=α*L-12=α*(μ+L)α*=4(μ+L)2.

      This results in

      β*=(1-21+L/μ)2.
    3. (c)

      Prove, using heavy ball momentum on f with the optimal parameters results in the recursion

      xk+1=139xk-49xk-1-19f(xk).
      Solution.

      Using our previous results about optimal rates we have for f

      α*=4(1+5)2=19  β*=(1-21+5)2=49.

      Thus

      xk+1=xk+49(xk-xk-1)=139xk-49xk-1+19f(xk).

    4. (d)

      We want to find a cycle of points pqrp, such that for x0=p we have

      x3k=px3k+1=qx3k+2=r  k0.

      Assume p<1, q<1 and r>2 and use the heavy-ball recursion to create linear equations for p,q,r. Solve this linear equation. What does this mean for convergence?

      Solution.

      We have

      (pqr)=(0-491391390-49-491390)(pqr)-19(f(r)f(p)f(q))

      Multiplying both sides by 9, using f(r)=25r-24 and f(p)=25p and similarly q and reordering, we get

      (941212944129)(pqr)=(2400)

      solving this system of equations results in

      p=79212250.65,q=-22081225-1.80,r=259212252.12.

      As we have managed to find a cycle of points, HBM does not converge to the minimum at zero in this case. Note: it is also possible to show that this cycle is attractive if you start in an epsilon environment away from these points. ∎

    5. (e)

      Implement Heavy-Ball momentum, Nesterov’s momentum and CGD https://classroom.github.com/a/f3PnRxTs.