Sheet 3

Prof. Simon Weißmann, Felix Benning
course: Optimization in Machine Learningsemester: FSS 2023tutorialDate: 30.03.2023dueDate: 12:00 in the exercise on Thursday 30.03.2023
Exercise 1 (Convergence Speed).

Proof that

  1. (i)

    if we have

    lim supke(xk+1)e(xk)=0,

    then e(xk) converges super-linearly.

    Solution.

    We define cn:=supkne(xk+1)e(xk). Then

    limncn=lim supke(xk+1)e(xk)=0

    and by definition

    e(xk+1)cke(xk).

    Thus we have super-linear convergence. ∎

  2. (ii)

    If for c(0,1) we have

    lim supke(xk+1)e(xk)<c,

    then e(xk) converges linearly with rate c.

    Solution.

    We again define cn:=supkne(xk+1)e(xk)

    limncn=lim supke(xk+1)e(xk)<c

    thus there exists N0 such that for all nN we have cnc and therefore for all nN

    e(xn+1)cne(xn)ce(xn).
  3. (iii)

    If for c(0,1) we have

    lim supke(xk+1)e(xk)2<c,

    then e(xk) converges super-linearly with rate c.

    Solution.

    We similarly define cn:=supkne(xk+1)e(xk)2 and again get limncn<c. Thus there exists N0 such that for all nK we have cnc and therefore for all nN

    e(xn+1)cne(xn)2ce(xn)2.
Exercise 2 (Sub-gradients).

Let f,g:d be convex functions.

  1. (i)

    Prove that f(x) is a convex set for any xd.

    Solution.

    Let g1,g2f(x). Then for any λ[0,1] and any yd

    f(y) =λf(y)+(1-λ)f(y)
    λ(f(x) + ⟨g_1, y-x⟩) + (1-λ) (f(x) + ⟨g_2, y-x⟩)
    =f(x)+λg1+(1-λ)g2,y-x

    thus λg1+(1-λ)g2f(x) by definition. ∎

  2. (ii)

    Prove for a>0, (af)=af

    Solution.

    We only need to prove “”. Using f~=af with a~=1a the other inclusion immediately follows.

    Let agxaf(x) with gxf(x). We need to show that agx(af)(x). But this follows immediately

    a>0f(y)gxf(x)a(f(x)+gx,y-x)=(af)(x)+agx,y-x.
  3. (iii)

    Prove that (f1+f2)f1+f2

    Solution.

    Let gifi(x) for i=1,2. Then we have that g1+g2(f1+f2) because

    (f1+f2)(y) (f1(x)+g1,y-x)+(f2(x)+g2,y-x)
    =(f1+f2)(x)+g1+g2,y-x.
  4. (iv)

    For h(x)=f(Ax+b) prove h(x)ATf(Ax+b). Prove equality for invertible A.

    Solution.

    Let gxf(x) i.e. gAx+bf(Ax+b). Then ATgAx+bh(x) because

    h(y)=f(Ay+b) f(Ax+b)+gAx+b,(Ay+b)-(Ax+b)
    =h(x)+ATgAx+b,y-x.

    If A is invertible, we have f(x)=h(A-1x-A-1b) so by the previous statement with A~=A-1 and b~=-A-1b, we get the other direction. ∎

    Exercise 3 (Lasso).

    Let

    f(x)=12x-y2+λx1

    for xd be the Lagrangian form of the least squares LASSO method.

    1. (a)

      Compute a sub-gradient of f.

      Solution.

      Using (g+λh)(x)g(x)+λh(x), we only need to determine the subgradient of g(x):=12x-y2 and

      h(x):=x1=i=1d|xi|.

      But g(x)=x-y as g is differentiable. And since it is also convex, we have

      g(x)={g(x)}.

      Now the subgradient of hi(x)=|xi| is given by sgn(xi)ei, where sgn(0)[-1,1] can be selected arbitrarily, because

      hi(x)+sgn(xi)ei,y-x =|xi|+sgn(xi)yi-sgn(xi)xi|xi|=sgn(xi)yi

      So again

      h(x)i=1dhi(x)(sgn(x1),,sgn(xn))T=:s(x).

      So putting everything together we have

      f(x)x-y+λs(x).
    2. (b)

      Prove that f is convex.

      Solution.

      As its sets of sub-gradients is nowhere empty, it is convex. ∎

    3. (c)

      Find a global minimum of f.

      Solution.

      By the lecture it is sufficient to find a point x such that 0f(x). By the previous exercise we therefore want to solve

      0=!x-y+λs(x)

      entry-wise this implies

      xi= y_i - λsgn(x_i) = { y i + λ x i ¡ 0 y i - λ[-1, 1] x i = 0 y i - λ x i ¿ 0
      ={yi+λyi+λ<00yi[-λ,λ]yi-λyi-λ>0
      ={yi+λyi<-λ0yi[-λ,λ]yi-λyi>λ.
    4. (d)

      Implement f as a sub-type of ”DifferentiableFunction” (even though it is not) by returning a single sub-gradient and apply gradient descent to verify the global minimum https://classroom.github.com/a/XqNuifmO .

      Exercise 4 (Momentum Matrix).

      let D=diag(λ1,,λd), α,β>0 and define

      T=((1+β)𝕀-αD-β𝕀𝕀𝟎)2d×2d

      Prove there exists a regular S2d×2d such that

      S-1TS=T^=(T1Td)

      with

      Ti=(1+β-αλi-β10)2×2.
      Solution.

      We simply define for the standard basis eid

      S=(e10ed00e10ed)2d×2d

      in particular ST=S-1. ∎

      Exercise 5 (PL-Inequality).

      Assume f:d is L-smooth and satisfies the Polyak-Łojasiewicz inequality

      f(x)22c(f(x)-f*) (PL)

      for some c>0 and all xd with f*=minxf(x)>-.

      1. (I)

        Prove that gradient descent with fixed step size αk=1L converges linearly in the sense

        f(xk)-f*(1-cL)k(f(x0)-f*).
        Solution.

        By L-smoothness and the descent lemma, we have

        f(xk+1)f(xk)-12Lf(xk)2(PL)f(xk)-cL(f(xk)-f*).

        Subtracting f* from both sides, we get

        f(xk+1)-f*(1-cL)(f(xk)-f*)
      2. (II)

        Prove that μ-strong-convexity and L-smoothness imply the PL-inequality.

        Solution.

        Recall by the solution of sheet 1, exercise 6 (iii), and strong convexity we have

        μx-y2 Df(B)(x,y)+Df(B)(y,x)
        =f(x)-f(y),x-y

        and therefore

        μx-yf(x)-f(y). (1)

        Finally we know by L-smoothness and f(x*)=0 where x* is the minimum

        f(x)-f(x*)=f(x*)=0Df(B)(x,x*)L-smoothL2x-x*2(1)L2μf(x)-f(x*)=02.
      3. (III)

        Use a graphing calculator to find c such that f(x)=x2+3sin2(x) satisfies the PL-condition (argue why x is not a problem) and prove it is not convex.

        Solution.

        For c=16 we have the PL-condition

        As f(x)=2(x+3sin(x)cos(x)) and therefore

        f(x)2=4(x+3sin(x)cos(x)[-1,1])2|x|34(|x|-3)2

        the x2 dominates for large x, so if we make c small enough we can ensure the inequality for large x.

        f is not convex because

        f(12π+120)=π24+3>12π2=12f(π)+12f(0).

        Exercise 6 (Weak PL-Inequality).

        Assume f:d is L-smooth and satisfies the “weak PL inequality”

        f(x)2c(f(x)-f*)

        for some c>0 and all xd with f*=minxf(x)>-.

        1. A.

          Let a0[0,1q] for some q>0 and assume for the sequence (an)n that it is positive and satisfies a diminishing contraction

          0an+1(1-qan)an  n0.

          Prove the convergence rate

          an1nq+1/a01(n+1)q.
          Hint.

          A useful checkpoint might be the telescoping sum of

          1an+1-1anq.
          Solution.

          Divide the reordered contraction

          anan+1+qan2

          by anan+1 to obtain

          1an+11an+qanan+111an+q

          which leads to

          1an-1a0=k=0n-11ak+1-1aknq.

          Reordering we obtain our claim

          an1nq+1a0a01q1(n+1)q.
        2. B.

          Prove that f is bounded. More specifically e(x):=f(x)-f*L2c2 for all x.

          Hint.

          Use Sheet 1 Exercise 1 (i).

          Solution.

          Using Sheet 1 Exercise 1 (i), we get

          f*f(x)-12Lf(x)2

          and therefore

          e(x)12Lf(x)2weak PL4c22Le(x)2.

          Dividing both sides by e(x) we obtain

          12c2Le(x)

          and thus

          e(x)L2c2.
        3. C.

          For gradient descent xn+1-xn=-αnf(xn) with constant step size αk=1L prove the convergence rate

          f(xn)-f*L2c2(n+1).
          Solution.

          Using L-smoothness, we have

          f(xk+1) f(xk)+f(xk),xk+1-xk+L2xk+1-xk2
          f(xk)-αk(1-L2αk)=12Lf(xk)24c2e(xk)2

          If we subtract f* from both sides and apply our weak PL inequality we get

          e(xk+1)e(xk)-4c22Le(xk)2=(1-2c2Le(xk))e(xk)

          with q=2c2L and e(x0)L2c2=1q by (iv)(III)B, we can apply (iv)(III)A to obtain our claim. ∎