Sheet 1

Prof. Simon Weißmann, Felix Benning
course: Optimization in Machine Learningsemester: FSS 2023tutorialDate: 02.03.2023dueDate: 12:00 in the exercise on Thursday 02.03.2023
Exercise 1 (Convex Examples).

Prove the following functions are convex

  1. (i)

    affine linear functions, i.e. f(x)=aTx+c for ad, c,

    Solution.

    We have

    f(λx+(1-λ)y) =aT(λx+(1-λ)y)+(λ+1-λ)1c
    = λ ⏟ (a^Tx+c) _f(x) +(1-λ) ⏟ (a^Ty+c) _f(y). ∎
  2. (ii)

    norms, i.e. xx,

    Solution.

    We have

    λx+(1-λ)yΔλx+(1-λ)y=scalingλx+(1-λ)y.
  3. (iii)

    sums of convex functions fk, i.e. f(x)=k=1nfk(x),

    Solution.

    We have

    f(λx+(1-λy))=k=1nfk(λx+(1-λ)y)λfk(x)+(1-λ)fk(y)λk=1nfk(x)=f(x)+(1-λ)k=1nfk(y)=f(y).
  4. (iv)

    F(x):=supff(x) for a set of convex functions .

    Solution.

    We have

    F(λx+(1-λ)y) supfλf(x)+(1-λ)f(y)supfλf(x)+supg(1-λ)g(y)
    =λF(x)+(1-λ)F(y).
    Exercise 2 (Finite Jensen).

    Let φ be convex, and i=1nλi=1 for λi0. Prove

    φ(i=1nλixi)i=1nλiφ(xi)

    and deduce (1ni=1nxi)21ni=1nxi2.

    Solution.

    For n=2 this is simply the definition of convexity. Assume it holds for n. Then the induction step is given by

    φ(i=1n+1λixi) =φ(λn+1xn+1+(1-λn+1)i=1nλi1-λn+1=:λ~ixi)
    λn+1φ(xn+1)+(1-λn+1).φ(i=1nλ~ixi)ind.i=1nλ~iφ(xi)
    i=1n+1λiφ(xi)

    Where it is easy to check that λ~i sums to one. For the second part select λi=1n and φ(x)=x2, which is convex because

    (λx+(1-λ)y)2 =λ2=λ(1-(1-λ))x2+(1-λ)2=(1-λ)-λ(1-λ)y2+2λ(1-λ)xy
    =λx2+(1-λ)y2-λ(1-λ)(x2+y2-2xy)=(x-y)20
    λx2+(1-λ)y2.
    Exercise 3 (Strict & Strong Convexity).

    Prove the following statements

    1. (a)

      μ-strong convexity implies strict convexity.

      Solution.

      Assuming xy we have

      0<μ2x-y2μ-strong conv.f(x)-f(y)-f(y),x-y

      But after moving the negative parts to the left we are left with a condition which is equivalent to strict convexity by Proposition A.1.8 in the lecture. ∎

    2. (b)

      For twice differentiable f, the following are equivalent

      1. (I)

        2f(x)μ𝕀

      2. (II)

        zT2f(x)zμz2

      3. (III)

        f is μ-strongly convex

      where 𝕀 is the identity matrix and

      AB:A-B is (weakly) positive definite.
      Solution.

      Let us get (iv)((b))(I)(iv)((b))(II) out of the way first:

      zT2f(x)z-μz2=z,2f(x)z-z,μ𝕀z=zT(2f(x)-μ𝕀)z

      If we have (iv)((b))(I), then the rightmost side is positive and thus we also have (iv)((b))(II). For the other direction we simply start from the left.

      For (iv)((b))(I)(iv)((b))(III) let

      g(y):=f(y)-μ2y2,

      because f is μ strongly convex

      g(y) =f(y)-μ2y2
      ≥f(x) + ⟨∇f(x), y-x⟩+ μ 2 ∥y-x∥^2 - μ 2 ∥y∥^2
      =f(x)-μ2x2=g(x)+μ2x2+f(x)-μxg(x),y-x+μx,y-x-μx,y+μ2x2
      =g(x)+g(x),y-x

      g is still convex. So we know by the lecture that

      2g(x)=2f(x)-μ𝕀

      is positive definite.

      On the other hand for (iv)((b))(III)(iv)((b))(I): if 2fμ𝕀, we know that 2g is positive definite and thus that g is convex by the lecture. Reusing the equations above except for the inequality we obtain

      f(x)+f(x),y-x+μ2y-x2-μ2y2 =g(x)+g(x),y-x
      =f(y)-μ2y2

      Adding μ2y2 on both sides results in the definition of μ-strong convexity of f. ∎

      Exercise 4 (Convexity and Minima).

      Prove the following statements

      1. (I)

        If f is convex, then every local minimum is also a global minimum.

        Solution.

        Let x* be a local minimum and assume it was not global. Then there exists y such that

        f(y)<f(x*).

        Select λ small enough such that x*+λ(y-x*) is still in the ϵ neighborhood of x* where x* is a local minimum. Then obtain a contradiction

        f(x*)f(x*+λ(y-x*))conv.(1-λ)f(x*)+λ>0f(y)<f(x*)<f(x*).
      2. (II)

        If f is strictly convex, then there exists at most one minimum.

        Solution.

        Let xy both be minima (which are global by (iv)((b))(I)). Then for some λ(0,1)

        f(λx+(1-λ)y)<strict conv.λf(x)+(1-λ)f(y)=minzf(z),

        which is a contradiction. ∎

      3. (III)

        If f convex and differentiable and f(x*)=0, then x* is a minimum.

        Solution.

        By Proposition A.1.8 convexity is equivalent to

        f(x)f(x*)+f(x*)=0,x-x*=f(x*)x.

        So x* is a (global) minimum. ∎

      Exercise 5 (Directional Minima).

      Let f be some differentiable function. For every direction dn define

      gd(α):=f(x*+αd).

      Assume that for every d, gd is minimized by α=0. Prove that

      1. (I)

        We have necessarily f(x*)=0.

        Solution.

        If we select a standard basis vector as the direction, then

        0=gei(α)|α=0=f(x*+αei),ei|α=0=f(x*),ei

        So the i-th entry of f(x*) is zero. As i was arbitrary, we are done. ∎

      2. (II)

        f(x*) is not necessarily a minimum of f.

        Hint.

        Let 0<p<q and define

        f(y,z):=(z-py2)(z-qy2)

        consider x*=(0,0) and prove that f(y,my2)<0 for p<m<q.

        Solution.

        Let 0<p<q and define

        f(y,z):=(z-py2)(z-qy2)

        then for p<m<q we have for y0

        f(y,my2)=(my2-py2)(my2-qy2)=y2>0(m-p)>0(m-q)<0<0.

        This means that f(0,0)=0 is not a minimum since we can get arbitrarily close to x*=(0,0) with (y,my2) with smaller and smaller y.

        Let d=(d1,d2) be an arbitrary direction. Then

        gd(α) =(αd2-p(αd1)2)(αd2-q(αd1)2)
        =α2(d2-pαd12)(d2-qαd12)
        =α2(d22-αd2(p+q)d12+α2pqd12)

        Thus g(0)=0 and g′′(0)=2d22>0 which implies g has a minimum in zero. ∎

      Exercise 6 (Bregman Divergence).

      The Bregman Divergence Df(B) of a continuously differentiable function f:d is defined as the error of the linear approximation and is related to μ-strong convexity and Lipschitz continuous gradients as follows

      μ2x-x02μ-strongly convex(definition)f(x)-f(x0)-f(x0),x-x0linear approximation=:Df(B)(x,x0)L-Lipschitz gradient(Descent Lemma)L2x-x02.

      For μ=0 this is simply the convexity condition by Prop. A.1.8. So non-negativity of the Bregman divergence implies convexity. The L-Lipschitz gradients provide us with an upper bound on the Bregman divergence on the other hand which immediately results in an upper bound on f

      f(x)=f(x0)+f(x0),x-x0+Df(B)(x,x0)L2x-x02. (1)
      1. (I)

        Prove for functions f with L-Lipschitz gradients, we have for all x0

        minxf(x)f(x0)-12Lf(x0)2.

        By minimizing the upper bound (1). What is the minimizer of the upper bound?

        Hint.

        First minimize over the direction x-x0 subject to the length x-x0=r being constant. Then minimize over r.

        Solution.

        We first solve the directional minimization problem

        argminx:x-x0=rf(x0)+f(x0),x-x0+L2x-x02 =x0+argmind:d=rf(x0),d
        =x0+argmaxd:d=r-f(x0),d
        =x0-rf(x0)f(x0), (2)

        where the last equation is true because by Cauchy-Schwartz

        -f(x0),dC.S.f(x0)r.

        So in summary we have

        minxf(x) minrminx:x-x0=rf(x0)+f(x0),x-x0+L2x-x02=(2)f(x0)-rf(x0)+L2r2.

        Minimizing over the length r implies minimizing a convex parabola, so the first order condition is sufficient yielding

        r*=f(x0)L.

        Reinserting r* into our upper bound yields the claim and we get the minimizer by inserting r* into (2) resulting in a gradient descent step

        x*=x0-1Lf(x0).
      2. (II)

        Prove for convex functions f with L-Lipschitz gradients

        Df(B)(x,x0)12Lf(x)-f(x0)2.
        Hint.

        Apply (iv)((b))(I) to

        ϕ(x):=Df(B)(x,x0).

        Due to convexity you should already know the global minimum of ϕ.

        Solution.

        To apply (iv)((b))(I), we want to prove

        ϕ(x):=Df(B)(x,x0)=f(x)-f(x0)-f(x0),x-x0

        has L-Lipschitz gradient. But the gradient of ϕ is given by

        ϕ(x)=f(x)-f(x0),

        which is merely a shift of the gradient of f and therefore still L-Lipschitz. Due to convexity we know the Bregman Divergence is greater than zero. So we have

        0=Df(B)(x0,x0)=minyϕ(y)(iv)((b))(I)ϕ(x)-12Lϕ(x)2.

        Reordering and inserting the definition we obtain our claim

        12Lf(x)-f(x0)2=12Lϕ(x0)2ϕ(x)=Df(B)(x,x0).
      3. (III)

        Prove for convex functions f with L-Lipschitz gradients, we have for all x,y

        f(x)-f(y),x-y1Lf(x)-f(y)2.
        Hint.

        Use (iv)((b))(II) twice.

        Solution.

        We simply apply (iv)((b))(II) twice:

        1Lf(x)-f(y)2≤D^(B)_f(y,x) + D^(B)_f(x,y)
        =f(y)-f(x)-f(x),y-x+f(x)-f(y)-f(y),x-y
        =f(x)-f(y),x-y.
      4. (IV)

        Prove for convex f that the upper bound

        Df(B)(x,x0)L2x-x02

        is sufficient for Lipschitz continuity of the gradient f.

        Hint.

        Argue why you can use (iv)((b))(III) without circular reasoning.

        Solution.

        We deduced this upper bound in (iv)((b))(I) and then never used the Lipschitz property again. In (iv)((b))(II) we only needed it to apply (iv)((b))(I) and in (iv)((b))(III) we only needed the property to apply (iv)((b))(II). So our premise is sufficient for (iv)((b))(III) without Lipschitz continuity of the gradient. So by (iv)((b))(III) and Cauchy-Schwarz we get

        1Lf(x)-f(y)2f(x)-f(y),x-yC.S.f(x)-f(y)x-y.

        Multiplying this inequality by L/f(x)-f(y) results in the claim. ∎

      5. (V)

        In this last part we want to assume f is μ-strongly convex (and its gradient L-Lipschitz). Prove for all x,y

        f(x)-f(y),x-yμL+μx-y2+1L+μf(x)-f(y)2.
        Hint.

        Note that strong convexity provides an alternative lower bound to (iv)((b))(II). Using this alternative lower bound we could directly obtain a modified version of (iv)((b))(III). But we are greedy. We want to use both lower bounds for an even tighter bound on the scalar product. We therefore want to use

        gx(y):=f(y)-μ2x-y2

        to break down the Bregman divergence of f, so we can have our cake and eat it:

        Df(B)(y,z)=Dgx(B)(y,z)use (iv)((b))(II)+μ2y-z2strong convexity.

        Remember to check convexity of gx and Lipschitz continuity of gx before applying (iv)((b))(II). When applying (iv)((b))(II), the selection z=x might be helpful.

        Solution.

        We define

        gx(y):=f(y)-μ2x-y2.

        for which we have

        gx(y)=f(y)-μ(y-x).

        Therefore we have

        Dgx(B)(y,z) =gx(y)-gx(z)-gx(z),y-z
        =f(y)-μ2x-y2-f(z)+μ2x-z2-f(z)-μ(z-x),y-z
        =Df(B)(y,z)+μ2[x-z2-x-y2+2z-x,y-z]=-(-x-z2+x-z+z-y2-2x-z,z-y)
        =Df(B)(y,z)-μ2z-y2.

        Convexity is a given because Dgx(B)0 since Df(B)(y,z)μ2y-z2. Similarly we obtain L-μ-Lipschitz continuity of gx(y) because

        Dgx(B)(y,z)=Df(B)(y,z)-μ2z-y2L-μ2z-y2.

        So we can apply (iv)((b))(II) to Dgx(B) with z=x to obtain

        Df(B)(y,x) =Dgx(B)(y,x)+μ2x-y2
        1 2(L-μ) ∥∇g_x(y) - ∇g_x(x)∥^2 + μ 2 ∥x-y∥^2
        =12(L-μ)f(y)-μ(y-x)-f(x)2(f(y)-μy)-(f(x)-μx)2+μ2x-y2.

        Which is symmetric in x and y so we can apply the same trick as in (iv)((b))(III) to get

        f(y)-f(x),y-x =Df(B)(y,x)+Df(B)(y,x)
        1(L-μ)f(y)-μ(y-x)-f(x)2=f(y)-f(x)2-2μf(y)-f(x),y-x+μ2y-x2+μx-y2.

        Collecting the scalar product on the left results in

        (1+2μL-μ)=L+μL-μf(y)-f(x),y-x1(L-μ)f(y)-f(x)2+(μ2L-μ+μ)x-y2.

        Multiplying both sides by L-μL+μ we finally get

        f(y)-f(x),y-x1(L+μ)f(y)-f(x)2+(μ2+μ(L-μ)L+μ)=μLL+μx-y2.