Stochastic Gradient Descent
10701 Recitations 3
             Mu Li
    Computer Science Department
     Cargenie Mellon University
       February 5, 2013
The problem
    I   A typical machine learning problem has a penalty/regularizer
        + loss form
                                                 n
                                              1X
                     min F (w ) = g (w ) +       f (w ; yi , xi ),
                       w                      n
                                                i=1
        xi , w  Rp , yi  R, both g and f are convex
    I   Today we only consider differentiable f , and let g = 0 for
        simplicity
    I   For example, let f (w ; yi , xi ) =  log p(yi |xi , w ), we are trying
        to maximize the log likelihood, which is
                                       n
                                 1X
                             max    log p(yi |xi , w )
                              w n
                                     i=1
Gradient Descent
  I   choose initial w (0) , repeat                 Two dimensional
                                                    example:
          w (t+1) = w (t)  t  F (w (t) )
      until stop
  I   t is the learning rate, and
                       1X
       F (w (t) ) =      w f (w (t) ; yi , xi )
                       n
                          i
  I   How to stop? kw (t+1)  w (t) k   or
      kF (w (t) )k  
Learning rate matters
                          too small t , after 100
  t = t, it is too big
                          iterations
Backtracking line search
   Adaptively choose the learning rate
     I   choose a parameter 0 <  < 1
     I   start with  = 1, repeat t = 0, 1, . . .
           I   while                                                           
                       L(w (t)  L(w (t) )) > L(w (t) )  kL(w (t) )k2
                                                           2
               update  = 
           I   w (t+1) = w (t)  L(w (t) )
Backtracking line search
   A typical choice  = 0.8, converged after 13 iterations:
Stochastic Gradient Descent
        We name n1 i f (w ; yi , xi ) the empirical loss, the thing we
                   P
    I
        hope to minimize is the expected loss
                            f (w ) = Eyi ,xi f (w ; yi , xi )
    I   Suppose we receive an infinite stream of samples (yt , xt ) from
        the distribution, one way to optimize the objective is
                     w (t+1) = w (t)  t w f (w (t) ; yt , xt )
    I   On practice, we simulate the stream by randomly pick up
        (yt , xt ) from the samples we have
        Comparing the average gradient of GD n1 i w f (w (t) ; yi , xi )
                                                 P
    I
More about SGD
    I   the objective does not always decrease for each step
    I   comparing to GD, SGD needs more steps, but each step is
        cheaper
    I   mini-batch, say pick up 100 samples and do average, may
        accelerate the convergence
Relation to Perceptron
    I   Recall Perceptron: initialize w , repeat
                                 (
                                    yi xi if yi hw , xi i < 0
                      w =w+
                                    0      otherwise
    I   Fix learning rate  = 1, let f (w ; y , x) = max(0, yi hw , xi i),
        then                          (
                                       yi xi if yi hw , xi i < 0
                   w f (w ; y , x) =
                                       0         otherwise
        we derive Perceptron from SGD
Question?