Machine Learning                               Srihari
Probability Theory
                        Sargur N. Srihari
                   srihari@cedar.buffalo.edu
                                                     1
Machine Learning                                              Srihari
           Probability Theory in Machine Learning
•  Probability is key concept is dealing with
   uncertainty
     –  Arises due to finite size of data sets and noise on
        measurements
•  Probability Theory
     –  Framework for quantification and manipulation of
        uncertainty
     –  One of the central foundations of machine learning
                                                                    2
Machine Learning                                                   Srihari
                   Random Variable (R.V.)
•  Takes values subject to chance
     –  E.g., X is the result of coin toss with values Head
        and Tail which are non - numeric
           •  X can be denoted by a r.v. x which has values of 1 and 0
     –  Each value of x has an associated probability
•  Probability Distribution
     –  Mathematical function that describes
           1. possible values of a r.v.
           2. and associated probabilities
                                                                         3
Machine Learning                                                           Srihari
              Probability with Two Variables
•  Key concepts:
      –  conditional & joint probabilities of variables
•  Random Variables: B and F
      –  Box B, Fruit F
            •  F has two values orange (o) or apple (a)
            •  B has values red (r) or blue (b)
   2 apples        3 apples
   6 oranges       1 orange    P(F=o)=3/4 and P(F=a)=1/4
                              Let p(B=r)=4/10 and p(B=b)=6/10
                               Given the above data we are interested in
                               several probabilities of interest:
                               marginal, conditional and joint
                               Described next
                                                                                 4
Machine Learning                                                        Srihari
                   Probabilities of Interest
   •  Marginal Probability
                                                       2 apples    3 apples
         –  what is the probability of an              6 oranges   1 orange
            apple? P(F=a)
               •  Note that we have to consider P(B)
   •  Conditional Probability
         –  Given that we have an orange
            what is the probability that we
            chose the blue box? P(B=b|F=o)
   •  Joint Probability
         –  What is the probability of orange
            AND blue box? P(B=b,F=o)                                          5
Machine Learning                                                           Srihari
       Sum Rule of Probability Theory
   •  Consider two random variables
   •  X can take on values xi, i=1,, M
   •  Y can take on values yi, i=1,..L
   •  N trials sampling both X and Y
   •  No of trials with X=xi and Y=yj is nij
                                                                     nij
                          Joint Probability p(X = x i ,Y = y j ) =
                                                                     N
                                          ci
                          p(X = x i ) =
   •  Marginal Probability                N
                                                            L
                          Since ci = ∑ nij , p(X = x i ) = ∑ p(X = x i ,Y = y j )
                                      j                    j =1                  6
Machine Learning                                                     Srihari
       Product Rule of Probability Theory
•  Consider only those instances for which X=xi
•  Then fraction of those instances for which Y=yj is
   written as p(Y=yj|X=xi)
•  Called conditional probability
•  Relationship between joint and conditional probability:
                                nij
      p(Y = y j | X = x i ) =
                                ci
                               nij        nij       ci
      p(X = x i ,Y = y j ) =          =         •
                                N    ci N
                                = p(Y = y j | X = x i )p(X = x i )
                                                                           7
Machine Learning                                                        Srihari
                     Bayes Theorem
•  From the product rule together with the symmetry
   property p(X,Y)=p(Y,X) we get
                               p(X |Y )p(Y )
                   p(Y | X ) =
                                  p(X )
•  Which is called Bayes’ theorem
•  Using the sum rule the denominator is expressed as
                                               Normalization constant to
                                               ensure sum of conditional
                   p(X ) = ∑ p(X |Y )p(Y )     probability on LHS
                           Y                   sums to 1 over all values of Y
                                                                              8
Machine Learning                                                                Srihari
                          Rules of Probability
  •  Given random variables X and Y
  •  Sum Rule gives Marginal Probability
                         L
                                                          ci
     p(X = x i ) = ∑ p(X = x i ,Y = y j ) =
                        j =1                              N
  •  Product Rule: joint probability in terms of conditional and
     marginal
                                      nij                        nij       ci
                          p(X,Y ) =         = p(Y | X )p(X ) =         ×
                                      N                          ci        N
  •  Combining we get Bayes Rule
           p(X |Y )p(Y ) where p(X ) = ∑ p(X |Y )p(Y )
    p(Y | X ) =                                           Y
                      p(X )
               Viewed as
               Posterior a likelihood x prior
                                                                                      9
Machine Learning                                                         Srihari
Ex: Joint Distribution over two Variables
                      X takes nine possible values, Y takes two values
                   N = 60 data points
                                            Histogram
                                            of Y
                                            (Fraction of
                                            data points
                                            having each
                                            value of Y)
   Histogram                               Histogram
   of X                                    of X given Y=1
                   Fractions would equal the probability as N àoo
                                                                             10
Machine Learning                                                                                    Srihari
      Bayes rule applied to Fruit Problem
•  Probability that box is red given
   that fruit picked is orange
                                p(F = o | B = r) p(B = r)
             p(B = r | F = o) =
                                         p(F = o)
                                 3 4
                                   ×
                              =  4    10 = 2 = 0.66 The a posteriori probability of 0.66 is
                                    9      3           different from the a priori probability of 0.4
                                   20
•  Probability that fruit is orange
 €
     –  From sum and product rules
   p(F = o) = p(F = o,B = r) + p(F = o,B = b)
           = p(F = o | B = r) p(B = r) + p(F = o | B = b) p(B = b)
             6 4 1 6             9         The marginal probability of 0.45 is lower
           = × + × =               = 0.45
             8 10 4 10 20                  than average probability of 7/12=0.58                         11
Machine Learning                                  Srihari
                   Independent Variables
•  If p(X,Y)=p(X)p(Y) then X and Y are said to be
   independent
•  Why?
                                    p(X,Y )
•  From product rule:   p(Y | X ) =
                                     p(X )
                                            = p(Y )
•  In fruit example if each box contained same
   fraction of apples and oranges then p(F|B)=p(F)
                                                      12
Machine Learning                                                   Srihari
   Probability Density Function (pdf)
                                                         Cumulative
                                                         Distribution
                                                         Function
  •  Continuous Variables
  •  If probability that x falls in
     interval (x,x+δx) is given
     by p(x)dx for δx à0
     then p(x) is a pdf of x
  •  Probability x lies in            Probability that x lies in
     interval (a,b) is                Interval (-∞,z) is
                                                 z
                      b
        p(x ∈ (a,b)) = ∫ p(x)dx       P(z) =    ∫ p(x)dx
                      a                         −∞
                                                                       13
Machine Learning                                       Srihari
                   Several Variables
•  If there are several continuous variables x1,…,xD
   denoted by vector x then we can define a joint
   probability density p(x)=p(x1,..,xD)
•  Multivariate probability density must satisfy
        p(x) ≥ 0
         ∞
        ∫ p(x)d x = 1
        −∞
                                                           14
Machine Learning                                    Srihari
      Sum, Product, Bayes for Continuous
•  Rules apply for continuous, or combinations of
   discrete and continuous variables
                   p(x) = ∫ p(x,y)dy
                   p(x,y) = p(y | x)p(x)
                              p(x | y)p(y)
                   p(y | x) =
                                 p(x)
•  Formal justification of sum, product rules for
   continuous variables requires measure theory
                                                        15
Machine Learning                                                           Srihari
                               Expectation
•  Expectation is average value of some function f(x) under the
   probability distribution p(x) denoted E[f]
•  For a discrete distribution
       E[f] = Σx p(x) f(x)                  Examples of f(x)
•  For a continuous distribution            of use in ML:
                                            f(x)=x;    E[f] is mean
                                            f(x)=ln p(x); E[f] is entropy
       E[f ] = ∫ p(x)f (x)dx                f(x)=-ln[q(x)/p(x)]; K-L divergence
•  If there are N points drawn from a pdf, then expectation can be
   approximated as                 This approximation is extremely important
                                       when we use
      E[f] = (1/N)ΣnN=1 f(xn)      sampling to determine expected value
•  Conditional Expectation with respect to a conditional distribution
           Ex[f] =   Σx p(x|y) f(x)
                                                                               16
Machine Learning                                Srihari
                   Variance
•  Measures how much variability there is in f(x)
   around its mean value E[f(x)]
•  Variance of f(x) is denoted as
      var[f] = E[(f(x) – E[f(x)])2]
•  Expanding the square
      var[f] = E[(f(x)2] – E[f(x)]2
•  Variance of the variable x itself
      var[x] = E[x2] – E[x]2
                                                    17
Machine Learning                                      Srihari
                           Covariance
•  For two random variables x and y their covariance is
•     cov[x,y] = Ex,y [{x-E[x]} {y-E[y]}]
                   = Ex,y [xy] - E[x]E[y]
     –  Expresses how x and y vary together
•  If x and y are independent then their covariance
   vanishes
•  If x and y are two vectors of random variables
   covariance is a matrix
•  If we consider covariance of components of vector x
   with each other then we denote it as cov[x] =cov [x,x]
                                                          18
Machine Learning                                                               Srihari
                      Bayesian Probabilities
 •  Classical or Frequentist view of Probabilities
       –  Probability is frequency of random, repeatable event
       –  Frequency of a tossed coin coming up heads is 1/2
 •  Bayesian View
       –  Probability is a quantification of uncertainty
       –  Degree of belief in propositions that do not involve random
          variables
       –  Examples of uncertain events as probabilities:
             •     Whether Arctic Sea ice cap will disappear
             •     Whether moon was once in its own orbit around the sun
             •     Whether Thomas Jefferson had a child by one of his slaves
             •     Whether a signature on a check is genuine                       19
Machine Learning                                                  Srihari
  Whether Arctic Sea cap will disappear
                                        •  We have some idea of how
                                           quickly polar ice is melting
                                        •  Revise it on the basis of
                                           fresh evidence (satellite
                                           observations)
NASA Video                              •  Assessment will affect
                                           actions we take (to reduce
                                           greenhouse gases)
   An uncertain event
   Answered by general Bayesian interpretation
                                                                       20
Machine Learning                                      Srihari
     Bayesian Representation of Uncertainty
•  Use of probability to represent uncertainty is not an
   ad-hoc choice
•  If numerical values are used to represent degrees of
   belief, then simple set of axioms for manipulating
   degrees of belief leads to sum and product rules of
   probability (Cox’s theorem)
•  Probability theory can be regarded as an extension of
   Boolean logic to situations involving uncertainty
   (Jaynes)
                                                           21
Machine Learning                                                                  Srihari
                        Bayesian Approach
•  Quantify uncertainty around choice of parameters w
     –  E.g., w is vector of parameters in curve fitting
                                                                     M
                         y(x, w) = w0 + w1x + w2x 2 + ..+ wM x M = ∑ w j x j
                                                                    j =0
•  Uncertainty before observing data expressed by p(w)
•  Given observed data D ={ t1, . . tN }
     –  Uncertainty in w after observing D, by Bayes rule:
                                                  p(D | w)p(w)
                              p(w | D) =
                                                     p(D)
     –  Quantity p(D|w) is evaluated for observed data
           •    It can be viewed as function of w
           •    It represents how probable the data set is for different parameters w
           •    It is called the Likelihood function
           •    Not a probability distribution over w                                22
Machine Learning                                          Srihari
                   Bayes theorem in words
•  Uncertainty in w expressed as
                                   p(D | w)p(w)
                        p(w | D) =
                                      p(D)
•  Bayes theorem in words:
                     posterior α likelihood ✕ prior
•  Denominator is normalization factor
     •  Involves marginalization over w
                   p(D) = ∫ p(D | w)p(w)d w by Sum Rule
Machine Learning                                        Srihari
               Role of Likelihood Function
•  Likelihood Function plays central role in both
   Bayesian and frequentist paradigms
•  Frequentist:
     •  w is a fixed parameter determined by an estimator;
     •  Error bars on estimate are obtained from possible
        data sets D
•  Bayesian:
     •  There is a single data set D
     •  Uncertainty in parameters expressed as probability
        distribution over w
Machine Learning                                                             Srihari
        Maximum Likelihood Approach
•  In frequentist setting w is a fixed parameter
     –  w is set to value that maximizes likelihood function p(D|w)
     –  In ML, negative log of likelihood function is called error
        function since maximizing likelihood is equivalent to
        minimizing error
•  Error Bars
     –  Bootstrap approach to creating L data sets
           •  From N data points new data sets are created by drawing N points at
              random with replacement
           •  Repeat L times to generate L data sets
           •  Accuracy of parameter estimate can be evaluated by variability of
              predictions between different bootstrap sets
                                                                                 25
Machine Learning                                                                      Srihari
                   Bayesian: Prior and Posterior
     •  Inclusion of prior knowledge arises naturally
     •  Coin Toss Example
           –  Fair looking coin is tossed three times and lands Head each time
           –  Classical m.l.e of the probability of landing heads is 1 implying all
              future tosses will land Heads
           –  Bayesian approach with reasonable prior will lead to less
              extreme conclusion
                          µ=p(H)
                   p(µ)                                    p(µ|H)
                                                                                          26
Machine Learning                                              Srihari
          Practicality of Bayesian Approach
•  Marginalization over whole parameter space is
   required to make predictions or compare
   models
•  Factors making it practical:
           •  Sampling Methods such as Markov Chain Monte Carlo
              methods
           •  Increased speed and memory of computers
•  Deterministic approximation schemes such as
   Variational Bayes and Expectation propagation
   are alternatives to sampling
                                                                  27
Machine Learning                                                                                Srihari
                   The Gaussian Distribution
•  For single real-valued variable x                     What is an Exponential:
                          1          ⎧⎪ 1        2⎪
                                                   ⎫     y=ex, where e=2.718
        N(x | µ, σ ) =2
                                 exp ⎨− 2 (x − µ) ⎪⎬
                                      ⎪                  Its value for argument 0 is 1
                           2 1/2
                       (2πσ )         ⎪⎪⎩ 2σ       ⎪⎪⎭   It is its own derivative
•  It has two parameters:                                   Maximum of a distribution is its mode
     –  Mean µ, variance σ 2,                               For a Gaussian, mode coincides with mean
     –  Standard deviation σ
           •  Precision β =1/σ 2
•  Can find expectations of functions of
   x under Gaussian
                 ∞
               ∫ N(x | µ, σ )
                            2
     E[x ] =
               −∞
                ∞
     E[x 2 ] =   ∫ N(x | µ, σ )x dx = µ
                                2   2     2
                                              + σ2                    µ= 0, σ =1
                 −∞
      var[x ] = E[x 2 ]− E[x ]2 = σ 2
Machine Learning                                              Srihari
    Multivariate Gaussian Distribution
•  For single real-valued variable x
                   1     1        ⎧⎪ 1                 ⎫⎪
   N(x | µ,Σ) =               exp ⎨− (x − µ) Σ (x − µ)⎪⎬
                                   ⎪        T −1
                     D/2
                (2π) Σ    1/2      ⎪⎪⎩ 2                ⎪⎪⎭
•  It has parameters:
     –  Mean µ, a D-dimensional vector
     –  Covariance matrix Σ
           •  Which is a D ×D matrix
Machine Learning                                                                                       Srihari
          Likelihood Function for Gaussian
•  Given N scalar observations x=[x1,.. xn]T
     –  Which are independent and identically
        distributed
•  Probability of data set is given by
   likelihood function
                                           N
                                                                          Data: black points
             p(x | µ, σ ) = ∏ N(x n | µ, σ 2 )
                               2
                                                                          Likelihood= product of blue values
                                          n=1
                                                                          Pick mean and variance to maximize
•  Log-likelihood function is                                             this product
                             1 N             N        N
         ln p(x | µ, σ ) = − 2 ∑ (x n − µ)2 − ln σ 2 − ln(2π)
                             2
                            2σ n=1           2        2
•  Maximum likelihood solutions are given by
                     1   N
             µML   =
                     N
                         ∑x
                         n=1
                                   n     which is the sample mean
                     1    N
              2
             σML   =
                     N
                         ∑ (x
                         n=1
                                   n
                                       − µML )2   which is the sample variance
                                                                                                           30
Machine Learning                                                  Srihari
             Bias in Maximum Likelihood
•  Maximum likelihood
   systematically
   underestimates variance
     –  E[µML]=µ
     –  E[σ 2ML]=((N-1)/N)σ 2
                                      Green curve is true distribution
                                      Averaged across three data sets
     –  Not an issue as N increases   mean is correct
                                      Variance is underestimated
•  Problem is related to over-        because it is estimated relative
                                      to sample mean and not true mean
   fitting problem
                                                                      31
Machine Learning                                                                         Srihari
           Curve Fitting Probabilistically
•  Goal is to predict for target
   variable t given a new value
   of the input variable x
     –  Given N input values
        x=(x1,..xN)T and corresponding
        target values t=(t1,..,tN)T
     –  Assume given value of x, value
        of t has a Gaussian distribution
        with mean equal to y(x,w) of
        polynomial curve                                  Gaussian conditional distribution
                                                          for t given x.
    p(t|x,w,β)=N(t|y(x,w),β-1)                            Mean is given by
                                                M         polynomial function y(x,w)
    y(x, w) = w0 + w1x + w2x 2 + ..+ wM x M = ∑ w j x j   Precision given by β
                                               j =0
                                                                                              32
Machine Learning                                                                               Srihari
 Curve Fitting with Maximum Likelihood
                                                                  N
                                                p(t | x, w, β) = ∏ N(tn | y(x n , w), β −1 )
•  Likelihood Function is                                        n=1
•  Logarithm of the Likelihood function is
                              β N                    N      N
         ln p(t | x, w, β) = − ∑ {y(x n , w) −tn }2 + ln β − ln(2π)
                              2 n=1                  2      2
•  To find maximum likelihood solution for
   polynomial coefficients wML
     –  Maximize w.r.t w
     –  Can omit last two terms -- don’t depend on w
     –  Can replace β/2 with ½ (since it is constant wrt w)
     –  Minimize negative log-likelihood
     –  Identical to sum-of-squares error function
                                                                                                   33
Machine Learning                                    Srihari
             Precision parameter with MLE
•  Maximum likelihood can also be used to
   determine β of Gaussian conditional distribution
•  Maximizing likelihood wrt β gives
                                                2
                    1   1   N
                      =
                   βML N
                            ∑ {y(x , w
                            n=1
                                  n   ML
                                           )−tn }
•  First determine parameter vector wML governing
   the mean and subsequently use this to find
   precision βML
                                                        34
Machine Learning                                 Srihari
                   Predictive Distribution
•  Knowing parameters w and β
•  Predictions for new values of x can be made
   using
         p(t|x,wML,βML)=N(t|y(x,wML),βML-1)
•  Instead of a point estimate we are now giving a
   probability distribution over t
                                                     35
Machine Learning                                                        Srihari
             A More Bayesian Treatment
•  Introducing a prior distribution over polynomial
   coefficients w
                                               (M +1)/2
                                       ⎛ α ⎞⎟                 ⎧
                                                              ⎪ α T ⎫
                                                                    ⎪
         p(w | α) = N(w | 0, α I ) = ⎜⎜⎜ ⎟⎟
                              −1
                                                          exp ⎨− w w⎪
                                                              ⎪     ⎬
                                       ⎝ 2π ⎟⎠                ⎪
                                                              ⎪
                                                              ⎩ 2   ⎪
                                                                    ⎪
                                                                    ⎭
     –  where α is the precision of the distribution
     –  M+1 is total no. of parameters for an Mth order polynomial
     –  α are Model parameters also called hyperparameter
           •  they control distribution of model parameters
                                                                            36
Machine Learning                                          Srihari
                   Posterior Distribution
•  Using Bayes theorem, posterior distribution for w is
   proportional to product of prior distribution and
   likelihood function
   p(w|x,t,α,β) α p(t|x,w,β)p(w|α)
•  w can be determined by finding the most probable
   value of w given the data, ie. maximizing posterior
   distribution
•  This is equivalent (by taking logs) to minimizing
                                      2
                   β N
                                            α T
                     ∑
                   2 n=1
                         {y(x n , w )−tn } + w w
                                            2
•  Same as sum of squared errors function with a
   regularization parameter given by λ=α/β
                                                              37
Machine Learning                                                                                 Srihari
                       Bayesian Curve Fitting
•  Previous treatment still makes point estimate of w
     –  In fully Bayesian approach consistently apply sum and
        product rules and integrate over all values of w
•  Given training data x and t and new test point x , goal
   is to predict value of t
     –  i.e, wish to evaluate predictive distribution p(t|x,x,t)
•  Applying sum and product rules of probability
     –  Predictive distribution can be written in the form
     p(t | x, x,t) = ∫ p(t, w |x, x, t)dw                   by Sum Rule (marginalizing over w)
                       =∫ p(t|x,w,x,t) p(w | x, x,t) by Product Rule
                        =∫ p(t|x,w)p(w|x,t)dw             by eliminating unnecessary variables
        p(t | x, w) = N(t | y(x , w), β −1 )   Posterior distribution over parameters                38
                                               Also a Gaussian
Machine Learning                                                                                              Srihari
                        Bayesian Curve Fitting
•  Predictive distribution is also Gaussian
              p(t | x, x,t) = N(t | m(x),s 2(x))
     –  Where the Mean and Variance are dependent on x
                               N
     m(x) = βφ(x) S ∑ φ(x n )tn
                         T
                               n=1                First term is uncertainty in predicted value due to noise in target
       2           −1            T                Second term is uncertainty in parameters due to Bayesian treatment
     s (x) = β          + φ(x) Sφ(x)
                          N
     S   −1
              = αI + β ∑ φ(x n )φ(x)T
                         n=1
     φ(x) has elements φi (x) = x i for i = 0,..M
                                     Predictive Distribution is a M=9 polynomial
                                     α = 5 x 10-3
                                     β =11.1
                                     Red curve is mean
                                     Red region is +1 std dev
                                                                                                                  39
Machine Learning                     Srihari
                   Model Selection
                                         40
Machine Learning                                         Srihari
                   Models in Curve Fitting
•  In polynomial curve fitting:
     –  an optimal order of polynomial gives best
       generalization
•  Order of the polynomial controls
     –  the number of free parameters in the model and
        thereby model complexity
•  With regularized least squares l also controls
   model complexity
                                                             41
Machine Learning                                                      Srihari
         Validation Set to Select Model
     •  Performance on training set is not a good
        indicator of predictive performance
     •  If there is plenty of data,
           –  use some of the data to train a range of models Or a
              given model with a range of values for its parameters
           –  Compare them on an independent set, called
              validation set
           –  Select one having best predictive performance
     •  If data set is small then some over-fitting can
        occur and it is necessary to keep aside a test set
                                                                          42
Machine Learning                                         Srihari
                   S-fold Cross Validation
     •  Supply of data is limited        S=4
     •  All available data is
        partitioned into S groups
     •  S-1 groups are used to train
        and evaluated on remaining
        group
     •  Repeat for all S choices of    If S=N this is the
                                       leave-one-out method
        held-out group
     •  Performance scores from S
        runs are averaged
                                                              43
Machine Learning                               Srihari
         Bayesian Information Criterion
     •  Criterion for choosing model
     •  Akaike Information criterion (AIC) chooses
        model for which the quantity
            ln p(D|wML) –M
     •  Is highest
     •  Where M is number of adjustable parameters
     •  BIC is a variant of this quantity
                                                   44
Machine Learning                                  Srihari
            The Curse of Dimensionality
             Need to deal with spaces with many
               variables in machine learning
                                                      45
Machine Learning                        Srihari
        Example Clasification Problem
     •  Three classes
     •  12 variables:
        two shown
     •  100 points
     •  Learn to        Which class
                        should x
        classify from   belong to?
        data
                                            46
Machine Learning                               Srihari
                   Cell-based Classification
     •  Naïve approach of
        cell based voting
        will fail because of
        exponential growth
        of cells with
        dimensionality
     •  Hardly any points in
        each cell
                                                   47
Machine Learning                                                   Srihari
  Volume of Sphere in High Dimensions
     •  Sphere is of radius r =1 in
        D-dimensions
     •  What fraction of volume
        lies between radius
        r = 1-ε and r =1?
     •  VD(r)=KDrD
     •  This fraction is given by 1-
        (1-ε)D
     •  As D increases high
        proportion of volume lies      Fraction of volume of sphere
        near outer shell               lying in range r =1- ε to r = 1
                                       for various dimensions D
                                                                         48
Machine Learning                      Srihari
      Gaussian in High-dimensional Space
     •  x-y space converted to r-
        space using polar
        coordinates
     •  Most of the probability
        mass is located in a thin
        shell at a specific radius
                                          49