Cheat Sheet
Cheat Sheet
Cheat Sheet
i
ii
Introduction
It is very difficult to come up with a single, consistent notation to cover the wide variety of data, models and algorithms
that we discuss. Furthermore, conventions differ between machine learning and statistics, and between different books
and papers. Nevertheless, we have tried to be as consistent as possible. Below we summarize most of the notation
used in this book, although individual sections may introduce new notation. Note also that the same symbol may have
different meanings depending on the context, although we try to avoid this where possible.
Symbol Meaning
x Floor of x, i.e., round down to nearest integer
x Ceiling of x, i.e., round down to nearest integer
xy Convolution of x and y
xy Hadamard (elementwise) product of x and y
ab logical AND
ab logical OR
a logical NOT
I(x) Indicator function, I(x) = 1 if x is true, else I(x) = 0
Infinity
Tends towards, e.g., n
Proportional to, so y = ax can be written as y x
|x| Absolute value
|S| Size (cardinality) of a set
n! Factorial function
Vector of first derivatives
2 Hessian matrix of second derivatives
Defined as
O() Big-O: roughly means order of magnitude
R The real numbers
1:n Range (Matlab convention): 1 : n = 1, 2, ..., n
Approximately equal to
arg max f (x) Argmax: the value x that maximizes f
x
(a) (b)
B(a, b) Beta function, B(a, b) =
(a + b)
(k )
B() Multivariate beta function, k
( k )
(n) k
k n choose k , equal to n!/(k!(nk)!)
(x) Dirac delta function, (x) = if x = 0, else (x) = 0
exp(x) Exponential function ex
(x) Gamma function, (x) = 0 ux1 eu du
d
(x) Digamma function,Psi(x) = log (x)
dx
v
vi Notation
We use boldface lower-case to denote vectors, such as x, and boldface upper-case to denote matrices, such as X. We
denote entries in a matrix by non-bold upper case letters, such as Xi j .
Vectors are assumed to be column vectors, unless noted otherwise. We use (x1 , , xD ) to denote a column vector
created by stacking D scalars. If we write X = (x1 , , xn ), where the left hand side is a matrix, we mean to stack
the xi along the columns, creating a matrix.
Symbol Meaning
X 0 X is a positive definite matrix
tr(X) Trace of a matrix
det(X) Determinant of matrix X
|X| Determinant of matrix X
X 1 Inverse of a matrix
X Pseudo-inverse of a matrix
XT Transpose of a matrix
xT Transpose of a vector
diag(x) Diagonal matrix made from vector x
diag(X) Diagonal vector extracted from matrix X
I or I d Identity matrix of size d d (ones on diagonal, zeros of)
1 or 1d Vector of ones (of length d)
0 or 0d Vector of zeros (of length
d)
d
||x|| = ||x||2 Euclidean or 2 norm x2j
j=1
d
||x||1 1 norm x j
j=1
X :, j jth column of matrix
X i,: transpose of ith row of matrix (a column vector)
X i, j Element (i, j) of matrix X
xy Tensor product of x and y
Probability notation
We denote random and fixed scalars by lower case, random and fixed vectors by bold lower case, and random and
fixed matrices by bold upper case. Occasionally we use non-bold upper case to denote scalar random variables. Also,
we use p() for both discrete and continuous random variables
Symbol Meaning
X,Y Random variable
P() Probability of a random event
F() Cumulative distribution function(CDF), also called distribution function
p(x) Probability mass function(PMF)
f (x) probability density function(PDF)
F(x, y) Joint CDF
p(x, y) Joint PMF
f (x, y) Joint PDF
Notation vii
In general, we use upper case letters to denote constants, such as C, K, M, N, T , etc. We use lower case letters as dummy
indexes of the appropriate range, such as c = 1 : C to index classes, i = 1 : M to index data cases, j = 1 : N to index
input features, k = 1 : K to index states or clusters, t = 1 : T to index time, etc.
We use x to represent an observed data vector. In a supervised problem, we use y or y to represent the desired output
label. We use z to represent a hidden variable. Sometimes we also use q to represent a hidden discrete variable.
Symbol Meaning
C Number of classes
D Dimensionality of data vector (number of features)
N Number of data cases
Nc Number of examples of class c,Nc = Ni=1 I(yi = c)
R Number of outputs (response variables)
D Training data D = {(xi , yi )|i = 1 : N}
Dtest Test data
X Input space
Y Output space
viii Notation
In the predictive or supervised learning approach,the goal is to learn a mapping from inputs x to outputs y,given a
labeled set of input-output paris D = {(xi , yi )}i=1 N.Here D is called the training set,and N is the number of training
examples. In the simplest setting,each training input xi is a D-dimensional vector of numbers,representing,say,the
height and weight of a person,which are called features,attributes,or covariates.
Similarly the form of the output or response variable can in principle be anything,but most methods assume that yi
is categorical or nominal variable from some finite set,yi {1, ...,C}.When yi is categorical,the problem is known as
classification or pattern recognition,and when real-valued,known as regression.
1.2.1 Classification
Here the goal is to learn a mapping from inputs x to outputs y,where y {1, ...,C},with C being the number of
classes.If C = 2,this is called binary classification;if C > 2,this is called multiclass classification.if the class labels
are not mutually exclusive,we call it multi-label classification,but this is best viewed as predicting multiple related
binary class labels(a so-called multiple output model). One way to formalize the problem is as function approxi-
mation:assume y = f (x) for some unknown function f ,and the goal of learning is to estimate the function f given a
labeled training set,and then to make predictions(estimate) using y = f(x).Our main goal is to make predictions on
novel inputs,meaning ones that we have not seen before(generalization.
We denote the probability distribution over possible labels,given the input vector x and training set D by p(y|x, D).
Given a probabilistic output,we can always compute out best guess as to the true label using
C
y = f(x) = arg max p(y = c|x, D) (1.1)
c=1
This corresponds to the most probable class label,and is called the mode of distribution p(y|x, D);it is also known as
a MAP estimate(MAP stands formaximum a posteriori).
1
2
1.2.1.2 Applications
1.2.2 Regression
Descriptive or unsupervised learning approach is sometimes called knowledge discovery.We will formalize out
task as one of density estimation,that is we want to build models of the form p(xi | ),instead of p(yi |xi , ).
Let zi {1, ..., K} represent the cluster to which data point i is assigned.(zi is an exmaple of hidden or latent variable).
Although the data may appear high dimensional,there may only be a small number of degrees of variability,corresponding
to latent factors.The most common approach to dimensionality reduction is called principal components analysis
or PCA.
1.4.1 Representation
In supervised learning, a model must be represented as a conditional probability distribution P(y|x)(usually we call
it classifier) or a decision function f (x). The set of classifiers(or decision functions) is called the hypothesis space of
the model. Choosing a representation for a model is tantamount to choosing the hypothesis space that it can possibly
learn.
1 Domingos, P. A few useful things to know about machine learning. Commun. ACM. 55(10):7887 (2012).
3
1.4.2 Evaluation
In the hypothesis space, an evaluation function (also called objective function or risk function) is needed to distinguish
good classifiers(or decision functions) from bad ones.
Definition 1.1. In order to measure how well a function fits the training data, a loss function L : Y Y R 0 is
defined. For training example (xi , yi ), the loss of predicting the value yb is L(yi , yb).
Definition 1.3. The risk function Rexp ( f ) can be estimated from the training data as
1 N
Remp ( f ) = L (yi , f (xi ))
N i=1
(1.3)
You can define your own loss function, but if youre a novice, youre probably better off using one from the
literature. There are conditions that loss functions should meet2 :
1. They should approximate the actual loss youre trying to minimize. As was said in the other answer, the standard
loss functions for classification is zero-one-loss (misclassification rate) and the ones used for training classifiers are
approximations of that loss.
2. The loss function should work with your intended optimization algorithm. Thats why zero-one-loss is not used
directly: it doesnt work with gradient-based optimization methods since it doesnt have a well-defined gradient (or
even a subgradient, like the hinge loss for SVMs has).
The main algorithm that optimizes the zero-one-loss directly is the old perceptron algorithm(chapter ??).
2 http://t.cn/zTrDxLO
4
1 N
min Remp ( f ) = min
f F f F
L (yi , f (xi ))
N i=1
(1.4)
1 N
min Rsrm ( f ) = min
f F f F
L (yi , f (xi )) + J( f )
N i=1
(1.6)
1.4.3 Optimization
Finally, we need a training algorithm(also called learning algorithm) to search among the classifiers in the the
hypothesis space for the highest-scoring one. The choice of optimization technique is key to the efficiency of the
model.
If data is plentiful,then on approach to avoid over-fitting is to use some of the some of the available data to train a
range of models,or a given model with a range of values for its complexity parameters,and then to compare them on
independent data,sometimes called a Validation set,and select the one having the best predictive performance.If the
model design is iterated many times using a limited size data set,then some over-fitting to the validation data can occur
and so it may be necessary to keep aside a third test set on which the performance of the selected model is finally
evaluated.
Definition 1.7. Cross validation, sometimes called rotation estimation, is a model validation technique for assessing
how the results of a statistical analysis will generalize to an independent data set3 .
validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using different
partitions, and the validation results are averaged over the rounds.
The common theme that arise when analyzing and organizing data in high-dimensional spaces is that when the
demensionality increases,the volume of the space increases so fast (grows exponentially with the dimensionality)
that the available data become sparse.
Fig. 1.1: Illustration of the curse of dimensionality:the number of regions of a regular grid grows exponentially with
the dimensionality D of the space.
VD (r) = KD rD (1.7)
The fraction of the volume of the sphere that lies between radius r = 1 and r = 1
VD (1) VD (1 )
= 1 (1 )D (1.8)
VD (1)
for large D this fraction tends to 1 even for small values of .Thus most of the volume of a sphere is concentrated in a
thin shell near the surface!
Probability theory provides us with a consistent mathematical framework for quantifying and manipulating uncer-
tainty.And when combined with probability theory,decision theory allows us to make optimal decisions in situations
involving uncertainty such as those encountered in pattern recognition.
Suppose we have an input vector x together with a corresponding vector t of target variables, and our goal is to
predict t given a new value for x. For regression problems, t will comprise continuous variables, whereas for classifi-
cation problems t will represent class labels. The joint probability distribution p(x, t) provides a complete summary of
the uncertainty associated with these variables. Determination of p(x, t) from a set of training data is an example of
inference and is typically a very difficult problem.
The decision step, is the subject of decision theory to tell us how to make optimal decisions given the appropriate
probabilities. We shall see that the decision stage is generally very simple, even trivial, once we have solved the
inference problem.
6
Consider,for a classification problem,in which case,the input are vectors x and we denote the class i by Ck .Using
Bayes theorem,these probabilities can be expressed in the form
p(x|Ck )p(Ck )
p(Ck |x) = (1.9)
p(x)
Note that any of the quantities appearing in Bayes theorem can be obtained from the joint distribution p(x, Ck ) by
either marginalizing or conditioning with respect to the appropriate variables.We can now interpret p(Ck ) as the
prior probability for the class Ck ,and the p(Ck |x) as the corresponding posterior probability,p(x|Ck ) is the likelihood
function.If our aim it to minimize the chance of assigning x to the wrong class,then intuitively we would choose the
class having the higher posterior probability.
We have broken the classification problem down into two separate stages,the inference stage in which we use training
data to learn to model for p(Ck |x),and the subsequent decision stage in which we use these posterior probabilities to
make optimal class assignments.An alternative possibility would be to solve both problems together and simply learn
a function that maps input x directly into decisions.Such a function is called a discriminant function.
In fact,we can identify three distinct approaches to solving decision problems,given, in decreasing order of com-
plexity,by
1. First solve the inference problem of determining the class-conditional densities p(x|Ck ) for each class Ck individu-
ally. Also separately infer the prior class probabilities p(Ck ). Then use Bayes theorem in the form
p(x|Ck )p(Ck )
p(Ck |x) = (1.10)
p(x)
to find the posterior class probabilities p(Ck |x). As usual, the denominator in Bayes theorem can be found in terms
of the quantities appearing in the numerator, because
Equivalently, we can model the joint distribution p(x, Ck ) directly and then normalize to obtain the posterior prob-
abilities. Having found the posterior probabilities, we use decision theory to determine class membership for each
new input x.Approaches that explicitly or implicitly model the distribution of inputs as well as outputs are known
as Generative models,because by sampling from them it is possible to generate synthetic data points in the input
space.
2. First solve the inference problem of determining the posterior class probabilities p(Ck |x),and then subsequently
use decision theory to assign each new x to one of the classes.Approaches that model the posterior probabilities are
called discriminative models
3. Find a function f (x),called a discriminant function,which maps each input x directly into a class label.For in-
stance,in the case of two-case problems, f () might be binary valued and such that f = 0 represents class C1 and so
on.
Now consider the relative merits of there three alternatives.Approach (a) is the most demanding because it involves
finding the joint distribution over both x and Ck. For many applications, x will have high dimensionality, and conse-
7
quently we may need a large training set in order to be able to determine the class-conditional densities to reasonable
accuracy. Note that the class priors p(C ) can often be estimated simply from the fractions of the training set data
points in each of the classes. One advantage of approach (a), however, is that it also allows the marginal density of
data p(x) to be determined from (1.83). This can be useful for detecting new data points that have low probability un-
der the model and for which the predictions may be of low accuracy, which is known as outlier detection or novelty
detection (Bishop, 1994; Tarassenko, 1995).
However, if we only wish to make classification decisions, then it can be wasteful of computational resources, and
excessively demanding of data, to find the joint distribution p(x, Ck) when in fact we only really need the posterior
probabilities p(Ckx), which can be obtained directly through approach (b). Indeed, the class conditional densities
may contain a lot of structure that has little effect on the posterior probabilities, as illustrated in Figure 1.27. There has
been much interest in exploring the relative merits of generative and discriminative approaches to machine learning,
and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006). An even simpler approach is (c) in which we
use the training data to find a discriminant function f(x) that maps each x directly onto a class label, thereby combining
the inference and decision stages into a single learning problem. In the example of Figure 1.27, this would correspond
to finding the value of x shown by the vertical green line, because this is the decision boundary giving the minimum
probability of misclassification.
Reasons for wanting to compute the posterior probabilities include:
Minimize risk.
Reject option.
Compensating for class priors.
Combining models. For complex applications,we may wish to break the problem into a number of smaller sub-
problems each of which can be tackled by a separate module.As long as each of the two models gives posterior
probabilities for the classes, we can combine the outputs systematically using the rules of probability. One simple
way to do this is to assume that, for each class separately, the distributions of inputs are independent,so that
This is an example of conditional independence property.The posterior probability,given both data(from different
modules) ,is then given by
Consider a discrete random variable x and we ask how much information is received when we observe a specific value
for this variable.Our measure of information content will therefore depend on the probability distribution p(x),and we
therefore look for a quantity h(x) that is monotonic function of the probability p(x) and that expresses the information
content.The form of h() can be found by noting that if we have two events x and y that are unrelated,then the informa-
tion gain from observing both of them should be the sum of the information gained from each of them separately,so
that h(x, y) = h(x) + h(y).Two unrelated events will be statistically independent and so p(x, y) = p(x)p(y).From these
two relationships,it is easily shown that h(x) must be given by the logarithm of p(x) and so we have
where the negative sign ensures that information is positive or zero.Note that low probability events x correspond to
high information content.The choice of basis for the logarithm is arbitrary.
Now suppose that a sender wishes to transmit the value of a random variable to a receiver.The average amount
of information that they transmit in the process is obtained by taking the expectation of 1.16 with respect to the
distribution p(x) and is given by
8
In the case of discrete distributions,we see that the maximum entropy configuration correspondes to an equal dis-
tribution of probabilities across the possible states of variable.Now consider the maximum entropy configuration for
a continuous variable.It will be necessary to constrain the first and second moments of p(x) as well as preserving
the normalization constraint.Therefore we need to maximize the differential entropy H[x] = p(x) ln p(x)dx with
three constraints
p(x)dx = 1 (1.21)
xp(x)dx = (1.22)
(x )2 p(x)dx = 2 . (1.23)
The constrained maximization can be performed using Lagrange multipliers so that we maximize the following func-
tional with respect to p(x)
p(x) ln p(x)dx + 1 ( p(x)dx 1) + 2 ( xp(x)dx ) + 3 ( (x )2 p(x)dx 2 ) (1.24)
Using the calculus of variations,we set the derivative of this functional to zero giving
The Lagrange multipliers can be found by back substitution of this result into the three constraint equations,leading
finally to the result
1 (x )2
p(x) = exp{ } (1.26)
(2 2 )( 1/2) 2 2
and so the distribution that maximizes the differential entropy is the Gaussian.
If we evaluate the differential entropy of the Gaussian,we obtain
1
H[x] = {1 + ln(2 2 )} (1.27)
2
Thus we see again that the entropy increases as the distribution becomes broader,i.e.,as 2 increases.
9
Suppose we have joint distribution p(x, y) from which we draw pairs of values of x and y.If a value of x is already
known,then the additional information needed to specify the corresponding value of y is given by ln p(y|x),thus the
average additional information needed can be written as
H[y|x] = p(y, x) ln p(y|x)dydx (1.28)
which is called the conditional entropy of y given x.Using the product rule,the conditional entropy satisfies the
relation
H[x, y] = H[y|x] + H[x] (1.29)
where H[x, y] is the differential entropy of p(x, y) and H[x] is the differential entropy of the marginal distribution
p(x).
Proof.
H[x, y] H[y|x] = p(y, x) ln p(y, x)dydx + p(y, x) ln p(y|x)dydx (1.30)
p(y|x)
= p(y, x) ln (1.31)
p(y, x)
1
= p(y, x) ln dydx (1.32)
p(x)
= p(y, x) ln p(x)dydx (1.33)
= p(x) ln p(x)dx (1.34)
= H[x] (1.35)
Consider some unknown distribution p(x),and suppose that we have modelled this using an approximating distribution
qx).Then the average additional amount of information(in nats) required to specify the value of x as a result of using
q(x) instead of the true distribution is given by
KL(q p) = p(x) ln q(x)dx ( p(x) ln p(x)dx) (1.36)
qx
= p(x) ln{ }dx (1.37)
px
This is known as the relative entropy or Kullback-Leibler divergence,or KL divergence (Kullback and Leibler,1951),between
the distributions p(x) and q(x).Note that it is not a symmetrical quantity. Now introduce the concept of convex func-
tions .A function f (x) is said to be convex if it has the property that every chord lies on or above the function, as shown
in Figure 1.31. Any value of x in the interval from x = a to x = b can be written in the form a + (1 )b where 0 1.
The corresponding point on the chord is given by f(a) + (1 )f(b),and the corresponding value of the function is f(a +
(1 )b).Convexity then implies
f ( a + (1 )b) f (a) + (1 ) f (b) (1.38)
This is equivalent to the requirement that the second derivative of the function be everywhere positive. A function is
called strictly convex if the equality is satisfied only for = 0 and = 1. If a function has the opposite property, namely
that every chord lies on or below the function, it is called concave, with a corresponding definition for strictly concave.
If a function f (x) is convex, then f (x) will be concave.
Using the technique of proof by induction,we can show that a convex function f (x) satisfies
M M
f ( i xi ) i f (xi ) (1.39)
i=1 i=1
10
where i 0 and i i = 1,for any set of points {xi },this is known as Jensens inequality.If we interpret the i as the
probability distribution over a discrete variable x taking the values {xi },then 1.39 can be written
where E() denotes the expectation.For continuous variables,Jensens inequality takes the form
f( xp(x)) f (x)p(x)dx (1.41)
Suppose we try to approximate this distribution using some parametric distribution q(x|).Then the expectation of
KL-divergence with respect to p(x) can be approximated by a finite sum over these training points,so that
N
KL(p q) { ln q(xn |) + ln p(xn )} (1.48)
n=1
The second term on the right-hand side is independent of ,and the first term is the negative log likelihood function for
under the distribution q(x|) evaluated using the training set.Thus minimizing this KL-divergence is equivalent
to maximizing the likelihood function.
If ,for a joint distribution,the variables x and y are not independent,we can gain some idea of whether they are
close to being independent by considering the KL-divergence between the joint distribution and the product of the
marginals,given by
which is called the mutual information between the variables x and y.From the property of the KL-divergence,we see
that I(x, y) 0 with equality if,and only if x and y are independent.Using the sum and product rules of probability,we
11
Thus we can view the mutual information as the reduction in the uncertainty about x by virtue of being told the value
of y (or vice versa).
1.9.2.1 Representation
1.9.2.2 Evaluation
No training is needed.
1.9.2.3 Optimization
No training is needed.
1.9.3 Overfitting
When we have a variety of models of different complexity (e.g., linear or logistic regression models with different
degree polynomials, or KNN classifiers with different values ofK), how should we pick the right one? A natural
approach is to compute the misclassification rate on the training set for each method.
Chapter 2
Probability and Statistics
One role for the distributions is to model the probability distribution p(x) of a random variable x,given a finite set of
observations.This problem is known as density estimation.We shall assume that the data points are independent and
identically distributed.
Parameter distributions are governed by a small number of adaptive parameters,such as the mean and variance
in the case of a Gaussian for example.In a frequentist treatment, we choose specific values for the parameters by
optimizing some criterion, such as the likelihood function. By contrast, in a Bayesian treatment we introduce prior
distributions over the parameters and then use Bayes theorem to compute the corresponding posterior distribution
given the observed data.
An important role is played by conjugate priors,that lead to posterior distributions having the same functional
form as the prior,and that therefore lead to a greatly simplified Bayesian analytics.
Nonparametric density estimation methods in which the form of the distribution typically depends on the size of
the data set.Parameters in such models control the model complexity rather than the form of the distribution.We cover
three nonparametric methods based respectively on histograms,nearest-neighbours,and kernels.
There are two different interpretations of probability. One is called the frequentist interpretation. In this view, prob-
abilities represent long run frequencies of events. For example, the above statement means that, if we flip the coin
many times, we expect it to land heads about half the time.
The other interpretation is called the Bayesian interpretation of probability. In this view, probability is used to
quantify our uncertainty about something; hence it is fundamentally related to information rather than repeated trials
(Jaynes 2003). In the Bayesian view, the above statement means we believe the coin is equally likely to land heads or
tails on the next toss
One big advantage of the Bayesian interpretation is that it can be used to model our uncertainty about events that
do not have long term frequencies. For example, we might want to compute the probability that the polar ice cap will
melt by 2020 CE. This event will happen zero or one times, but cannot happen repeatedly. Nevertheless,we thought
to be able to quantify our uncertainty about this event. To give another machine learning oriented example, we might
have observed a blip on our radar screen, and want to compute the probability distribution over the location of the
corresponding target (be it a bird, plane, or missile). In all these cases, the idea of repeated trials does not make sense,
but the Bayesian interpretation is valid and indeed quite natural. We shall therefore adopt the Bayesian interpretation
in this book. Fortunately, the basic rules of probability theory are the same, no matter which interpretation is adopted.
2.2.1 Concepts
The expression p(A) denotes the probability that event A is true.We require that 0 p(A) 1,where 0 means the
event definitely will not happen,and p(A) = 1 means the event definitely will happen.p(A) denotes the probability of
the event not A;this is defined to be p(A) = 1 p(A).
We denote a random event by defining a random variable X.
Descrete random variable: X ,which can take on any value from a finite or countably infinite set .We denote the
probability of the event that X = x by p(X = x),or just p(x) for short.Here p() is called a probability mass function
or pmf.The pmfs are defined one state space.I denotes the binary indicator funcition.
Continuous random variable: the value of X is real-valued.
13
14
probability Probability is the measure of the likeliness that an event will occur.
conditional probability A conditional probability measures the probability of an event given that (by assumption,
presumption, assertion or evidence) another event has occurred.
joint probability Joint probability is a measure of two events happening at the same time, and can only be applied
to situations where more than one observation can be occurred at the same time.
prior probability distribution In Bayesian statistical inference, a prior probability distribution, often called simply
the prior, of an uncertain quantity p is the probability distribution that would express ones uncertainty about p
before some evidence is taken into account.
posterior probability distribution In Bayesian statistics, the posterior probability of a random event or an uncer-
tain proposition is the conditional probability that is assigned after the relevant evidence or background is taken
into account.
likelihood function In statistics, a likelihood function (often simply the likelihood) is a function of the parameters
of a statistical model.The likelihood of a set of parameter values, , given outcomes x, is equal to the probability or
probability desity of those observed outcomes given those parameter values.
Define the conditional probability of event A,given that event B is true,as follows:
p(A, B)
p(A|B) = , i f p(B) > 0 (2.4)
p(B)
p(A, B,C)
p(A, B|C) = (2.8)
p(C)
p(A,C)p(B|A,C)
= (2.9)
p(C)
= p(A|C)p(B|A,C) (2.10)
p(A, B,C)
p(C|A, B) = (2.11)
p(A, B)
p(A, B,C)
p(B)
= (2.12)
p(A, B)
p(B)
p(A,C|B)
= (2.13)
p(A|B)
2.2.2.6 CDF
{
p(u) , discrete
F(x) P(X x) = xux (2.14)
(u)du
f , continuous
For descrete random variable, We denote the probability of the event that X = x by P(X = x), or just p(x) for short.
Here p(x) is called a probability mass function or PMF.A probability mass function is a function that gives the
probability that a discrete random variable is exactly equal to some value4 . This satisfies the properties 0 p(x) 1
and xX p(x) = 1. x
For continuous variable, in the equation F(x) = f (u)du, the function f (x) is called a probability density
function or PDF. A probability density function is a function that describes
the relative likelihood for this random
variable to take on a given value5 .This satisfies the properties f (x) 0 and f (x)dx = 1.
probability densities
b
p(x (a, b)) = p(x)dx (2.15)
a
The probability density function p(x) must satisfy the two conditions
4 http://en.wikipedia.org/wiki/Probability_mass_function
5 http://en.wikipedia.org/wiki/Probability_density_function
16
{
p(x) 0
(2.16)
p(x)dx = 1
Marginal CDF:
+
P(X = xi ) = P(X = xi ,Y = y j )
FX (x) F(x, +) = xi x x x j=1 (2.22)
x xi +
Xf (u)du = f (u, v)dudv
+
p(Y = y j ) =
y j y P(X = xi ,Y = y j )
FY (y) F(+, y) = y j y i=1 (2.23)
y f (v)dv = + y f (u, v)dudv
Y
Conditional PMF:
p(X = xi ,Y = y j )
p(X = xi |Y = y j ) = if p(Y ) > 0 (2.26)
p(Y = y j )
17
Bayes theorem
p(X|Y )p(Y )
p(Y |X) = (2.28)
p(X)
Denominator in Bayes theorem
p(X) = p(X|Y )p(Y ) (2.29)
Y
Bayesian probabilities So far, we have viewed probabilities in terms of the frequencies of random,repeatable
events,which we shall refer to as the classical or frequentist interpretation of probability.Now we turn to the more
general Bayesian view,in which probabilities provide a quantification of uncertainty.
We can adopt a similar approach when making inferences about quantities such as the parameters w in the poly-
nomial curve fitting.We capture our assumptions about w,before observing the data,in the form of a prior probabil-
ity distribution p(w).The effect of the observed data D = t1 , ...,tN is expressed through the conditional probability
p(D|w).Bayes theorem,which takes the form
p(D|w)p(w)
p(w|D) = (2.30)
p(D)
then allows us to evaluate the uncertainty in w after we have observed D in the form of the posterior probability
p(w|D).
The quantity p(D|w) on the right-hand side of Bayes theorem is evaluated for the observed data set D and can
be viewed as a function of the parameter vector w,in which case it is called the likelihood function.It expresses how
probable the observed data set is for different settings of the parameter vector w.
where all of these quantities are viewed as functions of w. Summing both side with respect to w
A widely used frequentist estimator is maximum likelihood,in which w is set to the value that maximizes the
likelihood function p(D|w).In the machine learning literature,the negative log of the likelihood function is called an
error function.
Oner approach to determining the frequentist error bars is the bootstrap,in which multiple data sets are created as
follows.Suppose out original data set consists of N data points.We can create a new data set XB by drawing N points at
18
random from X, with replacement, so that some points in X may be replicated in XB, whereas other points in X may
be absent from XB . This process can be repeated L times to generate L data sets each of size N and each obtained by
sampling from the original data set X. The statistical accuracy of parameter estimates can then be evaluated by looking
at the variability of predictions between the different bootstrap data sets.
We say X and Y are unconditionally independent or marginally independent, denoted X Y , if we can represent the
joint as the product of the two marginals, i.e.,
We say X and Y are conditionally independent(CI) given Z if the conditional joint can be written as a product of
conditional marginals:
X Y |Z = P(X,Y |Z) = P(X|Z)P(Y |Z) (2.39)
2.2.6 Quantiles
Since the cdf F is a monotonically increasing function, it has an inverse; let us denote this by F 1 . If F is the cdf of X
, then F 1 ( ) is the value of x such that P(X x ) = ; this is called the quantile of F. The value F 1 (0.5) is the
median of the distribution, with half of the probability mass on the left, and half on the right. The values F 1 (0.25)
and F 1 (0.75)are the lower and upper quartiles.
= E[X 2 ] 2 (2.41)
approximation
1 N
E[ f ] f (xn )
N n=1
(2.45)
convariance
cov[x, y] = Ex,y [{x E[x]}{y E[y]}] (2.48)
In the case of two vectors of random variables x and y
cov[x, y] = Ex,y [{x E[x]}{yT E[yT ]}]cov[x, y] = Ex,y [xyT ] E[x]E[yT ] (2.49)
In this section, we review some commonly used parametric distributions defined on discrete state spaces, both finite
and countably infinite.
Definition 2.1. Now suppose we toss a coin only once. Let X {0, 1} be a binary random variable, with probability
of success or heads of . We say that X has a Bernoulli distribution. This is written as X Ber( ), where the pmf is
defined as
Ber(x| ) I(x=1) (1 )I(x=0) (2.50)
Definition 2.2. Suppose we toss a coin n times. Let X {0, 1, , n} be the number of heads. If the probability of
heads is , then we say X has a binomial distribution, written as X Bin(n, ). The pmf is given by
( )
n k
Bin(k|n, ) (1 )nk (2.51)
k
Definition 2.3. The Bernoulli distribution can be used to model the outcome of one coin tosses. To model the outcome
of tossing a K-sided dice, let x = (I(x = 1), , I(x = K)) {0, 1}K be a random vector(this is called dummy encoding
or one-hot encoding), then we say X has a multinoulli distribution(or categorical distribution), written as X
Cat( ). The pmf is given by:
K
I(xk =1)
p(x) k (2.52)
k=1
Definition 2.4. Suppose we toss a K-sided dice n times. Let x = (x1 , x2 , , xK ) {0, 1, , n}K be a random vector,
where x j is the number of times side j of the dice occurs, then we say X has a multinomial distribution, written as
X Mu(n, ). The pmf is given by
20
( ) K
n
k k
x
p(x) (2.53)
x1 xk k=1
( )
n n!
where
x1 xk x1 !x2 ! xK !
Bernoulli distribution is just a special case of a Binomial distribution with n = 1, and so is multinoulli distribution
as to multinomial distribution. See Table 2.1 for a summary.
Name KnX
Bernoulli 1 1 x {0, 1}
Binomial 1 - x {0, 1, , n}
Multinoulli - 1 x {0, 1}K , K k=1 xk = 1
Multinomial - - x {0, 1, , n}K , K
k=1 xk = n
Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events
occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently
of the time since the last event.
Definition 2.5. We say that X {0, 1, 2, } has a Poisson distribution with parameter > 0, written as X Poi( ),
if its pmf is
x
p(x| ) = e (2.54)
x!
The first term is just the normalization constant, required to ensure the distribution sums to 1.
The Poisson distribution is often used as a model for counts of rare events like radioactive decay and traffic acci-
dents.
The Poisson distribution can be derived as a limiting case to the binomial distribution as the number of trials
goes to infinity and the expected number of successes remains fixed,known as law of rare events.Assume that there
exists a small enough subinterval for which the probability of an event occurring twice is negligible.Given only the
information of expected number of total events in the whole interval,denoted as ,divide the whole interval into n
subintervals I1 , I2 , ..., In of equal size,such that n > . The occurrence of an event in the whole interval can be seen
21
as a Bernoulli trial, where the ith trial corresponds to looking whether an event happens at the subinterval Ii with
probability p = /n.Then the number of events x that occur obey binomial distribution X B(x; n, p).
P(x) = B(x; n, p)
= Cnx (p)x (1 p)nx
n!
= px (1 p)nx
x!(n x)!
( )
n(n 1)(n 2) (n x + 1) x nx
= x 1
x! n n
( )nx
n n1 nx+1 x
= 1
n n n x! n
(( ) n ) ( )
n n1 nx+1 x x
= 1 1
n n n x! n n
e x
lim P(x) = (2.58)
n x!
The empirical distribution function6 , or empirical cdf, is the cumulative distribution function associated with the
empirical measure of the sample. Let D = {x1 , x2 , , xN } be a sample set, it is defined as
1 N
Fn (x) I(xi x)
N i=1
(2.59)
In this section we present some commonly used univariate (one-dimensional) continuous probability distributions.
6 http://en.wikipedia.org/wiki/Empirical_distribution_function
22
Table 2.3: Summary of Gaussian distribution.
1. First, it has two parameters which are easy to interpret, and which capture some of the most basic properties of a
distribution, namely its mean and variance.
2. Second,the central limit theorem (Section TODO) tells us that sums of independent random variables have an
approximately Gaussian distribution, making it a good choice for modeling residual errors or noise.
3. Third, the Gaussian distribution makes the least number of assumptions (has maximum entropy), subject to the
constraint of having a specified mean and variance, as we show in Section TODO; this makes it a good default
choice in many cases.
4. Finally, it has a simple mathematical form, which results in easy to implement, but often highly effective, methods,
as we will see.
See (Jaynes 2003, ch 7) for a more extensive discussion of why Gaussians are so widely used.More about Gaussian
distribution is discussed in 2.5.
is the mean, 2 > 0 is the scale parameter, and > 0 is called the degrees of freedom. See Figure 2.1 for some
plots.
The variance is only defined if > 2. The mean is only defined if > 1.
As an illustration of the robustness of the Student distribution, consider Figure 2.2. We see that the Gaussian is
affected a lot, whereas the Student distribution hardly changes. This is because the Student has heavier tails, at least
for small (see Figure 2.1).
If = 1, this distribution is known as the Cauchy or Lorentz distribution. This is notable for having such heavy
tails that the integral that defines the mean does not converge.
To ensure finite variance, we require > 2. It is common to use = 4, which gives good performance in a range
of problems (Lange et al. 1989). For 5, the Student distribution rapidly approaches a Gaussian distribution and
loses its robustness properties.
Here is a location parameter and b > 0 is a scale parameter. See Figure 2.1 for a plot.
23
(a)
(b)
Fig. 2.1: (a) The pdfs for a N (0, 1), T (0, 1, 1) and Lap(0, 1/ 2). The mean is 0 and the variance is 1 for both the
Gaussian and Laplace. The mean and variance of the Student is undefined when = 1.(b) Log of these pdfs. Note that
the Student distribution is not log-concave for any parameter value, unlike the Laplace distribution, which is always
log-concave (and log-convex...) Nevertheless, both are unimodal.
Its robustness to outliers is illustrated in Figure 2.2. It also put mores probability density at 0 than the Gaussian.
This property is a useful way to encourage sparsity in a model, as we will see in Section TODO.
Here a > 0 is called the shape parameter and b > 0 is called the rate parameter. See Figure 2.3 for some plots.
24
(a)
(b)
Fig. 2.2: Illustration of the effect of outliers on fitting Gaussian, Student and Laplace distributions. (a) No outliers (the
Gaussian and Student curves are on top of each other). (b) With outliers. We see that the Gaussian is more affected by
outliers than the Student and Laplace distributions.
(a)
(b)
Fig. 2.3: Some Ga(a, b = 1) distributions. If a 1, the mode is at 0, otherwise it is > 0.As we increase the rate b, we
reduce the horizontal scale, thus squeezing everything leftwards and upwards. (b) An empirical pdf of some rainfall
data, with a fitted Gamma distribution superimposed.
km m2 k
Pareto distribution X Pareto(k, m) x m kmk x(k+1) I(x m) if k > 1 m if k > 2
k1 (k 1)2 (k 2)
26
The Pareto distribution is used to model the distribution of quantities that exhibit long tails, also called heavy
tails.
As k , the distribution approaches (x m). See Figure 2.5(a) for some plots. If we plot the distribution on a
log-log scale, it forms a straight line, of the form log p(x) = a log x + c for some constants a and c. See Figure 2.5(b)
for an illustration (this is known as a power law).
The Gaussian distributions,also known as the normal distributions,is a widely used model for the distribution of con-
tinuous variables.In the case of single variable x,the Gaussian is in the form
1 1
N (x| , 2 ) = exp{ (x )2 } (2.62)
(2 )
2 1/2 (2 )2
where is the mean and 2 is the variance.For a D-dimensional vector x,the multivariate Gaussian distribution takes
the form
1 1 1
N (x|, ) = exp{ (x )T 1 (x )} (2.63)
(2 )D/2 | |1/2 2
Recall from Section 2.8.2,for a D dimensionl vector x,the multivariate Gaussian distribution(MVN),i.e. the pdf
takes the form: [ ]
1 1 T 1
N (x|, ) D 1 exp (x ) (x ) (2.64)
(2 ) 2 | | 2 2
where is a D dimensional mean vector, is a D D covariance matrix,and | | denotes the determinant of .
27
(a)
(b)
Fig. 2.5: (a) The Pareto distribution Pareto(x|m, k) for m = 1. (b) The pdf on a log-log scale.
The Gaussian distribution arises in many different contexts and can be motivated from variety of different perspec-
tives,such as the distribution that maximizes the entropy.Another situation in which the Gaussian distribution arises is
when we consider the sum of multiple random variables.The centrallimittheorem (due to Laplace),tells us that,subject
to certain mild conditions,the sum of a set of random variables,which is of course itself a random variable,has a
distribution that becomes increasingly Gaussian as the number of terms in the sum increases(Walker,1969).
The expression inside the exponent,which is the functional dependence of the Gaussian on x,is through the
quadratic form
2 = (x )T 1 (x ) (2.65)
The quantity is the Mahalanobisdistance between a data vector x and the mean vector and reduces to the Eu-
clidean distance when is the identity matrix. We can gain a better understanding of this quantity by performing
an eigendecomposition of . That is, we write = U U T , where U is an orthonormal matrix of eigenvectors
satsifying U T U = I, and is a diagonal matrix of eigenvalues. Using the eigendecomposition, we have that
D
1
1 = U T 1 U 1 = U 1 U T = ui uTi (2.66)
i=1 i
where ui is the ith column of U , containing the ith eigenvector. Hence we can rewrite the Mahalanobis distance as
follows:
28
( )
D
1
(x ) T 1
(x ) = (x ) T
i ui uTi (x ) (2.67)
i=1
D
1
= (x )T ui uTi (x ) (2.68)
i=1 i
D
y2i
= (2.69)
i=1 i
y21 y22
+ =1 (2.70)
1 2
Hence we see that the contours of equal probability density of a Gaussian lie along ellipses. This is illustrated in
Figure 2.6. The eigenvectors determine the orientation of the ellipse, and the eigenvalues determine how elogonated it
is.
Fig. 2.6: Visualization of a 2 dimensional Gaussian density. The major and minor axes of the ellipse are defined by the
first two eigenvectors of the covariance matrix, namely u1 and u2 . Based on Figure 2.7 of (Bishop 2006a)
In general, we see that the Mahalanobis distance corresponds to Euclidean distance in a transformed coordinate
system, where we shift by and rotate by U .
Proof. First of all,we note that the matrix can be taken to be symmetric,without loss of generality.Now we consider
the eigenvector equation for the convariance matrix
i = i i (2.71)
where i = 1, ..., D.Because is real,symmetric matrix its eigenvalues will be real,and its eigenvectors can be chosen
to form an orthonormal set,so that
i T j = Ii j (2.72)
where Ii j is the i, j element of the identity matrix and satisfies
{
1, if i=j
Ii j = (2.73)
0, otherwise.
The covariance matrix can be expressed as an equation in terms of its eigenvectors in the form
29
= U U T (2.74)
D
= i i ui T (2.75)
i=1
where ui is the ith column of U , containing the ith eigenvector. Similarly,the inverse covariance matrix 1 can be
expressed as
D
1
1 = ui ui T (2.76)
i=1 i
y = U (x ) (2.79)
where [ ]
U = u1 T u2 T ... un T (2.80)
U is an orthogonal matrix. A matrix whose eigenvalue are strictly positive is said to be positive definite, and if all the
eigenvalues are nonnegative,then the covariance matrix is said to be positive semidefinite.
Now consider the form of Gaussian distribution in the new coordinate system defined by the yi .In going from x to
y coordinate system,we have a Jacobian matrix J with elements given by
xi
Ji j = = U ji (2.81)
yi
x
J= (2.82)
y
where U ji are the elements of the matrix U T .Using the orthonormality property of the matrix U , we see that the square
of the determinant of the Jacobian matrix is
|J |2 = |I| = 1 (2.83)
Also,the determinant || of the covariance matrix can be written as the product of its eigenvalues,and hence
D
||1/2 = j
1/2
(2.84)
j=1
which is the product of D independent univariate Gaussian distributions.So we have normalized the multivariate Gaus-
sian.
Now we can check the first and second order moments of the Gaussian.
30
2.5.1.1 limitations
Gaussian distribution has quadratically growing parameters with dimension.A further limitation of the Gaussian distri-
bution is that it is intrinsically unimodal(i.e.,has a single maximum) and so is unable to provide a good approximation
to multimodal distributions.We will introduce latent variables,also called hidden variables or unobserved variables to
address both of these problems.
Theorem 2.1. (MLE for a MVN) If we have N iid samples xi N (, ), then the MLE for the parameters is given
by
1 N
= xi x
N i=1
(2.86)
1 N
= (xi x)(xi x)T
N i=1
(2.87)
( )
1 N
= xi xi xxT
N i=1
T
(2.88)
In this section, we show that the multivariate Gaussian is the distribution with maximum entropy subject to having a
specified mean and covariance (see also Section TODO). This is one reason the Gaussian is so widely used: the first
two moments are usually all that we can reliably estimate from data, so we want a distribution that captures these
properties, but otherwise makes as few addtional assumptions as possible.
To simplify notation, we will assume the mean is zero. The pdf has the form
( )
1 1 T 1
f (x) = exp x x (2.89)
Z 2
Theorem 2.2. If two sets of variables are jointly multivariate Gaussian, then the conditional distribution of one set
conditioned on the other is again Gaussian. Similarly, the marginal distribution of either set is also Gaussian.
The multivariate normal distribution is given by
( )
1 1 T 1
exp (x ) (x ) (2.90)
(2 )D/2 | |1/2 2
Write x as [
] [ ]
xa a
x= ,x = (2.91)
xb b
Consider the conditional distribution p(xa |xb ) and marginal distribution p(xa ). Separate the components of the co-
variance matrix into partitioned block matrix
[ ]
aa ab
= (2.92)
ba bb
From the product rule of probability, we see that this conditional distribution can be evaluated from the joint
distribution p(x) = p(xa , xb ) simple by fixing xb to the observed value and normalizing the resulting expression to
obtain a valid probability distribution over xa .To evaluate the conditional probability from the joint probability,for any
arbitrary xb ,we can use product and sum rule of probability
p(xb ) = p(xa , xb )dxa (2.93)
p(xa , xb )
p(xa xb ) = (2.94)
p(xb )
p(xa , xb )
p(xa xb ) (2.95)
p(xb )
p(xa , xb ) (2.96)
which has the exponential quadratic form,and hence the corresponding conditional distribution will be Gaussian.
1
(x ) 1 (x ) (2.97)
2 [ ][ ]
1[ ] aa ab (xa a )
= (xa a ) (xb b )
T T
(2.98)
2 ba bb (xb b )
[ ]
1[ ] (xa a )
= (xa a ) aa + (xb b ) ba (xa a ) ab + (xb b ) bb
T T T T
(2.99)
2 (xb b )
1
= ((xa a )T aa (xa a ) + (xb b )T ba (xa a ) (2.100)
2
+ (xa a )T ab (xb b ) + (xb b )T bb (xb b )) (2.101)
With the operation called completing the square of the exponent form in a General Gaussian distribution N (x|, )
1 1
(x )T 1 (x ) = xT 1 x + xT + const (2.102)
2 2
Based on this,we can apply method of undetermined coefficients to determinate the super parameters.Covariance
matrix is in the corresponding second-order term in xa ,and mean vector is in the corresponding linear term in xa .
We need to make use of Schur complement for the inverse covariance matrix(precision matrix)
[ ]1 [ ]1
AB M M BD 1
= (2.103)
CD D CM D CM BD 1
1 1
(2.104)
where
M = (A BD 1 C)1 (2.105)
can be evaluated.
32
Wrapping up,given joint Gaussian distribution,we can obtain partitioned Gaussians. Conditional distribution:
Marginal distribution:
Before defining the exponential family, we mention several reasons why it is important:
It can be shown that, under certain regularity conditions, the exponential family is the only family of distributions
with finite-sized sufficient statistics, meaning that we can compress the data into a fixed-sized summary without
loss of information. This is particularly useful for online learning, as we will see later.
The exponential family is the only family of distributions for which conjugate priors exist, which simplifies the
computation of the posterior (see Section 2.6.8).
The exponential family can be shown to be the family of distributions that makes the least set of assumptions subject
to some user-chosen constraints (see Section 2.6.9).
The exponential family is at the core of generalized linear models, as discussed in Section ??.
The exponential family is at the core of variational inference, as discussed in Section TODO.
2.6.1 Definition
A pdf or pmf p(x|),for x Rm and RD , is said to be in the exponential family if it is of the form
where
Z() = h(x) exp[ T (x)]dx (2.113)
A() = log Z() (2.114)
Here are called the natural parameters or canonical parameters, (x) RD is called a vector of sufficient
statistics, Z() is called the partition function, A() is called the log partition function or cumulant function, and
h(x) is the a scaling constant, often 1. If (x) = x, we say it is a natural exponential family.
Equation 2.112 can be generalized by writing
where is a function that maps the parameters to the canonical parameters = ().If dim() < dim( ()), it is
called a curved exponential family, which means we have more sufficient statistics than parameters. If () = , the
model is said to be in canonical form. We will assume models are in canonical form unless we state otherwise.
In many cases,however,we may have little idea of what form the distribution should take.We may then seek a form of
prior distribution,called a noninformative prior,which is intended to have little influence on the posterior distribution
as possible(Jeffries,1946;Box and Tao,1973;Bernardo and Smith,1994).
Suppose we have a distribution p(x| ) governed by a parameter ,we might be tempted to propose a prior distri-
bution p( ) =const as a suitable prior.Improper priors are ones in which the domain is unbounded and the prior
distribution cannot be correctly normalized because the integral over diverges.A second difficulty arises from the
transformation behaviour of a probability density under a nonlinear change of variables.For example
d
p ( ) = p ( ) = p ( 2 ) (2.116)
d
p(x| ) = f (x ) (2.117)
then the parameter is known as a location parameter.This family of densities exhibits translation invariance
because if we shift x by a constant to give x = x + c,then
p(x| ) = f (x ) (2.118)
where = + c.
A second example,
1 x
p(x| ) = f( ) (2.119)
where > 0.Note that this will be a normalized density provided f(x) is correctly normalized.The parameter is
known as a scale parameter,and the density exhibits scale invariance.
2.6.5 Examples
2.6.5.1 Bernoulli
The Bernoulli for x {0, 1} can be written in exponential family form as follows:
Ber(x| ) = x (1 )1x
(2.120)
= exp[x log + (1 x) log(1 )]
where (x) = (I(x = 0), I(x = 1)) and = (log , log(1 )).
However, this representation is over-complete since 1T (x) = I(x = 0) + I(x = 1) = 1. Consequently is not
uniquely identifiable. It is common to require that the representation be minimal, which means there is a unique
34
We can recover the mean parameter from the canonical parameter using
1
= sigm( ) = (2.122)
1 + e
2.6.5.2 Multinoulli
We can recover the mean parameters from the canonical parameters using
ek
k = j
(2.126)
1 + K1
j=1 e
j
K1
j=1 e 1
K = 1 j
= j
(2.127)
1 + K1
j=1 e 1 + K1
j=1 e
and hence
K1
A(] = log K = log(1 + e j ) (2.128)
j=1
where
1
=( , ) (2.130)
2 2 2
(x) = (x, x2 ) (2.131)
2
Z() = 2 exp( 2 ) (2.132)
2
2.6.5.4 Non-examples
Not all distributions of interest belong to the exponential family. For example, the uniform distribution,X U(a, b),
does not, since the support of the distribution depends on the parameters. Also, the Student T distribution (Section
TODO) does not belong, since it does not have the required form.
An important property of the exponential family is that derivatives of the log partition function can be used to generate
cumulants of the sufficient statistics.7 For this reason, A() is sometimes called a cumulant function. We will prove
this for a 1-parameter distribution; this can be generalized to a K-parameter distribution in a straightforward way. For
the first derivative we have
For the second derivative we have
{ }
dA d
= log exp [ (x)] h(x)dx
d d
d
exp [ (x)] h(x)dx
= d
exp [ (x)] h(x)dx
(x)exp [ (x)] h(x)dx
=
exp(A( ))
= (x) exp [ (x) A( )] h(x)dx
= (x)p(x)dx = E[ (x)] (2.133)
7The first and second cumulants of a distribution are its mean E[X] and variance var[X], whereas the first and second moments are its mean
E[X] and E[X 2 ].
36
[ ]
d2 A
= (x) exp [ (x) A( )] h(x) (x) A ( ) dx
d 2
[ ]
= (x)p(x) (x) A ( ) dx
= 2 (x)p(x)dx A ( ) (x)p(x)dx
2A
= E[i (x) j (x)] E[i (x)]E[ j (x)] (2.135)
i j
and hence
2 A() = cov[ (x)] (2.136)
Since the covariance is positive definite, we see that A() is a convex function (see Section B.1).
The Pitman-Koopman-Darmois theorem states that, under certain regularity conditions, the exponential family
is the only family of distributions with finite sufficient statistics. (Here, finite means of a size independent of the size
of the data set.)
One of the conditions required in this theorem is that the support of the distribution not be dependent on the
parameter.
TODO
37
2.6.8.1 Likelihood
Given a multivariate random variable or random vector 8 X RD , the joint probability distribution9 is a prob-
ability distribution that gives the probability that each of X1 , X2 , , XD falls in any particular range or discrete set of
values specified for that variable. In the case of only two random variables, this is called a bivariate distribution, but
the concept generalizes to any number of random variables, giving a multivariate distribution.
The joint probability distribution can be expressed either in terms of a joint cumulative distribution function or in
terms of a joint probability density function (in the case of continuous variables) or joint probability mass function
(in the case of discrete variables).
Definition 2.6. The covariance between two rvs X and Y measures the degree to which X and Y are (linearly) related.
Covariance is defined as
Definition 2.7. If X is a D-dimensional random vector, its covariance matrix is defined to be the following symmetric,
positive definite matrix:
[ ]
cov[x] E (x E[x])(x E[x])T (2.140)
var[x1 ] Cov[x1 , x2 ] Cov[x1 , xD ]
Cov[x2 , x1 ] var[x2 ] Cov[x2 , xD ]
= .. .. .. .. (2.141)
. . . .
Cov[xD , x1 ] Cov[xD , x2 ] var[xD ]
For
Cov[X,Y ]
corr[X,Y ] (2.142)
var[X], var[Y ]
8 http://en.wikipedia.org/wiki/Multivariate_random_variable
9 http://en.wikipedia.org/wiki/Joint_probability_distribution
38
corr[X1 , X1 ] corr[X1 , X2 ] corr[X1 , XD ]
corr[X2 , X1 ] corr[X2 , X2 ] corr[X2 , XD ]
R .. .. .. .. (2.143)
. . . .
corr[XD , X1 ] corr[XD , X2 ] corr[XD , XD ]
The correlation coefficient can viewed as a degree of linearity between X and Y , see Figure 2.7.
Fig. 2.7: Several sets of (x, y) points, with the Pearson correlation coefficient of x and y for each set. Note that the
correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship
(middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in
that case the correlation coefficient is undefined because the variance of Y is zero.Source:http://en.wikipedia.
org/wiki/Correlation
Uncorrelated does not imply independent. For example, let X U(1, 1) and Y = X 2 . Clearly Y is dependent
on X(in fact, Y is uniquely determined by X), yet one can show that corr[X,Y ] = 0. Some striking examples of this fact
are shown in Figure 2.7. This shows several data sets where there is clear dependence between X and Y , and yet the
correlation coefficient is 0. A more general measure of dependence between random variables is mutual information,
see Section TODO.
The multivariate Gaussian or multivariate normal(MVN) is the most widely used joint probability density function
for continuous variables. We discuss MVNs in detail in Chapter 4; here we just give some definitions and plots.
The pdf of the MVN in D dimensions is defined by the following:
[ ]
1 1 T 1
N (x|, ) exp (x ) (x ) (2.144)
(2 )D/2 | |1/2 2
where = E[X] RD is the mean vector, and = Cov[X] is the D D covariance matrix. The normalization constant
(2 )D/2 | |1/2 just ensures that the pdf integrates to 1.
Figure 2.8 plots some MVN densities in 2d for three different kinds of covariance matrices. A full covariance matrix
has A D(D + 1)/2 parameters (we divide by 2 since is symmetric). A diagonal covariance matrix has D parameters,
and has 0s in the off-diagonal terms. A spherical or isotropic covariance, = 2 I D , has one free parameter.
39
(a)
(b)
(c)
(d)
Fig. 2.8: We show the level sets for 2d Gaussians. (a) A full covariance matrix has elliptical contours.(b) A diagonal
covariance matrix is an axis aligned ellipse. (c) A spherical covariance matrix has a circular shape. (d) Surface plot for
the spherical Gaussian in (c).
40
A more robust alternative to the MVN is the multivariate Students t-distribution, whose pdf is given by
T (x|, , )
( +D 12 [ ] +D
2 ) | | 1 2
T 1
1 + (x ) (x ) (2.145)
( 2 ) ( ) 2 D
( +D 21 [ ] +D
2 ) | | T 1 2
= 1 + (x ) V (x ) (2.146)
( 2 ) ( ) 2 D
where is called the scale matrix (since it is not exactly the covariance matrix) and V = . This has fatter tails than
a Gaussian. The smaller is, the fatter the tails. As , the distribution tends towards a Gaussian. The distribution
has the following properties
mean = , mode = , Cov = (2.147)
2
A multivariate generalization of the beta distribution is the Dirichlet distribution, which has support over the proba-
bility simplex, defined by { }
K
SK = x : 0 xk 1, xk = 1 (2.148)
k=1
Kk=1 (k ) K
B( ) where 0 k (2.150)
(0 ) k=1
Figure 2.9 shows some plots of the Dirichlet when K = 3, and Figure 2.10 for some sampled probability vectors.
We see that 0 controls the strength of the distribution (how peaked it is), and thekcontrol where the peak occurs.
For example, Dir(1, 1, 1) is a uniform distribution, Dir(2, 2, 2) is a broad distribution centered at (1/3, 1/3, 1/3), and
Dir(20, 20, 20) is a narrow distribution centered at (1/3, 1/3, 1/3).If k < 1 for all k, we get spikes at the corner of the
simplex.
For future reference, the distribution has these properties
k k 1 k (0 k )
E(xk ) = , mode[xk ] = , var[xk ] = 2 (2.151)
0 0 K 0 (0 + 1)
If x P() is some random variable, and y = f (x), what is the distribution of Y ? This is the question we address in
this section.
41
(a)
(b)
(c)
(d)
Fig. 2.9: (a) The Dirichlet distribution when K = 3 defines a distribution over the simplex, which can be represented
by the triangular surface. Points on this surface satisfy 0 k 1 and Kk=1 k = 1. (b) Plot of the Dirichlet density
when = (2, 2, 2). (c) = (20, 2, 2).
42
Fig. 2.10: Samples from a 5-dimensional symmetric Dirichlet distribution for different parameter values.
If X is a discrete rv, we can derive the pmf for y by simply summing up the probability mass for all the xs such that
f (x) = y:
pY (y) = pX (x) (2.155)
x:g(x)=y
43
If X is continuous, we cannot use Equation 2.155 since pX (x) is a density, not a pmf, and we cannot sum up
densities. Instead, we work with cdfs, and write
FY (y) = P(Y y) = P(g(X) y) = fX (x)dx (2.156)
g(X)y
dx
fY (y) = fX (x)| | (2.157)
dy
This is called change of variables formula. We leave the proof of this as an exercise.
1 1
For example, suppose X U(1, 1), and Y = X 2 . Then pY (y) = y 2 .
2
Let f be a function f : Rn Rn , and let y = f (x). Then its Jacobian matrix J is given by
y y1
x1 xn
1
y
J xy .. .. .. (2.158)
x . . .
yn yn
x xn
1
|det(J )| measures how much a unit cube changes in volume when we apply f .
If f is an invertible mapping, we can define the pdf of the transformed variables using the Jacobian of the inverse
mapping y x:
x
py (y) = px (x)|det( )| = px (x)|det(J yx )| (2.159)
y
Given N random variables X1 , X2 , , XN , each variable is independent and identically distributed10 (iid for short),
and each has the same mean and variance 2 , then
n
Xi N
i=1
N (0, 1) (2.160)
N
this can also be written as
X 1 n
N (0, 1) , where X Xi (2.161)
/ N N i=1
In general, computing the distribution of a function of an rv using the change of variables formula can be difficult. One
simple but powerful alternative is as follows. First we generate S samples from the distribution, call them x1 , , xS .
(There are many ways to generate such samples; one popular method, for high dimensional distributions, is called
Markov chain Monte Carlo or MCMC; this will be explained in Chapter TODO.) Given the samples, we can ap-
10 http://en.wikipedia.org/wiki/Independent_identically_distributed
44
proximate the distribution of f (X) by using the empirical distribution of { f (xs )}Ss=1 . This is called a Monte Carlo
approximation11 , named after a city in Europe known for its plush gambling casinos.
We can use Monte Carlo to approximate the expected value of any function of a random variable. We simply draw
samples, and then compute the arithmetic mean of the function applied to the samples. This can be written as follows:
1 S
E[g(X)] = g(x)p(x)dx f (xs )
S s=1
(2.162)
where xs p(X).
This is called Monte Carlo integration12 , and has the advantage over numerical integration (which is based on
evaluating the function at a fixed grid of points) that the function is only evaluated in places where there is non-
negligible probability.
2.11.1 Entropy
The entropy of a random variable X with distribution p, denoted by H(X) or sometimes H(p), is a measure of its
uncertainty. In particular, for a discrete variable with K states, it is defined by
K
H(X) p(X = k) log2 p(X = k) (2.163)
k=1
Usually we use log base 2, in which case the units are called bits(short for binary digits). If we use log base e , the
units are called nats.
The discrete distribution with maximum entropy is the uniform distribution (see Section XXX for a proof). Hence
for a K-ary random variable, the entropy is maximized if p(x = k) = 1/K; in this case, H(X) = log2 K.
Conversely, the distribution with minimum entropy (which is zero) is any delta-function that puts all its mass on
one state. Such a distribution has no uncertainty.
2.11.2 KL divergence
One way to measure the dissimilarity of two probability distributions, p and q , is known as the Kullback-Leibler
divergence(KL divergence)or relative entropy. This is defined as follows:
p(x)
KL(P||Q) p(x) log2 (2.164)
x q(x)
where the sum gets replaced by an integral for pdfs13 . The KL divergence is only defined if P and Q both sum to 1 and
if q(x) = 0 implies p(x) = 0 for all x(absolute continuity). If the quantity 0 ln 0 appears in the formula, it is interpreted
as zero because lim x ln x. We can rewrite this as
x0
11 http://en.wikipedia.org/wiki/Monte_Carlo_method
12 http://en.wikipedia.org/wiki/Monte_Carlo_integration
13 The KL divergence is not a distance, since it is asymmetric. One symmetric version of the KL divergence is the Jensen-Shannon
One can show (Cover and Thomas 2006) that the cross entropy is the average number of bits needed to encode
data coming from a source with distribution p when we use model q to define our codebook. Hence the regular
entropy H(p) = H(p, p), defined in section 2.11.1,is the expected number of bits if we use the true model, so the
KL divergence is the diference between these. In other words, the KL divergence is the average number of extra bits
needed to encode the data, due to the fact that we used distribution q to encode the data instead of the true distribution
p.
The extra number of bits interpretation should make it clear that KL(p||q) 0, and that the KL is only equal to
zero if q = p. We now give a proof of this important result.
One important consequence of this result is that the discrete distribution with the maximum entropy is the uniform
distribution.
We have I(X;Y ) 0 with equality if P(X,Y ) = P(X)P(Y ). That is, the MI is zero if the variables are independent.
To gain insight into the meaning of MI, it helps to re-express it in terms of joint and conditional entropies. One can
show that the above expression is equivalent to the following:
where H(X) and H(Y ) are the marginal entropies, H(X|Y ) and H(Y |X) are the conditional entropies, and H(X,Y )
is the joint entropy of X and Y , see Fig. 2.1114 .
Intuitively, we can interpret the MI between X and Y as the reduction in uncertainty about X after observing Y , or,
by symmetry, the reduction in uncertainty about Y after observing X.
A quantity which is closely related to MI is the pointwise mutual information or PMI. For two events (not random
variables) x and y, this is defined as
14 http://en.wikipedia.org/wiki/Mutual_information
46
Fig. 2.11: Individual H(X), H(Y ), joint H(X,Y ), and conditional entropies for a pair of correlated subsystems X,Y
with mutual information I(X;Y ).
This measures the discrepancy between these events occuring together compared to what would be expected by
chance. Clearly the MI of X and Y is just the expected value of the PMI. Interestingly, we can rewrite the PMI as
follows:
p(x|y) p(y|x)
PMI(x, y) = log = log (2.173)
p(x) p(y)
This is the amount we learn from updating the prior p(x) into the posterior p(x|y) , or equivalently, updating the
prior p(y) into the posterior p(y|x) .
Chapter 3
Generative models for discrete data
p(y = c|)p(x|y = c, )
p(y = c|x, ) = (3.1)
c p(y = c |)p(x|y = c , )
This is called a generative classifier, since it specifies how to generate the data using the class conditional density
p(x|y = c) and the class prior p(y = c). An alternative approach is to directly fit the class posterior, p(y = c|x) ;this is
known as a discriminative classifier.
Psychological research has shown that people can learn concepts from positive examples alone (Xu and Tenenbaum
2007).
We can think of learning the meaning of a word as equivalent to concept learning, which in turn is equivalent to
binary classification. To see this, define f (x) = 1 if xis an example of the concept C, and f (x) = 0 otherwise. Then
the goal is to learn the indicator function f , which just defines which elements are in the set C.
3.2.1 Likelihood
( )N ( )N
1 1
p(D|h) = (3.2)
size(h) |h|
This crucial equation embodies what Tenenbaum calls the size principle, which means the model favours the
simplest (smallest) hypothesis consistent with the data. This is more commonly known as Occams razor15 .
3.2.2 Prior
The prior is decided by human, not machines, so it is subjective. The subjectivity of the prior is controversial. For
example, that a child and a math professor will reach different answers. In fact, they presumably not only have different
priors, but also different hypothesis spaces. However, we can finesse that by defining the hypothesis space of the child
and the math professor to be the same, and then setting the childs prior weight to be zero on certain advanced concepts.
Thus there is no sharp distinction between the prior and the hypothesis space.
However, the prior is the mechanism by which background knowledge can be brought to bear on a problem. Without
this, rapid learning (i.e., from small samples sizes) is impossible.
3.2.3 Posterior
15 http://en.wikipedia.org/wiki/Occam%27s_razor
47
48
p(D|h)p(h) I(D h)p(h)
p(h|D) = (3.3)
h H p(D|h )p(h ) h H I(D h )p(h )
where I(D h)p(h) is 1 iff(iff and only if) all the data are in the extension of the hypothesis h.
In general, when we have enough data, the posterior p(h|D) becomes peaked on a single concept, namely the MAP
estimate, i.e.,
p(h|D) hMAP (3.4)
where hMAP is the posterior mode,
Since the likelihood term depends exponentially on N, and the prior stays constant, as we get more and more data,
the MAP estimate converges towards the maximum likelihood estimate or MLE:
In other words, if we have enough data, we see that the data overwhelms the prior.
The concept of posterior predictive distribution16 is normally used in a Bayesian context, where it makes use of
the entire posterior distribution of the parameters given the observed data to yield a probability distribution over an
interval rather than simply a point estimate.
{
p(x|h)p(h|D)
p(x|D) Eh|D [p(x|h)] = h (3.7)
p(x|h)p(h|D)dh
This is just a weighted average of the predictions of each individual hypothesis and is called Bayes model averag-
ing(Hoeting et al. 1999).
3.3.1 Likelihood
3.3.2 Prior
16 http://en.wikipedia.org/wiki/Posterior_predictive_distribution
49
3.3.3 Posterior
This makes Bayesian inference particularly well-suited to online learning, as we will see later.
The mean and mode are point estimates, but it is useful to know how much we can trust them. The variance of the
posterior is one way to measure this. The variance of the Beta posterior is given by
(a + N1 )(b + N0 )
var( |D) = (3.14)
(a + N1 + b + N0 )2 (a + N1 + b + N0 + 1)
N1 N0 MLE (1 MLE )
var( |D) = (3.15)
NNN N
50
So far, we have been focusing on inference of the unknown parameter(s). Let us now turn our attention to prediction
of future observable data.
Consider predicting the probability of heads in a single future trial under a Beta(a, b)posterior. We have
1
p(x|D) = p(x| )p( |D)d
0
1
= Beta( |a, b)d
0
a
= E[ |D] = (3.16)
a+b
Let us now derive a simple Bayesian solution to the problem. We will use a uniform prior, so a = b = 1. In this case,
plugging in the posterior mean gives Laplaces rule of succession
N1 + 1
p(x|D) = (3.17)
N0 + N1 + 1
This justifies the common practice of adding 1 to the empirical counts, normalizing and then plugging them in, a
technique known as add-one smoothing. (Note that plugging in the MAP parameters would not have this smoothing
effect, since the mode becomes the MLE if a = b = 1, see Section 3.3.3.1.)
Suppose now we were interested in predicting the number of heads, x, in M future trials. This is given by
1
p(x|D) = Bin(x|M, )Beta( |a, b)d (3.18)
(0 ) 1
M 1
= x (1 )Mx a1 (1 )b1 d (3.19)
x B(a, b) 0
We recognize the integral as the normalization constant for a Beta(a + x, M x + b) distribution. Hence
1
x (1 )Mx a1 (1 )b1 d = B(x + a, M x + b) (3.20)
0
Thus we find that the posterior predictive is given by the following, known as the (compound) beta-binomial
distribution: ( )
M B(x + a, M x + b)
Bb(x|a, b, M) (3.21)
x B(a, b)
This distribution has the following mean and variance
a Mab a + b + M
mean = M , var = (3.22)
a+b (a + b)2 a + b + 1
This process is illustrated in Figure 3.1. We start with a Beta(2, 2) prior, and plot the posterior predictive density
after seeing N1 = 3 heads and N0 = 17 tails. Figure 3.1(b) plots a plug-in approximation using a MAP estimate. We see
that the Bayesian prediction has longer tails, spreading its probability mass more widely, and is therefore less prone to
overfitting and blackswan type paradoxes.
51
(a)
(b)
Fig. 3.1: (a) Posterior predictive distributions after seeing N1 = 3, N0 = 17. (b) MAP estimation.
In the previous section, we discussed how to infer the probability that a coin comes up heads. In this section, we
generalize these results to infer the probability that a dice with K sides comes up as face k.
3.4.1 Likelihood
Suppose we observe N dice rolls, D = {x1 , x2 , , xN }, where xi {1, 2, , K}. The likelihood has the form
( ) K N
N
k k where Nk = I(yi = k)
N
p(D|) = (3.23)
N1 Nk k=1 i=1
3.4.2 Prior
1 K k 1
Dir(|) = k I( SK )
B() k=1
(3.24)
52
3.4.3 Posterior
The posterior predictive distribution for a single multinoulli trial is given by the following expression:
p(X = j|D) = p(X = j|)p(|D)d (3.30)
[ ]
= p(X = j| j ) p( j , j |D)d j d j (3.31)
j + Nj
= j p( j |D)d j = E[ j |D] = (3.32)
0 + N
where j are all the components of except j .
The above expression avoids the zero-count problem. In fact, this form of Bayesian smoothing is even more im-
portant in the multinomial case than the binary case, since the likelihood of data sparsity increases once we start
partitioning the data into many categories.
Assume the features are conditionally independent given the class label, then the class conditional density has the
following form
D
p(x|y = c, ) = p(x j |y = c, jc ) (3.33)
j=1
Obviously we can handle other kinds of features, or use different distributional assumptions. Also, it is easy to mix
and match features of different types.
3.5.1 Optimization
We now discuss how to train a naive Bayes classifier. This usually means computing the MLE or the MAP estimate
for the parameters. However, we will also discuss how to compute the full posterior, p(|D).
N jc
jc = (3.37)
Nc
N j1c N j2c N jS j T
jc = ( , , , ) (3.38)
Nc Nc Nc
N
where N jkc I(xi j = a jk , yi = c) is the number that feature x j = a jk occurs in class c.
i=1
We can the estimate parameters using MLE or MAP, then the posterior predictive density is obtained by simply
plugging in the parameters (MLE) or (MAP).
Or we can use BMA, just integrate out the unknown parameters.
when using generative classifiers of any kind, computing the posterior over class labels using Equation 3.1 can fail
due to numerical underflow. The problem is that p(x|y = c) is often a very small number, especially if x is a high-
dimensional vector. This is because we require that x p(x|y) = 1, so the probability of observing any particular
high-dimensional vector is small. The obvious solution is to take logs when applying Bayes rule, as follows:
( )
log p(y = c|x, ) = bc log ebc (3.40)
c
In general, we have [ ] ( )
ebc = log ( ebc B )eB = log ebc B +B (3.42)
c
where B max{bc }.
This is called the log-sum-exp trick, and is widely used.
Since an NBC is fitting a joint distribution over potentially many features, it can suffer from overfitting. In addition,
the run-time cost is O(D), which may be too high for some applications.
One common approach to tackling both of these problems is to perform feature selection, to remove irrelevant
features that do not help much with the classification problem. The simplest approach to feature selection is to evaluate
the relevance of each feature separately, and then take the top K,whereKis chosen based on some tradeoff between
accuracy and complexity. This approach is known as variable ranking, filtering, or screening.
One way to measure relevance is to use mutual information (Section 2.11.3) between feature X j and the class label
Y
p(x j , y)
I(X j ,Y ) = p(x j , y) log (3.43)
xj y p(x j )p(y)
If the features are binary, it is easy to show that the MI can be computed as follows
55
[ ]
jc 1 jc
I j = jc c log + (1 jc )c log (3.44)
c j 1j
Document classification is the problem of classifying text documents into different categories.
One simple approach is to represent each document as a binary vector, which records whether each word is present
or not, so xi j = 1 iff word j occurs in document i, otherwise xi j = 0. We can then use the following class conditional
density:
D
p(xi |yi = c, ) = Ber(xi j | jc )
j=1
(3.45)
D
=
x
jci j (1 jc )1xi j
j=1
This is called the Bernoulli product model, or the binary independence model.
However, ignoring the number of times each word occurs in a document loses some information (McCallum and
Nigam 1998). A more accurate representation counts the number of occurrences of each word. Specifically, let xi be a
D
vector of counts for document i, so xi j {0, 1, , Ni }, where Ni is the number of terms in document i(so xi j = Ni ).
j=1
For the class conditional densities, we can use a multinomial distribution:
D
Ni !
x
p(xi |yi = c, ) = Mu(xi |Ni , c ) = D
jci j (3.46)
j=1 xi j ! j=1
where we have implicitly assumed that the document length Ni is independent of the class. Here jc is the probability
of generating word j in documents of class c; these parameters satisfy the constraint that Dj=1 jc = 1 for each class c.
Although the multinomial classifier is easy to train and easy to use at test time, it does not work particularly well
for document classification. One reason for this is that it does not take into account the burstiness of word usage. This
refers to the phenomenon that most words never appear in any given document, but if they do appear once, they are
likely to appear more than once, i.e., words occur in bursts.
The multinomial model cannot capture the burstiness phenomenon. To see why, note that Equation 3.46 has the form
x
jci j , and since jc 1 for rare words, it becomes increasingly unlikely to generate many of them. For more frequent
words, the decay rate is not as fast. To see why intuitively, note that the most frequent words are function words which
are not specific to the class, such as and, the, and but; the chance of the word and occuring is pretty much the same no
matter how many time it has previously occurred (modulo document length), so the independence assumption is more
reasonable for common words. However, since rare words are the ones that matter most for classification purposes,
these are the ones we want to model the most carefully.
56
Various ad hoc heuristics have been proposed to improve the performance of the multinomial document classifier
(Rennie et al. 2003). We now present an alternative class conditional density that performs as well as these ad hoc
methods, yet is probabilistically sound (Madsen et al. 2005).
Suppose we simply replace the multinomial class conditional density with the Dirichlet Compound Multinomial
or DCM density, defined as follows:
p(xi |yi = c, ) = Mu(xi |Ni , c )Dir( c |c )
Ni ! D
B(xi + c ) (3.47)
= D
j=1 xi j ! j=1 B(c )
(This equation is derived in Equation TODO.) Surprisingly this simple change is all that is needed to capture the
burstiness phenomenon. The intuitive reason for this is as follows: After seeing one occurence of a word, say wordj,
the posterior counts on j gets updated, making another occurence of wordjmore likely. By contrast, ifj is fixed, then
the occurences of each word are independent. The multinomial model corresponds to drawing a ball from an urn with
Kcolors of ball, recording its color, and then replacing it. By contrast, the DCM model corresponds to drawing a ball,
recording its color, and then replacing it with one additional copy; this is called the Polya urn.
Using the DCM as the class conditional density gives much better results than using the multinomial, and has
performance comparable to state of the art methods, as described in (Madsen et al. 2005). The only disadvantage is
that fitting the DCM model is more complex; see (Minka 2000e; Elkan 2006) for the details.
Chapter 4
Gaussian Models
The Gaussian,also known as the normal distribution,is a widely used model for the distribution of continuous
variables.In the case of a single variable x,the Gaussian distribution can be written in the form
1 1
N (x| , 2 ) = exp{ } (4.1)
1
(2 2 ) 2 2 2 (x )2
where the is the mean and 2 is the variance. Now we will discuss the multivariate Gaussian or multivariate
normal(MVN), which is the most widely used joint probability density function for continuous variables. It will form
the basis for many of the models.
One important application of MVNs is to define the the class conditional densities in a generative classifier, i.e.,
The resulting technique is called (Gaussian) discriminant analysis or GDA (even though it is a generative, not
discriminative, classifier see Section TODO for more on this distinction). If c is diagonal, this is equivalent to naive
Bayes.
We can classify a feature vector using the following decision rule, derived from Equation 3.1:
When we compute the probability of x under each class conditional density, we are measuring the distance from x
to the center of each class, c , using Mahalanobis distance. This can be thought of as a nearest centroids classifier.
As an example, Figure 4.1 shows two Gaussian class-conditional densities in 2d, representing the height and weight
of men and women. We can see that the features are correlated, as is to be expected (tall people tend to weigh more).
The ellipses for each class contain 95% of the probability mass. If we have a uniform prior over classes, we can classify
a new test vector as follows:
y = arg max(x c )T 1c (x c ) (4.4)
c
By plugging in the definition of the Gaussian density to Equation 3.1, we can get
[ ]
c |2 c | 2 exp 12 (x )T 1 (x )
1
p(y|x, ) = [ ] (4.5)
c c |2 c | 2 exp 12 (x )T 1 (x )
1
Thresholding this results in a quadratic function of x. The result is known as quadratic discriminant analysis(QDA).
Figure 4.2 gives some examples of what the decision boundaries look like in 2D.
57
58
(a)
(b)
Fig. 4.1: (a) Height/weight data. (b) Visualization of 2d Gaussians fit to each class. 95% of the probability mass is
inside the ellipse.
We now consider a special case in which the covariance matrices are tied or shared across classes, c = . In this
case, we can simplify Equation 4.5 as follows:
( )
1 1
p(y|x, ) c exp c 1 x xT 1 x Tc 1 c
2 2
( )
1
= exp c 1 x Tc 1 c + log c
2
( )
1 T 1
exp x x
2
( )
1
exp c 1 x Tc 1 c + log c (4.6)
2
Since the quadratic term xT 1 x is independent of c, it will cancel out in the numerator and denominator. If we
define
59
(a)
(b)
Fig. 4.2: Quadratic decision boundaries in 2D for the 2 and 3 class case.
1
c Tc 1 c + log c (4.7)
2
c 1 c (4.8)
p(y|x, ) = (, c) (4.9)
T x+c
c e c
T T
where (e1 x + 1 , , eC x + C ), () is the softmax activation function17 , defined as follows:
exp(qi )
(q, i) n (4.10)
j=1 exp(q j )
When parameterized by some constant, > 0, the following formulation becomes a smooth, differentiable approx-
imation of the maximum function:
Dj=1 x j e x j
S (x) = D x (4.11)
j=1 e j
S has the following properties:
1. S max as
17 http://en.wikipedia.org/wiki/Softmax_activation_function
60
To gain further insight into the meaning of these equations, let us consider the binary case. In this case, the posterior
is given by
e1 x+1
T
1
= (4.13)
1 + e 0 1 )T x + (0 1 )
(
= sigm(( 1 0 )T x + (0 1 )) (4.14)
w = 1 0 = 1 (1 0 ) (4.17)
1 log(1 /0 )
x0 = (1 + 0 ) (1 0 ) (4.18)
2 (1 0 )T 1 (1 0 )
(This is closely related to logistic regression, which we will discuss in Section TODO.) So the final decision rule is
as follows: shift x by x0 , project onto the line w, and see if the result is positive or negative.
If = 2 I, then w is in the direction of 1 0 . So we classify the point based on whether its projection is
closer to 0 or 1 . This is illustrated in Figure 4.3. Furthemore, if 1 = 0 , then x0 = 21 (1 + 0 ), which is half
way between the means. If we make 1 > 0 , then x0 gets closer to 0 , so more of the line belongs to class 1 a
priori. Conversely if 1 < 0 , the boundary shifts right. Thus we see that the class prior, c, just changes the decision
threshold, and not the overall geometry, as we claimed above. (A similar argument applies in the multi-class case.)
18 http://en.wikipedia.org/wiki/Sigmoid_function
61
The magnitude of w determines the steepness of the logistic function, and depends on how well-separated the means
are, relative to the variance. In psychology and signal detection theory, it is common to define the discriminability of
a signal from the background noise using a quantity called d-prime:
1 0
d (4.20)
where 1 is the mean of the signal and 0 is the mean of the noise, and is the standard deviation of the noise. If d
is large, the signal will be easier to discriminate from the noise.
The speed and simplicity of the MLE method is one of its greatest appeals. However, the MLE can badly overfit in
high dimensions. In particular, the MLE for a full covariance matrix is singular if Nc < D. And even when Nc > D, the
MLE can be ill-conditioned, meaning it is close to singular. There are several possible solutions to this problem:
Use a diagonal covariance matrix for each class, which assumes the features are conditionally independent; this is
equivalent to using a naive Bayes classifier (Section 3.5).
Use a full covariance matrix, but force it to be the same for all classes, c = . This is an example of parameter
tying or parameter sharing, and is equivalent to LDA (Section 4.1.2).
62
Use a diagonal covariance matrix and forced it to be shared. This is called diagonal covariance LDA, and is dis-
cussed in Section TODO.
Use a full covariance matrix, but impose a prior and then integrate it out. If we use a conjugate prior, this can be
done in closed form, using the results from Section TODO; this is analogous to the Bayesian naive Bayes method
in Section 3.5.1.2. See (Minka 2000f) for details.
Fit a full or diagonal covariance matrix by MAP estimation. We discuss two different kindsof prior below.
Project the data into a low dimensional subspace and fit the Gaussians there. See Section TODO for a way to find
the best (most discriminative) linear projection.
We discuss some of these options below.
One drawback of diagonal LDA is that it depends on all of the features. In high dimensional problems, we might prefer
a method that only depends on a subset of the features, for reasons of accuracy and interpretability. One approach is to
use a screening method, perhaps based on mutual information, as in Section 3.5.4. We now discuss another approach
to this problem known as the nearest shrunken centroids classifier (Hastie et al. 2009, p652).
Given a joint distribution, p(x1 , x2 ), it is useful to be able to compute marginals p(x1 ) and conditionals p(x1 |x2 ). We
discuss how to do this below, and then give some applications. These operations take O(D3 ) time in the worst case.
See Section TODO for faster methods.
Theorem 4.1. (Marginals and conditionals of an MVN). Suppose X = (x1 , x2 )is jointly Gaussian with parameters
( ) ( ) ( )
1 11 12 1 11 12
= , = , = = , (4.25)
2 21 22 21 22
p(x1 ) = N (x1 |1 , 11 )
(4.26)
p(x2 ) = N (x2 |2 , 22 )
Equation 4.27 is of such crucial importance in this book that we have put a box around it, so you can easily find it.
For the proof, see Section TODO.
We see that both the marginal and conditional distributions are themselves Gaussian. For the marginals, we just
extract the rows and columns corresponding to x1 or x2 . For the conditional, we have to do a bit more work. How-
ever, it is not that complicated: the conditional mean is just a linear function of x2 , and the conditional covariance
is just a constant matrix that is independent of x2 . We give three different (but equivalent) expressions for the poste-
rior mean, and two different (but equivalent) expressions for the posterior covariance; each one is useful in different
circumstances.
4.2.2 Examples
Below we give some examples of these equations in action, which will make them seem more intuitive.
4.2.4 Proof
Suppose we have two variables, x and y.Let x RDx be a hidden variable, and y RDy be a noisy observation of x.
Let us assume we have the following prior and likelihood:
p(x) = N (x|x , x )
(4.28)
p(y|x) = N (y|W x + y , y )
where W is a matrix of size Dy Dx . This is an example of a linear Gaussian system. We can represent this schemat-
ically as x y, meaning x generates y. In this section, we show how to invert the arrow, that is, how to infer x
from y. We state the result below, then give several examples, and finally we derive the result. We will see many more
applications of these results in later chapters.
Theorem 4.2. (Bayes rule for linear Gaussian systems). Given a linear Gaussian system, as in Equation 4.28, the
posterior p(x|y) is given by the following:
64
4.3.2 Examples
4.3.3 Proof
5.1 Introduction
Using the posterior distribution to summarize everything we know about a set of unknown variables is at the core of
Bayesian statistics. In this chapter, we discuss this approach to statistics in more detail.
The posterior p(|D) summarizes everything we know about the unknown quantities . In this section, we discuss
some simple quantities that can be derived from a probability distribution, such as a posterior. These summary statistics
are often easier to understand and visualize than the full joint.
We can easily compute a point estimate of an unknown quantity by computing the posterior mean, median or mode.
In Section 5.7, we discuss how to use decision theory to choose between these methods. Typically the posterior mean
or median is the most appropriate choice for a realvalued quantity, and the vector of posterior marginals is the best
choice for a discrete quantity. However, the posterior mode, aka the MAP estimate, is the most popular choice because
it reduces to an optimization problem, for which efficient algorithms often exist. Futhermore, MAP estimation can be
interpreted in non-Bayesian terms, by thinking of the log prior as a regularizer (see Section TODO for more details).
Although this approach is computationally appealing, it is important to point out that there are various drawbacks
to MAP estimation, which we briefly discuss below. This will provide motivation for the more thoroughly Bayesian
approach which we will study later in this chapter(and elsewhere in this book).
The most obvious drawback of MAP estimation, and indeed of any other point estimate such as the posterior mean or
median, is that it does not provide any measure of uncertainty. In many applications, it is important to know how much
one can trust a given estimate. We can derive such confidence measures from the posterior, as we discuss in Section
5.2.2.
If we dont model the uncertainty in our parameters, then our predictive distribution will be overconfident. Overconfi-
dence in predictions is particularly problematic in situations where we may be risk averse; see Section 5.7 for details.
Choosing the mode as a summary of a posterior distribution is often a very poor choice, since the mode is usually quite
untypical of the distribution, unlike the mean or median. The basic problem is that the mode is a point of measure zero,
whereas the mean and median take the volume of the space into account. See Figure 5.1.
65
66
(a)
(b)
Fig. 5.1: (a) A bimodal distribution in which the mode is very untypical of the distribution. The thin blue vertical line
is the mean, which is arguably a better summary of the distribution, since it is near the majority of the probability
mass. (b) A skewed distribution in which the mode is quite different from the mean.
How should we summarize a posterior if the mode is not a good choice? The answer is to use decision theory,
which we discuss in Section 5.7. The basic idea is to specify a loss function, where L( , ) is the loss you incur if the
truth is and your estimate is . If we use 0-1 loss L( , ) = I( = )(see section 1.4.2.1), then the optimal estimate
is the posterior mode. 0-1 loss means you only get points if you make no errors, otherwise you get nothing: there is
no partial credit under this loss function! For continuous-valued quantities, we often prefer to use squared error loss,
L( , ) = ( )2 ; the corresponding optimal estimator is then the posterior mean, as we show in Section 5.7. Or we
can use a more robust loss function, L( , ) = | |, which gives rise to the posterior median.
A more subtle problem with MAP estimation is that the result we get depends on how we parameterize the probability
distribution. Changing from one representation to another equivalent representation changes the result, which is not
very desirable, since the units of measurement are arbitrary (e.g., when measuring distance, we can use centimetres or
inches).
To understand the problem, suppose we compute the posterior forx. If we define y=f(x), the distribution for yis
given by Equation 2.157. The dx dy term is called the Jacobian, and it measures the change in size of a unit volume passed
through f . Let x = arg maxx px (x) be the MAP estimate for x. In general it is not the case that x = arg maxx px (x) is
given by f (x). For example, let X N (6, 1) and y = f (x), where f (x) = 1/(1 + exp(x + 5)).
We can derive the distribution of y using Monte Carlo simulation (see Section 2.10). The result is shown in Figure
??. We see that the original Gaussian has become squashed by the sigmoid nonlinearity. In particular, we see that the
mode of the transformed distribution is not equal to the transform of the original mode.
The MLE does not suffer from this since the likelihood is a function, not a probability density. Bayesian inference
does not suffer from this problem either, since the change of measure is taken into account when integrating over the
parameter space.
67
Fig. 5.2: Example of the transformation of a density under a nonlinear transform. Note how the mode of the trans-
formed distribution is not the transform of the original mode. Based on Exercise 1.4 of (Bishop 2006b).
In addition to point estimates, we often want a measure of confidence. A standard measure of confidence in some
(scalar) quantity is the width of its posterior distribution. This can be measured using a 100(1 )% credible interval,
which is a (contiguous) region C = (, u)(standing for lower and upper) which contains 1 of the posterior probability
mass, i.e.,
C (D) where P( u) = 1 (5.1)
There may be many such intervals, so we choose one such that there is (1 )/2 mass in each tail; this is called a
central interval.
If the posterior has a known functional form, we can compute the posterior central interval using = F 1 ( /2)
and u = F 1 (1 /2), where F is the cdf of the posterior.
If we dont know the functional form, but we can draw samples from the posterior, then we can use a Monte Carlo
approximation to the posterior quantiles: we simply sort the S samples, and find the one that occurs at location /S
along the sorted list. As S , this converges to the true quantile.
People often confuse Bayesian credible intervals with frequentist confidence intervals. However, they are not the
same thing, as we discuss in Section TODO. In general, credible intervals are usually what people want to compute,
but confidence intervals are usually what they actually compute, because most people are taught frequentist statistics
but not Bayesian statistics. Fortunately, the mechanics of computing a credible interval is just as easy as computing a
confidence interval.
Sometimes we have multiple parameters, and we are interested in computing the posterior distribution of some function
of these parameters. For example, suppose you are about to buy something from Amazon.com, and there are two sellers
offering it for the same price. Seller 1 has 90 positive reviews and 10 negative reviews. Seller 2 has 2 positive reviews
and 0 negative reviews. Who should you buy from?19 .
On the face of it, you should pick seller 2, but we cannot be very confident that seller 2 is better since it has had
so few reviews. In this section, we sketch a Bayesian analysis of this problem. Similar methodology can be used to
compare rates or proportions across groups for a variety of other settings.
Let 1 and 2 be the unknown reliabilities of the two sellers. Since we dont know much about them, well endow
them both with uniform priors, i Beta(1, 1). The posteriors are p(1 |D1 ) = Beta(91, 11) and p(2 |D2 ) = Beta(3, 1).
We want to compute p(1 > 2 |D). For convenience, let us define = 1 2 as the difference in the rates.
(Alternatively we might want to work in terms of the log-odds ratio.) We can compute the desired quantity using
numerical integration
1 1
p( > 0|D) = I(1 > 2 )Beta(1 |91, 11)
0 0 (5.2)
Beta(2 |3, 1)d1 d2
We find p( > 0|D) = 0.710, which means you are better off buying from seller 1!
In general, when faced with a set of models (i.e., families of parametric distributions) of different complexity, how
should we choose the best one? This is called the model selection problem.
One approach is to use cross-validation to estimate the generalization error of all the candidate models, and then to
pick the model that seems the best. However, this requires fitting each model K times, where K is the number of CV
folds. A more efficient approach is to compute the posterior over models,
p(D|m)p(m)
p(m|D) = (5.3)
m p(D|m )p(m )
From this, we can easily compute the MAP model, m = arg maxm p(m|D). This is called Bayesian model selection.
If we use a uniform prior over models, this amounts to picking the model which maximizes
p(D|m) = p(D|)p(|m)d (5.4)
This quantity is called the marginal likelihood, the integrated likelihood, or the evidence for model m. The details
on how to perform this integral will be discussed in Section 5.3.2. But first we give an intuitive interpretation of what
this quantity means.
One might think that using p(D|m) to select models would always favour the model with the most parameters. This
is true if we use p(D| m ) to select models, where m ) is the MLE or MAP estimate of the parameters for model m,
because models with more parameters will fit the data better, and hence achieve higher likelihood. However, if we
integrate out the parameters, rather than maximizing them, we are automatically protected from overfitting: models
with more parameters do not necessarily have higher marginal likelihood. This is called the Bayesian Occams razor
effect (MacKay 1995b; Murray and Ghahramani 2005), named after the principle known as Occams razor, which
says one should pick the simplest model that adequately explains the data.
One way to understand the Bayesian Occams razor is to notice that the marginal likelihood can be rewritten as
follows, based on the chain rule of probability (Equation 2.20):
This is similar to a leave-one-out cross-validation estimate (Section 1.5.0.1) of the likelihood, since we predict each
future point given all the previous ones. (Of course, the order of the data does not matter in the above expression.) If a
model is too complex, it will overfit the early examples and will then predict the remaining ones poorly.
Another way to understand the Bayesian Occams razor effect is to note that probabilities must sum to one. Hence
p(D ) p(m|D ) = 1, where the sum is over all possible data sets. Complex models, which can predict many things,
must spread their probability mass thinly, and hence will not obtain as large a probability for any given data set as
simpler models. This is sometimes called the conservation of probability mass principle, and is illustrated in Figure
5.3.
Fig. 5.3: A schematic illustration of the Bayesian Occams razor. The broad (green) curve corresponds to a complex
model, the narrow (blue) curve to a simple model, and the middle (red) curve is just right. Based on Figure 3.13 of
(Bishop 2006a).
When using the Bayesian approach, we are not restricted to evaluating the evidence at a finite grid of values. Instead,
we can use numerical optimization to find = arg max p(D| ). This technique is called empirical Bayes or type
II maximum likelihood (see Section 5.6 for details). An example is shown in Figure TODO(b): we see that the curve
has a similar shape to the CV estimate, but it can be computed more efficiently.
thus ignoring the normalization constant p(D|m). This is valid since p(D|m)is constant wrt . However, when com-
paring models, we need to know how to compute the marginal likelihood, p(D|m). In general, this can be quite hard,
since we have to integrate over all possible parameter values, but when we have a conjugate prior, it is easy to compute,
as we now show.
Let p() = q()/Z0 be our prior, where q() is an unnormalized distribution, and Z0 is the normalization constant
of the prior. Let p(D|) = q(D|)/Z be the likelihood, where Z contains any constant factors in the likelihood.
Finally let p(|D) = q(|D)/ZN be our posterior , where q(|D) = q(D|)q() is the unnormalized posterior, and ZN
is the normalization constant of the posterior. We have
70
p(D|)p()
p(|D) = (5.7)
p(D)
q(|D) q(D|)q()
= (5.8)
ZN Z Z0 p(D)
ZN
p(D) = (5.9)
Z0 Z
So assuming the relevant normalization constants are tractable, we have an easy way to compute the marginal
likelihood. We give some examples below.
Let us apply the above result to the Beta-binomial model. Since we know p(|D) = Beta(|a , b ), where a = a + N1 ,
b = b + N0 , we know the normalization constant of the posterior is B(a , b ). Hence
p(D| )p( )
p( |D) = (5.10)
p(D)
[ ]
1 1
= a1 (1 )b1
p(D) B(a, b)
[( ) ]
N
N1 (1 )N0 (5.11)
N1
( )
N 1 1 [ a+N1 1 ]
= (1 )b+N0 1 (5.12)
N1 p(D) B(a, b)
So
( )
1 N 1 1
= (5.13)
B(a + N1 , b + N0 ) N1 p(D) B(a, b)
( )
N B(a + N1 , b + N0 )
p(D) = (5.14)
N1 B(a, b)
(N )
The marginal likelihood for the Beta-Bernoulli model is the same as above, except it is missingthe N1 term.
By the same reasoning as the Beta-Bernoulli case, one can show that the marginal likelihood for the Dirichlet-
multinoulli model is given by
B(N + )
p(D) = (5.15)
B()
(k k ) (Nk + k )
(N + k k )
= (5.16)
k (k )
Consider the case of an MVN with a conjugate NIW prior. Let Z0 be the normalizer for the prior, ZN be normalizer for
the posterior, and let Z (2 )ND/2 = be the normalizer for the likelihood. Then it is easy to see that
71
ZN
p(D) = (5.17)
Z0 Z
( )D/2
2
1 N |S N |N /2 2(0 +N)D/2D (N /2)
= ( )D/2 (5.18)
(2 )ND/2 2
0 |S 0 |0 /2 20 D/2D (0 /2)
( )D/2
1 0 |S 0 |0 /2 D (N /2)
= (5.19)
ND/2 N |S N |N /2 D (0 /2)
In general, computing the integral in Equation 5.4 can be quite difficult. One simple but popular approximation is
known as the Bayesian information criterion or BIC, which has the following form (Schwarz 1978):
dof()
BIC log p(D|) log N (5.20)
2
where dof() is the number of degrees of freedom in the model, and is the MLE for the model. We see that this
has the form of a penalized log likelihood, where the penalty term depends on the models complexity. See Section
TODO for the derivation of the BIC score.
As an example, consider linear regression. As we show in Section TODO, the MLE is given by w = (X T X)1 X T y
and 2 = N1 Ni=1 (yi wT xi ). The corresponding log likelihood is given by
N N
log p(D|) = log(2 2 ) (5.21)
2 2
Hence the BIC score is as follows (dropping constant terms)
N D
BIC = log( 2 ) log N (5.22)
2 2
where D is the number of variables in the model. In the statistics literature, it is common to use an alternative definition
of BIC, which we call the BIC cost(since we want to minimize it):
The BIC method is very closely related to the minimum description length or MDL principle, which characterizes
the score for a model in terms of how well it fits the data, minus how complex the model is to define. See (Hansen and
Yu 2001) for details.
There is a very similar expression to BIC/ MDL called the Akaike information criterion or AIC, defined as
This is derived from a frequentist framework, and cannot be interpreted as an approximation to the marginal likeli-
hood. Nevertheless, the form of this expression is very similar to BIC. We see that the penalty for AIC is less than for
BIC. This causes AIC to pick more complex models. However, this can result in better predictive accuracy. See e.g.,
(Clarke et al. 2009, sec 10.2) for further discussion on such information criteria.
72
Sometimes it is not clear how to set the prior. When we are performing posterior inference, the details of the prior may
not matter too much, since the likelihood often overwhelms the prior anyway. But when computing the marginal like-
lihood, the prior plays a much more important role, since we are averaging the likelihood over all possible parameter
settings, as weighted by the prior.
If the prior is unknown, the correct Bayesian procedure is to put a prior on the prior. If the prior is unknown, the
correct Bayesian procedure is to put a prior on the prior.
Suppose our prior on models is uniform, p(m) 1. Then model selection is equivalent to picking the model with the
highest marginal likelihood. Now suppose we just have two models we are considering, call them the null hypothesis,
M0 , and the alternative hypothesis, M1 . Define the Bayes factor as the ratio of marginal likelihoods:
5.4 Priors
The most controversial aspect of Bayesian statistics is its reliance on priors. Bayesians argue this is unavoidable, since
nobody is a tabula rasa or blank slate: all inference must be done conditional on certain assumptions about the world.
Nevertheless, one might be interested in minimizing the impact of ones prior assumptions. We briefly discuss some
ways to do this below.
If we dont have strong beliefs about what should be, it is common to use an uninformative or non-informative
prior, and to let the data speak for itself.
In many cases, we are not very confident in our prior, so we want to make sure it does not have an undue influence on
the result. This can be done by using robust priors(Insua and Ruggeri 2000), which typically have heavy tails, which
avoids forcing things to be too close to the prior mean.
Robust priors are useful, but can be computationally expensive to use. Conjugate priors simplify the computation, but
are often not robust, and not flexible enough to encode our prior knowledge. However, it turns out that a mixture
of conjugate priors is also conjugate, and can approximate any kind of prior (Dallal and Hall 1983; Diaconis and
Ylvisaker 1985). Thus such priors provide a good compromise between computational convenience and flexibility.
73
A key requirement for computing the posterior p(|D) is the specification of a prior p(|), where are the hyper-
parameters. What if we dont know how to set ? In some cases, we can use uninformative priors, we we discussed
above. A more Bayesian approach is to put a prior on our priors! In terms of graphical models (Chapter TODO), we
can represent the situation as follows:
D (5.27)
This is an example of a hierarchical Bayesian model, also called a multi-level model, since there are multiple
levels of unknown quantities.
Method Definition
Maximum likelihood = arg max p(D|)
MAP estimation = arg max p(D|)p(|)
= arg max p(D|)p(|)d
ML-II (Empirical Bayes)
= arg max p(D|)
= arg max p(D|)p(|)p()d
MAP-II
= arg max p(D|)p()
Full Bayes p(, |D) p(D|)p(|)
We have seen how probability theory can be used to represent and updates our beliefs about the state of the world.
However, ultimately our goal is to convert our beliefs into actions. In this section, we discuss the optimal way to do
this.
Our goal is to devise a decision procedure or policy, f (x) : X Y, which minimizes the expected loss Rexp ( f )(see
Equation 1.2).
In the Bayesian approach to decision theory, the optimal output, having observed x, is defined as the output a that
minimizes the posterior expected loss:
L[y, f (x)]p(y|x)
y
( f ) = E p(y|x) [L(y, f (x))] = (5.28)
L[y, f (x)]p(y|x)dy
y
Hence the Bayes estimator, also called the Bayes decision rule, is given by
When L(y, f (x)) is 0-1 loss(Section 1.4.2.1), we can proof that MAP estimate minimizes 0-1 loss,
K
arg min ( f ) = arg min L[Ck , f (x)]p(Ck |x)
f H f H i=1
K
= arg min I( f (x) = Ck )p(Ck |x)
f H i=1
K
= arg min p( f (x) = Ck |x)
f H i=1
For continuous parameters, a more appropriate loss function is squared error, 2 loss, or quadratic loss, defined as
L(y, f (x)) = [y f (x)]2 .
The posterior expected loss is given by
( f ) = L[y, f (x)]p(y|x)dy
y
= [y f (x)]2 p(y|x)dy (5.30)
y
[ ]
= y2 2y f (x) + f (x)2 p(y|x)dy
y
This is often called the minimum mean squared error estimate or MMSE estimate.
The 2 loss penalizes deviations from the truth quadratically, and thus is sensitive to outliers. A more robust alternative
is the absolute or 1 loss. The optimal estimate is the posterior median, i.e., a value a such that P(y < a|x) = P(y
a|x) = 0.5.
75
Proof.
( f ) = L[y, f (x)]p(y|x)dy = |y f (x)|p(y|x)dy
y y
= [ f (x) y]p(y < f (x)|x)+
y
[y f (x)]p(y f (x)|x)dy
= [p(y < f (x)|x) p(y f (x)|x)] dy = 0
f
y
In classification problems where p(y|x) is very uncertain, we may prefer to choose a reject action, in which we refuse
to classify the example as any of the specified classes, and instead say dont know. Such ambiguous cases can be
handled by e.g., a human expert. This is useful in risk averse domains such as medicine and finance.
We can formalize the reject option as follows. Let choosing f (x) = cK+1 correspond to picking the reject action,
and choosing f (x) {C1 , ...,Ck } correspond to picking one of the classes. Suppose we define the loss function as
0 if f (x) = y and f (x), y {C1 , ...,Ck }
L( f (x), y) = s if f (x) = y and f (x), y {C1 , ...,Ck } (5.32)
r if f (x) = CK+1
where s is the cost of a substitution error, and r is the cost of the reject action.
We can define the loss incurred by f (x) (i.e., using this predictor) when the unknown state of nature is (the parameters
of the data generating mechanism) as follows:
This is known as the generalization error. Our goal is to minimize the posterior expected loss, given by
( f |D) = p(|D)L(, f )d (5.34)
This should be contrasted with the frequentist risk which is defined in Equation TODO.
In this section, we focus on binary decision problems, such as hypothesis testing, two-class classification, object/ event
detection, etc. There are two types of error we can make: a false positive(aka false alarm), or a false negative(aka
missed detection). The 0-1 loss treats these two kinds of errors equivalently. However, we can consider the following
more general loss matrix:
TODO
Chapter 6
Linear Regression
6.1 Introduction
Linear regression is the work horse of statistics and (supervised) machine learning. When augmented with kernels or
other forms of basis function expansion, it can model also nonlinear relationships. And when the Gaussian output is
replaced with a Bernoulli or multinoulli distribution, it can be used for classification, as we will see below. So it pays
to study this model in detail.
In the simplest approach,we can directly construct an appropriate function y(x) whose values for new inputs x
constitute the predictions for the corresponding values of t.More generally,from a probabilistic perspective,we aim to
model the predictive distribution p(t|x) because this expresses the uncertainty about the value of t for each value of x.
6.2 Representation
The simplest linear model for regression is linear regression that involves a linear combination of the input variables
where j (x) are known as basis functions, w = (w0 , ..., wM1 )T and = (0 , ..., M1 )T . This is known as basis
function expansion. (Note that the model is still linear in the parameters w, so it is still called linear regression; the
importance of this will become clear below.) A simple example are polynomial basis functions, where the model has
the form
(x) = (1, x, , xd ) (6.3)
As before,we assume that the target variable is given by a deterministic function y(x, w) with additive Gaussian
noise so that
t = y(x, w) + (6.4)
where is a zero mean Gaussian random variable with precision(inverse variance) .Thus
or
y(x) = wT x + (6.6)
p(y|x, ) = N (y|wT x, 2 ) (6.7)
(6.8)
where w (weight vector) and x are extended vectors, x = (1, x), w = (b, w) and has a Gaussian or normal
distribution.w0 is the intercept or text
77
78
Assume the training examples are independently and identically distributed(IID),we obtain the likelihood func-
tion,which is a function of adjustable parameters w and ,in the form
N
p(t|X, w, ) = N (tn |wT (xn ), 1 ) (6.9)
n=1
A common way to estimate the parameters of a statistical model is to compute the MLE(Maximum likelihood estima-
tions),which is defined as
= arg max log p(D|) (6.11)
Instead of maximizing the log-likelihood, we can equivalently minimize the negative log likelihood or NLL:
The NLL formulation is sometimes more convenient, since many optimization software packages are designed to
find the minima of functions, rather than maxima.
Now let us apply the method of MLE to the linear regression setting. Inserting the definition of the Gaussian into
the above, we find that the log likelihood is given by
1 N
ED (w) = {tn wT (xn )}2
2 n=1
(6.18)
There multiple ways to compute the optimal solution of likelihood function: in matrix calculus, sum of terms
derivatives, geometry interpretation.
T
1
T
2
=
...
Tn
T (6.21)
1
[ ]
T2
N
T
= 1 2 ... n = i iT
... i=1
T
N
We see that maximizing the likelihood function under a conditional Gaussian noise distribution for a linear model
is equivalent to w minimizing the sum-of-squares error function given by ED , so this method is known as least
squares.The gradient of the log likelihood function takes the form
N
log p(t|w, ) = {tn wT (xn )} (xn ) (6.22)
n=1
log ED = 0 (6.23)
log EDT =0 (6.24)
N
{tn wT (xn )} (xn )T = 0 (6.25)
n=1
N N
tn (xn )T wT ( (xn )(xn )T ) = 0 (6.26)
n=1 n=1
tT wT T = 0 (6.27)
The corresponding solution wOLS to this linear system of equations is called the ordinary least squares or OLS or
normal equation solution. When D is small(for example, N < 1000), we can use the equation to compute w directly.
Make the bias parameter explicit,the error function
1 N M1
ED (w) =
2 n=1
{tn w0 w j j (xn )}2 (6.29)
j=1
Then we have
y = wT (6.31)
M
y= w j j + w0 (6.32)
j=1
M M1
y= w j j + t w j j (6.33)
j=1 j=1
M
y y = w j ( j j ) (6.34)
j=1
which indicates that we can normalize the features by subtracting the mean of while training to reduce computation
complexity but obtain the same weight parameter.
With regard to matrix calculus, when performing gradient based updates, we represent matrix derivatives in denomi-
nator layout. Thus the dimension of the matrix, which we are computing derivatives with respect to, will be consistent
with its gradient matrix. Denominator layout convention will make it convenient for use to compute complex deriva-
tives with chain rule and use the result to update old parameters directly, without transposing operation.
There are multiple ways to derive matrix differential calculus:
1. From differential definition, the limit of ddxf when dx approaches 0. Take the difference d f = f (x + dx) f (x),
and compute limh0 ddxf .
2. According to matrix multiplication definition, we can expand matrix cell to sum of terms. Then use 1 dimensional
calculus, finally, rewrite to matrix form.
3. Construct full derivative from a sum of of partial derivatives. In this way, we can decompose a complex formula
into several simple ones.
We now state without proof some facts of matrix calculus (we wont need all of these at this section).
AB = BT (6.35)
A
[ ]T
f (A) = f (A) (6.36)
AT A
ABAT C = CAB +CT ABT (6.37)
A
|A| = |A|(A1 )T (6.38)
A
n
trA Aii
i=1
To see why AB
A
= B T , we use dimension analysis. Denote f = f (C). Assume A is of dimension m n, B is of
dimension n p, and of course, C is of dimension n p. Take the derivative of f with respect to A:
f f C
= (6.39)
A A
C |{z}
|{z} |{z}
mn mp pn
(6.40)
81
And B T is of size p n, the whole function mapping is linear. Also, we can explicitly write matrix elements in sum
of terms to show that.
Another vector vector differentiation identity, if A is not a function of x then
dX T AX
(6.41)
dX
dY T AX
= (6.42)
dX
Y T AX Y Y T AX X
= + (6.43)
Y X X X
X T AT Y T Y T AX
= I+ I (6.44)
Y X
= X T AT + Y T A (6.45)
T T T
= X A +X A (6.46)
T T
= X (A + A ) (6.47)
= (A + AT )X, In denominator layout (6.48)
The previous derivation is based on the sum of terms arithmetic. Now we derive it from matrix calculus.
Proof. Lets drop constants with respect to w and NLL can be written as
1 N
NLL(w) = (ti wT i )2
2 i=1
1
= (w t)T (w t)
2
1 T T
= (w w wT T t tT w + tT t)
2
1 T T
= (w w wT T t tT w)
2
1 T T
= w ( )w wT (T t)
2
NLL 1
= ((T + T )w 2T t)
w 2
1 T
= ( w T t)
2
NLL
=0
w
T w T t = 0
T w = T t
wML = (T )1 T t
Fig. 6.1: Graphical interpretation of least squares for N = 3 examples and D = 2 features. x1 and x2 are vectors in R3 ;
together they define a 2D plane. t is also a vector in R3 but does not lie on this 2D plane. The orthogonal projection
of t onto this plane is denoted t. The red line from t to t is the residual, whose norm we want to minimize. For visual
clarity, all vectors have been converted to unit norm.
x1 T
x2 T [ ]
X =
... = x1 x1 ... xD (6.49)
xN T
t1
t2
t= ...
(6.50)
tn
We seek a vector t RN that lies in the column linear space of X and is as close as possible to t,i.e.,we want to find
t span(X) (6.51)
t = Xw = w1 x1 + + wD xD (6.52)
t = arg min (6.53)
tspan({x1 ,...,xD })
To minimize the norm of the residual, t t, we want the residual vector to be orthogonal to every column of X,so
x j (t t) = 0 for j = 1 : D. Hence
x j (t t) = 0 X T (t Xw) = 0
(6.54)
w = (X T X)1 X T t
83
Stochastic gradient descent(SGD), is also known as sequential gradient descent. Batch techniques,such as maximum
likelihood, involves processing the entire training set in one go(pass), which can be computationally costly for large
data sets. For sufficiently large datasets, its worthwhile to use sequential algorithms,known as on-line algorithms.
If the error function comprises a sum over data points E = n En, then after presentation of pattern n, the stochastic
gradient descent algorithm updates the parameter vector w using
E = En (6.55)
n
w +1 = w En (6.56)
where denotes the iteration number,the is a learning rate parameter.For the case of sum-of-squares error func-
tion,this gives
w +1 = w (tn w( )T n )n (6.57)
where n = (xn ).This is known as the least-mean-squares or the LMS algorithm. Notice that the gradient vector
En here, is different with the one we used to derive the closed-form solution, which was using numerator layout to
make it convenient to solve the equation.
N
NLL(w) = (wT xi yi )xi j (6.58)
wi i=1
w j =w j NLL(w)
wj
N
=w j (wT xi yi )xi j (6.59)
i=1
w =w (wT xi yi )x (6.60)
One problem with ML estimation is that it can result in over-fitting. In this section, we discuss a way to ameliorate this
problem by using MAP estimation with a Gaussian prior.
We can encourage the parameters to be small, thus resulting in a smoother curve, by using a zero-mean Gaussian prior:
where 1/ 2 controls the strength of the prior. The corresponding MAP estimation problem becomes
N D
arg max log N (ti |w0 + wT i , 2 ) + log N (w j |0, 2 ) (6.62)
w
i=1 j=1
where the first term is the MSE/ NLL as usual, and the second term, 0, is the regularization coefficient that controls
the complexity penalty. Here EW (w) is one of the simplest forms of regularizer given by the sum-of-squares of the
weight vector elements.
1
EW (w) = wT w (6.65)
2
This particular choice of regularizer is known as weight decay.
The corresponding solution is given by
wridge = ( I D + T )1 T t (6.66)
This technique is known as ridge regression,or penalized least squares. In general, adding a Gaussian prior to the
parameters of a model to encourage them to be small is called 2 regularization or weight decay. Note that the offset
term w0 is not regularized, since this just affects the height of the function, not its complexity.
We will consider a variety of different priors in this book. Each of these corresponds to a different form of regular-
ization. This technique is very widely used to prevent overfitting.
Now consider the case where we wish to predict K > 1 target variables,which we denote collectively by the target
vector t.To use the same set of basis functions to model all the components of the target vector so that
W ML = (T )1 T T (6.71)
wridge = V (Z T Z + I N )1 Z T y (6.72)
85
Regularization is the most common way to avoid overfitting. However, another effective approach which is not always
available is to use lots of data. It should be intuitively obvious that the more training data we have, the better we will
be able to learn.
In domains with lots of data, simple methods can work surprisingly well (Halevy et al. 2009). However, there are
still reasons to study more sophisticated learning methods, because there will always be problems for which we have
little data. For example, even in such a data-rich domain as web search, as soon as we want to start personalizing the
results, the amount of data available for any given user starts to look small again (relative to the complexity of the
problem).
6.5.0.1 representation
Data representation: N observations of x,wiritten X (x1 , x2 , ..., xn )T together with corresponding observations of the
values of t,denoted t (t1 ,t2 , ...tN )T .
ti = f (xi ) + (6.73)
where the noise has zero mean and variance 2 . Find a function
f(x)
e 1 N 1
E(w) = {y(xn , w tn }2 + w 2 (6.76)
2 n=1 2
Solving for y(x) and using the sum and product rules of probability,we obtain
t p(x,t)dt
y(x) = = t p(t|x)dt = E[t|x] (6.81)
p(x)
Lets derive this result in a slightly different way.Armed with knowledge that the optimal solution is the conditional
expectation,we can expand the square term as follows
{y(x t)}2 = {y(x) E[t|x] + E[t|x] t)}2 = {y(x) E[t|x]}2 + 2{y(x) E[t|x]}{E[t|x] t} + {E[t|x] t}2 (6.82)
where,E[t|x] denote Et [t|x].Substitute into the loss function and perform the integral over t,we see the cross-term
vanishes
E[L] = {y(x) t}2 p(x,t)dxdt (6.83)
= {y(x) E[t|x]}2 p(x)dx + {E[t|x] t}2 p(x)dx (6.84)
= {y(x) h(x)}2 p(x)dx + {h(x) t}2 p(x)dx (6.85)
= {y(x) h(x)}2 p(x)dx + {h(x) t}2 p(x,t)dxdt (6.86)
(6.87)
6.5.0.3 Decomposition
For a popular choice,we use squared loss function,for which the optimal prediction is given by the conditional expec-
tation,which we denote by h(x) and which is given by
h(x) = E[t|x] = t p(t|x)dt (6.88)
Consider the integrand of the first term of 6.83,which for particular data set D takes the form
This quantity will be dependent on the particular data set D,so we take its average over the ensemble of data sets. If
we add and subtract the quantity ED [y(x; D)] inside the braces,and then expand,we obtain
(6.92)
= {y(x; D) ED [y(x; D)]}2 + {ED [y(x; D)] h(x)}2 +0 (6.93)
| {z } | {z }
variance (bias)2
(6.94)
87
where
The function y(x) we seek to determine enters only the first term,which will be minimized when y(x) is equal
to E[t|x],in which case this term will vanish.The second term is the variance of distribution of t,averaged over
x,representing the intrinsic variabilility of the target data and can be regarded as noise.Its the irreducible minimum
value of the loss function.
More sophisticated loss function,Minkowski loss
E[Lq ] = |y(x) t|q p(x,t)dxdt (6.99)
Hold-out data can be used to determine model complexity but it will be computationally expensive and wasteful of
valuable data.We therefore turn to a Bayesian treatment of linear regression,which will avoid the over-fitting problem
of maximum likelihood,and which will also lead to automatic methods of determining model complexity using the
training data.
First introduce a prior probability distribution over the model parameters.The likelihood function p(t|w) defined by
6.9 is the exponential of a quadratic function of w,so the corresponding conjugate prior is Gaussian
The posterior distribution is proportional to the product of the likelihood and the prior.And the posterior will also
be Gaussian due to the choice of conjugate Gaussian prior,which is derived in 2.5.
where
mN = S N (S 1
0 m0 + t)
T
(6.102)
S 1
N = S 1
0 +
T
(6.103)
Thus the maximum posterior weigh vector is simply given by wMAP = mN .If N = 0 then the posterior distribution
reverts to the prior.Furthermore,if data points arrive sequentially,then the posterior distribution at any stage acts as the
prior distribution,such that the new posterior is again given.
A zero-mean isotropic Gaussian governed by a single precision parameter so that
mN = S N T t (6.105)
S 1
N = I + T
(6.106)
The log of posterior distribution is given by the sum of the log likelihood and the log of prior and,as a function of
w,takes the form
N
log p(w|t) = {tn wT (xn )}2 wT w + const (6.107)
2 n=1 2
Maximization of this posterior w.r.t w is equivalent to minimization of the sum-of-squares error function with the
addition of a quadratic regularization term,corresponding with = / .
The right hand quantity can be interpreted as the marginalization by integration of the joint distribution of p(t, w|t, , )
over w.And the joint distribution can be solved by the Bayesian theorem for Gaussian. The conditional distribution
p(t|x, w, ) of the target variable is given by 6.5,and the posterior weight distribution is given by 6.101.This involves
the convolution of two Gaussian distributions.The predictive distribution take the form
The posterior mean solution ?? has an interpretation that will set stage for kernel methods,including Gaussian pro-
cesses.The predictive mean:
N N
y(x, w) = mTN (x) = (x)T S N T t = (x)T S N (xn )tn = k(x, xn )tn (6.111)
n=1 n=1
where
k(x, x ) = (x)T S N (x ) (6.112)
is known as the smoother matrix or equivalent kernel.Regression functions,such as this,which make predictions
by taking linear combinations of the training set target values are known linear smoothers.The kernel functions are
localized around x(local evidence weight more than distant evidence).
The covariance between y(x) and y(x ) is given by
We see that the predictive mean at nearby points will be highly correlated.
The formulation of linear regression in terms of kernel functions suggests an alternative approach,called Gaussian
process:Instead of introducing a set of basis functions,which implicitly determines an equivalent kernel,we can instead
define a localized kernel directly and use this to make predictions for new input vectors x,given the observed training
set.
The effective kernel defines the weights by which the training set target values are combined in order to make a
prediction at a new value of x,and it can be shown that these weights sum to one,
N
(x, xn ) = 1 (6.115)
n=1
1/2
where (x) = 1/2 S n (x)
The over-fitting associated with maximum likelihood can be avoided by marginalizing(summing or integrating)
over the model parameters instead of making point estimates of their values,no need for validation.
Suppose we wish to compare a set of L models {Mi }, i = 1, ...L over observed data D.Evaluate the posterior
distribution
The prior allows us to express a preference for different models.p(D|Mi ) is the model evidence,which expresses
the preference shown by the data for different models.It is also called the marginal likelihood.The ratio of model
evidences p(D|Mi )/p(D|M j ) for two models is Bayes factor. The predictive distribution
This is an example of a mixture distribution in which the overall predictive distribution is obtained by averaging the
predictive distributions of individual models weighted by the posterior probabilities p(Mi |D) of those models.
To use the single most probable model alone to make predictions is known as model selection.Model evidence
p(D|Mi ) = p(D|w)p(w|Mi )dw (6.122)
because
p(D|w, Mi )p(w|Mi )
p(w|D, Mi ) = (6.123)
p(D|Mi )
90
Assume the posterior distribution over parameters is sharply peaked around the most probable value wMAP ,with
width w posterior ,then we can approximate the integral by the value of integrand at its maximum times the width of
the peak.And further assume that the prior is flat with width w prior so that p(w) = 1/ w prior ,then we have
w posterior
p(D) = p(D|w)p(w)dw p(D|wMAP )p(wMAP ) w posterior p(D|wMAP ) (6.124)
w prior
w posterior
ln p(D) ln p(D|wMAP ) + ln( ) (6.125)
w prior
If we have a set of M parameters,assuming that all parameters have the same ratio of w posterior / w prior ,we obtain
w posterior
ln p(D) ln p(D|wMAP ) + M ln( ) (6.126)
w prior
As we increase the complexity of the model,the first term representing the fit to the data,increases(?),whereas the
second term will decrease due to the dependence on M.The optimal model is given by a trade-off between these two
competing terms. Average the Bayes factor over the distribution of data sets
p(D|M1 )
p(D|M1 ) ln (6.127)
p(D|M2 )
This quantity is Kullback Leibler divergence,and will be bigger than or equal to 0 if M1 is the correct model.
91
The goal in classification is to take an input vector x and to assign it to one of K discrete classes Ck where K =
1, ..., K.The input space is thereby divided into decision regions whose boundaries are called decision boundaries or
decision surfaces.Data sets whose classes can be separated exactly by linear decision surfaces are said to be linearly
separable.
In Chapter 1,we identified three distinct approaches to the classification problem.The simplest involves constructing
a discriminant function that directly assigns each vector x to a specific class. A more powerful approach, however,
models the conditional probability distribution p(Ck |x) in an inference stage, and then subsequently uses this dis-
tribution to make optimal decisions. By separating inference and decision, we gain numerous benefits, as discussed in
1. There are two different approaches to determining the conditional probabilities p(Ck |x). One technique is to model
them directly, for example by representing them as parametric models and then optimizing the parameters using a
training set. Alternatively, we can adopt a generative approach in which we model the class-conditional densities
given by p(xCk), together with the prior probabilities p(Ck) for the classes, and then we compute the required
posterior probabilities using Bayes theorem
p(xCk )p(Ck )
p(Ck |x) = (7.1)
p(x)
In the linear regression models considered in Chapter 3, the model prediction y(x,w) was given by a linear function
of the parameters w. In the simplest case, the model is also linear in the input variables and therefore takes the form
y(x) = wT x + w0 , so that y is a real number. For classification problems, however, we wish to predict discrete class
labels, or more generally posterior probabilities that lie in the range (0, 1). To achieve this, we consider a generalization
of this model in which we transform the linear function of w using a nonlinear function f () so that
In the machine learning literature f () is known as an activation function,whereas its inverse is called a link function
in the statistics literature.The decision surfaces correspond to y(x) = constant, so that wT x + w0 = constant and hence
the decision surfaces are linear functions of x, even if the function f () is nonlinear. For this reason, the class of models
described by (4.3) are called generalized linear models (McCullagh and Nelder,1996).However,in contrast to the
models used for regression,they are no longer linear in the parameters due to the presence of the nonlinear function
f ().
A discriminant is a function that takes an input vector x and assigns it to be one of K classes,denoted Ck .
y(x) = xT x + w0 (7.3)
where w is called a weight vector,and w0 is bias.The negative of the bias is sometimes called threshold.
The value of y(x) gives a signed measure of the perpendicular distance r of point x from the decision sur-
face.Consider an arbitrary point x and let x be its orthogonal projection onto the decision surface,then
93
94
w
x = x + r (7.4)
w
(7.5)
Consider the extension of linear discriminants to K > 2 classes,building a K-class discriminants by combining a
number of two-class discriminant functions on which leads to some serious difficulties(Duda and Hart,1973).
Consider the use of K 1 classifier each of which solves a two-class problem of separating points int particular
class Ck from points not in that class.This is known as a one-versus-the-rest classifier.An alternative is to introduce
K(K 1)/2 binary discriminant functions,one for every possible pair classes.This is known as a one-versus-one
classifier.Both of these two approaches run into the problem of ambiguous regions.We can avoid these difficulties by
considering a single K-class discriminant comprising k linear functions of the form
and then assigning a point x to class Ck if yk (x) > y j (x) for all j = k.The decision boundary between class Ck and C j
is therefore given by yk (x) = y j (x) and hence corresponds to a (D 1)-dimensional hyperplane defined by
The decision regions of such a discriminant are always singly connected and convex.
Even as a discriminant function(where we use it to make decisions directly and dispense with any probabilistic in-
terpretation)it suffers from some severe problems.Least-squares solutions lack robustness to outliers,and this applies
equally to the classification application.Additional points(outliers) produce a significant change in the location of the
decision boundary.The sum-of-squares error function penalizes predictions that are too correct in that they lie a long
way on the correct side of the decision boundary.More than lack of robustness,least-squares solutions may gives poor
results on classification problems.
The failure of least squares lies in the fact that it corresponds to maximum likelihood under the assumption of a
Gaussian conditional distribution,whereas binary target vectors clearly have a distribution that is far from Gaussian(a
Bernoulli).
7.1.4.1 representation
One way to view a linear classification model is in terms of dimensionality reduction.Consider first the case of two
classes,and suppose we take the D dimensional input vector x and project it down to one dimension using
95
y = wT x. (7.11)
If we place a threshold on y and classify y w0 as class C ,and otherwise class C ,then we obtain our standard
linear classifier discussed in the previous section.
7.1.4.2 evaluation
By adjusting the components of the weight vector w,we can select a projection that maximizes the class separation.To
begin with,consider a two-class problem in which there are N1 points of class C and N2 points of class C ,so that the
mean vectors of the two classes are given by
1 1
N1 n xn
m1 = xn , m1 = (7.12)
C N2 nC
The simplest measure of the separation of the classes,when projected onto w,is the separation of the projected class
means.This suggests that we might choose w so as to maximize
m2 m1 = xT (m2 m1 ) (7.13)
where
mk = wT mk (7.14)
is the mean of the projected data from class C .However this expression can be makde arbitrarily large simply by
increasing the magnitude of w.To solve this,we could constrain 2 to have unit length,so that i w2i = 1.Using a La-
grange multiplier to perform the constrained maximization,we then find that w (m2 m1 ).There is still a problem
with this approach that the projected data have considerable overlap for strongly nondiagonal convariances of the class
distributions.
7.1.4.3 optimization
The idea proposed by Fisher is to maximize a function that will give a large separation between the projected class
means while also giving a small variance within each class,thereby minimizing the class overlap.The projection
formula 7.11 transforms the set of labelled data points in x into a labelled set in the one-dimensional space y.The
within-class variance of the transformed data from class C is therefore given by
where
yn = wT xn (7.16)
We can define the total within-class variance for the whole data set to be simply s21 + s22 .The Fisher criterion is defined
to be the ratio of the between-class variance to the within-class variance and is given by
(m2 m1 )2
J(w) = (7.17)
s21 + s22
We can make the dependence on w explicit by rewrite the Fisher criterion in the form
wT SB w
J(w) = (7.18)
wT SW w
where SB is the between class covariance matrix,given by
so
wT SB w = wT (m2 m1 )(m2 m1 )T w = (m2 m1 )(m2 m1 ) (7.20)
and SW is the total within class covariance matrix,given by
so
wT SW w = wT (xn mk )(xn mk )T = (yn mk )2 (7.22)
Ck nCk Ci nCk
Differentiating 7.18 with respect to w,we find that J(w) is maximized when
Rewriting this as
SB w = SW w (7.24)
where
(wT SB w)
= (7.25)
(wT SW w)
which is called a generalized eigenvalue problem solution.
Proof.
f (x) f g f g f (x) g(x)
= , where f = and g =
x g(x) g 2 x x (7.26)
xT Ax = (A + AT )x
x
(7.27)
In particular,since
w = SW (m2 m1 )(m2 m1 ) (7.29)
Multiply both sides of 7.24 by SW 1 we then obtain
w SW 1 (m2 m1 ) (7.30)
Note that if the within-class covariance is isotropic,so that SW is proportional to the unit matrix(SW I),we find
that w is proportional to the difference vector of the class means.
The result 7.30 is known as Fisher slineardiscriminant,although strictly it is not a discriminant but rather a specific
choice of direction for projection of the data down to one dimension.However,the projected data can subsequently
be used to construct a discriminant,by choosing a threshold y0 so that we classify a new point as belonging to C1 if
y(x) y0 and can classify it as belonging to C2 otherwise.
We can extend the above idea to multiple classes,and to higher dimensional subspaces,by finding a projection matrixW
which maps from D to L so as to maximize
97
|W B WT|
J(W ) = (7.31)
|W W W T |
where
Nc
B (mc m)(mc m)T (7.32)
c N
Nc
W c (7.33)
c N
c (xi mc )(xi mc )T (7.34)
i:yi =c
Another example of a linear discriminant model is the perceptron of Rosenblatt(1962),which corresponds to a two-
class transformation to give a feature vector (x),and this is then used to construct a generalized linear model of the
form
y(x) = f (wT (x)) (7.36)
where the nonlinear activation function f () is given by a step function of the form
{
+1, a 0
f (a) = (7.37)
1, a 0
A natural choice of error function would be the total number of misclassified patterns,which does not lead to a simple
learning algorithm because the error is a piecewise constant function of w,with discontinuities wherever a change in
w causes the decision boundary to move across one of the data points.Methods based on gradient cant be applied.An
alternative error function is known as the perceptron criterion,given by
where we use t {1, +1} coding scheme, M denotes the set of all misclassified patterns.The perceptron criterion
associates zero error with any pattern that is correctly classified,whereas for a misclassified pattern x it tries to mini-
mize the quantity wT ntn .The perceptron learning rule is not guaranteed to reduce the total error function at each
stage.
98
However,the perceptron convergence theorem states that if there exists an exact solution(linearly separable),then
the perceptron learning algorithm is guaranteed to find an exact solution if a finite number of steps.
Aside from difficulties with the learning algorithm,the perceptron does not provide probabilistic outputs,nor does it
generalize readily to K > 2 classes.The most important limitation,however,arises from the fact that it is based on linear
combinations of fixed basis functions.
7.1.9.2 optimization
We now apply the stochastic gradient descent algorithm to this error function.The change in the weight vector w is
then given by
w +1 = w EP (w) = w + ntn (7.39)
where is the learning rate parameter and is an integer that indexes the steps of the algorithm.Because the perceptron
function y(x, w) is unchanged if we multiply w by a constant,so we can set the learning rate equal to 1 without of
generality.
We turn next to a probabilistic view of classification and show how models with linear decision boundaries arise from
simple assumptions about the distribution of the data,adopting a generative approach in which we model the class-
conditional densities p(x|Ck ),as well as the class priors p(k ),and then use these to compute posterior probabilities
p(Ck |x) through Bayes theorem.
Consider first of all the case two classes.The posterior probability for class C1 can be written as
p(x|C1 )p(C1 )
p(C1 |x) = (7.40)
p(x|C1 )p(C1 ) + p(x|C2 )p(C2 )
1
= = (a) (7.41)
1 + exp(a)
1
(a) = (7.43)
1 + exp(a)
The term sigmoid means S-shaped,sometimes called a squashing function because it maps the whole real axis into
a finite interval.It satisfies symmetry property
p(x|Ck )p(Ck )
p(Ck |x) = (7.46)
j p(x|C j )p(C j )
exp(ak )
= (7.47)
j exp(a j )
which is known as the normalized exponential and can be regarded as a multiclass generalization of the logistic
sigmoid.Here the quantities ak are defined by
The normalized function is also known as the softmax function,as it represents a smoothed version of the max
function because,if ak a j for all j = k,then p(Ck |x) 1 and,p(C j |x) 0.
The following investigate the consequences of choosing specific forms for the class-conditional densities,continuous
and discrete inputs.
Assume that the class-conditional densities are Gaussian sharing the same covariance matrix and then explore the
resulting form for the posterior probabilities.Thus the density(pdf) for class Ck is given by
1 1 1
p(x|Ck ) = exp{ (x k )T 1 (x k )} (7.49)
(2 )D/2 ||1/2 2
where
w = 1 (1 2 ) (7.54)
1 1 p(C1 )
w0 = 1 T 1 1 + 2 T 1 2 + log (7.55)
2 2 C2
We see that the quadratic terms in x from the exponents of the Gaussian densities have cancelled (due to the common
covariance matrices) leading to a linear function of x in the argument of logistic sigmoid.The prior p(Ck ) enter only
through the bias parameter w0 so that it have effect of the parallel contours of constant posterior probability.
For the general case of K cases we have
Notice that x 1 x is cancelled in the exponent of the softmax function.Therefore we can write
where
wk = T k (7.59)
1
wk0 = k T 1 k + ln p(Ck ) (7.60)
2
We see that the ak (x) are again linear functions of x due to the shared covariances,otherwise we obtain quadratic
functions of x,giving rise to a quadratic discriminant.
Having specified a parametric functional form for the class-conditional densities p(x|Ck ),we can then determine the
values of the parameters,together with the prior class probabilities p(Ck ),using maximum likelihood.
Consider first the case of two classes,each having a Gaussian class-conditional density with a shared covariance
matrix,and suppose we have a data set {xn , tn } where n = 1, ...N.Here tn = 1 denotes class C1 and 0 denotes C2 .Denote
the prior class probability p(C1 ) = so that p(C2 ) = 1 .For a data point xn from class C1 ,we have tn = 1 and hence
Setting the derivative with respect to equal to zero and rearranging,we obtain
N
1 1
{tn + (tn 1) 1 } = 0 (7.66)
n=1
N N N
(tn 1) tn + tn = 0 (7.67)
n=1 n=1 n=1
1 N N1 N1
=
N n=1
tn =
N
=
N1 + N2
(7.68)
where Nk denotes the total number of data points in class Ck .Thus the maximum likelihood estimate for pi is simply
the fraction of points in class as expected,which can be easily generalized to the multiclass case where ... .
Now consider the maximization with respect to 1 .The log likelihood function those terms that depend on 1
101
N N
1
tn ln N (xn |1 , ) = 2 tn (xn 1 )T 1 (xn 1 ) + const (7.69)
n=1 n=1
Setting the derivative with respect to 1 to zero and arranging ,we obtain
1
1 = tn x n (7.70)
N1
which is simply the mean of all the input vectors xn assigned to corresponding class. Similarly,
1
2 = tn x n (7.71)
N2
Finally,consider the maximum likelihood solution for the shared covariance matrix .We have
1 N 1 N
2 n=1
t n ln | | tn (xn 1 )T 1 (xn 1 )
2 n=1
(7.72)
1 N 1 N
2 n=1
(1 t n ) ln | | (1 tn )(xn 2 )T 1 (xn 2 )
2 n=1
(7.73)
N N
= ln | | Tr{ 1 S} (7.74)
2 2
where we have defined
N1 N2
S= S1 + S2 (7.75)
N N
1
S1 = (xn 1 )(xn 1 )
N1 n
(7.76)
C1
1
S2 = (xn 2 )(xn 2 )
N2 n
(7.77)
C2
Using the standard result for the maximum likelihood solution for a Gaussian distribution,we see that
=S (7.78)
which represents a weighted average of the covariance matrices associated with each of the two classes separately.20
This result is easily extended to the K class problem.Note that the approach of fitting Gaussian distributions to the
classes is not robust to outliers,because the maximum likelihood estimation of a Gaussian is not robust.
Assume binary feature value xi {0, 1} is Bernoulli distributed,thus we have class-conditional distributions of the
form
D
p(x|Ck ) = kixi (1 ki )1xi (7.79)
i=1
20 ?
102
which again are linear functions of the input values xi .Analogous results are obtained for discrete variables each of
which can take M > 2 states.
As we have see so far,for both Gaussian distributed and discrete inputs,the posterior class probabilities are given by
generalized linear models with logistic sigmoid(K = 2 classes) or softmax(K 2 classes) activation functions.These
are particular cases of a more general result obtained by assuming that the class-conditional densities p(xCk) are
members of the exponential family of distributions. Using the form 2.6 for members of the exponential family, we see
that the distribution of x can be written in the form
We restrict attention to the subclass for which u(x) = x.Making use of 2.119 to introduce a scaling parameter s,then
1 1
p(x|k , s) = h(x/s)g(k ) exp{ Tk u(x)} (7.82)
s s
Note that each class have its own parameter vector k but share the same scale parameter s.
For the two-class problem,the posterior is given by a logistic sigmoid acting on a linear function
We have seen that the posterior probability for the two-class classification and multiclass case can be written as a
logistic and softmax function respectively for a wide choice of class-conditional distributions p(x|Ck ).For a specific
class-conditional densities p(x|Ck ),we use maximum likelihood to determine the parameters of the densities as well as
the class priors p(Ck ) and use Bayes theorem to find the posterior class probabilities.However,an alternative approach
is to use the functional form of the generalized linear model explicitly and to determine its parameters directly by
using maximum likelihood,such as iterative reweighted least squares or IRLS.
The indirect approach to finding the parameters of a generalized linear model by fitting class-conditional densities
and class priors separately and then applying Bayes theorem,represents an example of gernerative modellingbecause
we could take such a model and generate synthetic data by drawing values of x from the marginal distribution p(x).In
the direct approach,we maximize a likelihood function defined through the conditional distribution p(Ck |x),which
represents a form of discriminative training.
So far,we have considered classification models that work directly with the original input vector x.However,all of
the algorithms can equally applicable if we first make a fixed nonlinear transformation of the inputs using a vector of
basis functions (x).The resulting decision boundaries will be linear in the feature space ,corresponding to nonlinear
decision boundaries in the original x space.One of the basis function is typically set to a constant,say 0 (x) = 1,so
that the corresponding parameter w0 plays the role of a bias.
Note that nonlinear transformations cannot remove class overlap between the class-conditional densities p(x|Ck ).
103
7.3.2.1 representation
We begin with two-class classification.The posterior probability of class C1 can be written as a logistic sigmoid acting
on a linear function of the feature vector so that
which has a linear dependence on the number of parameters.Logistic regression model can also be written as
d
= (1 ) (7.87)
da
7.3.2.2 evaluation
For a data set {n ,tn },where tn {0, 1} and n = (xn ),with n = 1, ..., N,the likelihood function can be written
N
p(t|w) = ytnn {1 yn }1tn (7.88)
n=1
where t = (t1 , ...,tN )T and yn = p(C1 |n ).The negative logarithm of the likelihood,which gives the cross-entropy error
function is in the form
N
E(w) = ln p(t|w) = {tn ln yn + (1 tn ) ln(1 yn )} (7.89)
n=1
Taking the gradient of the error function with respect to w,we obtain
N
yn (1 yn ) yn (q yn )
E(w) = {tn (1 tn ) }n (7.90)
n=1 yn 1 yn
N
= {tn (1 yn ) (1 tn )yn }n (7.91)
n=1
N
= (yn tn )n (7.92)
n=1
We see that the factor involving the derivative of the logistic sigmoid has cancelled.In particular,the contribution to the
gradient from data point n is given by the error yn tn between the target value and the prediction of the model,times
the basis function vector n ,taking the same form as the gradient of the sum-of-squares error function for the linear
regression model.
Note that maximum likelihood can exhibit severe over-fitting for data sets that are linearly separable.The singularity
can be avoided by inclusion of a prior and finding a MAP(maximum a posterior) solution for w,or equivalently by
adding a regularization term to the error function.
7.3.2.3 optimization
For logistic regression,there is no longer a closed-form solution,due to the nonlinearity of the logistic sigmoid func-
tion.The error function is concave,and can be minimized by an efficient iterative technique based on the Newton-
Raphson iterative optimization scheme,which uses a local quadratic approximation to the log likelihood function.
where H is the Hessian matrix whose elements comprise the second derivatives of E(w) with respect to the compo-
nents of w.
Now apply the Newton-Raphson method ot the linear regression model with sum-of-squares error function.The
gradient and Hessian of this error function are given by
N
E(w) = (wT n tn )n = T w T t (7.94)
n=1
N
H = E(w) = n nT = T (7.95)
n=1
where is the N M design matrix,whose nth row is given by nT .The Newton-Raphson update then takes the form
Rnn = yn (1 yn ) (7.100)
z = w(old) R1 (y t) (7.104)
We see that the update formula takes the form of a set of normal equations for a weighted least-squares prob-
lem.Because the weighting matrix R depends on the parameter vector w,we must apply the normal equations itera-
tively,each time using the new weight vector w to compute a revised weighing matrix R.For this reason,the algorithm
is known as iterative reweighted least squares,or IRLS.The elements of the diagonal weighing matrix R can be
105
interpreted as variances because the mean and variance of t in the logistic regression model are given by
7.3.3.2 Linearization
In fact,we can interpret IRLS as the solution to a linearized problem in the space of the variable a = wT .Apply Taylor
expansion and then we obtain
dan
an (w) an (w (old)
)+ (tn yn ) (7.107)
dyn w(old)
(yn tn )
= Tn w(old) = zn (7.108)
yn (1 yn )
In our discussion of generative models for multiclass classification,the posterior probabilities are given by a softmax
transformation of linear functions of the feature variables,so that
exp(ak )
p(Ck | ) = yk ( ) = (7.109)
j exp(a j )
exp(ak )
= (7.110)
sum
(7.111)
Softmax function is
expz j
(z j ) = , for j = 1, ..., K. (7.112)
Kk=1 expzk
expz j max z
= (7.113)
Kk=1 expzk max z
K
log (z j ) = z j max z log expzk max z (7.114)
k=1
ak = wTk (7.115)
where ynk = yk (n ),and T is an N N matrix of target variables with elements tnk .Taking the negative logarithm then
gives
N K
E(w1 , ..., wK ) = ln p(T |w1 , ..., wK ) = tnk ln ynk (7.119)
n=1 k=1
which is known as the cross entropy error function for the multiclass classification problem.
Take the gradient of the error function,making use of the derivatives of the softmax function,we obtain
En log pk
= tk (7.120)
aj k aj
1 pk
= tk (7.121)
k pk a j
1
= t j (1 p j ) tk (pk p j ) (7.122)
k= j pk
= t j (1 p j ) + tk (p j ) (7.123)
k= j
= t j + t j p j + tk (p j ) (7.124)
k= j
( )
= pj tk tj (7.125)
k
= pj tj (7.126)
= yj tj (7.127)
(7.128)
N
w j E(w1 , ..., wK ) = (yn j tn j )n (7.129)
n=1
N
n (yn j tn j ) W E(w1 , ...wk ) = T (Y T ) (7.130)
n=1
where n = (xn ) is basis function of xn of dimension M 1, Y is the N K prediction matrix, T is the N K target
label matrix, is the N M design matrix, W is M K weight matrix of parameters.
(7.131)
The Hessian matrix is also positive definite and so the error function has a unique minimum.
107
A broad range of class-conditional distributions,described by the exponential family,the resulting posterior class prob-
abilities are given by a logistic(or softmax) transformation acting on a linear function of the feature variables,but not
all choices of class-conditional density give rise to such a simple form for the posterior probabilities.
The cumulative distribution function is given by
a
(a) = N ( |0, 1)d (7.133)
which is known as the probit function.It has a sigmoidal shape.A evaluation of a closely related function is
a
2
er f (a) = exp( 2 /2)d (7.134)
0
known as the erf function or error function.The generalized linear model based on a probit function is known as
probit regression.
One issue that occur in practical applications is that of outliers.Because probit activation function they decay like
exp(x2 ),so probit model can be significantly more sensitive to outliers.However,the effect of mislabelling is easily
incoporated into a probabilistic model by introducing a probability that the target value t has been flipped to the
wrong value,
where (x) is the activation function with input vector x.Here may be set in advance,or it may be treated as a
hyperparameter whose value is inferred from data.
We now show that there is a general result of assuming a conditional distribution for the target variable from the
exponential family,along with a corresponding choice for the activation function known as the canonical link function.
Making use of the restricted form of exponential family distribution 7.82 for target variable t
1 t t
p(t| , s) = h( )g( ) exp{ } (7.137)
s s s
The conditional mean of t,which denoted by y,is given by
d
y E[t| ] = s ln g( ) (7.138)
d
where we assume that all observations share a common scale parameter(which corresponds to the noise variance for a
Gaussian distribution for instance) and so s is independent of .The derivative with respect to the model parameter w
108
is then given by
N
d tn d n dyn
w ln p(t| , s) = { d n ln g(n ) + s } dyn dan an (7.141)
n=1
N
1
= s {tn yn } (yn ) f (an )n (7.142)
n=1
where an = wT n ,and we have used yn = f (an ) together.There is a considerable simplification if we choose a partic-
ular form of the link function f 1 (y) given by
f 1 (y) = (y) (7.143)
In this case,the gradient of error function reduces to
1 N
ln E(w) = {yn tn }n
s n=1
(7.144)
Applicability of models that comprised linear combinations of fixed basis functions is limited by the curse of di-
mensionality.In order the apply such models to large-scale problems,it is necessary to adapt the basis functions to the
data.
Support vector machines and relevance vector machine choose a subset from a fixed set of basis functions and
results in much sparser models.
An alternative approach is to fix the number of basis functions in advance but allow them to be adaptive,in other
words to use parametric forms for the basis function
The linear models are based on linear combinations of fixed nonlinear basis functions j (x) and take the form
M
y(x, w) = f ( w j j (x)) = f (wT (x)) (8.1)
j=1
where f () is a nonlinear activation function in the case of classification and is the identity in the case of regression.
Neural network model can be described a series of functional transformations.First construct M linear combinations
of the input variables x1 , . . . , xD in the form
D
= w ji xi + w j0
(l+1) (l) (l)
aj (8.2)
i=1
(l) (l)
=w x (8.3)
(l)
where j = 1, . . . , M,and the superscript (l) indicates the l the layer of the network.We refer the parameters w ji as
(l)
weights and w j0 as biases.The quantity a j are known as activations.Each of them is then transformed using activation
function h() to give
z j = h(a j ) (8.4)
These quantities correspond to the outputs of the basis functions in linear model that,in the context of neural net-
works,are called hidden units.The nonlinear functions h() are generally chosen to be sigmoidal or the tanh func-
tion.
The transformation of the following layer of the network,combine these values to give output unit activations
M
wk j z j + wk0
(2) (2)
ak = (8.5)
j=1
109
110
zk = h( wk j z j ) (8.6)
j
where the sum runs over all units that send connections to unit k (including the bias parameter).
Consider a two-layer network as shown before.For M hidden units,there will be M sign-flip symmetries,i.e.,2M
equivalent weight vectors.Similarly,interchanging the values of all the weights(and the bias) can give rise to the same
mapping function from inputs to outputs represented by the network.There will be M! equivalent weight vectors.The
network have an overall weight-space symmetry factor of M!2M .
Given a training set comprising a set of input vectors xn ,where n = 1, . . . , N,together with a corresponding set of target
vectors tn ,we minimize the error function.We can provide a probabilistic interpretation to the network outputs.
For regression,and for the moment we consider a single target variable t that can take any real value.We assume
that t has a Gaussian distribution with an x-dependent mean,which is given by the output of the neural network,so that
where is the precision (inverse variance) of the Gaussian noise.Maximize the likelihood function as in linear mod-
els.View the network as having an output activation function that id the identity,so that yk = ak .The corresponding
sum-of-squares error function has the property
E E
= = yk tk (8.8)
ak yk
For binary classification in which t 0, 1,consider a network having a single output whose activation function is a
logistic sigmoid
111
1
y = (a) = (8.9)
1 + exp(a)
We can interpret y(x, w) as the conditional probability p(C1 |x).the conditional distribution of targets given inputs is
then a Bernoulli distribution of the form
Taking the negative logarithm of the corresponding likelihood function,we get cross-entropy error function
N K
E(w) = tnk ln ynk + (1 tnk ) ln(1 ynk ) (8.12)
n=1 k=1
The derivative of the error function with respect to the activation for a particular output unit is
Following the logistic regression,the output unit activation is given by the softmax function
Turn next to the task of finding a weight vector w which minimizes the chosen function E(w). Resort to iterative
Fig. 8.2: geometrical view of the error function E(w) as a surface sitting over weight space.A is a local minimum and
B is the global minimum.The local gradient of the error surface is given by the vector E.
112
numerical procedures.Most techniques involve choosing some initial value w(0) for weight vector and then moving
through weight space in a succession of steps of the form
w( +1) = w( ) + w( ) (8.17)
Many algorithms make use of gradient information,evaluating E(w) at the new weight vector w( +1) .
Consider the Taylor expansion of E(w) around some point in weight space.
It is possible to evaluate the gradient of an error function efficiently by means of the backpropagation proce-
dure.Because each evaluation of E brings W items of information,we might hope to find the minimum of the function
in O(W ) gradient evaluations.Each such evaluation takes only O(W ) steps and so the minimum can be now found in
O(W 2 ) steps.
The goal in this section is to find an efficient technique for evaluating the gradient of an error function E(w) for a
feed-forward neural network.This can be achieved using a local message passing scheme in which information is sent
alternately forwards and backwards through the network,known as error backpropagation or backprop.
Many error functions of practical interest,for instance those defined by maximum likelihood for a set of i.i.d
data,comprise a sum of terms,on for each data point,so that
113
N
E(w) = En (w) (8.20)
n=1
In a general feed-forward network,each unit computes a weighted sum of its inputs of the form
a j = w ji zi (8.21)
i
where zi is the activation of a unit that sends a connection to unit j and w ji is the weight associated with that connec-
tion.The sum is transformed by an activation function h() to give the activation z j of unit j in the form
z j = h(a j ) (8.22)
Evaluate the derivate of En with respect to a weight w ji ,applying the chain rule for partial derivatives
En En a j
= (8.23)
w ji a j w ji
aj
= zi (8.25)
w ji
En
= j zi . (8.26)
w ji
k = yk tk (8.27)
For hidden units,make use of the chain rule for partial derivatives
En En ak
aj
j = (8.28)
k ak a j
(8.29)
where the sum runs over all units k to which unit j sends connections.
En
k = (8.30)
ak
En En ak
aj
j = (8.31)
k ak a j
ak
= k (8.32)
k aj
= k h (a j )wk j (8.33)
k
= h (a j ) wk j k
(8.34)
k
114
Well use wli j to denote the weight for the connection from the kth neuron in the (l 1)th layer to the jth neuron in
the (l)th layer.And blj for the bias of the jth neuron in the lth layer, alj for the activation(weighted input) of the jth
neuron in the lth layer, denotes the element-wise activation function,zlj denotes the output of jth neuron in the lth
layer. Then we have
al W l z l1 + bl (8.35)
z (a )
l l
(8.36)
(s t) j = s j t j (8.37)
C
lj (8.38)
alj
C
jL = (8.39)
aLj
C zLk
= (8.40)
k zLk aLj
C z j
L
= , if output only depends on same neurons input (8.41)
zLj aLj
C L
= (a j ). (8.42)
z Lj
L = zC (aL ). (8.43)
L = (aL )zC, (8.44)
where (zL ) is a square matrix whose diagonal entries are the values (zL ), and whose off-diagonal entries are zero.
Propagation (recursive relationship) in hidden layer:
115
C
jl = (8.45)
alj
C al+1
= k
(8.46)
k al+1
k
alj
al+1
= k
l+1 , (8.47)
k alj k
al+1
k = wl+1 l l+1
k j z j + bk = wl+1
k j (a j ) + bk .
l l+1
(8.48)
j j
al+1 l
k j (a j )
= wl+1
k
(8.49)
alj
jl = wl+1 l+1 l
k j k (a j ) (8.50)
k
l = ((W l+1 )T l+1 ) (al ), (8.51)
l = (al )(wl+1 )T l+1 . (8.52)
(8.53)
C C a j
l
= aj bj
(8.54)
blj
l l
C
= alj
1 = lj (8.55)
Rate of change of the cost with respect to any weight matrix in the network:
C a j
l
C
= (8.56)
wljk alj wljk
k j
= zl1 l
(8.57)
C
= out zin T , (8.58)
w
When dealing with batch training instances, we sum derivatives with respect to parameters over these instances.
Back-propagation can also be applied to the calculation of other derivatives.Consider the evaluation of the Jacobian
matrix,whose elements are given by the derivatives of the network outputs with respects to the inputs
yk
Jki = (8.59)
xi
116
9.1 Introduction
One of significant limitations of kernel methods is that the kernel function (xn , xm ) must be evaluated for all possible
pairs xn and xm of training points,computationally infeasible.Kernel-based algorithms that have sparse solutions
predict for new inputs depend only on the kernel function evaluated at a subset of the training data points,such as
suport vector machine(SVM).
The funtional margin tn y(x) > 0 for data points correctly classified.The maximize it
1
w, b = arg max{ min[tn (wT (xn ) + b)]} (9.2)
w,b w n
for point closet to the surface.Then all data points satisfies the constraint
117
118
This is the canonical representation of the decision hyperplane.The optimization problem is equivalent to
1
arg min w 2 (9.5)
w,b 2
where a = (a1 , ..., aN )T .Setting the derivatives of L with respect to w and b equal to zero,we obtain conditions
N
w= antn (xn ) (9.7)
n=1
N
0= antn (9.8)
n=1
Eliminating w and b gives the dual representation of the maximum margin problem
N
1 N N
L(a) = an an amtntm k(xn , xm )
2 n=1
(9.9)
n=1 m=1
an 0, n = 1, ..., N (9.10)
N
antn = 0. (9.11)
n=1
The Karush Kuhn Tucker(KKT ) conditions require following the properties hold
an 0 (9.13)
tn y(xn ) 1 0 (9.14)
an {tn y(xn ) 1 = 0} (9.15)
Data points for which an = 0 will disappear and remaining ones are called suport vectors.They lie on the maximum
margin hyperplanes in feature space.
Having solved the quadratic programming problem,the threshold parameter b
where S denotes the set of indices of the support vectors.Multiply through by tn ,making use of tn 2 = 1,and then average
these equations over all support vectors
1
b=
NS (tn amtm (xn , xm )) (9.17)
nS mS
119
where NS is the total number of support vectors. Express the maximum-margin classifier in terms of the minimization
of an error function with a quadratic regularizer
N
E (y(xn )tn 1) + w 2 (9.18)
n=1
In practice,the class-conditional distributions may overlap,in which case exact separation of the training data can lead
to poor generalization.Introduce slack variables, n 0 where n = 1, ...N,one for each training data points.We allow
points on the wrong side but with a penalty that increases of the distance from the boundary.
n =| tn y(xn ) | (9.19)
This is described as relaxing the hard margin constraint to give a soft margin and allows misclassification of training
set data points.
Maximize the margin with softly penalized points on the wrong side of the margin boundary.
N
1
C n + w 2 (9.21)
n=1 2
where the parameter C controls the trade-off between the slack variable penalty and the margin,minimizing errors and
controlling model complexity.In the limit C ,the model gets more complex,less data points are misclassified.
Minimization with constraint
N N N
1
L(w, ba) = w 2 +C n an {tn y(xn ) 1 + n } n n (9.22)
2 n=1 n=1 n=1
where {an 0} and {n 0} are Lagrange multipliers.The KKT set of conditions are given by
120
an 0 (9.23)
tn y(xn ) 1 + n 0 (9.24)
an (tn y(xn ) 1 + n ) = 0 (9.25)
n 0 (9.26)
n 0 (9.27)
n n = 0 (9.28)
where n = 1, ..., N.
Optimize out w, b and {n }
L N
= 0 w = antn (9.29)
w n=1
L N
= 0 antn = 0 (9.30)
b n=1
L
= 0 an = C n (9.31)
n
Eliminated,the dual Lagrangian is in the form
N N N
1
L(a) = an 2 an amtntm (xn , xm ) (9.32)
n=1 n=1 m=1
0 an C (9.33)
N
antn = 0 (9.34)
n=1
tn y(xn ) = 1 n (9.35)
if an < C,then implies that n > 0,which requires n = 0 and hence such points lie on the margin.Points with an = C
can lie inside the margin and can either be correctly classified if n 1 or misclassified if n > 1.
To determine b,we note that support vectors for which 0 an C have n = 0 so that tn y(xn ) = 1 and hence satisfy
the score for the j-th class is the j-th element: s j = f (xi , W ) j . The Multiclass SVM loss for the i-th example is then
formalized as follows:
With linear score function f (xi ; W ) = W xi , rewrite the loss function in this equivalent form:
The threshold at zero max(0, ) is called hinge loss.And there is a squared hinge loss (L2-SVM). The gradient of loss
function for a single point i:
( )
wyi Li = (wTj xi wTyi xi + > 0) xi (9.40)
j=yi
where is the indicator function that is one if the condition inside is true or zero otherwise. For the rows where j = yi
the gradient is:
9.2.2.1 -SVM
Maximize
1 N N
L(a) = an amtntm (xn , xm )
2 n=1
(9.42)
m=1
0 an 1/N (9.43)
N
antn = 0 (9.44)
n=1
N
an (9.45)
n=1
The parameter ,which replaces C,can be interpreted as both an upper bound on the fraction of margin errors and a
lower bound on the fraction of support vectors.
9.2.2.2 Optimization
chunking Lagrangian is unchanged if we remove the rows and columns of the kernel matrix corresponding to
Lagrange multipliers.Implemented using protected conjugate gradients.
Decomposition methods solves a series of smaller quadratic programming problems but are designed so that each
of these is of a fixed size.
sequantial minimal optimization or SMO Takes the concept of chunking to the extreme limit and considers just
two Lagrange multipliers at a time.Heuristics are given for choosing the pair of Lagrange multipliers to be consid-
ered at each step.
122
Support vector machines dont manage to avoid the curse of dimensionality because there are constraints amongst
the feature values that restrict the effective dimensionality of feature space.
Construct an error function by taking the negative logarithm of likelihood function with a quadratic regularizer
N
ELR (yntn ) + w 2 (9.49)
n=1
where
1 N
2 n=1
{yn tn }2 + w 2 .
2
(9.51)
To obtain sparse solutions,the quadratic error function is replaced by an -insensitive error function.For example
{
0, if | y(x) t |< ;
E (y(x) t) = (9.52)
y(x) t | , otherwise
We therefore minimize
N
1
C E (y(xn ) tn ) + | w |2 (9.53)
n=1 2
where y(x) is the prediction function.The (inverse) regularization parameter denoted C,appears in front of the error
term.
Introducing two slack variables for each data point xn .The condition for target points to lie inside the -tube is
that yn tn yn + ,and for those outside:
{
n 0
(9.54)
n 0, n = 1, ..., N
y(xn ) + < tn y(xn ) + + n (9.55)
y(xn ) > tn y(xn ) n (9.56)
The error function for support vector regression can then be written as
N
1
C (n + n ) + w 2 (9.57)
n=1 2
Substitute for y(x) using 9.1 and set the derivatives of the Lagrangian with respect to x, b, n , n to zero,giving
L N
= 0 w = (an an ) (xn ) (9.60)
w n=1
L N
= 0 (an an ) = 0 (9.61)
b n=1
L
= 0 an + n = 0 (9.62)
n
L
= 0 an + n = 0 (9.63)
n
(9.64)
Using these to eliminate the corresponding variables,we see the dual problem of maximizing
1 N N
L(a, a) = (an an )(am am )(xn , xm )
2 n=1
(9.65)
m=1
N N
(an + an ) + (an an )tn (9.66)
n=1 n=1
0 an C (9.67)
0 an C (9.68)
The corresponding karush Kuhn Tucker(KKT) conditons for 9.58,which state that at the solution the product
of the dual variables and the constraints must vanish are given by
an ( + n + yn tn ) = 0 (9.70)
an ( + n yn + tn ) = 0 (9.71)
(C an )n = 0 (9.72)
(C an )n = 0 (9.73)
The support vectors are those data points that contribute to predictions,in other words those for which either an = 0
or an = 0.These are points lying on the boundary of the -tube or outside the tube.
The parameter b satisfies
+ yn tn = 0 (9.74)
b = tn wT (xn ) (9.75)
N
= tn (am am )(xn , xm ) (9.76)
m=1
subject to constraints
0 an C/N (9.79)
0 an C/N (9.80)
N
(an an ) = 0 (9.81)
n=1
N
(an + an ) C (9.82)
n=1
(9.83)
There are at most N data points falling outside the insensitive tube,which at least N data points are support vectors
and so lie either on the tube or outside it.
TODO
Chapter 10
Graphical Models
10.1 Introduction
The joint distribution over a set of variables in the form of a product of factors
where xs denotes a set of the variables.We shall denote individual variable by xi ,which can comprise groups of vari-
ables(such as vectors and matrices).Each factor fs is a function of a corresponding set of variable xs . Directed graphs
have factors fs (xs ) as local conditional distributions.Undirected graphs factors are potential function over the maxi-
mal cliques.
sum-product algorithm solves the problem of evaluating the local marginals over nodes or subsets of nodes.And
max-sum algorithm find the most probable state,evaluating the maximal joint distribution.
The marginal is obtained by
= fs x (x) (10.6)
sne(x)
127
128
Recursive inference in the sub-graph and interchanging sums and products leads to
where ne(x) denotes the set of neighbor variables. And define messages from variable nodes to factor nodes
= fl xm (xm ) (10.12)
lne(xm )\ fs
The sum-product algorithm allows us to take a joint distribution p((x)) expressed as a factor graph and efficiently
find marginal component variables.Max-sum find a setting of variables that has the largest probability,which is an
application of dynamic programming. Maximize the joint distribution
Joint distribution
Making use of distributive law for multiplication with max operator,similarly to add operator
11.1 Introduction
If we define a joint distribution over observed and latent variables,the corresponding distribution of the observed
variables alone is marginalization.Mixture distributions can be interpreted in terms of discrete latent variables.As well
as providing framework for building more complex probability distributions,mixture models can also be used to cluster
data.
11.2.1 representation
Suppose we have a data set {x1 , ..., xN } consisting of N observations of a random D-dimensional Euclidean variable
x.Our goal is to partition the data set into some number K of clusters.First introduce a set of D-dimensional vectors
k ,where k = 1, ..., K,in which k is a prototype associated with the kth cluster,representing the centers of the clusters.
Our goal is then to find an assignment of data points to clusters,as well as a set of vectors {k },such that the sum
of squares of the distances of each data point to its closest vector k ,is a minimum.
A corresponding set of binary indicator variables rnk {0, 1},describing which of the K clusters the data point xn is
assigned to,so that if data point xn is assigned to cluster k then rnk = 1,and rn j = 0 for j = k.This is the 1-of-K coding
scheme.
11.2.2 evaluation
11.2.3 optimization
Our goal is to find value for the {rnk } and the {k } so as to minimize J.We can do this through an iterative procedure
in which each iteration involves two successive steps corresponding to successive optimizations with respect to the rnk
and the k ,which can be interpreted as parameters of each clusters probability distribution.
Consider first the determination of the rnk ,which can be interpreted as likelihood function of each cluster.We simply
assign the nth data point to the closest cluster centre.
{
1, if k = arg min j xn j 2
rnk = (11.2)
0, otherwise
Note that
129
130
Tj j
= arg max(xTn j ) (11.5)
j 2
Now consider the optimization of the k with the nk held fixed.Setting the objective function Js derivative with respect
to k to zero giving:
N
k J = 2 rnk (xn k ) = 0 (11.6)
n=1
n rnk xn
k = (11.7)
n rnk
rnk xTn
Tk = n (11.8)
n rnk
which can be vectorized when implementing.
Algorithm 1: K- MEANS coordinate descent
Input: A set X = {x1 , x2 , . . . , xN }
Output: {rnk } and the {k } such that J is minimized.
1 initialize k ;
2 repeat
{
1, if k = arg min j xn j 2
3 E(expectation).Minimize J with respect to the rnk .rnk = ,keeping k fixed
0, otherwise
;
4 M(maximization).Minimize J with respect to the k : k =nrrnk xn ,keeping rnk fixed.;
n nk
5 until converged;
The objective may converge to a local rather than global minimum(MacQueen 1967).
Derive an on-line stochastic rather than the batch version of K-means,by applying Robbin-Monro procedure :
new k + n (xn k )
= old old
k (11.9)
where n is the learning rate parameter,which is typically made to decrease monotonically as more data points are
considered.
Generalize the K-means algorithm,introducing a general dissimilarity measure (x, x )
N K
J = rnk (xn , k ) (11.10)
n=1 k=1
which gives the K-medoids algorithm.The computation cost of E step is O(KN).The M step is potentially more
complex.So it is common to restrict each cluster prototype to be equal to one of the data vectors assigned to that
cluster,requiring O(Nk2 ) evaluations of (, ).
11.3.1 representation
The Gaussian mixture distribution can be written as a linear superposition of Gaussians in the form
131
K
p(x) = k N (x|k , k ) (11.11)
k=1
zk {0, 1} (11.12)
zk = 1 (11.13)
k
The marginal distribution over z is specified in terms of the mixing coefficients k ,such that
p(zk = 1) = k (11.14)
0 k 1 (11.15)
K
k = 1 (11.16)
k=1
The joint distribution is given by p(z)p(x|z),and the marginal distribution of x is obtained by summing the joint
distribution over states of z:
K
p(x) = p(z)p(x|z) = k N (x|k , k ) (11.20)
z k=1
p(zk = 1)p(x|zk = 1)
(zk ) p(zk = 1|x) = (11.21)
Kj=1 p(z j = 1)p(x|z j = 1)
k N (x|k , k )
= (11.22)
j=1 j N (x| j , j )
K
We shall view k as the prior probability of zk = 1,and the quantity (zk ) as the corresponding posterior. (zk ) can also
be viewed as the responsibility that component k takes for explaining the observation x.
Assuming that the data points are drawn independently from the same distribution(i.i.d),the log of the likelihood
function is given by
132
N K
ln(p(X|, , )) = ln{ k N (xn |k , k )} (11.23)
n=1 k=1
It is worth emphasizing that there are significant problems associated with the maximum likelihood framework ap-
plied to GMM,due to the presence of singularities.If one of the components of the mixture model collapse onto a spe-
cific data point( j 0),the log likelihood function will go to infinity.These singularities provide another example of
severe over-fitting of maximum likelihood approach.A further issue is identifiability(Casella and Berger,2002),which
is that a K-component mixture results in K! equivalent solutions.
An elegant and powerful method for finding the maximum likelihood solutions for models with latent variables is
called the expectation-maximization,or EM algorithm.
Begin by writing down the conditions that must be satisfied at a maximum of the likelihood function.Setting the
derivatives of ln p(X|, , ) 11.23 with respect to the means k of the Gaussian components to zero,we obtain
N
k N (xn |k , k )
0= 1
k (xn k ) (11.24)
n=1 j=1 j N (xn | j , j )
K
N
0 = (znk )(xn k ) (11.25)
n=1
N
Nk = (znk ) (11.26)
n=1
1 N
k = (znk )xn
Nk n=1
(11.27)
(11.28)
dex
where we have made use of = ex and 2.63.We can interpret Nk as the effective number of points assigned to cluster
dx
k.
Now set the derivative of ln(X|, , ) with respect to k to zero:
133
(k N (xn |k , k ))
ln p(X|, , ) N
k
= K (11.29)
k n=1 j=1 k N (xn |k , k )
ln y 1 y
= (11.30)
x y x
y ln y
=y (11.31)
x x
(k N (xn |k , k )) N (xn |k , k )
= k (11.32)
k k
ln N (xn |k , k )
= k N (xn |k , k ) (11.33)
k
1 1 1
N (x|, ) = exp{ (x )T 1 (x )} (11.34)
(2 )D/2 | |1/2 2
ln N (xn |k , k ) 1 ln(| k ) | (xn j ) 1 (xn j )
= ( + ) (11.35)
k 2 k k
det(X)
= det(X)(X 1 )T (11.36)
X
aT X 1 b
= X T abT X T (11.37)
X
| k |
ln(| k ) | k | k | T
= = k
= 1 (11.38)
k | k | | k | k
ln N (xn |k , k ) 1
= ( 1 1 T 1
k (xn k )(xn k ) k ) (11.39)
k 2 k
N
derivative = (znk )(1/2)[ 1 1 T 1
k k (xn k )(xn k ) k ] = 0 (11.40)
n=1
1 N
k = (znk )(xn k )(xn k )T
Nk n=1
(11.41)
where again we see the appearance of the responsibilities 11.21.If we now multiply both sides by k and sum over k
making use of constraint Kk=1 k = 1,we find = N.Then
Nk
k = (11.44)
N
134
There do suggest s ample iterative scheme for finding a solution to the maximum likelihood problem.
Algorithm 2: EM for Gaussian Mixtures
1 repeat
2 1.Initialize the means k ,covariances k and mixing coefficients k ,and evaluate the initial value of the log
likelihood
3 2.E step.Evaluate the responsibilities using the current parameter values
p(zk = 1)p(x|zk = 1)
(zk ) p(zk = 1|x) = (11.45)
Kj=1 p(z j = 1)p(x|z j = 1)
k N (x|k , k )
= (11.46)
j=1 j N (x| j , j )
K
1 N
new
k = (znk )xn
Nk n=1
(11.47)
1 N
new
k = (znk )(xn k )(xn k )T
Nk n=1
(11.48)
Nk
knew = (11.49)
N
5 where
N
Nk = (znk ) (11.50)
n=1
11.4.1 representation
We denote the set of all observed data by X,in which the nth row represents xTn ,and denote the set of all latent variables
by Z,with corresponding rows z Tn .The set of all model parameters is denoted by ,so the log likelihood function is
given by
Replace the sum over Z with an integral for continuous latent variables.
{X, Z} is called the complete data set,and {X} is incomplete.We consider complete-data log likelihood func-
tions expected value under the posterior distribution of the latent variables,corresponding to the E step.In the subse-
135
Maximize this
where
5 4.
6 until converged
When finding MAP(maximum posterior) solutions,maximize the quantity Q(, old ) + ln p() in M step.
The expectation maximization algorithm,or EM,is a general technique for finding maximum likelihood solutions for
probabilistic models having latent variables.
Given the same setting as previous,our goal is to maximize the likelihood function
take the expectation over values of Z by multiplying both sides by q(Z) and summing(or integrting) over Z we get
136
where
p(X, Z|)
L(q, ) = q(Z) ln{ } (11.66)
Z q(Z)
p(Z|X, )
KL(q p) = q(Z) ln{ } (11.67)
Z q(Z)
Note that L(q, ) is a functional of the distribution p(Z) and KL is the Kullback-Leibler divergence between q(Z) and
the posterior p(Z|X, ).KL divergence will be nonnegative according to Jensons inequality about convex function.
Substitute q(Z) = p(Z|X, old ) into the lower bound,then after the E step,
L(q, ) = p(Z|X, old ) ln p(X, Z|) p(Z|X, old ) ln p(Z|X, old ) (11.70)
Z Z
Maximize the posterior distribution p(|X) with EM for models with a prior p() over parameters.
decomposed as
where ln p(X) is constant.We can optimize the right-hand side alternatively with respect to q and in E-step and
M-step.
11.5.2 generalized EM
The generalized EM,or GEM,algorithm addresses the problem of an intractable M step.Instead of aiming to maximize
L(q, ) with respect to ,it seeks instead to change the parameters in such a way as to increase its value.
1 Nonlinear optimization. conjugate gradients algorithm in M step
2 expectation conditional maximization(ECM). Making several constrained optimizations within each M step.
Generalize the E step by performing a partial,rather than complete, optimization of L(q, ) with respect to q(Z).
Chapter 12
Continuous Latent Variables
12.1.1 Introduction
Principal Component Analysis is widely used for applications such as dimensionality reduction,lossy data compres-
sion,feature extraction,and data visualization.Also known as the Karhunen-Loeve transform.There are two definitions
giving rise to the same algorithm.PCA can be defined as the orthogonal projection of the data onto a lower dimensional
linear space,known as the principal subspace,such that the variance of the projected data is maximized.Equivalently,it
can be defined as the linear projection the minimizes the average projection cost, defined as the linear projection
that minimizes the average projection cost,defined as the mean squared distance between the data points and their
projections.
Consider a data set of observations {xn } where n = 1, ..., N,and xn is a Euclidean variable with dimensionality D. Our
goal is to project the data onto a space having dimensionality M < D while maximizing the variance of the projected
data.We define the direction of this space using a D-dimensional unit vector uT1 u1 = 1.Each data point xn is then
projected onto a scalar value uT1 xn .The mean of the projected data is uT1 x where the x is the sample set mean given by
1 N
x = xn
N n=1
(12.1)
1 N T 1 N
N n=1
{u1 xn uT1 x}2 = {uT1 (xn x)}2
N n=1
(12.2)
1 N
= {uT1 (xn x)(xn x)T uT1 }
N n=1
(12.3)
1 N
S= (xn x)(xn x)T
N n=1
(12.5)
We now maximize the projected variance uT1 Su1 with respect to u1 ,which is a constrained maximization to prevent
u1 .The appropriate constraint comes from the normalization condition uT1 u1 = 1.To enforce this constraint,we
introduce a Lagrange multiplier that we shall denote by 1 ,and then make an unconstrained maximization of
By setting the derivative with respect to u1 equal to zero,we see that this quantity will have a stationary point when
Su1 = 1 u1 (12.7)
139
140
which says that u1 must be an eigenvector of S.If we left-multiply by uT1 and make use of uT1 u1 = 1,we see that the
variance is given by
uT1 Su1 = 1 (12.8)
and so the variance will be a maximum when we set u1 equal to the eigenvector having the largest eigenvalue 1 .This
eigenvector is known as the first principal component.
13.1.1 representation
Directed Probabilistic Graphical Models with Latent Variables to represent Hidden Markov Models.Hidden
Markov Models are State Space Models.
Introduce latent variable z n with 1-of-K coding scheme. Transition probabilities or transition matrix A jk =
p(znk = 1|zn1, j = 1) and probabilities satisfy 0 A jk 1 with k A jk = 1.So that matrix A has K(K 1) independent
parameters.Write the conditional probability explicitly
K K
p(z n |z n1 , A) = A jkn1, j
z znk
(13.1)
k=1 j=1
The initial latent node does not have a parent node,so the marginal prior distribution is represented by a vector of
probabilities with k p(z1k = 1),so that
K
p(z 1 |) = k 1k
z
(13.2)
k=1
where k k = 1. The emission probabilities can be represented in the form
K
p(xn |z n , ) = p(xn |k )znk (13.3)
k=1
For homogeneous models,the joint distribution over both latent and observed variables is given by
[ ]
N N
p(X, Z|) = p(z 1 |) p(zn |zn1 , A) p(xm |zm , ) (13.4)
n=2 m=1
13.1.2 evaluation
The likelihood function is obtained from the joint distribution by marginalizing over the latent variables
141
142
13.1.3 optimization
13.1.3.1 EM algorithm
Direct maximization of the likelihood function lead to complex expressions with no close-form solutions(not i.i.d),so
we turn the expectation maximization (simplified variational bayesian).In the E step,take these parameters to find
the posterior distribution of the latent variables p(Z|X, old ).Then use this posterior distribution to evaluate the ex-
pectation of the logarithm of complete-data likelihood function,as a function of parameters
The quantities and are posterior distribution of latent variables,and can be obtained using a two-stage message
passing algorithm,known as Baum-Welch,a.k.a alpha-beta algorithm.
In the recursion relationship,these probabilities are significantly less than unity,as we work our way forward along the
chain,the values of can go to zero exponentially quickly.
13.1.4 predict
13.1.5 title
TODO
Chapter 14
Combining Models
14.1 Introduction
committees Train L different models and then make predictions using the average of the predictions made by each
model.
boosting Train multiple models in sequence in which the error function used to train a particular model depends on
the performance of the previous models.
decision trees Different models are responsible for making predictions in different regions of input space.
mixtures of experts Models are viewed as mixture distributions in which the component densities,as well as the
mixing coefficients,are conditioned on the input variables.
K
p(t|x) = k (x)p(t|x, k) (14.1)
k=1
in which k (x) = p(k|x) represents the input-dependent mixing coefficients,and k indexes the model.
In Bayesian model averaging the whole data set is generated by a single model.By contrast,when we combine multiple
models,we see that different data points within the data set can potentially be generated from different values of the
latent variable z and hence by different components.
14.3 Committees
The simplest way to construct a committee is to average the predictions of a set of individual models,to cancel the
contribution arising from variance and bias.
Bootstrap data to introduce variability between the different models within the committee.Suppose we generate M
bootstrap data sets
1 M
yCOM (x) = ym (x)
M m=1
(14.2)
14.4 Boosting
Here we describe the most widely used form of boosting algorithm:AdaBoost,short for adaptive boosting.The base
classifiers are known as weak learners and are trained in sequence using a weighted form of the data set in which
the weighting coefficient associated with each data point depends on the performance of the previous classifiers.
Consider a two-class classification problem,in which the training data comprises input vectors x1 , ..., xN along with
corresponding binary target variables t1 , ...,tN where tn {1, 1}.Each data point is given an associated weighting
parameter wn ,initially set 1/N for all.A base classifier function y(x) {1, 1}
145
146
Algorithm 4: AdaBoost
Input: A set X = {x1 , x2 , . . . , xN },{t1 , ...,tN }
Output: y(x),.
1 1. Initialize the data weighting coefficients {wn } by setting wn = 1/N for n = 1, ..., N.
1
2 2. for m = 1, ..., N do
3 (a) Fit a classifier ym (x) to the training data by minimizing the weighted error function
N
wn
(m)
Jm = I(ym (xn ) = tn ) (14.3)
n=1
where I(ym (xn ) = tn ) is the indicator function and equals 1 when ym (xn ) = tn and 0 otherwise.
4 (b) Evaluate the quantities
(m)
Nn=1 wn I(ym (xn ) = tn )
m = (m)
(14.4)
Nn=1 wn
and then use these to evaluate
1 m
m = ln{ }. (14.5)
m
(m+1) (m)
wn = wn exp{m I(ym (xm ) = tn )} (14.6)
M
YM (x) = sign( m ym (x)) (14.7)
m=1
where fm (x) is a classifier defined in terms of a linear combination of base classifiers yl (x) of the form
1 m
fm (x) = l yl (x)
2 l=1
(14.9)
and tn {1, 1} are the training set target values.Our goal is to minimize E with respect to the weighting coefficients
l and parameters of the base classifiers yl (x).
Separating off the contribution from base classifier ym (x),
N
E= exp{tn fm (xn )} (14.10)
n=1
N m
1
= exp{tn 2 l yl (x)} (14.11)
n=1 l=1
N
1
= exp{tn fm1 (xn ) 2 tn m ym (xn )} (14.12)
n=1
N
1
wn
(m)
= exp{ tn m ym (xn )} (14.13)
n=1 2
(m)
where the coefficients wn = exp{tn fm1 (xn )} can be viewed as constants because we are optimizing only m and
ym (x).Denote by Tm the set of data points correctly classified by ym (x) and misclassified points by Mm ,then we in
turn rewrite the error function
E = em /2 wn + em /2
(m) (m)
wn (14.14)
nTm nMm
N N
= (em /2 em /2 ) wn I(ym (xn ) = tn ) + em /2 wn
(m) (m)
(14.15)
n=1 n=1
A.1 LU Decomposition
A.2 QR Decomposition
Definition A.1. An eigenvector of an n n matrix A is a nonzero vector x such that Ax = x for some scalar .A
scalar is called eigenvalue of A if there is a nontrivial solution x of Ax = x;such an x is called an eigenvector
corresponding to 1 .
A.5.1 Definition
X = |{z}
|{z} U |{z} VT
|{z} (A.1)
ND NN ND DD
where U is an N N matrix whose columns are orthornormal(so U T U = I), V is D D matrix whose rows and
columns are orthonormal (so V T V = V V T = I D ), and is a N D matrix containing the r = min(N, D) singular
values i 0 on the main diagonal, with 0s filling the rest of the matrix.
A.5.2 Proof
Theorem A.1. Suppose v 1 , v2 ..., vn is an orthogonal basis of Rn consisting of eigenvector of AT A,arranged so that
the corresponding eigenvalues of AT A satisfy 1 2 ... n 0 and suppose A has r nonzero singular values.Then
Av1 , ..., Avr is an orthogonal basis for ColA,and rankA = r.
We therefore have
149
150 A Matrix Decomposition
Avi = i ui (A.4)
For a general vector x,since eigenvectors are orthogonal unit vectors,we have
Remember that dot product can be computed using the vector transpose
v u = vT u (A.8)
which leads to
So we can see that ui is the eigenvector of symmetric matrix AAT ,and vi is the eigenvector of symmetric matrix
AT A.In summary,ui and vi are the left-eigenvector and right-eigenvectors of matrix A.
A.5.3 Application
The projection vectors for principal component projection are the left-eigenvectors
Appendix B
Optimization methods
B.1 Convexity
Definition B.1. (Convex set) We say aset S is convex if for any x1 , x2 S, we have
Definition B.2. (Convex function) A function f (x) is called convex if its epigraph(the set of points above the func-
tion) defines a convex set. Equivalently, a function f (x) is called convex if it is defined on a convex set and if, for any
x1 , x2 S, and any [0, 1], we have
Definition B.3. A function f (x) is said to be strictly convex if the inequality is strict
Definition B.4. A function f (x) is said to be (strictly) concave if f (x) is (strictly) convex.
Theorem B.1. If f (x) is twice differentiable on [a, b] and f (x) 0 on [a, b] then f (x) is convex on [a, b].
Proposition B.1. log(x) is strictly convex on (0, ).
Intuitively, a (strictly) convex function has a bowl shape, and hence has a unique global minimum x corresponding
d2
to the bottom of the bowl. Hence its second derivative must be positive everywhere, dx 2 f (x) > 0. A twice-continuously
differentiable, multivariate function f is convex iff its Hessian is positive definite for all x. In the machine learning
context, the function f often corresponds to the NLL.
Models where the NLL is convex are desirable, since this means we can always find the globally optimal MLE. We
will see many examples of this later in the book. However, many models of interest will not have concave likelihoods.
In such cases, we will discuss ways to derive locally optimal parameter estimates.
151
152 B Optimization methods
The line search2 approach first finds a descent direction along which the objective function f will be reduced and
then computes a step size that determines how far x should move along that direction. The descent direction can
be computed by various methods, such as gradient descent(Section B.2), Newtons method(Section B.4) and Quasi-
Newton method(Section B.5). The step size can be determined either exactly or inexactly.
Consider the following, which well call the primal optimization problem:
xyz (B.4)
1
f (x) f (xk ) + g Tk (x xk ) + (x xk )T H k (x xk )
2
where g k g(xk ) = f (xk ), H k H(xk ),
[ 2 ]
f
H(x) (Hessian matrix)
xi x j DD
f (x) = g k + H k (x xk ) = 0 (B.5)
xk+1 = xk H 1
k gk (B.6)
2 http://en.wikipedia.org/wiki/Line_search
B.5 Quasi-Newton method 153
From Equation B.5 we can infer out the quasi-Newton condition as follows:
f (x) g k = H k (x xk )
g k1 g k = H k (xk1 xk )
g k g k1 = H k (xk xk1 )
g k+1 g k = H k+1 (xk+1 xk ) (quasi-Newton condition) (B.7)
B.5.1 DFP
Updating rule:
B k+1 = B k + P k + Qk (B.9)
From Equation B.8 we can get
B k+1 y k = B k y k + P k y k + Qk y k = k
To make the equation above establish, just let
P k yk = k
Qk y k = B k y k
k Tk
Pk = (B.10)
Tk y k
B k y k y Tk B k
Qk = (B.11)
y Tk B k y k
B.5.2 BFGS
B k+1 k = y k (B.12)
The updating rule is similar to DFP, but P k and Qk are different. Let
154 B Optimization methods
P k k = yk
Qk k = B k k
Then
y k y Tk
Pk = (B.13)
y Tk k
B k k Tk B k
Qk = (B.14)
Tk B k k
B.5.3 Broyden
Use the template glossary.tex together with the Springer document class SVMono (monograph-type books) or SVMult
(edited books) to style your glossary in the Springer layout.
glossary term Write here the description of the glossary term. Write here the description of the glossary term. Write
here the description of the glossary term.
glossary term Write here the description of the glossary term. Write here the description of the glossary term. Write
here the description of the glossary term.
glossary term Write here the description of the glossary term. Write here the description of the glossary term. Write
here the description of the glossary term.
glossary term Write here the description of the glossary term. Write here the description of the glossary term. Write
here the description of the glossary term.
glossary term Write here the description of the glossary term. Write here the description of the glossary term. Write
here the description of the glossary term.
155