Introduction To Machine Learning
Introduction To Machine Learning
System LATEX 2
Vishwanathan]
A catalogue record for this book is available from the British Library
Library of Congress Cataloguing in Publication data available
ISBN 0 521 82583 0 hardback
Author: vishy
Revision: 252
Timestamp: October 1, 2010
URL: svn://smola@repos.stat.purdue.edu/thebook/trunk/Book/thebook.tex
Contents
Preface
1
page 1
Introduction
1.1 A Taste of Machine Learning
1.1.1 Applications
1.1.2 Data
1.1.3 Problems
1.2 Probability Theory
1.2.1 Random Variables
1.2.2 Distributions
1.2.3 Mean and Variance
1.2.4 Marginalization, Independence, Conditioning, and
Bayes Rule
1.3 Basic Algorithms
1.3.1 Naive Bayes
1.3.2 Nearest Neighbor Estimators
1.3.3 A Simple Classifier
1.3.4 Perceptron
1.3.5 K-Means
3
3
3
7
9
12
12
13
15
16
20
22
24
27
29
32
Density Estimation
2.1 Limit Theorems
2.1.1 Fundamental Laws
2.1.2 The Characteristic Function
2.1.3 Tail Bounds
2.1.4 An Example
2.2 Parzen Windows
2.2.1 Discrete Density Estimation
2.2.2 Smoothing Kernel
2.2.3 Parameter Estimation
2.2.4 Silvermans Rule
2.2.5 Watson-Nadaraya Estimator
2.3 Exponential Families
2.3.1 Basics
37
37
38
42
45
48
51
51
52
54
57
59
60
60
v
vi
0 Contents
2.4
2.5
2.3.2 Examples
Estimation
2.4.1 Maximum Likelihood Estimation
2.4.2 Bias, Variance and Consistency
2.4.3 A Bayesian Approach
2.4.4 An Example
Sampling
2.5.1 Inverse Transformation
2.5.2 Rejection Sampler
62
66
66
68
71
75
77
78
82
Optimization
91
3.1 Preliminaries
91
3.1.1 Convex Sets
92
3.1.2 Convex Functions
92
3.1.3 Subgradients
96
3.1.4 Strongly Convex Functions
97
3.1.5 Convex Functions with Lipschitz Continous Gradient 98
3.1.6 Fenchel Duality
98
3.1.7 Bregman Divergence
100
3.2 Unconstrained Smooth Convex Minimization
102
3.2.1 Minimizing a One-Dimensional Convex Function
102
3.2.2 Coordinate Descent
104
3.2.3 Gradient Descent
104
3.2.4 Mirror Descent
108
3.2.5 Conjugate Gradient
111
3.2.6 Higher Order Methods
115
3.2.7 Bundle Methods
121
3.3 Constrained Optimization
125
3.3.1 Projection Based Methods
125
3.3.2 Lagrange Duality
127
3.3.3 Linear and Quadratic Programs
131
3.4 Stochastic Optimization
135
3.4.1 Stochastic Gradient Descent
136
3.5 Nonconvex Optimization
137
3.5.1 Concave-Convex Procedure
137
3.6 Some Practical Advice
139
143
143
144
Contents
vii
Conditional Densities
5.1 Logistic Regression
5.2 Regression
5.2.1 Conditionally Normal Models
5.2.2 Posterior Distribution
5.2.3 Heteroscedastic Estimation
5.3 Multiclass Classification
5.3.1 Conditionally Multinomial Models
5.4 What is a CRF?
5.4.1 Linear Chain CRFs
5.4.2 Higher Order CRFs
5.4.3 Kernelized CRFs
5.5 Optimization Strategies
5.5.1 Getting Started
5.5.2 Optimization Algorithms
5.5.3 Handling Higher order CRFs
5.6 Hidden Markov Models
5.7 Further Reading
5.7.1 Optimization
149
150
151
151
151
151
151
151
152
152
152
152
152
152
152
152
153
153
153
155
155
156
161
161
161
161
161
161
161
161
161
163
163
163
164
164
164
Linear Models
7.1 Support Vector Classification
165
165
viii
0 Contents
170
170
172
177
177
179
180
181
184
186
186
189
189
190
191
192
193
193
193
193
193
193
193
193
193
193
193
193
193
193
193
194
194
194
194
Appendix 1
197
Appendix 2
Conjugate Distributions
201
Appendix 3
Bibliography
Loss Functions
203
221
Preface
0 Preface
Density
Estimation
Graphical
Models
Duality and
Estimation
Conditional
Densities
Linear Models
Kernels
Moment
Methods
Optimization
Conditional
Random Fields
Structured
Estimation
Reinforcement
Learning
Duality and
Estimation
Introduction
Introduction
Density
Estimation
Density
Estimation
Graphical
Models
Graphical
Models
Conditional
Densities
Kernels
Moment
Methods
Linear Models
Duality and
Estimation
Optimization
Conditional
Random Fields
Kernels
Structured
Estimation
Reinforcement
Learning
Conditional
Densities
Moment
Methods
Linear Models
Optimization
Conditional
Random Fields
Structured
Estimation
Reinforcement
Learning
1
Introduction
Over the past two decades Machine Learning has become one of the mainstays of information technology and with that, a rather central, albeit usually
hidden, part of our life. With the ever increasing amounts of data becoming
available there is good reason to believe that smart data analysis will become
even more pervasive as a necessary ingredient for technological progress.
The purpose of this chapter is to provide the reader with an overview over
the vast range of applications which have at their heart a machine learning
problem and to bring some degree of order to the zoo of problems. After
that, we will discuss some basic tools from statistics and probability theory,
since they form the language in which many machine learning problems must
be phrased to become amenable to solving. Finally, we will outline a set of
fairly basic yet effective algorithms to solve an important problem, namely
that of classification. More sophisticated tools, a discussion of more general
problems and a detailed analysis will follow in later parts of the book.
1.1.1 Applications
Most readers will be familiar with the concept of web page ranking. That
is, the process of submitting a query to a search engine, which then finds
webpages relevant to the query and which returns them in their order of
relevance. See e.g. Figure 1.1 for an example of the query results for machine learning. That is, the search engine returns a sorted list of webpages
given a query. To achieve this goal, a search engine needs to know which
3
1 Introduction
Web
Images
Maps
Web
News
Shopping
Gmail
more !
machine learning
Sign in
Search
Scholar
Advanced Search
Preferences
Sponsored Links
Machine Learning
Google Sydney needs machine
learning experts. Apply today!
www.google.com.au/jobs
Machine Learning is the study of computer algorithms that improve automatically through
experience. Applications range from datamining programs that ...
www.cs.cmu.edu/~tom/mlbook.html - 4k - Cached - Similar pages
machine learning
www.aaai.org/AITopics/html/machine.html - Similar pages
Machine Learning
A list of links to papers and other resources on machine learning.
www.machinelearning.net/ - 14k - Cached - Similar pages
pages are relevant and which pages match the query. Such knowledge can be
gained fromMachine
several
sources: the link structure of webpages, their content,
Learning Journal
the frequency with which users will follow the suggested links in a query, or
from examples
queries
CS 229:of
Machine
Learning in combination with manually ranked webpages.
Increasingly machine learning rather than guesswork and clever engineering
is used to automate the process of designing a good search engine [RPB06].
A rather related application is collaborative
filtering. Internet bookNext
stores such as Amazon, or video rental sites such as Netflix use this information extensively to entice users to purchase additional goods (or rent more
movies). The problem is quite similar to the one of web page ranking. As
before, we want to obtain a sorted list (in this case of articles). The key difference is that an explicit query is missing and instead we can only use past
purchase and viewing decisions of the user to predict future viewing and
purchase habits. The key side information here are the decisions made by
similar users, hence the collaborative nature of the process. See Figure 1.2
for an example. It is clearly desirable to have an automatic system to solve
this problem, thereby avoiding guesswork and time [BK07].
An equally ill-defined problem is that of automatic translation of documents. At one extreme, we could aim at fully understanding a text before
translating it using a curated set of rules crafted by a computational linguist
well versed in the two languages we would like to translate. This is a rather
arduous task, in particular given that text is not always grammatically correct, nor is the document understanding part itself a trivial one. Instead, we
could simply use examples of translated documents, such as the proceedings
of the Canadian parliament or other multilingual entities (United Nations,
European Union, Switzerland) to learn how to translate between the two
Amazon.com: Machine Learning: Tom M. Mitchell: Books.
www.amazon.com/Machine-Learning-Tom-M-Mitchell/dp/0070428077 - 210k Cached - Similar pages
Machine Learning publishes articles on the mechanisms through which intelligent systems
improve their performance over time. We invite authors to submit ...
pages.stern.nyu.edu/~fprovost/MLJ/ - 3k - Cached - Similar pages
STANFORD. CS229 Machine Learning Autumn 2007. Announcements. Final reports from
this year's class projects have been posted here. ...
cs229.stanford.edu/ - 10k - Cached - Similar pages
1 2 3 4 5 6 7 8 9 10
machine learning
Search
Search within results | Language Tools | Search Tips | Dissatisfied? Help us improve | Try Google Experimental
2008 Google - Google Home - Advertising Programs - Business Solutions - About Google
Today's Deals
Gift Cards
Books
Advanced Search
Books
Browse Subjects
Bestsellers
Libros En Espaol
Bargain Books
Textbooks
Join Amazon Prime and ship Two-Day for free and Overnight for $3.99. Already a member? Sign in.
Quantity:
by Thomas Mitchell (Author) "Ever since computers were invented, we have wondered whether
they might be made to learn..." (more)
or
Availability: Usually ships within 4 to 7 weeks. Ships from and sold by Amazon.com. Giftwrap available.
$153.44
$153.44
Better Together
Buy this book with Introduction to Machine Learning (Adaptive Computation and Machine Learning) by Ethem Alpaydin today!
Buy Together Today: $130.87
Artificial Intelligence: A
Modern Approach (2nd
Edition) (Prentice Hall
Series in Artificial
Intelligence) by Stuart
Russell
(76) $115.00
(50)
Editorial Reviews
Fig. 1.2. Books recommended by Amazon.com when viewing Tom Mitchells Machine Learning Book [Mit97]. It is desirable for the vendor to recommend relevant
books which a user might purchase.
Book Description
This exciting addition to the McGraw-Hill Series in Computer Science focuses on the concepts and techniques that contribute to the rapidly
changing field of machine learning--including probability and statistics, artificial intelligence, and neural networks--unifying them all in a logical
and coherent manner. Machine Learning serves as a useful reference tool for software developers and researchers, as well as an outstanding text
for college students. --This text refers to the Hardcover edition.
Book Info
Presents the key algorithms and theory that form the core of machine learning. Discusses such theoretical issues as How does learning
performance vary with the number of training examples presented? and Which learning algorithms are most appropriate for various types of
learning tasks? DLC: Computer algorithms. --This text refers to the Hardcover edition.
Product Details
Paperback: 352 pages
Publisher: McGraw-Hill Education (ISE Editions); 1st edition (October 1, 1997)
Language: English
ISBN-10: 0071154671
ISBN-13: 978-0071154673
Product Dimensions: 9 x 5.9 x 1.1 inches
Shipping Weight: 1.2 pounds (View shipping rates and policies)
Fig. 1.3. 11 Pictures of the same person taken from the Yale face recognition
database. The challenge is to recognize that we are dealing with the same person in all 11 cases.
Average Customer Review:
(What's this?)
#11 in Books > Computers & Internet > Computer Science > Artificial Intelligence > Machine Learning
Would you like to update product info or give feedback on images? (We'll ask you to sign in so we can get back to you)
(learn more)
Browse and search another edition of this book.
First Sentence:
Ever since computers were invented, we have wondered whether they might be made to learn. Read the first page
Browse Sample Pages:
Front Cover | Copyright | Table of Contents | Excerpt | Index | Back Cover | Surprise Me!
Search Inside This Book:
(What's this?)
(What's this?)
computer science
(6)
artificial intelligence
(2)
(1)
pattern recognition
(1)
1 Introduction
Fig. 1.4. Named entity tagging of a news article (using LingPipe). The relevant
locations, organizations and persons are tagged for further information extraction.
are clearly terms from agriculture, it is equally clear that in the context of
contemporary politics they refer to members of the Republican Party.
Other applications which take advantage of learning are speech recognition (annotate an audio sequence with text, such as the system shipping
with Microsoft Vista), the recognition of handwriting (annotate a sequence
of strokes with text, a feature common to many PDAs), trackpads of computers (e.g. Synaptics, a major manufacturer of such pads derives its name
from the synapses of a neural network), the detection of failure in jet engines, avatar behavior in computer games (e.g. Black and White), direct
marketing (companies use past purchase behavior to guesstimate whether
you might be willing to purchase even more) and floor cleaning robots (such
as iRobots Roomba). The overarching theme of learning problems is that
there exists a nontrivial dependence between some observations, which we
will commonly refer to as x and a desired response, which we refer to as y,
for which a simple set of deterministic rules is not known. By using learning
we can infer such a dependency between x and y in a systematic fashion.
We conclude this section by discussing the problem of classification,
since it will serve as a prototypical problem for a significant part of this
book. It occurs frequently in practice: for instance, when performing spam
filtering, we are interested in a yes/no answer as to whether an e-mail contains relevant information or not. Note that this issue is quite user dependent: for a frequent traveller e-mails from an airline informing him about
recent discounts might prove valuable information, whereas for many other
recipients this might prove more of an nuisance (e.g. when the e-mail relates
to products available only overseas). Moreover, the nature of annoying emails might change over time, e.g. through the availability of new products
(Viagra, Cialis, Levitra, . . . ), different opportunities for fraud (the Nigerian
419 scam which took a new twist after the Iraq war), or different data types
(e.g. spam which consists mainly of images). To combat these problems we
Fig. 1.5. Binary classification; separate stars from diamonds. In this example we
are able to do so by drawing a straight line which separates both sets. We will see
later that this is an important example of what is called a linear classifier.
want to build a system which is able to learn how to classify new e-mails.
A seemingly unrelated problem, that of cancer diagnosis shares a common
structure: given histological data (e.g. from a microarray analysis of a patients tissue) infer whether a patient is healthy or not. Again, we are asked
to generate a yes/no answer given a set of observations. See Figure 1.5 for
an example.
1.1.2 Data
It is useful to characterize learning problems according to the type of data
they use. This is a great help when encountering new challenges, since quite
often problems on similar data types can be solved with very similar techniques. For instance natural language processing and bioinformatics use very
similar tools for strings of natural language text and for DNA sequences.
Vectors constitute the most basic entity we might encounter in our work.
For instance, a life insurance company might be interesting in obtaining the
vector of variables (blood pressure, heart rate, height, weight, cholesterol
level, smoker, gender) to infer the life expectancy of a potential customer.
A farmer might be interested in determining the ripeness of fruit based on
(size, weight, spectral data). An engineer might want to find dependencies
in (voltage, current) pairs. Likewise one might want to represent documents
by a vector of counts which describe the occurrence of words. The latter is
commonly referred to as bag of words features.
One of the challenges in dealing with vectors is that the scales and units
of different coordinates may vary widely. For instance, we could measure the
height in kilograms, pounds, grams, tons, stones, all of which would amount
to multiplicative changes. Likewise, when representing temperatures, we
have a full class of affine transformations, depending on whether we represent them in terms of Celsius, Kelvin or Farenheit. One way of dealing
1 Introduction
1.1.3 Problems
The range of learning problems is clearly large, as we saw when discussing
applications. That said, researchers have identified an ever growing number
of templates which can be used to address a large set of situations. It is those
templates which make deployment of machine learning in practice easy and
our discussion will largely focus on a choice set of such problems. We now
give a by no means complete list of templates.
Binary Classification is probably the most frequently studied problem
in machine learning and it has led to a large number of important algorithmic
and theoretic developments over the past century. In its simplest form it
reduces to the question: given a pattern x drawn from a domain X, estimate
which value an associated binary random variable y {1} will assume.
For instance, given pictures of apples and oranges, we might want to state
whether the object in question is an apple or an orange. Equally well, we
might want to predict whether a home owner might default on his loan,
given income data, his credit history, or whether a given e-mail is spam or
ham. The ability to solve this basic problem already allows us to address a
large variety of practical settings.
There are many variants exist with regard to the protocol in which we are
required to make our estimation:
10
1 Introduction
Fig. 1.6. Left: binary classification. Right: 3-class classification. Note that in the
latter case we have much more degree for ambiguity. For instance, being able to
distinguish stars from diamonds may not suffice to identify either of them correctly,
since we also need to distinguish both of them from triangles.
11
error we make. For instance, in the problem of assessing the risk of cancer, it
makes a significant difference whether we mis-classify an early stage of cancer as healthy (in which case the patient is likely to die) or as an advanced
stage of cancer (in which case the patient is likely to be inconvenienced from
overly aggressive treatment).
Structured Estimation goes beyond simple multiclass estimation by
assuming that the labels y have some additional structure which can be used
in the estimation process. For instance, y might be a path in an ontology,
when attempting to classify webpages, y might be a permutation, when
attempting to match objects, to perform collaborative filtering, or to rank
documents in a retrieval setting. Equally well, y might be an annotation of
a text, when performing named entity recognition. Each of those problems
has its own properties in terms of the set of y which we might consider
admissible, or how to search this space. We will discuss a number of those
problems in Chapter ??.
Regression is another prototypical application. Here the goal is to estimate a real-valued variable y R given a pattern x (see e.g. Figure 1.7). For
instance, we might want to estimate the value of a stock the next day, the
yield of a semiconductor fab given the current process, the iron content of
ore given mass spectroscopy measurements, or the heart rate of an athlete,
given accelerometer data. One of the key issues in which regression problems
differ from each other is the choice of a loss. For instance, when estimating
stock values our loss for a put option will be decidedly one-sided. On the
other hand, a hobby athlete might only care that our estimate of the heart
rate matches the actual on average.
Novelty Detection is a rather ill-defined problem. It describes the issue
of determining unusual observations given a set of past measurements.
Clearly, the choice of what is to be considered unusual is very subjective.
A commonly accepted notion is that unusual events occur rarely. Hence a
possible goal is to design a system which assigns to each observation a rating
12
1 Introduction
Fig. 1.8. Left: typical digits contained in the database of the US Postal Service.
Right: unusual digits found by a novelty detection algorithm [SPST+ 01] (for a
description of the algorithm see Section 7.4). The score below the digits indicates
the degree of novelty. The numbers on the lower right indicate the class associated
with the digit.
as to how novel it is. Readers familiar with density estimation might contend
that the latter would be a reasonable solution. However, we neither need a
score which sums up to 1 on the entire domain, nor do we care particularly
much about novelty scores for typical observations. We will later see how this
somewhat easier goal can be achieved directly. Figure 1.8 has an example of
novelty detection when applied to an optical character recognition database.
13
height
X
x
(x)
weight
Fig. 1.9. The random variable maps from the set of outcomes of an experiment
(denoted here by X) to real numbers. As an illustration here X consists of the
patients a physician might encounter, and they are mapped via to their weight
and height.
1.2.2 Distributions
Perhaps the most important way to characterize a random variable is to
associate probabilities with the values it can take. If the random variable is
discrete, i.e., it takes on a finite number of values, then this assignment of
probabilities is called a probability mass function or PMF for short. A PMF
must be, by definition, non-negative and must sum to one. For instance,
if the coin is fair, i.e., heads and tails are equally likely, then the random
variable X described above takes on values of +1 and 1 with probability
0.5. This can be written as
P r(X = +1) = 0.5 and P r(X = 1) = 0.5.
(1.1)
When there is no danger of confusion we will use the slightly informal notation p(x) := P r(X = x).
In case of a continuous random variable the assignment of probabilities
results in a probability density function or PDF for short. With some abuse
of terminology, but keeping in line with convention, we will often use density
or distribution instead of probability density function. As in the case of the
PMF, a PDF must also be non-negative and integrate to one. Figure 1.10
shows two distributions: the uniform distribution
(
1
if x [a, b]
p(x) = ba
(1.2)
0
otherwise,
14
1 Introduction
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
-4
-2
-4
-2
Fig. 1.10. Two common densities. Left: uniform distribution over the interval
[1, 1]. Right: Normal distribution with zero mean and unit variance.
(1.3)
Closely associated with a PDF is the indefinite integral over p. It is commonly referred to as the cumulative distribution function (CDF).
Definition 1.1 (Cumulative Distribution Function) For a real valued
random variable X with PDF p the associated Cumulative Distribution Function F is given by
F (x0 ) := Pr X x0 =
x0
dp(x).
(1.4)
(1.5)
The values of x0 for which F (x0 ) assumes a specific value, such as 0.1 or 0.5
have a special name. They are called the quantiles of the distribution p.
Definition 1.2 (Quantiles) Let q (0, 1). Then the value of x0 for which
Pr(X < x0 ) q and Pr(X > x0 ) 1 q is the q-quantile of the distribution
p. Moreover, the value x0 associated with q = 0.5 is called the median.
15
p(x)
Fig. 1.11. Quantiles of a distribution correspond to the area under the integral of
the density p(x) for which the integral takes on a pre-specified value. Illustrated
are the 0.1, 0.5 and 0.9 quantiles respectively.
For instance, in the case of a dice we have equal probabilities of 1/6 for all
6 possible outcomes. It is easy to see that this translates into a mean of
(1 + 2 + 3 + 4 + 5 + 6)/6 = 3.5.
The mean of a random variable is useful in assessing expected losses and
benefits. For instance, as a stock broker we might be interested in the expected value of our investment in a years time. In addition to that, however,
we also might want to investigate the risk of our investment. That is, how
likely it is that the value of the investment might deviate from its expectation since this might be more relevant for our decisions. This means that we
16
1 Introduction
need a variable to quantify the risk inherent in a random variable. One such
measure is the variance of a random variable.
Definition 1.4 (Variance) We define the variance of a random variable
X as
h
i
Var[X] := E (X E[X])2 .
(1.9)
As before, if f : R R is a function, then the variance of f (X) is given by
h
i
Var[f (X)] := E (f (X) E[f (X)])2 .
(1.10)
The variance measures by how much on average f (X) deviates from its expected value. As we shall see in Section 2.1, an upper bound on the variance
can be used to give guarantees on the probability that f (X) will be within
of its expected value. This is one of the reasons why the variance is often
associated with the risk of a random variable. Note that often one discusses
properties of a random variable in terms of its standard deviation, which is
defined as the square root of the variance.
We say that X and Y are independent, i.e., the values that X takes does
not depend on the values that Y takes whenever
p(x, y) = p(x)p(y).
(1.13)
Independence is useful when it comes to dealing with large numbers of random variables whose behavior we want to estimate jointly. For instance,
whenever we perform repeated measurements of a quantity, such as when
17
2.0
2.0
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
-0.5
-0.5
0.0
0.5
1.0
1.5
2.0
-0.5
-0.5
0.0
0.5
1.0
1.5
2.0
Fig. 1.12. Left: a sample from two dependent random variables. Knowing about
first coordinate allows us to improve our guess about the second coordinate. Right:
a sample drawn from two independent random variables, obtained by randomly
permuting the dependent sample.
measuring the voltage of a device, we will typically assume that the individual measurements are drawn from the same distribution and that they are
independent of each other. That is, having measured the voltage a number
of times will not affect the value of the next measurement. We will call such
random variables to be independently and identically distributed, or in short,
iid random variables. See Figure 1.12 for an example of a pair of random
variables drawn from dependent and independent distributions respectively.
Conversely, dependence can be vital in classification and regression problems. For instance, the traffic lights at an intersection are dependent of each
other. This allows a driver to perform the inference that when the lights are
green in his direction there will be no traffic crossing his path, i.e. the other
lights will indeed be red. Likewise, whenever we are given a picture x of a
digit, we hope that there will be dependence between x and its label y.
Especially in the case of dependent random variables, we are interested
in conditional probabilities, i.e., probability that X takes on a particular
value given the value of Y . Clearly P r(X = rain|Y = cloudy) is higher than
P r(X = rain|Y = sunny). In other words, knowledge about the value of Y
significantly influences the distribution of X. This is captured via conditional
probabilities:
p(x|y) :=
p(x, y)
.
p(y)
(1.14)
18
1 Introduction
p(x|y)p(y)
.
p(x)
(1.15)
This follows from the fact that p(x, y) = p(x|y)p(y) = p(y|x)p(x). The key
consequence of (1.15) is that we may reverse the conditioning between a
pair of random variables.
1.2.4.1 An Example
We illustrate our reasoning by means of a simple example inference using
an AIDS test. Assume that a patient would like to have such a test carried
out on him. The physician recommends a test which is guaranteed to detect
HIV-positive whenever a patient is infected. On the other hand, for healthy
patients it has a 1% error rate. That is, with probability 0.01 it diagnoses
a patient as HIV-positive even when he is, in fact, HIV-negative. Moreover,
assume that 0.15% of the population is infected.
Now assume that the patient has the test carried out and the test returns HIV-negative. In this case, logic implies that he is healthy, since the
test has 100% detection rate. In the converse case things are not quite as
straightforward. Denote by X and T the random variables associated with
the health status of the patient and the outcome of the test respectively. We
are interested in p(X = HIV+|T = HIV+). By Bayes rule we may write
p(X = HIV+|T = HIV+) =
While we know all terms in the numerator, p(T = HIV+) itself is unknown.
That said, it can be computed via
X
p(T = HIV+) =
p(T = HIV+, x)
x{HIV+,HIV-}
p(T = HIV+|x)p(x)
x{HIV+,HIV-}
1.0 0.0015
= 0.1306.
1.0 0.0015 + 0.01 0.9985
In other words, even though our test is quite reliable, there is such a low
prior probability of having been infected with AIDS that there is not much
evidence to accept the hypothesis even after this test.
19
test 1
age
x
test 2
Fig. 1.13. A graphical description of our HIV testing scenario. Knowing the age of
the patient influences our prior on whether the patient is HIV positive (the random
variable X). The outcomes of the tests 1 and 2 are independent of each other given
the status X. We observe the shaded random variables (age, test 1, test 2) and
would like to infer the un-shaded random variable X. This is a special case of a
graphical model which we will discuss in Chapter ??.
Let us now think how we could improve the diagnosis. One way is to obtain further information about the patient and to use this in the diagnosis.
For instance, information about his age is quite useful. Suppose the patient
is 35 years old. In this case we would want to compute p(X = HIV+|T =
HIV+, A = 35) where the random variable A denotes the age. The corresponding expression yields:
p(T = HIV+|X = HIV+, A)p(X = HIV+|A)
p(T = HIV+|A)
Here we simply conditioned all random variables on A in order to take additional information into account. We may assume that the test is independent
of the age of the patient, i.e.
p(t|x, a) = p(t|x).
What remains therefore is p(X = HIV+|A). Recent US census data pegs this
number at approximately 0.9%. Plugging all data back into the conditional
10.009
expression yields 10.009+0.010.991
= 0.48. What has happened here is that
by including additional observed random variables our estimate has become
more reliable. Combination of evidence is a powerful tool. In our case it
helped us make the classification problem of whether the patient is HIVpositive or not more reliable.
A second tool in our arsenal is the use of multiple measurements. After
the first test the physician is likely to carry out a second test to confirm the
diagnosis. We denote by T1 and T2 (and t1 , t2 respectively) the two tests.
Obviously, what we want is that T2 will give us an independent second
opinion of the situation. In other words, we want to ensure that T2 does
not make the same mistakes as T1 . For instance, it is probably a bad idea
to repeat T1 without changes, since it might perform the same diagnostic
20
1 Introduction
(1.16)
See Figure 1.13 for a graphical illustration of the setting. Random variables
satisfying the condition (1.16) are commonly referred to as conditionally
independent. In shorthand we write T1 , T2
X. For the sake of the argument
we assume that the statistics for T2 are given by
p(t2 |x)
x = HIV-
x = HIV+
t2 = HIV0.95
0.01
t2 = HIV+ 0.05
0.99
Clearly this test is less reliable than the first one. However, we may now
combine both estimates to obtain a very reliable estimate based on the
combination of both events. For instance, for t1 = t2 = HIV+ we have
1.0 0.99 0.009
= 0.95.
1.0 0.99 0.009 + 0.01 0.05 0.991
In other words, by combining two tests we can now confirm with very high
confidence that the patient is indeed diseased. What we have carried out is a
combination of evidence. Strong experimental evidence of two positive tests
effectively overcame an initially very strong prior which suggested that the
patient might be healthy.
Tests such as in the example we just discussed are fairly common. For
instance, we might need to decide which manufacturing procedure is preferable, which choice of parameters will give better results in a regression estimator, or whether to administer a certain drug. Note that often our tests
may not be conditionally independent and we would need to take this into
account.
p(X = HIV+|T1 = HIV+, T2 = HIV+) =
21
the
quick
brown
fox
jumped
over
lazy
dog
hunts
2
1
1
0
1
0
1
1
1
0
1
0
1
0
1
1
0
1
0
1
and associated labels yi , denoted by Y := {y1 , . . . , ym }. Here the labels satisfy yi {spam, ham}. The key assumption we make here is that the pairs
(xi , yi ) are drawn jointly from some distribution p(x, y) which represents
the e-mail generating process for a user. Moreover, we assume that there
is sufficiently strong dependence between x and y that we will be able to
estimate y given x and a set of labeled instances X, Y.
Before we do so we need to address the fact that e-mails such as Figure 1.14
are text, whereas the three algorithms we present will require data to be
represented in a vectorial fashion. One way of converting text into a vector
is by using the so-called bag of words representation [Mar61, Lew98]. In its
simplest version it works as follows: Assume we have a list of all possible
words occurring in X, that is a dictionary, then we are able to assign a unique
number with each of those words (e.g. the position in the dictionary). Now
we may simply count for each document xi the number of times a given
word j is occurring. This is then used as the value of the j-th coordinate
of xi . Figure 1.15 gives an example of such a representation. Once we have
the latter it is easy to compute distances, similarities, and other statistics
directly from the vectorial representation.
22
1 Introduction
p(x|y)p(y)
.
p(x)
We may have a good estimate of p(y), that is, the probability of receiving
a spam or ham mail. Denote by mham and mspam the number of ham and
spam e-mails in X. In this case we can estimate
mspam
mham
p(ham)
and p(spam)
.
m
m
The key problem, however, is that we do not know p(x|y) or p(x). We may
dispose of the requirement of knowing p(x) by settling for a likelihood ratio
L(x) :=
p(x|spam)p(spam)
p(spam|x)
=
.
p(ham|x)
p(x|ham)p(ham)
(1.17)
# of words
Y in x
p(wj |y),
(1.18)
j=1
where wj denotes the j-th word in document x. This amounts to the assumption that the probability of occurrence of a word in a document is
independent of all other words given the category of the document. Even
though this assumption does not hold in general for instance, the word
York is much more likely to after the word New it suffices for our
purposes (see Figure 1.16).
This assumption reduces the difficulty of knowing p(x|y) to that of estimating the probabilities of occurrence of individual words w. Estimates for
23
word 1
word 2
word 3
...
word n
Fig. 1.16. Naive Bayes model. The occurrence of individual words is independent
of each other, given the category of the text. For instance, the word Viagra is fairly
frequent if y = spam but it is considerably less frequent if y = ham, except when
considering the mailbox of a Pfizer sales representative.
p(w|y) can be obtained, for instance, by simply counting the frequency occurrence of the word within documents of a given class. That is, we estimate
o
Pm P# of words in xi n
j
y
=
spam
and
w
=
w
i
i=1
j=1
i
p(w|spam)
Pm P# of words in xi
{yi = spam}
i=1
j=1
n
o
Here yi = spam and wij = w equals 1 if and only if xi is labeled as spam
and w occurs as the j-th word in xi . The denominator is simply the total
number of words in spam documents. Similarly one can compute p(w|ham).
In principle we could perform the above summation whenever we see a new
document x. This would be terribly inefficient, since each such computation
requires a full pass through X and Y. Instead, we can perform a single pass
through X and Y and store the resulting statistics as a good estimate of the
conditional probabilities. Algorithm 1.1 has details of an implementation.
Note that we performed a number of optimizations: Firstly, the normaliza1
tion by m1
spam and mham respectively is independent of x, hence we incorporate it as a fixed offset. Secondly, since we are computing a product over
a large number of factors the numbers might lead to numerical overflow or
underflow. This can be addressed by summing over the logarithm of terms
rather than computing products. Thirdly, we need to address the issue of
estimating p(w|y) for words w which we might not have seen before. One
way of dealing with this is to increment all counts by 1. This method is
commonly referred to as Laplace smoothing. We will encounter a theoretical
justification for this heuristic in Section 2.3.
This simple algorithm is known to perform surprisingly well, and variants
of it can be found in most modern spam filters. It amounts to what is
commonly known as Bayesian spam filtering. Obviously, we may apply it
to problems other than document categorization, too.
24
1 Introduction
25
Fig. 1.17. 1 nearest neighbor classifier. Depending on whether the query point x is
closest to the star, diamond or triangles, it uses one of the three labels for it.
Fig. 1.18. k-Nearest neighbor classifiers using Euclidean distances. Left: decision
boundaries obtained from a 1-nearest neighbor classifier. Middle: color-coded sets
of where the number of red / blue points ranges between 7 and 0. Right: decision
boundary determining where the blue or red dots are in the majority.
flexible. For instance, we could use string edit distances to compare two
documents or information theory based measures.
However, the problem with nearest neighbor classification is that the estimates can be very noisy whenever the data itself is very noisy. For instance,
if a spam email is erroneously labeled as nonspam then all emails which
are similar to this email will share the same fate. See Figure 1.18 for an
example. In this case it is beneficial to pool together a number of neighbors,
say the k-nearest neighbors of x and use a majority vote to decide the class
membership of x. Algorithm 1.2 has a description of the algorithm. Note
that nearest neighbor algorithms can yield excellent performance when used
with a good distance measure. For instance, the technology underlying the
Netflix progress prize [BK07] was essentially nearest neighbours based.
Note that it is trivial to extend the algorithm to regression. All we need
to change in Algorithm 1.2 is to return the average of the values yi instead
of their majority vote. Figure 1.19 has an example.
Note that the distance computation d(xi , x) for all observations can be-
26
1 Introduction
Fig. 1.19. k-Nearest neighbor regression estimator using Euclidean distances. Left:
some points (x, y) drawn from a joint distribution. Middle: 1-nearest neighbour
classifier. Right: 7-nearest neighbour classifier. Note that the regression estimate is
much more smooth.
27
x
Fig. 1.20. A trivial classifier. Classification is carried out in accordance to which of
the two means or + is closer to the test point x. Note that the sets of positive
and negative labels respectively form a half space.
yi =1
(1.19)
k+ xk2 = k+ k2 + kxk2 2 h+ , xi .
(1.20)
Here h, i denotes the standard dot product between vectors. Taking differences between the two distances yields
f (x) := k+ xk2 k xk2 = 2 h + , xi + k k2 k+ k2 .
(1.21)
This is a linear function in x and its sign corresponds to the labels we estimate for x. Our algorithm sports an important property: The classification
rule can be expressed via dot products. This follows from
X
X
k+ k2 = h+ , + i = m2
hxi , xj i and h+ , xi = m1
hxi , xi .
+
+
yi =yj =1
yi =1
28
1 Introduction
H
x
(x)
Fig. 1.21. The feature map maps observations x from X into a feature space H.
The map is a convenient way of encoding pre-processing steps systematically.
Analogous expressions can be computed for . Consequently we may express the classification rule (1.21) as
f (x) =
m
X
i hxi , xi + b
(1.22)
i=1
P
2 P
where b = m2
k(x, x0 ) := (x), (x0 ) .
(1.23)
This small modification leads to a number of very powerful algorithm and
it is at the foundation of an area of research called kernel methods. We
will encounter a number of such algorithms for regression, classification,
segmentation, and density estimation over the course of the book. Examples
of suitable k are the polynomial kernel k(x, x0 ) = hx, x0 id for d N and the
0 2
Gaussian RBF kernel k(x, x0 ) = ekxx k for > 0.
The upshot of (1.23) is that our basic algorithm can be kernelized. That
is, we may rewrite (1.21) as
f (x) =
m
X
i k(xi , x) + b
(1.24)
i=1
29
1.3.4 Perceptron
In the previous sections we assumed that our classifier had access to a training set of spam and non-spam emails. In real life, such a set might be difficult
to obtain all at once. Instead, a user might want to have instant results whenever a new e-mail arrives and he would like the system to learn immediately
from any corrections to mistakes the system makes.
To overcome both these difficulties one could envisage working with the
following protocol: As emails arrive our algorithm classifies them as spam or
non-spam, and the user provides feedback as to whether the classification is
correct or incorrect. This feedback is then used to improve the performance
of the classifier over a period of time.
This intuition can be formalized as follows: Our classifier maintains a
parameter vector. At the t-th time instance it receives a data point xt , to
which it assigns a label yt using its current parameter vector. The true label
yt is then revealed, and used to update the parameter vector of the classifier.
Such algorithms are said to be online. We will now describe perhaps the
simplest classifier of this kind namely the Perceptron [Heb49, Ros58].
Let us assume that the data points xt Rd , and labels yt {1}. As
before we represent an email as a bag-of-words vector and we assign +1 to
spam emails and 1 to non-spam emails. The Perceptron maintains a weight
30
1 Introduction
w*
xt
wt+1
w*
xt
wt
Fig. 1.22. The Perceptron without bias. Left: at time t we have a weight vector wt
denoted by the dashed arrow with corresponding separating plane (also dashed).
For reference we include the linear separator w and its separating plane (both
denoted by a solid line). As a new observation xt arrives which happens to be
mis-classified by the current weight vector wt we perform an update. Also note the
margin between the point xt and the separating hyperplane defined by w . Right:
This leads to the weight vector wt+1 which is more aligned with w .
(1.25)
where hw, xt i denotes the usual Euclidean dot product and b is an offset. Note
the similarity of (1.25) to (1.21) of the simple classifier. Just as the latter,
the Perceptron is a linear classifier which separates its domain Rd into two
halfspaces, namely {x| hw, xi + b > 0} and its complement. If yt = yt then
no updates are made. On the other hand, if yt 6= yt the weight vector is
updated as
w w + yt xt and b b + yt .
(1.26)
Figure 1.22 shows an update step of the Perceptron algorithm. For simplicity
we illustrate the case without bias, that is, where b = 0 and where it remains
unchanged. A detailed description of the algorithm is given in Algorithm 1.3.
An important property of the algorithm is that it performs updates on w
by multiples of the observations xi on which it makes a mistake. Hence we
P
may express w as w = iError yi xi . Just as before, we can replace xi and x
by (xi ) and (x) to obtain a kernelized version of the Perceptron algorithm
[FS99] (Algorithm 1.4).
If the dataset (X, Y) is linearly separable, then the Perceptron algorithm
31
eventually converges and correctly classifies all the points in X. The rate of
convergence however depends on the margin. Roughly speaking, the margin
quantifies how linearly separable a dataset is, and hence how easy it is to
solve a given classification problem.
Definition 1.6 (Margin) Let w Rd be a weight vector and let b R be
an offset. The margin of an observation x Rd with associated label y is
(x, y) := y (hw, xi + b) .
(1.27)
(1.28)
Geometrically speaking (see Figure 1.22) the margin measures the distance
of x from the hyperplane defined by {x| hw, xi + b = 0}. Larger the margin,
the more well separated the data and hence easier it is to find a hyperplane
with correctly classifies the dataset. The following theorem asserts that if
there exists a linear classifier which can classify a dataset with a large margin, then the Perceptron will also correctly classify the same dataset after
making a small number of mistakes.
Theorem 1.7 (Novikoff s theorem) Let (X, Y) be a dataset with at least
one example labeled +1 and one example labeled 1. Let R := maxt kxt k, and
assume that there exists (w , b ) such that kw k = 1 and t := yt (hw , xt i +
2
)2 )
b ) for all t. Then, the Perceptron will make at most (1+R )(1+(b
2
mistakes.
This result is remarkable since it does not depend on the dimensionality
of the problem. Instead, it only depends on the geometry of the setting,
as quantified via the margin and the radius R of a ball enclosing the
observations. Interestingly, a similar bound can be shown for Support Vector
Machines [Vap95] which we will be discussing in Chapter 7.
Proof We can safely ignore the iterations where no mistakes were made
and hence no updates were carried out. Therefore, without loss of generality
assume that the t-th update was made after seeing the t-th observation and
let wt denote the weight vector after the update. Furthermore, for simplicity
assume that the algorithm started with w0 = 0 and b0 = 0. By the update
equation (1.26) we have
hwt , w i + bt b = hwt1 , w i + bt1 b + yt (hxt , w i + b )
hwt1 , w i + bt1 b + .
32
1 Introduction
t hwt , w i + bt b =
,
bt
b
q
p
wt
w
= kwt k2 + b2 1 + (b )2
t
bt
b
p
p
t(R2 + 1) 1 + (b )2 .
Squaring both sides of the inequality and rearranging the terms yields an
upper bound on the number of updates and hence the number of mistakes.
The Perceptron was the building block of research on Neural Networks
[Hay98, Bis95]. The key insight was to combine large numbers of such networks, often in a cascading fashion, to larger objects and to fashion optimization algorithms which would lead to classifiers with desirable properties.
In this book we will take a complementary route. Instead of increasing the
number of nodes we will investigate what happens when increasing the complexity of the feature map and its associated kernel k. The advantage of
doing so is that we will reap the benefits from convex analysis and linear
models, possibly at the expense of a slightly more costly function evaluation.
1.3.5 K-Means
All the algorithms we discussed so far are supervised, that is, they assume
that labeled training data is available. In many applications this is too much
to hope for; labeling may be expensive, error prone, or sometimes impossible. For instance, it is very easy to crawl and collect every page within the
www.purdue.edu domain, but rather time consuming to assign a topic to
each page based on its contents. In such cases, one has to resort to unsupervised learning. A prototypical unsupervised learning algorithm is K-means,
which is clustering algorithm. Given X = {x1 , . . . , xm } the goal of K-means
is to partition it into k clusters such that each point in a cluster is similar
to points from its own cluster than with points from some other cluster.
33
1 XX
J(r, ) :=
rij kxi j k2 ,
2
(1.29)
i=1 j=1
(1.30)
j0
and 0 otherwise.
Stage 2 Keep the r fixed and determine . Since the rs are fixed, J is an
quadratic function of . It can be minimized by setting the derivative
with respect to j to be 0:
m
X
(1.31)
i=1
Rearranging obtains
P
rij xi
.
j = Pi
i rij
(1.32)
P
Since i rij counts the number of points assigned to cluster j, we are
essentially setting j to be the sample mean of the points assigned
to cluster j.
The algorithm stops when the cluster assignments do not change significantly. Detailed pseudo-code can be found in Algorithm 1.5.
Two issues with K-Means are worth noting. First, it is sensitive to the
choice of the initial cluster centers . A number of practical heuristics have
been developed. For instance, one could randomly choose k points from the
given dataset as cluster centers. Other methods try to pick k points from X
which are farthest away from each other. Second, it makes a hard assignment
of every point to a cluster center. Variants which we will encounter later in
34
1 Introduction
the book will relax this. Instead of letting rij {0, 1} these soft variants
will replace it with the probability that a given xi belongs to cluster j.
The K-Means algorithm concludes our discussion of a set of basic machine
learning methods for classification and regression. They provide a useful
starting point for an aspiring machine learning researcher. In this book we
will see many more such algorithms as well as connections between these
basic algorithms and their more advanced counterparts.
Problems
Problem 1.1 (Eyewitness) Assume that an eyewitness is 90% certain
that a given person committed a crime in a bar. Moreover, assume that
there were 50 people in the restaurant at the time of the crime. What is the
posterior probability of the person actually having committed the crime.
Problem 1.2 (DNA Test) Assume the police have a DNA library of 10
million records. Moreover, assume that the false recognition probability is
below 0.00001% per record. Suppose a match is found after a database search
for an individual. What are the chances that the identification is correct? You
can assume that the total population is 100 million people. Hint: compute
the probability of no match occurring first.
Problem 1.3 (Bomb Threat) Suppose that the probability that one of a
thousand passengers on a plane has a bomb is 1 : 1, 000, 000. Assuming that
the probability to have a bomb is evenly distributed among the passengers,
35
the probability that two passengers have a bomb is roughly equal to 1012 .
Therefore, one might decide to take a bomb on a plane to decrease chances
that somebody else has a bomb. What is wrong with this argument?
Problem 1.4 (Monty-Hall Problem) Assume that in a TV show the
candidate is given the choice between three doors. Behind two of the doors
there is a pencil and behind one there is the grand prize, a car. The candidate chooses one door. After that, the showmaster opens another door behind
which there is a pencil. Should the candidate switch doors after that? What
is the probability of winning the car?
Problem 1.5 (Mean and Variance for Random Variables) Denote by
Xi random variables. Prove that in this case
"
#
"
#
X
X
X
X
EX1 ,...XN
xi =
EXi [xi ] and VarX1 ,...XN
xi =
VarXi [xi ]
i
36
1 Introduction
Problem 1.9 Prove that the Normal distribution (1.3) has mean and
variance 2 . Hint: exploit the fact that p is symmetric around .
Problem 1.10 (Cauchy Distribution) Prove that for the density
p(x) =
1
(1 + x2 )
(1.33)
mean and variance are undefined. Hint: show that the integral diverges.
Problem 1.11 (Quantiles) Find a distribution for which the mean exceeds the median. Hint: the mean depends on the value of the high-quantile
terms, whereas the median does not.
Problem 1.12 (Multicategory Naive Bayes) Prove that for multicategory Naive Bayes the optimal decision is given by
n
Y
p([x]i |y)
(1.34)
i=1
2
Density Estimation
11
X
p(i) =
i=0
i 100i
11
X
100
1
5
i=0
7.0%
(2.1)
After looking at this figure you decide that things are probably reasonable.
And, in fact, they are consistent with the convergence behavior of a simulated dice in Figure 2.1. In computing (2.1) we have learned something
useful: the expansion is a special case of a binomial series. The first term
m=10
m=20
m=50
m=100
m=200
m=500
0.3
0.3
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0.1
0.1
0.0
1234 56
0.0
1234 56
0.0
1234 56
0.0
1234 56
0.0
1234 56
0.0
1234 56
Fig. 2.1. Convergence of empirical means to expectations. From left to right: empirical frequencies of occurrence obtained by casting a dice 10, 20, 50, 100, 200, and
500 times respectively. Note that after 20 throws we still have not observed a single
20
6, an event which occurs with only 56
2.6% probability.
37
38
2 Density Estimation
X
m := 1
X
Xi
m
(2.2)
i=1
be the empirical average over the random variables Xi . Then for any > 0
the following holds
m = 1.
(2.3)
lim Pr X
m
39
6
5
4
3
2
1
101
102
103
Fig. 2.2. The mean of a number of casts of a dice. The horizontal straight line
n as a
denotes the mean 3.5. The uneven solid line denotes the actual mean X
function of the number of draws, given as a semilogarithmic plot. The crosses denote
n ever more closely approaches the mean 3.5
the outcomes of the dice. Note how X
are we obtain an increasing number of observations.
This establishes that, indeed, for large enough sample sizes, the average will
converge to the expectation. The strong law strengthens this as follows:
Theorem 2.2 (Strong Law of Large Numbers)
Under the conditions
m = = 1.
of Theorem 2.1 we have Pr limm X
m
The strong law implies that almost surely (in a measure theoretic sense) X
converges to , whereas the weak law only states that for every the random
m will be within the interval [, +]. Clearly the strong implies
variable X
m = converges to 1, hence
the weak law since the measure of the events X
any -ball around would capture this.
Both laws justify that we may take sample averages, e.g. over a number
of events such as the outcomes of a dice and use the latter to estimate their
means, their probabilities (here we treat the indicator variable of the event
as a {0; 1}-valued random variable), their variances or related quantities. We
postpone a proof until Section 2.1.2, since an effective way of proving Theorem 2.1 relies on the theory of characteristic functions which we will discuss
in the next section. For the moment, we only give a pictorial illustration in
Figure 2.2.
m = m1 Pm Xi conOnce we established that the random variable X
i=1
verges to its mean , a natural second question is to establish how quickly it
m are.
converges and what the properties of the limiting distribution of X
Note in Figure 2.2 that the initial deviation from the mean is large whereas
as we observe more data the empirical mean approaches the true one.
40
2 Density Estimation
6
5
4
3
2
1
101
102
103
Fig. 2.3. Five instantiations of a running average over outcomes of a toss of a dice.
Note that all of them converge to the mean 3.5. Moreover note that
p they all are
well contained within the upper and lower envelopes given by VarX [x]/m.
m
X
i=1
# 1 "
2
i2
m
X
#
Xi i
(2.4)
i=1
41
which have all mean = 3.5 and variance (see Problem 2.1)
VarX [x] = EX [x2 ] EX [x]2 = (1 + 4 + 9 + 16 + 25 + 36)/6 3.52 2.92.
P
We now study the random variable Wm := m1 m
i=1 [Xi 3.5]. Since each
of the terms in the sum has zero mean, also Wm s mean vanishes. Moreover,
Wm is a multiple of Zm of (2.4). Hence we have that Wm converges to a
1
normal distribution with zero mean and standard deviation 2.92m 2 .
Consequently the average of m tosses of the dice yields a random variable with mean 3.5 and it will approach a normal distribution with variance
1
m 2 2.92. In other words, the empirical mean converges to its average at
1
rate O(m 2 ). Figure 2.3 gives an illustration of the quality of the bounds
implied by the CLT.
One remarkable property of functions of random variables is that in many
conditions convergence properties of the random variables are bestowed upon
the functions, too. This is manifest in the following two results: a variant
of Slutskys theorem and the so-called delta method. The former deals with
limit behavior whereas the latter deals with an extension of the central limit
theorem.
Theorem 2.4 (Slutskys Theorem) Denote by Xi , Yi sequences of random variables with Xi X and Yi c for c R in probability. Moreover,
denote by g(x, y) a function which is continuous for all (x, c). In this case
the random variable g(Xi , Yi ) converges in probability to g(X, c).
For a proof see e.g. [Bil68]. Theorem 2.4 is often referred to as the continuous
mapping theorem (Slutsky only proved the result for affine functions). It
means that for functions of random variables it is possible to pull the limiting
procedure into the function. Such a device is useful when trying to prove
asymptotic normality and in order to obtain characterizations of the limiting
distribution.
Theorem 2.5 (Delta Method) Assume that Xn Rd is asymptotically
2
normal with a2
n (Xn b) N(0, ) for an 0. Moreover, assume that
d
l
g : R R is a mapping which is continuously differentiable at b. In this
case the random variable g(Xn ) converges
>
a2
n (g(Xn ) g(b)) N(0, [x g(b)][x g(b)] ).
(2.5)
(2.6)
42
2 Density Estimation
43
(2.10)
44
2 Density Estimation
1.0
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.5
0.0
-5
0.0
-5
0.0
-5
0.0
-5
0.0
1.5
1.5
1.5
1.5
1.5
1.0
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.5
0.0
-1 0
0.0
-1 0
0.0
-1 0
0.0
-1 0
0.0
-5
-1 0
Fig. 2.4. A working example of the central limit theorem. The top row contains
distributions of sums of uniformly distributed random variables on the interval
[0.5, 0.5]. From left to right we have sums of 1, 2, 4, 8 and 16 random variables.
The
bottom row contains the same distribution with the means rescaled by m, where
m is the number of observations. Note how the distribution converges increasingly
to the normal distribution.
Given m random variables Xi with mean EX [x] = this means that their
m := 1 Pm Xi has the characteristic function
average X
i=1
m
m
i
1
X m () = 1 + w + o(m |w|)
(2.11)
m
In the limit of m this converges to exp(iw), the characteristic function of the constant distribution with mean . This proves the claim that in
m is essentially constant with mean .
the large sample limit X
Proof [Central Limit Theorem] We use the same idea as above to prove
the CLT. The main difference, though, is that we need to assume that the
second moments of the random variables Xi exist. To avoid clutter we only
prove the case of constant mean EXi [xi ] = and variance VarXi [xi ] = 2 .
45
P
Let Zm := 1 2 m
i=1 (Xi ). Our proof relies on showing convergence
m
of the characteristic function of Zm , i.e. Zm to that of a normally distributed random variable W with zero mean and unit variance. Expanding
the exponential to second order yields:
1
exp(iwx) = 1 + iwx w2 x2 + o(|w|2 )
2
1
and hence X () = 1 + iwEX [x] w2 VarX [x] + o(|w|2 )
2
Since the mean of Zm vanishes by centering (Xi ) and the variance per
variable is m1 we may write the characteristic function of Zm via
m
1 2
2
1
Zm () = 1
w + o(m |w| )
2m
As before, taking limits m yields the exponential function. We have
that limm Zm () = exp( 12 2 ) which is the characteristic function of
the normal distribution with zero mean and variance 1. Since the characteristic function transform is injective this proves our claim.
Note that the characteristic function has a number of useful properties. For
instance, it can also be used as moment generating function via the identity:
n X (0) = in EX [xn ].
(2.12)
Its proof is left as an exercise. See Problem 2.2 for details. This connection
also implies (subject to regularity conditions) that if we know the moments
of a distribution we are able to reconstruct it directly since it allows us
to reconstruct its characteristic function. This idea has been exploited in
density estimation [Cra46] in the form of Edgeworth and Gram-Charlier
expansions [Hal92].
46
2 Density Estimation
Pr(X ) .
(2.13)
Proof We use the fact that for nonnegative random variables
Z
Z
Z
x
xdp(x) = .
dp(x)
dp(x) 1
Pr(X ) =
0
This means that for random variables with a small mean, the proportion of
samples with large value has to be small.
Consequently deviations from the mean are O(1 ). However, note that this
bound does not depend on the number of observations. A useful application
of the Gauss-Markov inequality is Chebyshevs inequality. It is a statement
on the range of random variables using its variance.
Theorem 2.10 (Chebyshev) Denote by X a random variable with mean
and variance 2 . Then the following holds for > 0:
Pr(|x | )
2
.
2
(2.14)
m
X
i=1
m2 VarXi [xi ] = m1 2 .
47
2 m1 2 and equivalently / m.
This bound is quite reasonable for large but it means that for high levels
of confidence we need a huge number of observations.
Much stronger results can be obtained if we are able to bound the range
of the random variables. Using the latter, we reap an exponential improvement in the quality of the bounds in the form of the McDiarmid [McD89]
inequality. We state the latter without proof:
Theorem 2.11 (McDiarmid) Denote by f : Xm R a function on X
and let Xi be independent random variables. In this case the following holds:
Pr (|f (x1 , . . . , xm ) EX1 ,...,Xm [f (x1 , . . . , xm )]| > ) 2 exp 22 C 2 .
P
2
Here the constant C 2 is given by C 2 = m
i=1 ci where
f (x1 , . . . , xi , . . . , xm ) f (x1 , . . . , x0i , . . . , xm ) ci
for all x1 , . . . , xm , x0i and for all i.
This bound can be used for averages of a number of observations when
they are computed according to some algorithm as long as the latter can be
encoded in f . In particular, we have the following bound [Hoe63]:
Theorem 2.12 (Hoeffding) Denote by Xi iid random variables with bounded
m := m1 Pm Xi be their average.
range Xi [a, b] and mean . Let X
i=1
Then the following bound holds:
2m2
Pr Xm > 2 exp
.
(2.15)
(b a)2
m each individual random
Proof This is a corollary of Theorem 2.11. In X
m . Straightvariable has range [a/m, b/m] and we set f (X1 , . . . , Xm ) := X
2
2
2
forward algebra shows that C = m (b a) . Plugging this back into
McDiarmids theorem proves the claim.
Note that (2.15) is exponentially better than the previous bounds. With
increasing sample size the confidence level also increases exponentially.
Example 2.3 (Hoeffding bound) As in example 2.2 assume that Xi are
m be their average. Moreover, assume that
iid random variables and let X
48
2 Density Estimation
2.1.4 An Example
It is probably easiest to illustrate the various bounds using a concrete example. In a semiconductor fab processors are produced on a wafer. A typical
300mm wafer holds about 400 chips. A large number of processing steps
are required to produce a finished microprocessor and often it is impossible
to assess the effect of a design decision until the finished product has been
produced.
Assume that the production manager wants to change some step from
process A to some other process B. The goal is to increase the yield of
the process, that is, the number of chips of the 400 potential chips on the
wafer which can be sold. Unfortunately this number is a random variable,
i.e. the number of working chips per wafer can vary widely between different
wafers. Since process A has been running in the factory for a very long
time we may assume that the yield is well known, say it is A = 350 out
of 400 processors on average. It is our goal to determine whether process
B is better and what its yield may be. Obviously, since production runs
are expensive we want to be able to determine this number as quickly as
possible, i.e. using as few wafers as possible. The production manager is risk
averse and wants to ensure that the new process is really better. Hence he
requires a confidence level of 95% before he will change the production.
49
(2.18)
we can solve for m = 2 /2 = 40, 000/(0.05 400) = 20, 000. In other words,
we would typically need 20,000 wafers to assess with reasonable confidence
whether process B is better than process A. This is completely unrealistic.
Slightly better bounds can be obtained if we are able to make better
assumptions on the variance. For instance, if we can be sure that the yield
of process B is at least 300, then the largest possible variance is 0.25(300
0)2 + 0.75(300 400)2 = 30, 000, leading to a minimum of 15,000 wafers
which is not much better.
Hoeffding Since the yields are in the interval {0, . . . , 400} we have an explicit bound on the range of observations. Recall the inequality (2.16) which
bounds the failure probably = 0.05 by an exponential term. Solving this
for m yields
m 0.5|b a|2 2 log(2/) 737.8
(2.19)
50
2 Density Estimation
2
= 2.96/ m and hence m = 8.76 2
(2.21)
Again, our problem is that we do not know the variance of the distribution.
Using the worst-case bound on the variance, i.e. 2 = 40, 000 would lead to
a requirement of at least m = 876 wafers for testing. However, while we do
not know the variance, we may estimate it along with the mean and use the
empirical estimate, possibly plus some small constant to ensure we do not
underestimate the variance, instead of the upper bound.
Assuming that fluctuations turn out to be in the order of 50 processors,
i.e. 2 = 2500, we are able to reduce our requirement to approximately 55
wafers. This is probably an acceptable number for a practical test.
Rates and Constants The astute reader will have noticed that all three
confidence bounds had scaling behavior m = O(2 ). That is, in all cases
the number of observations was a fairly ill behaved function of the amount
of confidence required. If we were just interested in convergence per se, a
statement like that of the Chebyshev inequality would have been entirely
sufficient. The various laws and bounds can often be used to obtain considerably better constants for statistical confidence guarantees. For more
complex estimators, such as methods to classify, rank, or annotate data,
a reasoning such as the one above can become highly nontrivial. See e.g.
[MYA94, Vap98] for further details.
51
(2.22)
where pX (x) := m
m
X
(2.23)
i=1
10
11
12
Occurrences of dog
This means that the word dog occurs the following number of times:
Occurrences of dog
Number of documents
(2.24)
i=1
52
2 Density Estimation
Occurrences of dog
Number of documents
Frequency of occurrence
Laplace smoothing
4
0.33
0.26
2
0.17
0.16
2
0.17
0.16
1
0.083
0.11
1
0.083
0.11
0
0
0.05
2
0.17
0.16
The problem with this method is that as |X| increases we need increasingly
more observations to obtain even a modicum of precision. On average, we
will need at least one observation for every x X. This can be infeasible for
large domains as the following example shows.
Example 2.4 (Curse of Dimensionality) Assume that X = {0, 1}d , i.e.
x consists of binary bit vectors of dimensionality d. As d increases the size of
X increases exponentially, requiring an exponential number of observations
to perform density estimation. For instance, if we work with images, a 100
100 black and white picture would require in the order of 103010 observations
to model such fairly low-resolution images accurately. This is clearly utterly
infeasible the number of particles in the known universe is in the order
of 1080 . Bellman [Bel61] was one of the first to formalize this dilemma by
coining the term curse of dimensionality.
This example clearly shows that we need better tools to deal with highdimensional data. We will present one of such tools in the next section.
53
we may choose to smooth it out by a smoothing kernel h(x) such that the
probability mass becomes somewhat more spread out. For a density estimate
on X Rd this is achieved by
m
p(x) =
1 X d
r h
m
xxi
r
(2.26)
i=1
Out of the m samples drawn from p(x), the probability that k of them fall
in region R is given by the binomial distribution
m k
q (1 q)mk .
k
The expected fraction of points falling inside the region can easily be computed from the expected value of the Binomial distribution: E[k/m] = q.
Similarly, the variance can be computed as Var[k/m] = q(1 q)/m. As
m the variance goes to 0 and hence the estimate peaks around the
expectation. We can therefore set
k mq.
If we assume that R is so small that p(x) is constant over R, then
q p(x) V,
where V is the volume of R. Rearranging we obtain
p(x)
k
.
mV
(2.27)
Let us now set R to be a cube with side length r, and define a function
(
1 if |ui | 21
h(u) =
0 otherwise.
i
Observe that h xx
is 1 if and only if xi lies inside a cube of size r centered
r
54
2 Density Estimation
around x. If we let
m
X
x xi
,
k=
h
r
i=1
1 X d
p(x) =
r h
m
i=1
x xi
r
,
where rd is the volume of the hypercube of size r in d dimensions. By symmetry, we can interpret this equation as the sum over m cubes centered around
m data points xn . If we replace the cube by any smooth kernel function h()
this recovers (2.26).
There exists a large variety of different kernels which can be used for the
kernel density estimate. [Sil86] has a detailed description of the properties
of a number of kernels. Popular choices are
1
1 2
h(x) = (2) 2 e 2 x
h(x) =
h(x) =
h(x) =
1 |x|
2e
3
4 max(0, 1
1
2 [1,1] (x)
x )
Gaussian kernel
(2.28)
Laplace kernel
(2.29)
Epanechnikov kernel
(2.30)
Uniform kernel
(2.31)
Triangle kernel.
(2.32)
Further kernels are the triweight and the quartic kernel which are basically
powers of the Epanechnikov kernel. For practical purposes the Gaussian kernel (2.28) or the Epanechnikov kernel (2.30) are most suitable. In particular,
the latter has the attractive property of compact support. This means that
for any given density estimate at location x we will only need to evaluate
terms h(xi x) for which the distance kxi xk is less than r. Such expansions are computationally much cheaper, in particular when we make use of
fast nearest neighbor search algorithms [GIM99, IM98]. Figure 2.7 has some
examples of kernels.
2.2.3 Parameter Estimation
So far we have not discussed the issue of parameter selection. It should be
evident from Figure 2.6, though, that it is quite crucial to choose a good
kernel width. Clearly, a kernel that is overly wide will oversmooth any fine
detail that there might be in the density. On the other hand, a very narrow
kernel will not be very useful, since it will be able to make statements only
about the locations where we actually observed data.
55
0.10
0.05
0.04
0.03
0.05
0.02
0.01
0.00
40
50
60
70
80
90
100
0.00
40
110
50
60
70
80
90
100
110
Fig. 2.5. Left: a naive density estimate given a sample of the weight of 18 persons.
Right: the underlying weight distribution.
0.050
0.050
0.050
0.050
0.025
0.025
0.025
0.025
0.000
40
60
80 100
0.000
40
60
80 100
0.000
40
60
80 100
0.000
40
60
80 100
Fig. 2.6. Parzen windows density estimate associated with the 18 observations of
the Figure above. From left to right: Gaussian kernel density estimate with kernel
of width 0.3, 1, 3, and 10 respectively.
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.0
-2 -1 0 1 2
0.0
-2 -1 0 1 2
0.0
-2 -1 0 1 2
0.0
-2 -1 0 1 2
Fig. 2.7. Some kernels for Parzen windows density estimation. From left to right:
Gaussian kernel, Laplace kernel, Epanechikov kernel, and uniform density.
56
2 Density Estimation
m
Y
i=1
p(xi ) = m log m +
m
X
i=1
log
m
X
rd h
xi xj
r
(2.33)
j=1
Remark 2.13 (Log-likelihood) We consider the logarithm of the likelihood for reasons of computational stability to prevent numerical underflow.
While each term p(xi ) might be within a suitable range, say 102 , the product of 1000 of such terms will easily exceed the exponent of floating point
representations on a computer. Summing over the logarithm, on the other
hand, is perfectly feasible even for large numbers of observations.
Unfortunately computing the log-likelihood is equally infeasible: for decreasing r the only surviving terms in (2.33) are the functions h((xi xi )/r) =
h(0), since the arguments of all other kernel functions diverge. In other
words, the log-likelihood is maximized when p(x) is peaked exactly at the
locations where we observed the data. The graph on the left of Figure 2.6
shows what happens in such a situation.
What we just experienced is a case of overfitting where our model is too
flexible. This led to a situation where our model was able to explain the
observed data unreasonably well, simply because we were able to adjust
our parameters given the data. We will encounter this situation throughout
the book. There exist a number of ways to address this problem.
Validation Set: We could use a subset of our set of observations as an
estimate of the log-likelihood. That is, we could partition the observations into X := {x1 , . . . , xn } and X0 := {xn+1 , . . . , xm } and use
the second part for a likelihood score according to (2.33). The second
set is typically called a validation set.
n-fold Cross-validation: Taking this idea further, note that there is no
particular reason why any given xi should belong to X or X0 respectively. In fact, we could use all splits of the observations into sets
X and X0 to infer the quality of our estimate. While this is computationally infeasible, we could decide to split the observations into
n equally sized subsets, say X1 , . . . , Xn and use each of them as a
validation set at a time while the remainder is used to generate a
density estimate.
Typically n is chosen to be 10, in which case this procedure is
57
referred to as 10-fold cross-validation. It is a computationally attractive procedure insofar as it does not require us to change the
basic estimation algorithm. Nonetheless, computation can be costly.
Leave-one-out Estimator: At the extreme end of cross-validation we could
choose n = m. That is, we only remove a single observation at a time
and use the remainder of the data for the estimate. Using the average
over the likelihood scores provides us with an even more fine-grained
estimate. Denote by pi (x) the density estimate obtained by using
X := {x1 , . . . , xm } without xi . For a Parzen windows estimate this
is given by
h
i
X
x x
m
pi (xi ) = (m 1)1
rd h i r j = m1
p(xi ) rd h(0) .
j6=i
(2.34)
Note that this is precisely the term rd h(0) that is removed from
the estimate. It is this term which led to divergent estimates for
r 0. This means that the leave-one-out log-likelihood estimate
can be computed easily via
m
L(X) = m log m1
+
m
X
h
i
log p(xi ) rd h(0) .
(2.35)
i=1
58
2 Density Estimation
Fig. 2.8. Nonuniform density. Left: original density with samples drawn from the
distribution. Middle: density estimate with a uniform kernel. Right: density estimate
using Silvermans adjustment.
p(x) =
1 X d xxi
ri h ri .
m
(2.37)
i=1
59
60
2 Density Estimation
Fig. 2.9. Watson Nadaraya estimate. Left: a binary classifier. The optimal solution
would be a straight line since both classes were drawn from a normal distribution
with the same variance. Right: a regression estimator. The data was generated from
a sinusoid with additive noise. The regression tracks the sinusoid reasonably well.
2.3.1 Basics
Densities from the exponential family are defined by
p(x; ) := p0 (x) exp (h(x), i g()) .
(2.40)
61
Here p0 (x) is a density on X and is often called the base measure, (x) is
a map from x to the sufficient statistics (x). is commonly referred to as
the natural parameter. It lives in the space dual to (x). Moreover, g() is a
normalization constant which ensures that p(x) is properly normalized. g is
often referred to as the log-partition function. The name stems from physics
where Z = eg() denotes the number of states of a physical ensemble. g can
be computed as follows:
Z
(2.41)
g() = log exp (h(x), i) dx.
X
(2.42)
Proof Note that 2 g() = Varx [(x)] implies that g is convex, since the
covariance matrix is positive semidefinite. To show (2.42) we expand
R
Z
exp h(x), i dx
X (x)
R
g() =
= (x)p(x; )dx = Ex [(x)] . (2.43)
X exp h(x), i
Next we take the second derivative to obtain
Z
2
(x) [(x) g()]> p(x; )dx
g() =
X
h
i
= Ex (x)(x)> Ex [(x)] Ex [(x)]>
(2.44)
(2.45)
which proves the claim. For the first equality we used (2.43). For the second
line we used the definition of the variance.
One may show that higher derivatives n g() generate higher order cumulants of (x) under p(x; ). This is why g is often also referred as the
cumulant-generating function. Note that in general, computation of g()
62
2 Density Estimation
2.3.2 Examples
A large number of densities are members of the exponential family. Note,
however, that in statistics it is not common to express them in the dot
product formulation for historic reasons and for reasons of notational compactness. We discuss a number of common densities below and show why
they can be written in terms of an exponential family. A detailed description
of the most commonly occurring types are given in a table.
Gaussian Let x, Rd and let Rdd where 0, that is, is a
positive definite matrix. In this case the normal distribution can be
expressed via
1
d2
12
> 1
p(x) = (2) || exp (x ) (x )
(2.46)
2
1
1 > 1
>
= exp x + tr xx
c(, )
2
where c(, ) = 12 > 1 + d2 log 2 + 12 log ||. By combining the
terms in x into (x) := (x, 12 xx> ) we obtain the sufficient statistics
of x. The corresponding linear coefficients (1 , 1 ) constitute the
natural parameter . All that remains to be done to express p(x) in
terms of (2.40) is to rewrite g() in terms of c(, ). The summary
table on the following page contains details.
Multinomial Another popular distribution is one over k discrete events.
In this case X = {1, . . . , k} and we have in completely generic terms
P
p(x) = x where x 0 and x x = 1. Now denote by ex Rk the
x-th unit vector of the canonical basis, that is hex , ex0 i = 1 if x = x0
and 0 otherwise. In this case we may rewrite p(x) via
p(x) = x = exp (hex , log i)
(2.47)
63
e x
1
=
exp (x log ) where x N0 .
x!
x!
(2.48)
(2.49)
For n the second term converges to e . The third term converges to 1, since we have a product of only 2k terms, each of which
converge to 1. Using the exponential families notation we may check
that E[x] = and that moreover also Var[x] = .
Beta This is a distribution on the unit interval X = [0, 1] which is very
versatile when it comes to modelling unimodal and bimodal distri-
64
2 Density Estimation
0.40
3.5
0.35
3.0
0.30
2.5
0.25
2.0
0.20
1.5
0.15
1.0
0.10
0.5
0.05
0.00
0
10
15
20
25
30
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Fig. 2.10. Left: Poisson distributions with = {1, 3, 10}. Right: Beta distributions
with a = 2 and b {1, 2, 3, 5, 7}. Note how with increasing b the distribution
becomes more peaked close to the origin.
butions. It is given by
p(x) = xa1 (1 x)b1
(a + b)
.
(a)(b)
(2.50)
[0, 1]
[0, )
Cn
Sn
R+
N
Beta
Gamma
Wishart
Dirichlet
Inverse 2
Logarithmic
Conjugate
Lebesgue
1
x
Q
( ni=1 xi )1
1
e 2x
|X|
1
x(1x)
1
x
n+1
2
x 2
Lebesgue
generic
(log x1 , . . . , log xn )
log x
x
(, g())
1
1
1 1
2 log 2 2 log 2 + 2 2
n
1
1 > 1
2 log 2 2 log |2 | + 2 1 2 1
1
1
2 log 2 1 2 2 log 2
1 )(2 )
log (
(1 +2 )
log
P
i
log N
i=1 e
log 1 e
e
log (1 ) 1 log 2
1 log |2 | + 1 n log 2
P
+ ni=1 log 1 + 1i
2
Pn
Pn
i=1 log (i ) log ( i=1 i )
( 1) log 2 + log( 1)
log( log(1 e ))
(log x, x)
log |x|, 12 x
x, 21 x2
x, 21 xx>
x, x1
ex
x
x
(R+ )n
(0, )
(, 0)
(0, )2
R Cn
R2
(0, )2
Rn Cn
R (0, )
(, 0)
RN
(, 0)
R
Sn denotes the probability simplex in n dimensions. Cn is the cone of positive semidefinite matrices in Rnn .
[0, )
Inverse Normal
Gaussian
Lebesgue
Lebesgue
[0, )
Laplace
n
1
x!
Counting
Counting
{1..N }
N+
0
N+
0
Multinomial
Exponential
Poisson
66
2 Density Estimation
2.4 Estimation
In many statistical problems the challenge is to estimate parameters of interest. For instance, in the context of exponential families, we may want
to estimate a parameter such that it is close to the true parameter
in the distribution. While the problem is fully general, we will describe the
relevant steps in obtaining estimates for the special case of the exponential
family. This is done for two reasons firstly, exponential families are an
important special case and we will encounter slightly more complex variants
on the reasoning in later chapters of the book. Secondly, they are of a sufficiently simple form that we are able to show a range of different techniques.
In more advanced applications only a small subset of those methods may be
practically feasible. Hence exponential families provide us with a working
example based on which we can compare the consequences of a number of
different techniques.
2.4.1 Maximum Likelihood Estimation
Whenever we have a distribution p(x; ) parametrized by some parameter
we may use data to find a value of which maximizes the likelihood that
the data would have been generated by a distribution with this choice of
parameter.
For instance, assume that we observe a set of temperature measurements
X = {x1 , . . . , xm }. In this case, we could try finding a normal distribution
such that the likelihood p(X; ) of the data under the assumption of a normal
distribution is maximized. Note that this does not imply in any way that the
temperature measurements are actually drawn from a normal distribution.
Instead, it means that we are attempting to find the Gaussian which fits the
data in the best fashion.
While this distinction may appear subtle, it is critical: we do not assume
that our model accurately reflects reality. Instead, we simply try doing the
best possible job at modeling the data given a specified model class. Later
we will encounter alternative approaches at estimation, namely Bayesian
methods, which make the assumption that our model ought to be able to
describe the data accurately.
Definition 2.16 (Maximum Likelihood Estimator) For a model p(; )
parametrized by and observations X the maximum likelihood estimator
(MLE) is
ML [X] := argmax p(X; ).
(2.51)
2.4 Estimation
67
m
Y
p(xi ; ) =
i=1
m
Y
(2.52)
i=1
(2.53)
(2.54)
i=1
Here [X] is the empirical average of the map (x). Maximization of p(X; )
is equivalent to minimizing the negative log-likelihood log p(X; ). The
latter is a common practical choice since for independently drawn data,
the product of probabilities decomposes into the sum of the logarithms of
individual likelihoods. This leads to the following objective function to be
minimized
log p(X; ) = m [g() h, [X]i]
(2.55)
Often the Poisson distribution is specified using := log as its rate parameter. In this case we
have p(x; ) = x e /x! as its parametrization. The advantage of the natural parametrization
using is that we can directly take advantage of the properties of the log-partition function as
generating the cumulants of x.
68
2 Density Estimation
i=1
i=1
X
1 X
xi and hence = log
xi log m.
m
(2.57)
j=1 e
1 X
{xj = i} .
m
(2.58)
j=1
P
It is easy to check that (2.58) is satisfied for ei = m
j=1 {xj = i}. In other
words, the MLE for a discrete distribution simply given by the empirical
frequencies of occurrence.
The multinomial setting also exhibits two rather important aspects of exP
ponential families: firstly, choosing i = c + log m
i=1 {xj = i} for any c R
will lead to an equivalent distribution. This is the case since the sufficient
statistic (x) is not minimal. In our context this means that the coordinates
P
of (x) are linearly dependent for any x we have that j [(x)]j = 1,
hence we could eliminate one dimension. This is precisely the additional
degree of freedom which is reflected in the scaling freedom in .
Secondly,
for data
i where some events do not occur at all, the expression
hP
m
{x
=
i}
= log 0 is ill defined. This is due to the fact that this
log
j
j=1
particular set of counts occurs on the boundary of the convex set within
which the natural parameters are well defined. We will see how different
types of priors can alleviate the issue.
Using the MLE is not without problems. As we saw in Figure 2.1, convergence can be slow, since we are not using any side information. The latter
can provide us with problems which are both numerically better conditioned
and which show better convergence, provided that our assumptions are accurate. Before discussing a Bayesian approach to estimation, let us discuss
basic statistical properties of the estimator.
2.4 Estimation
69
Fig. 2.11. Left: unbiased estimator; the estimates, denoted by circles have as mean
the true parameter, as denoted by a star. Middle: consistent estimator. While the
true model is not within the class we consider (as denoted by the ellipsoid), the
estimates converge to the white star which is the best model within the class that
approximates the true model, denoted by the solid star. Right: different estimators
have different regions of uncertainty, as made explicit by the ellipses around the
true parameter (solid star).
70
2 Density Estimation
] N(0, 2 g() ).
m 2 [[X]
(2.59)
[X]
when performing estimation. The Cramer-Rao bound governs this.
Theorem 2.19 (Cram
er and Rao [Rao73]) Assume that X is drawn from
(2.60)
p(X; )dX = 1 = 0.
(2.62)
2.4 Estimation
71
Hence we may simplify the covariance formula by dropping the means via
h
i
h
i
= p(X; )(X)
log p(X; )d
Z
= p(X; )(X)dX
= = 1.
Here the last equality follows since we may interchange integration by X
and the derivative with respect to .
The Cramer-Rao theorem implies that there is a limit to how well we may
estimate a parameter given finite amounts of data. It is also a yardstick by
which we may measure how efficiently an estimator uses data. Formally, we
define the efficiency as the quotient between actual performance and the
Cramer-Rao bound via
e := 1/det IB.
(2.63)
(X).
Theorem 2.18 implies that for exponential families MLE is asymptotically efficient. It turns out to be generally true.
Theorem 2.20 (Efficiency of MLE [Cra46, GW92, Ber85]) The maximum likelihood estimator is asymptotically efficient (e = 1).
generating p(; X). If this is not true, we need to settle for less: how well [X]
approaches the best possible choice of within the given model class. Such
behavior is referred to as consistency. Note that it is not possible to define
consistency per se. For instance, we may ask whether converges to the
converges to the optimal density
optimal parameter , or whether p(x; )
p(x; ), and with respect to which norm. Under fairly general conditions
this turns out to be true for finite-dimensional parameters and smoothly
parametrized densities. See [DGL96, vdG00] for proofs and further details.
2.4.3 A Bayesian Approach
The analysis of the Maximum Likelihood method might suggest that inference is a solved problem. After all, in the limit, MLE is unbiased and it
exhibits as small variance as possible. Empirical results using a finite amount
of data, as present in Figure 2.1 indicate otherwise.
While not making any assumptions can lead to interesting and general
72
2 Density Estimation
theorems, it ignores the fact that in practice we almost always have some
idea about what to expect of our solution. It would be foolish to ignore such
additional information. For instance, when trying to determine the voltage
of a battery, it is reasonable to expect a measurement in the order of 1.5V
or less. Consequently such prior knowledge should be incorporated into the
estimation process. In fact, the use of side information to guide estimation
turns out to be the tool to building estimators which work well in high
dimensions.
Recall Bayes rule (1.15) which states that p(|x) = p(x|)p()
. In our conp(x)
text this means that if we are interested in the posterior probability of
assuming a particular value, we may obtain this using the likelihood (often
referred to as evidence) of x having been generated by via p(x|) and our
prior belief p() that might be chosen in the distribution generating x.
Observe the subtle but important difference to MLE: instead of treating
as a parameter of a density model, we treat as an unobserved random
variable which we may attempt to infer given the observations X.
This can be done for a number of different purposes: we might want to
infer the most likely value of the parameter given the posterior distribution
p(|X). This is achieved by
MAP (X) := argmax p(|X) = argmin log p(X|) log p().
(2.64)
The second equality follows since p(X) does not depend on . This estimator
is also referred to as the Maximum a Posteriori, or MAP estimator. It differs
from the maximum likelihood estimator by adding the negative log-prior
to the optimization problem. For this reason it is sometimes also referred
to as Penalized MLE. Effectively we are penalizing unlikely choices via
log p().
Note that using MAP (X) as the parameter of choice is not quite accurate.
After all, we can only infer a distribution over and in general there is no
guarantee that the posterior is indeed concentrated around its mode. A more
accurate treatment is to use the distribution p(|X) directly via
Z
p(x|X) = p(x|)p(|X)d.
(2.65)
In other words, we integrate out the unknown parameter and obtain the
density estimate directly. As we will see, it is generally impossible to solve
(2.65) exactly, an important exception being conjugate priors. In the other
cases one may resort to sampling from the posterior distribution to approximate the integral.
While it is possible to design a wide variety of prior distributions, this book
2.4 Estimation
73
(2.66)
That is, they restrict the deviation of the parameter value from some guess
0 . The intuition is that extreme values of are much less likely than more
moderate choices of which will lead to more smooth and even distributions
p(x|).
A popular choice is the Gaussian prior which we obtain for p = d = 1
and = 1/2 2 . Typically one sets 0 = 0 in this case. Note that in (2.66)
we did not spell out the normalization of p() in the context of MAP
estimation this is not needed since it simply becomes a constant offset in
the optimization problem (2.64). We have
MAP [X] = argmin m [g() h, [X]i] + k 0 kdp
(2.67)
(2.68)
(2.69)
Note that p(|n, ) itself is a member of the exponential family with the
feature map () = (, g()). Hence h(, n) is convex in (n, n). Moreover,
the posterior distribution has the form
p(|X) p(X|)p(|n, ) exp (hm[X] + n, i (m + n)g()) . (2.70)
74
2 Density Estimation
Fig. 2.12. From left to right: regions of equal prior probability in R2 for priors using
the `1 , `2 and ` norm. Note that only the `2 norm is invariant with regard to the
coordinate system. As we shall see later, the `1 norm prior leads to solutions where
only a small number of coordinates is nonzero.
That is, the posterior distribution has the same form as a conjugate prior
with parameters m[X]+n
and m + n. In other words, n acts like a phantom
m+n
sample size and is the corresponding mean parameter. Such an interpretation is reasonable given our desire to design a prior which, when combined
with the likelihood remains in the same model class: we treat prior knowledge as having observed virtual data beforehand which is then added to the
actual set of observations. In this sense data and prior become completely
equivalent we obtain our knowledge either from actual observations or
from virtual observations which describe our belief into how the data generation process is supposed to behave.
Eq. (2.70) has the added benefit of allowing us to provide an exact normalized version of the posterior. Using (2.68) we obtain that
p(|X) = exp hm[X] + n, i (m + n)g() h m[X]+n
,
m
+
n
.
m+n
The main remaining challenge is to compute the normalization h for a range
of important conjugate distributions. The table on the following page provides details. Besides attractive algebraic properties, conjugate priors also
have a second advantage the integral (2.65) can be solved exactly:
Z
p(x|X) =
Combining terms one may check that the integrand amounts to the normal-
2.4 Estimation
75
h
,
m
+
n
m+n+1
m+n
Such an expansion is very useful whenever we would like to draw x from
p(x|X) without the need to obtain an instantiation of the latent variable .
We provide explicit expansions in appendix 2. [GS04] use the fact that
can be integrated out to obtain what is called a collapsed Gibbs sampler for
topic models [BNJ03].
2.4.4 An Example
Assume we would like to build a language model based on available documents. For instance, a linguist might be interested in estimating the frequency of words in Shakespeares collected works, or one might want to
compare the change with respect to a collection of webpages. While models describing documents by treating them as bags of words which all have
been obtained independently of each other are exceedingly simple, they are
valuable for quick-and-dirty content filtering and categorization, e.g. a spam
filter on a mail server or a content filter for webpages.
Hence we model a document d as a multinomial distribution: denote by
wi for i {1, . . . , md } the words in d. Moreover, denote by p(w|) the
probability of occurrence of word w, then under the assumption that the
words are independently drawn, we have
p(d|) =
md
Y
p(wi |).
(2.71)
i=1
It is our goal to find parameters such that p(d|) is accurate. For a given
collection D of documents denote by mw the number of counts for word w
in the entire collection. Moreover, denote by m the total number of words
in the entire collection. In this case we have
Y
Y
p(D|) =
p(di |) =
p(w|)mw .
(2.72)
i
76
2 Density Estimation
m w + nw
mw + nw
=
.
m+n
m+n
(2.74)
In other words, we add the pseudo counts nw to the actual word counts mw .
This is particularly useful when the document we are dealing with is brief,
that is, whenever we have little data: it is quite unreasonable to infer from
a webpage of approximately 1000 words that words not occurring in this
page have zero probability. This is exactly what is mitigated by means of
the conjugate prior (, n).
Finally, let us consider norm-constrained priors of the form (2.66). In this
case, the integral required for
Z
p(D) = p(D|)p()d
Z
exp k 0 kdp + m h[D], i mg() d
is intractable and we need to resort to an approximation. A popular choice
is to replace the integral by p(D| ) where maximizes the integrand. This
is precisely the MAP approximation of (2.64). Hence, in order to perform
estimation we need to solve
(2.75)
minimize g() h[D], i + k 0 kdp .
m
A very simple strategy for minimizing (2.75) is gradient descent. That is for
a given value of we compute the gradient of the objective function and take
a fixed step towards its minimum. For simplicity assume that d = p = 2 and
= 1/2 2 , that is, we assume that is normally distributed with variance
2 and mean 0 . The gradient is given by
1
[ 0 ]
(2.76)
m 2
In other words, it depends on the discrepancy between the mean of (x)
with respect to our current model and the empirical average [X], and the
difference between and the prior mean 0 .
Unfortunately, convergence of the procedure [. . .] is usually
very slow, even if we adjust the steplength efficiently. The reason is that
the gradient need not point towards the minimum as the space is most likely
[ log p(D, )] = Exp(x|) [(x)] [D] +
2.5 Sampling
77
2.5 Sampling
So far we considered the problem of estimating the underlying probability
density, given a set of samples drawn from that density. Now let us turn to
the converse problem, that is, how to generate random variables given the
underlying probability density. In other words, we want to design a random
variable generator. This is useful for a number of reasons:
We may encounter probability distributions where optimization over suitable model parameters is essentially impossible and where it is equally impossible to obtain a closed form expression of the distribution. In these cases
it may still be possible to perform sampling to draw examples of the kind
of data we expect to see from the model. Chapter ?? discusses a number of
graphical models where this problem arises.
Secondly, assume that we are interested in testing the performance of a
network router under different load conditions. Instead of introducing the
under-development router in a live network and wreaking havoc, one could
78
2 Density Estimation
estimate the probability density of the network traffic under various load
conditions and build a model. The behavior of the network can then be
simulated by using a probabilistic model. This involves drawing random
variables from an estimated probability distribution.
Carrying on, suppose that we generate data packets by sampling and see
an anomalous behavior in your router. In order to reproduce and debug
this problem one needs access to the same set of random packets which
caused the problem in the first place. In other words, it is often convenient
if our random variable generator is reproducible; At first blush this seems
like a contradiction. After all, our random number generator is supposed
to generate random variables. This is less of a contradiction if we consider
how random numbers are generated in a computer given a particular
initialization (which typically depends on the state of the system, e.g. time,
disk size, bios checksum, etc.) the random number algorithm produces a
sequence of numbers which, for all practical purposes, can be treated as iid.
A simple method is the linear congruential generator [PTVF94]
xi+1 = (axi + b) mod c.
The performance of these iterations depends significantly on the choice of the
constants a, b, c. For instance, the GNU C compiler uses a = 1103515245, b =
12345 and c = 232 . In general b and c need to be relatively prime and a 1
needs to be divisible by all prime factors of c and by 4. It is very much
advisable not to attempt implementing such generators on ones own unless
it is absolutely necessary.
Useful desiderata for a pseudo random number generator (PRNG) are that
for practical purposes it is statistically indistinguishable from a sequence of
iid data. That is, when applying a number of statistical tests, we will accept
the null-hypothesis that the random variables are iid. See Chapter ?? for
a detailed discussion of statistical testing procedures for random variables.
In the following we assume that we have access to a uniform RNG U [0, 1]
which draws random numbers uniformly from the range [0, 1].
2.5 Sampling
79
0.3
0.8
0.2
0.6
0.4
0.1
0.2
0
1
Fig. 2.13. Left: discrete probability distribution over 5 possible outcomes. Right:
associated cumulative distribution function. When sampling, we draw x uniformly
at random from U [0, 1] and compute the inverse of F .
variable x := (z) is drawn from x 1 (x) p(1 (x)). Here x 1 (x)
denotes the determinant of the Jacobian of 1 .
This follows immediately by applying a variable transformation
for a mea
sure, i.e. we change dp(z) to dp(1 (x)) x 1 (x). Such a conversion strategy is particularly useful for univariate distributions.
Corollary 2.22 Denote by p(x) a distribution on R with cumulative distriR x0
bution function F (x0 ) = dp(x). Then the transformation x = (z) =
F 1 (z) converts samples z U [0, 1] to samples drawn from p(x).
We now apply this strategy to a number of univariate distributions. One of
the most common cases is sampling from a discrete distribution.
Example 2.8 (Discrete Distribution) In the case of a discrete distribution over {1, . . . , k} the cumulative distribution function is a step-function
with steps at {1, . . . , k} where the height of each step is given by the corresponding probability of the event.
The implementation works as follows: denote by p [0, 1]k the vector of
probabilities and denote by f [0, 1]k with fi = fi1 + pi and f1 = p1 the
steps of the cumulative distribution function. Then for a random variable z
drawn from U [0, 1] we obtain x = (z) := argmini {fi z}. See Figure 2.13
for an example of a distribution over 5 events.
80
2 Density Estimation
Exponential Distribution
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0
0.2
10
10
Example 2.9 (Exponential Distribution) The density of a Exponentialdistributed random variable is given by
p(x|) = exp(x) if > 0 and x 0.
(2.78)
(2.79)
(2.80)
2.5 Sampling
81
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00 4
Fig. 2.15. Red: true density of the standard normal distribution (red line) is contrasted with the histogram of 20,000 random variables generated by the Box-M
uller
transform.
The key observation is that the joint distribution p(x, y) is radially symmetric, i.e. it only depends on the radius r2 = x2 + y 2 . Hence we may perform
a variable substitution in polar coordinates via the map where
x = r cos and y = r sin hence (x, y) = 1 (r, ).
(2.81)
(2.82)
Observing that z U [0, 1] implies that 1 z U [0, 1] yields the following
sampler: draw z , zr U [0, 1] and compute x and y by
p
p
x = 2 log zr cos 2z and y = 2 log zr sin 2z .
Note that the Box-M
uller transform yields two independent Gaussian random variables. See Figure 2.15 for an example of the sampler.
82
2 Density Estimation
if r 1
otherwise
(2.84)
2.5 Sampling
83
Fig. 2.16. Rejection sampler. Left: samples drawn from the uniform distribution on
[0, 1]2 . Middle: the samples drawn from the uniform distribution on the unit disc
are all the points in the grey shaded area. Right: the same procedure allows us to
sample uniformly from arbitrary sets.
3.0
2.5
2.5
2.0
2.0
1.5
1.5
1.0
1.0
0.5
0.0
0.5
0.0
0.2
0.4
0.6
0.8
1.0
0.00.0
0.2
0.4
0.6
0.8
1.0
Fig. 2.17. Accept reject sampling for the Beta(2, 5) distribution. Left: Samples are
generated uniformly from the blue rectangle (shaded area). Only those samples
which fall under the red curve of the Beta(2, 5) distribution (darkly shaded area)
are accepted. Right: The true density of the Beta(2, 5) distribution (red line) is
contrasted with the histogram of 10,000 samples drawn by the rejection sampler.
84
2 Density Estimation
(2.85)
Note that the algorithm of Example 2.12 is a special case of such a rejection
1
sampler we majorize pX by the uniform distribution rescaled by p(X)
.
Example 2.14 (Beta distribution) Recall that the Beta(a, b) distribution,
as a member of the Exponential Family with sufficient statistics (log x, log(1
x)), is given by
p(x|a, b) =
(a + b) a1
x (1 x)b1 ,
(a)(b)
(2.86)
a1
.
a+b2
(2.87)
provided a > 1. Hence, if we use as proposal distribution the uniform distribution U [0, 1] with scaling factor c = p(M |a, b) we may apply Theorem 2.23.
As illustrated in Figure 2.17, to generate a sample from Beta(a, b) we first
generate a pair (x, t), uniformly at random from the shaded rectangle. A
sample is retained if ct p(x|a, b), and rejected otherwise. The acceptance
rate of this sampler is 1c .
Example 2.15 (Normal distribution) We may use the Laplace distribution to generate samples from the Normal distribution. That is, we use
q(x|) =
|x|
e
2
(2.88)
2.5 Sampling
85
mean and unit variance it turns out that choosing = 1 yields the most
efficient sampling scheme (see Problem 2.27) with
r
2e
q(x| = 1)
p(x)
q
0.6
2e
g(x|0, 1)
p(x)
0.4
0.2
0
4
Fig. 2.18. Rejection sampling for the Normal distribution (red curve).
Samples are
p
generated uniformly from the Laplace distribution rescaled by 2e/. Only those
samples which fall under the red curve of the standard normal distribution (darkly
shaded area) are accepted.
Now suppose that we want to draw from p(x| 2 ) by sampling from another
Gaussian q with slightly larger variance 2 > 2 . In this case the ratio
between both distributions is maximized at 0 and it yields
q(0| 2 ) h id
c=
=
p(0|2 )
86
2 Density Estimation
(2.89)
Problem 2.4 (Weak Law of Large Numbers {2}) In analogy to the proof
of the central limit theorem prove the weak law of large numbers. Hint: use
a first order Taylor expansion of eit = 1 + it + o(t) to compute an approximation of the characteristic function. Next compute the limit m for
X m . Finally, apply the inverse Fourier transform to associate the constant
distribution at the mean with it.
Problem 2.5 (Rates and confidence bounds {3}) Show that the rate
of hoeffding is tight get bound from central limit theorem and compare to
the hoeffding rate.
Problem 2.6 Why cant we just use each chip on the wafer as a random
variable? Give a counterexample. Give bounds if we actually were allowed to
do this.
Problem 2.7 (Union Bound) Work on many bounds at the same time.
We only have logarithmic penalty.
Problem 2.8 (Randomized Rounding {4}) Solve the linear system of
equations Ax = b for integral x.
2.5 Sampling
87
Problem 2.9 (Randomized Projections {3}) Prove that the randomized projections converge.
Problem 2.10 (The Count-Min Sketch {5}) Prove the projection trick
Problem 2.11 (Parzen windows with triangle kernels {1}) Suppose
you are given the following data: X = {2, 3, 3, 5, 5}. Plot the estimated density using a kernel density estimator with the following kernel:
(
0.5 0.25 |u| if |u| 2
k(u) =
0 otherwise.
Problem 2.12 Gaussian process link with Gaussian prior on natural parameters
Problem 2.13 Optimization for Gaussian regularization
Problem 2.14 Conjugate prior (student-t and wishart).
Problem 2.15 (Multivariate Gaussian {1}) Prove that 0 is a necessary and sufficient condition for the normal distribution to be well defined.
Problem 2.16 (Discrete Exponential Distribution {2}) (x) = x and
uniform measure.
Problem 2.17 Exponential random graphs.
Problem 2.18 (Maximum Entropy Distribution) Show that exponential families arise as the solution of the maximum entropy estimation problem.
Problem 2.19 (Maximum Likelihood Estimates for Normal Distributions)
Derive the maximum likelihood estimates for a normal distribution, that is,
show that they result in
m
1 X
1 X
xi and
2 =
(xi
)2
m
m
i=1
(2.90)
i=1
using the exponential families parametrization. Next show that while the
1
mean estimate
is unbiased, the variance estimate has a slight bias of O( m
).
2
To see this, take the expectation with respect to
.
88
2 Density Estimation
Problem 2.20 (cdf of Logistic random variable {1}) Show that the cdf
of the Logistic random variable (??) is given by (??).
Problem 2.21 (Double-exponential (Laplace) distribution {1}) Use
the inverse-transform method to generate a sample from the double-exponential
(Laplace) distribution (2.88).
Problem 2.22 (Normal random variables in polar coordinates {1})
If X1 and X2 are standard normal random variables and let (R, ) denote the polar coordinates of the pair (X1 , X2 ). Show that R2 22 and
Unif[0, 2].
Problem 2.23 (Monotonically increasing mappings {1}) A mapping
T : R R is one-to-one if, and only if, T is monotonically increasing, that
is, x > y implies that T (x) > T (y).
Problem 2.24 (Monotonically increasing multi-maps {2}) Let T : Rn
Rn be one-to-one. If X pX (x), then show that the distribution pY (y) of
Y = T (X) can be obtained via (??).
Problem 2.25 (Argmax of the Beta(a, b) distribution {1}) Show that
the mode of the Beta(a, b) distribution is given by (2.87).
Problem 2.26 (Accept reject sampling for the unit disk {2}) Give at
least TWO different accept-reject based sampling schemes to generate samples uniformly at random from the unit disk. Compute their efficiency.
Problem 2.27 (Optimizing Laplace for Standard Normal {1}) Optimize
the ratio p(x)/g(x|, ), with respect to and , where p(x) is the standard
normal distribution (??), and g(x|, ) is the Laplace distribution (2.88).
Problem 2.28 (Normal Random Variable Generation {2}) The aim
of this problem is to write code to generate standard normal random variables (??) by using different methods. To do this generate U Unif[0, 1]
and apply
(i) the Box-Muller transformation outlined in Section ??.
(ii) use the following approximation to the inverse CDF
1 () t
a0 + a1 t
,
1 + b1 t + b2 t2
(2.91)
2.5 Sampling
89
3
Optimization
(3.1)
1 X
Remp (f ) :=
l(f (xi ), yi ).
m
(3.2)
i=1
Here xi are the training instances and yi are the corresponding labels. l the
loss function measures the discrepancy between y and the predictions f (xi ).
Finding the optimal f involves solving an optimization problem.
This chapter provides a self-contained overview of some basic concepts and
tools from optimization, especially geared towards solving machine learning
problems. In terms of concepts, we will cover topics related to convexity,
duality, and Lagrange multipliers. In terms of tools, we will cover a variety
of optimization algorithms including gradient descent, stochastic gradient
descent, Newtons method, and Quasi-Newton methods. We will also look
at some specialized algorithms tailored towards solving Linear Programming
and Quadratic Programming problems which often arise in machine learning
problems.
3.1 Preliminaries
Minimizing an arbitrary function is, in general, very difficult, but if the objective function to be minimized is convex then things become considerably
simpler. As we will see shortly, the key advantage of dealing with convex
functions is that a local optima is also the global optima. Therefore, well
developed tools exist to find the global minima of a convex function. Consequently, many machine learning algorithms are now formulated in terms of
convex optimization problems. We briefly review the concept of convex sets
and functions in this section.
91
92
3 Optimization
Fig. 3.1. The convex set (left) contains the line joining any two points that belong
to the set. A non-convex set (right) does not satisfy this property.
P
P
A vector sum i i xi is called a convex combination if i 0 and i i =
1. Convex combinations are helpful in defining a convex hull:
Definition 3.2 (Convex Hull) The convex hull, conv(X), of a finite subset X = {x1 , . . . , xn } of Rn consists of all convex combinations of x1 , . . . , xn .
(3.3)
(3.4)
3.1 Preliminaries
93
(3.5)
whenever x 6= x0 .
In fact, the above definition can be extended to show that if f is a convex
P
function and i 0 with i i = 1 then
!
X
X
f
i f (xi ).
(3.6)
i xi
i
800
1.0
600
0.5
f(x)
1.5
f(x)
1000
400
0.0
0.5
200
1.0
0
6
x0
1.5 3
x0
Fig. 3.2. A convex function (left) satisfies (3.4); the shaded region denotes its epigraph. A nonconvex function (right) does not satisfy (3.4).
f (x0 ) f (x) + x0 x, f (x) for all x, x0 X.
(3.7)
In other words, the first order Taylor approximation lower bounds the convex
function universally (see Figure 3.4). Here and in the rest of the chapter
hx, yi denotes the Euclidean dot product between vectors x and y, that is,
X
hx, yi :=
xi yi .
(3.8)
i
If f is twice differentiable, then f is convex if, and only if, its Hessian is
positive semi-definite, that is,
2 f (x) 0.
(3.9)
For twice differentiable strictly convex functions, the Hessian matrix is positive definite, that is, 2 f (x) 0. We briefly summarize some operations
which preserve convexity:
94
3 Optimization
Addition
Scaling
Affine Transform
Adding a Linear Function
3
18
16
14
12
10
8
6
4
2
0
2
1
0
-1
3
-2
2
1
-3
-2
0
-1
-3
3
-1
-2
-3
-1
1
-2
3 -3
Fig. 3.3. Left: Convex Function in two variables. Right: the corresponding convex
below-sets {x|f (x) c}, for different values of c. This is also called a contour plot.
(3.10)
is convex.
Proof For any x, x0 Xc , we have f (x), f (x0 ) c. Moreover, since f is
convex, we also have
f (x + (1 )x0 ) f (x) + (1 )f (x0 ) c for all 0 < < 1.
(3.11)
Hence, for all 0 < < 1, we have (x + (1 )x0 ) Xc , which proves the
claim. Figure 3.3 depicts this situation graphically.
3.1 Preliminaries
95
0
x x, f (x) 0 for all x0 .
(3.12)
Proof To show the forward implication, suppose that x is the optimum
but (3.12) does not hold, that is, there exists an x0 for which
0
x x, f (x) < 0.
Consider the line segment z() = (1 )x + x0 , with 0 < < 1. Since X
is convex, z() lies in X. On the other hand,
d
f (z())
= x0 x, f (x) < 0,
d
=0
which shows that for small values of we have f (z()) < f (x), thus showing
that x is not optimal.
The reverse implication follows from (3.7) by noting that f (x0 ) f (x),
whenever (3.12) holds.
96
3 Optimization
One way to ensure that (3.12) holds is to set f (x) = 0. In other words,
minimizing a convex function is equivalent to finding a x such that f (x) =
0. Therefore, the first order conditions are both necessary and sufficient
when minimizing a convex function.
3.1.3 Subgradients
So far, we worked with differentiable convex functions. The subgradient is a
generalization of gradients appropriate for convex functions, including those
which are not necessarily smooth.
Definition 3.7 (Subgradient) Suppose x is a point where a convex function f is finite. Then a subgradient is the normal vector of any tangential
supporting hyperplane of f at x. Formally is called a subgradient of f at
x if, and only if,
f (x0 ) f (x) + x0 x, for all x0 .
(3.13)
The set of all subgradients at a point is called the subdifferential, and is denoted by x f (x). If this set is not empty then f is said to be subdifferentiable
at x. On the other hand, if this set is a singleton then, the function is said
to be differentiable at x. In this case we use f (x) to denote the gradient
of f . Convex functions are subdifferentiable everywhere in their domain. We
now state some simple rules of subgradient calculus:
Addition
Scaling
Affine Transform
Pointwise Maximum
3.1 Preliminaries
97
(3.14)
kxk2
P
Example 3.2 (Negative Entropy) Let n = {x s.t.
i xi = 1 and xi 0}
be the n dimensional simplex, and f : n R be the negative entropy:
X
f (x) =
xi log xi .
(3.15)
i
Then f is 1-strongly convex with respect to the kk1 norm on the simplex
(see Problem 3.7).
If f is a -strongly convex function then one can show the following properties (Exercise 3.8). Here x, x0 are arbitrary and f (x) and 0 f (x0 ).
2
f (x0 ) f (x) + x0 x, +
x0 x
2
0
1
0
0
2
f (x ) f (x) + x x, +
2
2
0
0
x x ,
x x0
2
1
x x0 , 0
0
.
(3.16)
(3.17)
(3.18)
(3.19)
98
3 Optimization
(3.21)
L
f (x0 ) f (x) + x0 x, f (x) +
x x0
2
1
f (x) f (x0 )
2
f (x0 ) f (x) + x0 x, f (x) +
2L
2
x x0 , f (x) f (x0 ) L
x x0
2
1
x x0 , f (x) f (x0 )
f (x) f (x0 )
.
L
(3.22)
(3.23)
(3.24)
(3.25)
(3.26)
(3.27)
3.1 Preliminaries
99
Fig. 3.4. A convex function is always lower bounded by its first order Taylor approximation. This is true even if the function is not differentiable (see Figure 3.5)
5
4
3
2
1
0
14
(3.28)
100
3 Optimization
f (x, x0 ) = f (x) f (x0 ) x x0 , f (x0 ) .
(3.29)
Also see Figure 3.6. Here are some well known examples.
Example 3.5 (Square Euclidean Norm) Set f (x) = 12 kxk2 . Clearly,
f (x) = x and therefore
2
1
1
1
2
f (x, x0 ) = kxk2
x0
x x0 , x0 =
x x0
.
2
2
2
3.1 Preliminaries
101
f(x)
f(x,x0 )
f(x0 )
Fig. 3.6. f (x) is the value of the function at x, while f (x0 )+hx x0 , f (x0 )i denotes
the first order Taylor expansion of f around x0 , evaluated at x. The difference
between these two quantities is the Bregman divergence, as illustrated.
One can calculate f (x) = log x, where log x is the component wise logarithm of the entries of x, and write the Bregman divergence
X
X
X
X
f (x, x0 ) =
xi log xi
xi
x0i log x0i +
x0i x x0 , log x0
i
X
xi
0
+ xi xi .
=
xi log
x0i
i
(3.31)
102
3 Optimization
We say that the q-norm is dual to the p-norm whenever p1 + 1q = 1. One can
verify (Problem 3.12) that the i-th component of the gradient f (x) is
xi f (x) =
(3.32)
1
1
2 X
sign(x0i ) |x0i |p1
kxk2p
x0
p
(xi x0i )
.
2
2
kx0 kp2
p
i
f (x, x0 ) is convex in x.
f (x, x0 ) 0.
f may not be symmetric, that is, in general f (x, x0 ) 6= f (x0 , x).
x f (x, x0 ) = f (x) f (x0 ).
(3.33)
Proof
f (x, y) + f (y, z) = f (x) f (y) hx y, f (y)i + f (y) f (z) hy z, f (z)i
= f (x) f (z) hx y, f (y)i hy z, f (z)i
= f (x, z) + hf (z) f (y), x yi .
103
simple problem has many applications. As we will see later, many optimization methods find a direction of descent and minimize the objective function
along this direction1 ; this subroutine is called a line search. Algorithm 3.1
depicts a simple line search routine based on interval bisection.
Before we show that Algorithm 3.1 converges, let us first derive an important property of convex functions of one variable. For a differentiable
one-dimensional convex function J (3.7) reduces to
J(w) J(w0 ) + (w w0 ) J 0 (w0 ),
(3.34)
(3.35)
(3.36)
If w w0 , then this implies that J 0 (w) J 0 (w0 ). In other words, the gradient
of a one dimensional convex function is monotonically non-decreasing.
Recall that minimizing a convex function is equivalent to finding w such
that J 0 (w ) = 0. Furthermore, it is easy to see that the interval bisection
maintains the invariant J 0 (at ) < 0 and J 0 (bt ) > 0. This along with the
monotonicity of the gradient suffices to ensure that w (at , bt ). Setting
w = w in (3.34), and using the monotonicity of the gradient allows us to
1
If the objective function is convex, then the one dimensional function obtained by restricting
it along the search direction is also convex (Exercise 3.10).
104
3 Optimization
(3.37)
Since we halve the interval (at , bt ) at every iteration, it follows that (bt at ) =
(U L)/2t . Therefore
(U L) J 0 (U )
,
(3.38)
2t
for all w0 (at , bt ). In other words, to find an -accurate solution, that is,
J(w0 ) J(w ) we only need log(U L) + log J 0 (U ) + log(1/) < t iterations. An algorithm which converges to an accurate solution in O(log(1/))
iterations is said to be linearly convergent.
For multi-dimensional objective functions, one cannot rely on the monotonicity property of the gradient. Therefore, one needs more sophisticated
optimization algorithms, some of which we now describe.
J(w0 ) J(w )
(3.39)
Here ei denotes the i-th basis vector, that is, a vector with one at the i-th coordinate and zeros everywhere else, while t R is a non-negative scalar step
size. One could, for instance, minimize the one dimensional convex function
J(wt ei ) to obtain the stepsize t . The coordinates can either be selected
cyclically, that is, 1, 2, . . . , n, 1, 2, . . . or greedily, that is, the coordinate which
yields the maximum reduction in function value.
Even though coordinate descent can be shown to converge if J has a Lipschitz continuous gradient [LT92], in practice it can be quite slow. However,
if a high precision solution is not required, as is the case in some machine
learning applications, coordinate descent is often used because a) the cost
per iteration is very low and b) the speed of convergence may be acceptable
especially if the variables are loosely coupled.
3.2.3 Gradient Descent
Gradient descent (also widely known as steepest descent) is an optimization
technique for minimizing multidimensional smooth convex objective functions of the form J : Rn R. The basic idea is as follows: Given a location
105
(3.40)
where t is a scalar stepsize. See Algorithm 3.2 for details. Different variants
of gradient descent depend on how t is chosen:
Exact Line Search: Since J(wt J(wt )) is a one dimensional convex
function in , one can use the Algorithm 3.1 to compute:
t = argmin J(wt J(wt )).
(3.41)
Instead of the simple bisecting line search more sophisticated line searches
such as the More-Thuente line search or the golden bisection rule can also
be used to speed up convergence (see [NW99] Chapter 3 for an extensive
discussion).
Inexact Line Search: Instead of minimizing J(wt J(wt )) we could
simply look for a stepsize which results in sufficient decrease in the objective
function value. One popular set of sufficient decrease conditions is the Wolfe
conditions
J(wt+1 ) J(wt ) + c1 t hJ(wt ), wt+1 wt i (sufficient decrease) (3.42)
hJ(wt+1 ), wt+1 wt i c2 hJ(wt ), wt+1 wt i (curvature) (3.43)
with 0 < c1 < c2 < 1 (see Figure 3.7). The Wolfe conditions are also called
the Armijio-Goldstein conditions. If only sufficient decrease (3.42) alone is
enforced, then it is called the Armijio rule.
acceptable stepsize
acceptable stepsize
Fig. 3.7. The sufficient decrease condition (left) places an upper bound on the
acceptable stepsizes while the curvature condition (right) places a lower bound on
the acceptable stepsizes.
106
3 Optimization
1
L
1
kJ(wt )k2 J(wt ) J(wt+1 )
2L
Summing this inequality
(3.46)
1 X
kJ(wt )k2 J(w0 ) J(wT ) J(w0 ) J(w ),
2L
t=0
107
Solving for
r
2L(J(w0 ) J(w ))
=
T +1
(3.47)
iterations.
Proof Combining (3.46) with kJ(wt )k2 2(J(wt ) J(w )), and using
the definition of c one can write
c(J(wt ) J(w )) J(wt+1 ) J(w ).
Applying the above equation recursively
cT (J(w0 ) J(w )) J(wT ) J(w ).
Solving for
= cT (J(w0 ) J(w ))
and rearranging yields (3.47).
When applied to practical problems which are not strongly convex gradient descent yields a low accuracy solution within a few iterations. However, as the iterations progress the method stalls and no further increase
in accuracy is obtained because of the O(1/2 ) rates of convergence. On
the other hand, if the function is strongly convex, then gradient descent
converges linearly, that is, in O(log(1/)) iterations. However, the number
108
3 Optimization
(3.48)
(3.49)
where, as in the previous section, J() denotes the gradient of J. Minimizing this quadratic model at every iteration entails taking gradients with
109
Fig. 3.8. Convergence of gradient descent with exact line search on two quadratic
problems (3.48). The problem on the left is ill-conditioned, whereas the problem
on the right is well-conditioned. We plot the contours of the objective function,
and the steps taken by gradient descent. As can be seen gradient descent converges
fast on the well conditioned problem, while it zigzags and takes many iterations to
converge on the ill-conditioned problem.
(3.50)
Performing a line search along the direction J(wt ) recovers the familiar
gradient descent update
wt+1 = wt t J(wt ).
(3.51)
The closely related mirror descent method replaces the quadratic penalty
in (3.49) by a Bregman divergence defined by some convex function f to
yield
Qt (w) := J(wt ) + hJ(wt ), w wt i + f (w, wt ).
(3.52)
(3.53)
(3.54)
(3.55)
110
3 Optimization
Theorem 3.14 Let J be a convex function and J(w ) denote its minimum
value. The mirror descent updates (3.54) with a -strongly convex function
f satisfy
2
1 P 2
f (w , w1 ) + 2
t t kJ(wt )k
P
min J(wt ) J(w ).
t
t t
Proof Using the convexity of J (see (3.7)) and (3.54) we can write
J(w ) J(wt ) + hw wt , J(wt )i
1
J(wt ) hw wt , f (wt+1 ) f (wt )i .
t
Now applying Lemma 3.11 and rearranging
f (w , wt ) f (w , wt+1 ) + f (wt , wt+1 ) t (J(wt ) J(w )).
Summing over t = 1, . . . , T
f (w , w1 ) f (w , wT +1 ) +
f (wt , wt+1 )
f (w , wT +1 )
J(w )
Noting that
0, J(wt )
mint J(wt ) J(w ), and
rearranging it follows that
P
f (w , w1 ) + t f (wt , wt+1 )
P
min J(wt ) J(w ).
(3.56)
t
t t
Using (3.17) and (3.54)
1
1 2
kf (wt ) f (wt+1 )k2 =
t kJ(wt )k2 .
2
2
The proof is completed by plugging in (3.57) into (3.56).
f (wt , wt+1 )
(3.57)
.
min J(wt ) J(w ) L
1tT
T
Proof Since J is Lipschitz continuous
1
f (w , w1 ) + 2
P
min J(wt ) J(w )
1tT
t t
2 2
t t L
111
.
min J(wt ) J(w ) L
L
P 1
1tT
2
2
T
t
t
(3.59)
gt> pt
p>
t Apt
(3.60a)
wt+1 = wt + t pt
(3.60b)
gt+1 = Awt+1 b
(3.60c)
> Ap
gt+1
t
p>
Ap
t
t
(3.60d)
(3.60e)
t+1 =
112
3 Optimization
The following theorem asserts that the pt generated by the above procedure
are indeed conjugate directions.
Theorem 3.17 Suppose the t-th iterate generated by the conjugate gradient
method (3.60) is not the solution of (3.59), then the following properties
hold:
span{g0 , g1 , . . . , gt } = span{g0 , Ag0 , . . . , At g0 }.
t
(3.61)
(3.62)
(3.63)
(3.64)
= p>
j (Awt + t pt b)
gt> pt
>
= pj Awt >
Apt b
pt Apt
=
p>
j gt
p>
j Apt
p>
t Apt
gt> pt .
For j = t, both terms cancel out, while for j < t both terms vanish due to
the induction hypothesis.
Step 2: Next we prove that (3.61) holds. Using (3.60c) and (3.60b)
gt+1 = Awt+1 b = Awt + t Apt b = gt + t Apt .
By our induction hypothesis, gt span{g0 , Ag0 , . . . , At g0 }, while Apt
span{Ag0 , A2 g0 , . . . , At+1 g0 }. Combining the two we conclude that gt+1
span{g0 , Ag0 , . . . , At+1 g0 }. On the other hand, we already showed that gt+1
is orthogonal to {p0 , p1 , . . . , pt }. Therefore, gt+1
/ span{p0 , p1 , . . . , pt }. Thus
our induction assumption implies that gt+1
/ span{g0 , Ag0 , . . . , At g0 }. This
allows us to conclude that span{g0 , g1 , . . . , gt+1 } = span{g0 , Ag0 , . . . , At+1 g0 }.
113
By the definition of t+1 (3.60d) the above expression vanishes for j = t. For
j < t, the first term is zero because Apj span{p0 , p1 , . . . , pj+1 }, a subspace
orthogonal to gt+1 as already shown in Step 1. The induction hypothesis
guarantees that the second term is zero.
Step 4 Clearly, (3.61) and (3.60e) imply (3.62). This concludes the proof.
A practical implementation of (3.60) requires two more observations:
First, using (3.60e) and (3.63)
gt> pt = gt> gt t gt> pt1 = gt> gt .
Therefore (3.60a) simplifies to
t =
gt> gt
.
p>
t Apt
(3.65)
> g
gt+1
t+1
.
>
gt gt
(3.66)
114
3 Optimization
g> g
t = p>t Apt
t
t
wt+1 = wt + t pt
gt+1 = gt + t Apt
g > gt+1
t+1 = t+1
gt> gt
pt+1 = gt+1 + t+1 pt
t=t+1
end while
Return: wt
p>
t A(w w0 )
.
p>
t Apt
(3.67)
On the other hand, following the iterative process (3.60b) from w0 until wt
yields
wt w0 = 0 p0 + . . . + t1 pt1 .
Again premultiplying with p>
t A and using conjugacy
p>
t A(wt w0 ) = 0.
(3.68)
p>
gt> pt
t A(w wt )
=
,
p>
p>
t Apt
t Apt
(3.69)
115
g> p
The gradient vanishes if we set = p>t Apt , which recovers (3.60a). In other
t
t = p>tKttpt and t =
t
>
gt+1
Kt pt
p>
t Kt pt
>
gt+1
gt+1
.
gt> gt
Fletcher-Reeves
Polak-Ribi`ere
t = tg>t+1
.
t gt
In practice, Polak-Ribi`ere tends to be better than
Fletcher-Reeves.
Hestenes-Stiefel
yt> gt+1
.
yt> pt
116
3 Optimization
(3.71)
(3.72)
(3.73)
for some constant C > 0. We now show that Newtons method exhibits
quadratic convergence close to the optimum.
Theorem 3.19 (Quadratic convergence of Newtons Method) Suppose
J is twice differentiable, strongly convex, and the Hessian of J is bounded
and Lipschitz continuous with modulus
M in a
neighborhood of the so
2
1
lution w . Furthermore, assume that J(w)
N . The iterations
117
J(wt ) J(w ) =
(3.74)
Next using the fact that 2 J(wt ) is invertible and the gradient vanishes at
the optimum (J(w ) = 0), write
wt+1 w = wt w 2 J(wt )1 J(wt )
= 2 J(wt )1 [2 J(wt )(wt w ) (J(wt ) J(w ))]. (3.75)
Using (3.75), (3.74), and the Lipschitz continuity of 2 J
J(wt ) J(w ) 2 J(wt )(wt w )
Z 1
2
[
J(w
+
t(w
w
))
J(w
)](w
w
)dt
=
t
t
t
t
0
Z 1
2
[ J(wt + t(wt w )) 2 J(wt )]
k(wt w )k dt
0
Z 1
M
2
kwt w k
M t dt =
kwt w k2 .
2
0
(3.76)
M
2 J(wt )1
kwt w k2 N M kwt w k2 .
2
2
118
3 Optimization
1200
1000
800
600
400
200
0
200
400 6
Fig. 3.9. The blue solid line depicts the one dimensional convex function J(w) =
w4 + 20w2 + w. The green dotted-dashed line represents the first order Taylor
approximation to J(w), while the red dashed line represents the second order Taylor
approximation, both evaluated at w = 2.
(3.77)
Unlike Newtons method which uses the Hessian to build its quadratic model
(3.70), BFGS uses the matrix Ht 0, which is a positive-definite estimate
of the Hessian. A quasi-Newton direction of descent is found by minimizing
Qt (w):
w wt = Ht1 J(wt ).
(3.78)
The stepsize t > 0 is found by a line search obeying the Wolfe conditions
119
(3.79)
(3.81)
which implies that Qt+1 (wt+1 ) = J(wt+1 ), and hence our second condition is automatically satisfied. In order to satisfy our first condition, we
require
Qt+1 (wt ) = J(wt+1 ) + Ht+1 (wt wt+1 ) = J(wt ).
(3.82)
(3.83)
(3.84)
(3.85)
(3.86)
(3.87)
120
3 Optimization
if t = 0 : Bt :=
s>
t yt
I
yt> yt
1
t = (s>
t yt )
>
Bt+1 = (I t st yt> )Bt (I t yt s>
t ) + t st st
t=t+1
end while
Return: wt
yt> J(wt+1 )
st ,
yt> st
(3.88)
which recovers the Hestenes-Stiefel update (see (3.60e) and Table 3.2).
Limited-memory BFGS (LBFGS) is a variant of BFGS designed for solving large-scale optimization problems where the O(d2 ) cost of storing and
updating Bt would be prohibitively expensive. LBFGS approximates the
quasi-Newton direction (3.78) directly from the last m pairs of st and yt via
a matrix-free approach. This reduces the cost to O(md) space and time per
iteration, with m freely chosen. Details can be found in Algorithm 3.5.
3.2.6.2 Spectral Gradient Methods
Although spectral gradient methods do not use the Hessian explicitly, they
are motivated by arguments very reminiscent of the Quasi-Newton methods.
Recall the update rule (3.79) and secant equation (3.83). Suppose we want
121
(3.89)
where t+1 is a scalar and I denotes the identity matrix. Then the secant
equation (3.83) becomes
t+1 st = yt .
(3.90)
In general, the above equation cannot be solved. Therefore we use the t+1
which minimizes kt+1 st yt k2 which yields the Barzilai-Borwein (BB) stepsize
t+1 =
s>
t yt
.
s>
t st
(3.91)
As it turns out, t+1 lies between the minimum and maximum eigenvalue of
the average Hessian in the direction st , hence the name Spectral Gradient
method. The parameter update (3.79) is now given by
wt+1 = wt
1
J(wt ).
t
(3.92)
122
3 Optimization
its linearization (i.e., first order Taylor approximation). See Figures 3.4 and
3.5 for geometric intuition, and recall (3.7) and (3.13):
J(w) J(w0 ) + w w0 , s0
w and s0 J(w0 ).
(3.94)
Given subgradients s1 , s2 , . . . , st evaluated at locations w0 , w1 , . . . , wt1 , we
can construct a tighter (piecewise linear) lower bound for J as follows (also
see Figure 3.10):
J(w) JtCP (w) := max {J(wi1 ) + hw wi1 , si i}.
1it
(3.95)
123
CP to obtain
Given iterates {wi }t1
i=0 , the cutting plane method minimizes Jt
the next iterate wt :
(3.96)
This iteratively refines the piecewise linear lower bound J CP and allows us
to get close to the minimum of J (see Figure 3.10 for an illustration).
If w denotes the minimizer of J, then clearly each J(wi ) J(w ) and
hence min0it J(wi ) J(w ). On the other hand, since J JtCP it follows that J(w ) JtCP (wt ). In other words, J(w ) is sandwiched between
min0it J(wi ) and JtCP (wt ) (see Figure 3.11 for an illustration). The cutting
plane method monitors the monotonically decreasing quantity
t := min J(wi ) JtCP (wt ),
0it
(3.97)
Fig. 3.10. A convex function (blue solid curve) is bounded from below by its linearizations (dashed lines). The gray area indicates the piecewise linear lower bound
obtained by using the linearizations. We depict a few iterations of the cutting plane
method. At each iteration the piecewise linear lower bound is minimized and a new
linearization is added at the minimizer (red rectangle). As can be seen, adding more
linearizations improves the lower bound.
124
3 Optimization
Fig. 3.11. A convex function (blue solid curve) with four linearizations evaluated at
four different locations (magenta circles). The approximation gap 3 at the end of
fourth iteration is indicated by the height of the cyan horizontal band i.e., difference
between lowest value of J(w) evaluated so far and the minimum of J4CP (w) (red
diamond).
well known (see e.g., [LNN95, Bel05]) that it can be very slow when new
iterates move too far away from the previous ones (i.e., causing unstable
zig-zag behavior in the iterates). In fact, in the worst case the cutting
plane method might require exponentially many steps to converge to an
optimum solution.
Bundle methods stabilize CPM by augmenting the piecewise linear lower
(e.g., JtCP (w) in (3.95)) with a prox-function (i.e., proximity control function) which prevents overly large steps in the iterates [Kiw90]. Roughly
speaking, there are 3 popular types of bundle methods, namely, proximal
[Kiw90], trust region [SZ92], and level set [LNN95]. All three versions use
2
1
2 kk as their prox-function, but differ in the way they compute the new
iterate:
t
kw w
t1 k2 + JtCP (w)},
2
w
1
trust region: wt := argmin{JtCP (w) | kw w
t1 k2 t },
2
w
1
level set: wt := argmin{ kw w
t1 k2 | JtCP (w) t },
2
w
proximal:
wt := argmin{
(3.98)
(3.99)
(3.100)
where w
t1 is the current prox-center, and t , t , and t are positive tradeoff parameters of the stabilization. Although (3.98) can be shown to be
equivalent to (3.99) for appropriately chosen t and t , tuning t is rather
difficult while a trust region approach can be used for automatically tuning
125
(3.101a)
s. t. ci (w) 0 for i I
(3.101b)
ei (w) = 0 for i E
(3.101c)
where both ci and ei are convex functions. We say that w is feasible if and
only if it satisfies the constraints, that is, ci (w) 0 for i I and ei (w) = 0
for i E.
Recall that w is the minimizer of an unconstrained problem if and only if
kJ(w)k = 0 (see Lemma 3.6). Unfortunately, when constraints are present
one cannot use this simple characterization of the solution. For instance, the
w at which kJ(w)k = 0 may not be a feasible point. To illustrate, consider
the following simple minimization problem (see Figure 3.12):
1 2
w
2
s. t. 1 w 2.
min
w
(3.102a)
(3.102b)
Clearly, 12 w2 is minimized at w = 0, but because of the presence of the constraints, the minimum of (3.102) is attained at w = 1 where J(w) = w is
equal to 1. Therefore, we need other ways to detect convergence. In Section
3.3.1 we discuss some general purpose algorithms based on the concept of orthogonal projection. In Section 3.3.2 we will discuss Lagrange duality, which
can be used to further characterize the solutions of constrained optimization
problems.
(3.103)
3 Optimization
J(w)
126
14
12
10
8
6
4
2
0
0
w
(3.104)
(3.105)
for some pre-defined tolerance > 0. Of course, J is unknown and hence the
gap J(w) J cannot be computed in practice. Furthermore, as we showed
in Section 3.3, for constrained optimization problems kJ(w)k does not
vanish at the optimal solution. Therefore, we will use the following stopping
127
(3.106)
iE
128
3 Optimization
0,
Proof First assume that w is feasible, that is, ci (w) 0 for i I and
ei (w) = 0 for i E. Since i 0 we have
X
X
i ci (w) +
i ei (w) 0,
(3.108)
iI
iE
0,
iI
iE
(3.109)
for 0 and , then one can prove the following property, which is often
called as weak duality.
Theorem 3.21 (Weak Duality) The Lagrange dual function (3.109) satisfies
D(, ) J(w)
for all feasible w and 0 and . In particular
D := max min L(w, , ) min max L(w, , ) = J .
0,
0,
(3.110)
129
iE
Therefore
D(, ) = min L(w, , ) = min J(w) +
w
i ci (w) +
iI
i ei (w) J(w)
iE
0,
(3.111)
The proof of the above theorem is quite technical and can be found in
any standard reference (e.g., [BV04]). Therefore we will omit the proof and
proceed to discuss various implications of strong duality. First note that
min max L(w, , ) = max min L(w, , ).
w
0,
0,
(3.112)
In other words, one can switch the order of minimization over w with maximization over and . This is called the saddle point property of convex
functions.
Suppose strong duality holds. Given any 0 and such that D(, ) >
and a feasible w we can immediately write the duality gap
J(w) J = J(w) D J(w) D(, ),
where J and D were defined in (3.111). Below we show that if w is primal
optimal and ( , ) are dual optimal then J(w ) D( , ) = 0. This
provides a non-heuristic stopping criterion for constrained optimization: stop
when J(w) D(, ) , where is a pre-specified tolerance.
130
3 Optimization
Suppose the primal and dual optimal values are attained at w and
( , ) respectively, and consider the following line of argument:
J(w ) = D( , )
= min J(w) +
(3.113a)
X
i ci (w) +
iI
J(w ) +
i ej (w)
(3.113b)
iE
i ci (w ) +
iI
i ei (w )
(3.113c)
iE
J(w ).
(3.113d)
iE
iE
ej (w ) = 0 i E
i
J(w ) +
iI
i ci (w ) +
(3.114b)
(3.114c)
)=0
(3.114d)
i ei (w ) = 0.
(3.114e)
i ci (w
X
(3.114a)
iE
The above conditions are called the KKT conditions. If the primal problem is
convex, then the KKT conditions are both necessary and sufficient. In other
satisfy (3.114) then w
are primal and dual
words, if w
and (
, )
and (
, )
optimal with zero duality gap. To see this note that the first two conditions
show that w
is feasible. Since i 0, L(w, , ) is convex in w. Finally the
Since
last condition states that w
minimizes L(w,
, ).
i ci (w)
= 0 and
131
ej (w)
= 0, we have
= min L(w,
D(
, )
, )
w
= J(w)
+
n
X
i ci (w)
+
i=1
m
X
j ej (w)
j=1
= J(w).
(3.115a)
s. t. Aw = b, w 0.
(3.115b)
(3.116a)
s. t. Aw b,
(3.116b)
(3.117a)
s. t. Aw = b, 0.
(3.117b)
w,
Next, we split w into its positive and negative parts w+ and w respectively by setting wi+ = max(0, wi ) and wi = max(0, wi ). Using these new
132
3 Optimization
min
c
w
w+ ,w ,
0
+
+
w
w
s. t. A A I w = b, w 0,
(3.118a)
(3.118b)
(3.119)
Taking gradients with respect to the primal and dual variables and setting
them to zero obtains
A> = c
(3.120a)
Aw = b
(3.120b)
>
w=0
(3.120c)
w0
(3.120d)
0.
(3.120e)
Condition (3.120c) can be simplified by noting that both w and are constrained to be non-negative, therefore > w = 0 if, and only if, i wi = 0 for
i = 1, . . . , n.
Using (3.120a), (3.120c), and (3.120b) we can write
c> w = (A> )> w = > Aw = > b.
Substituting this into (3.115) and eliminating the primal variable w yields
the following dual LP
max b>
,
s.t. A> = c, 0.
(3.121a)
(3.121b)
133
b
0
max
, + ,
(3.122a)
+
+
= c, 0.
s.t.
A> A> I
(3.122b)
min
a>
i w
bi for i I
(3.123a)
(3.123b)
(3.123c)
min
(3.124a)
(3.124b)
(3.125)
To find the saddle point of the Lagrangian we take gradients with respect
134
3 Optimization
b
The matrix in the above equation is called the KKT matrix, and we can use
it to characterize the conditions under which (3.124) has a unique solution.
Theorem 3.23 Let Z be a n (n m) matrix whose columns form a basis
for the null space of A, that is, AZ = 0. If A has full row rank, and the
reduced-Hessian matrix Z > GZ is positive definite, then there exists a unique
pair (w , ) which solves (3.126). Furthermore, w also minimizes (3.124).
Proof Note that a unique (w , ) exists whenever the KKT matrix is
non-singular. Suppose this is not the case, then there exist non-zero vectors
a and b such that
G A>
a
= 0.
A 0
b
Since Aa = 0 this implies that a lies in the null space of A and hence there
exists a u such that a = Zu. Therefore
G A>
Zu
= u> Z > GZu = 0.
Zu 0
A 0
0
Positive definiteness of Z > GZ implies that u = 0 and hence a = 0. On the
other hand, the full row rank of A and A> b = 0 implies that b = 0. In
summary, both a and b are zero, a contradiction.
Let w 6= w be any other feasible point and w = w w. Since Aw =
Aw = b we have that Aw = 0. Hence, there exists a non-zero u such that
w = Zu. The objective function J(w) can be written as
1
J(w) = (w w)> G(w w) + (w w)> d
2
1
= J(w ) + w> Gw (Gw + d)> w.
2
First note that 21 w> Gw = 12 u> Z > GZu > 0 by positive definiteness of
the reduced Hessian. Second, since w solves (3.126) it follows that (Gw +
135
d)> w = > Aw = 0. Together these two observations imply that J(w) >
J(w ).
If the technical conditions of the above theorem are met, then solving the
equality constrained QP (3.124) is equivalent to solving the linear system
(3.126). See [NW99] for a extensive discussion of algorithms that can be
used for this task.
Next we turn our attention to the general QP (3.123) which also contains
inequality constraints. The Lagrangian in this case can be written as
X
X
1
L(w, ) = w> Gw + w> d +
i (a>
i (a>
i w bi ) +
i w bi ). (3.127)
2
iI
Let
iE
then the KKT conditions (3.114) for this problem can be written as
a>
i w bi < 0 i I \ A(w )
(3.128a)
a>
i w
(3.128b)
bi = 0 i E A(w )
i
Gw + d +
X
iA(w )
i ai +
0 i A(w )
i ai = 0.
(3.128c)
(3.128d)
iE
Classical optimization techniques must compute this sum in its entirety for
each evaluation of the objective, respectively its gradient. As available data
sets grow ever larger, such batch optimizers therefore become increasingly
inefficient. They are also ill-suited for the incremental setting, where partial
data must be modeled as it arrives.
136
3 Optimization
Stochastic gradient-based methods, by contrast, work with gradient estimates obtained from small subsamples (mini-batches) of training data. This
can greatly reduce computational requirements: on large, redundant data
sets, simple stochastic gradient descent routinely outperforms sophisticated
second-order batch methods by orders of magnitude.
The key idea here is that J(w) is replaced by an instantaneous estimate
Jt which is computed from a mini-batch of size k comprising of a subset of
points (xti , yit ) with i = 1, . . . , k drawn from the dataset:
k
1X
Jt (w) = (w) +
l(w, xti , yit ).
k
(3.129)
i=1
(3.130)
converges to the true minimizer of J(w) if the stepsize t decays as O(1/ t).
For instance, one could set
r
t =
,
(3.131)
+t
where > 0 is a tuning parameter. See Algorithm 3.8 for details.
3.4.1.1 Practical Considerations
One simple yet effective rule of thumb to tune is to select a small subset
of data, try various values of on this subset, and choose the that most
reduces the objective function.
In some cases letting t to decay as O(1/t) has been found to be more
effective:
t =
.
(3.132)
+t
The free parameter > 0 can be tuned as described above. If (w) is strongly convex, then dividing the stepsize t by yields good practical
performance.
137
Compute stepsize t =
6:
wt+1 = wt t Jt (wt )
t=t+1
end while
Return: wT
7:
8:
9:
+t
(3.133)
where f and g are convex functions. Clearly, J is not convex, but there
exists a reasonably simple algorithm namely the Concave-Convex Procedure
(CCP) for finding a local minima of J. The basic idea is simple: In the
tth iteration replace g by its first order Taylor expansion at wt , that is,
g(wt ) + hw wt , g(wt )i and minimize
Jt (w) = f (w) g(wt ) hw wt , g(wt )i .
(3.134)
(3.135)
138
3 Optimization
10
200
20
150
30
100
40
50
50
60
0
70
801.0
1.5
2.0
2.5
3.0
3.5
4.0
501.0
1.5
2.0
2.5
3.0
3.5
4.0
Fig. 3.13. Given the function on the left we decompose it into the difference of two
convex functions depicted on the right panel. The CCP algorithm generates iterates
by matching points on the two convex curves which have the same tangent vectors.
As can be seen, the iterates approach the solution x = 2.0.
Theorem 3.24 Let J be a function which can be decomposed into a difference of two convex functions e.g., (3.133). The iterates generated by (3.135)
monotically decrease J. Furthermore, the stationary point of the iterates is
a local minima of J.
Proof Since f and g are convex
f (wt ) f (wt+1 ) + hwt wt+1 , f (wt+1 )i
g(wt+1 ) g(wt ) + hwt+1 wt , g(wt )i .
Adding the two inequalities, rearranging, and using (3.135) shows that J(wt ) =
f (wt ) g(wt ) f (wt+1 ) g(wt+1 ) = J(wt+1 ), as claimed.
Let w be a stationary point of the iterates. Then f (w ) = g(w ),
which in turn implies that w is a local minima of J because J(w ) = 0.
There are a number of extensions to CCP. We mention only a few in the
passing. First, it can be shown that all instances of the EM algorithm (Section ??) can be shown to be special cases of CCP. Second, the rate of con-
139
140
3 Optimization
Problems
Problem 3.1 (Intersection of Convex Sets {1}) If C1 and C2 are convex sets, then show that C1 C2 is also convex. Extend your result to show
T
that ni=1 Ci are convex if Ci are convex.
Problem 3.2 (Linear Transform of Convex Sets {1}) Given a set C
Rn and a linear transform A Rmn , define AC := {y = Ax : x C}. If
C is convex then show that AC is also convex.
Problem 3.3 (Convex Combinations {1}) Show that a subset of Rn is
convex if and only if it contains all the convex combination of its elements.
Problem 3.4 (Convex Hull {2}) Show that the convex hull, conv(X) is
the smallest convex set which contains X.
Problem 3.5 (Epigraph of a Convex Function {2}) Show that a function satisfies Definition 3.3 if, and only if, its epigraph is convex.
Problem 3.6 Prove the Jensens inequality (3.6).
Problem 3.7 (Strong convexity of the negative entropy {3}) Show that
the negative entropy (3.15) is 1-strongly convex with respect to the kk1 norm
2
on the simplex. Hint: First show that (t) := (t 1) log t 2 (t1)
t+1 0 for
all t 0. Next substitute t = xi /yi to show that
X
xi
(xi yi ) log
kx yk21 .
yi
i
141
Problem 3.8 (Strongly Convex Functions {2}) Prove 3.16, 3.17, 3.18
and 3.19.
Problem 3.9 (Convex Functions with Lipschitz Continuous Gradient {2})
Prove 3.22, 3.23, 3.24 and 3.25.
Problem 3.10 (One Dimensional Projection {1}) If f : Rd R is
convex, then show that for an arbitrary x and p in Rd the one dimensional
function () := f (x + p) is also convex.
Problem 3.11 (Quasi-Convex Functions {2}) In Section 3.1 we showed
that the below-sets of a convex function Xc := {x | f (x) c} are convex. Give
a counter-example to show that the converse is not true, that is, there exist
non-convex functions whose below-sets are convex. This class of functions is
called Quasi-Convex.
Problem 3.12 (Gradient of the p-norm {1}) Show that the gradient of
the p-norm (3.31) is given by (3.32).
Problem 3.13 Derive the Fenchel conjugate of the following functions
(
0
if x C
f (x) =
where C is a convex set
otherwise.
f (x) = ax + b
1
f (x) = x> Ax where A is a positive definite matrix
2
f (x) = log(x)
f (x) = exp(x)
f (x) = x log(x)
1
t
142
3 Optimization
4
Online Learning and Boosting
So far the learning algorithms we considered assumed that all the training
data is available before building a model for predicting labels on unseen data
points. In many modern applications data is available only in a streaming
fashion, and one needs to predict labels on the fly. To describe a concrete
example, consider the task of spam filtering. As emails arrive the learning
algorithm needs to classify them as spam or ham. Tasks such as these are
tackled via online learning. Online learning proceeds in rounds. At each
round a training example is revealed to the learning algorithm, which uses
its current model to predict the label. The true label is then revealed to
the learner which incurs a loss and updates its model based on the feedback
provided. This protocol is summarized in Algorithm 4.1. The goal of online
learning is to minimize the total loss incurred. By an appropriate choice
of labels and loss functions, this setting encompasses a large number of
tasks such as classification, regression, and density estimation. In our spam
detection example, if an email is misclassified the user can provide feedback
which is used to update the spam filter, and the goal is to minimize the
number of misclassified emails.
(4.1)
The prediction on a new data point is computed via a majority vote amongst
the consistent experts: yt = majority(Ct ).
Lemma 4.1 The Halving algorithm makes at most log2 (n) mistakes.
143
144
T
X
ft (w).
(4.2)
t=1
(w, w0 ) = (w) (w0 ) w w0 , (w0 ) .
We can also generalize the orthogonal projection (3.104) by replacing the
square Euclidean norm with the above Bregman divergence:
P, (w0 ) = argmin (w, w0 ).
w
(4.3)
145
(4.4)
(4.5)
For the rest of this chapter we will make the following standard assumptions:
146
f (w) f (w0 ) w w0 , (w, w0 ) for all f (w0 ).
(4.6)
The usual notion of strong convexity is recovered by setting () =
1
2
kk2 .
t2
kgt k2
2
(4.7)
t2
kgt k2 . (4.8)
2
Proof We prove the result in three steps. First we upper bound (w, wt+1 )
by (w, w
t+1 ). This is a consequence of (4.4) and the non-negativity of the
Bregman divergence which allows us to write
(w, wt+1 ) (w, w
t+1 ).
(4.9)
(4.10)
147
(4.7).
Now we are ready to prove regret bounds.
Lemma 4.4 Let w denote the best parameter chosen in hindsight,
and let kgt k L for all t. Then the regret of Algorithm 4.2 can be bounded
via
T
T
X
1
L2 X
ft (wt ) ft (w ) F
T +
t .
(4.12)
T
2
t=1
t=1
t
1
((1 t ) (w , wt ) (w , wt+1 )) +
kgt k2 .
t
2
Summing over t
T
X
ft (wt ) ft (w )
t=1
T
T
X
X
t
1
((1 t ) (w , wt ) (w , wt+1 )) +
kgt k2 .
t
2
|t=1
{z
} |t=1 {z
}
T1
T2
(w , w1 )
(w , wT +1 ) +
(w , wt )
T1 =
1
T
t t1
t=2
T
X
1
1
1
(w , w1 ) +
(w , wt )
1
t t1
t=2
T
X
1
1
1
1
F +
F
=F
T .
1
t t1
T
t=2
On the other hand, since the subgradients are Lipschitz continuous with
constant L it follows that
T2
T
L2 X
t .
2
t=1
ft (xt ) ft (x )
1
t
then
L2
(1 + log(T )),
2
148
t=1
then
L2
T.
ft (xt ) ft (x ) F +
1
t .
In this case
1
T
= T , and
L2 X 1
L2
ft (wt ) ft (w )
(1 + log(T )).
2
t
2
t=1
T
X
L2 X 1
L2
F T +
T.
ft (wt ) ft (w ) F T +
2 t
t=1
t=1
Problems
Problem 4.1 (Generalized Cauchy-Schwartz {1}) Show that hx, yi
2
2
n
1
Pb
t=a 2 t
5
Conditional Densities
A number of machine learning algorithms can be derived by using conditional exponential families of distribution (Section 2.3). Assume that the
training set {(x1 , y1 ), . . . , (xm , ym )} was drawn iid from some underlying
distribution. Using Bayes rule (1.15) one can write the likelihood
p(|X, Y ) p()p(Y |X, ) = p()
m
Y
p(yi |xi , ),
(5.1)
i=1
m
X
(5.2)
i=1
Because we do not have any prior knowledge about the data, we choose a
zero mean unit variance isotropic normal distribution for p(). This yields
m
log p(|X, Y ) =
X
1
kk2
log p(yi |xi , ) + const.
2
(5.3)
i=1
(5.4)
then
m
log p(|X, Y ) =
X
1
kk2 +
g(|xi ) h(xi , yi ), i + const.
2
(5.5)
i=1
where
g(|x) = log
(5.6)
yY
150
5 Conditional Densities
can be used to obtain the maximum aposteriori (MAP) estimate for . Given
the optimal , the class label at any given x can be predicted using
y = argmax p(y|x, ).
(5.7)
(5.8)
Define (x)
:= (x, +1) (x, 1). Plugging (5.8) into (5.4), using the
1
D
E and
1 + exp (x),
p(y = 1|x, ) =
1 + exp
1
D
(x),
E ,
or more compactly
1
p(y|x, ) =
1 + exp
D
y (x),
E .
(5.9)
Since p(y|x, ) is a logistic function, hence the name logistic regression. The
classification rule (5.7) in this case specializes as follows: predict +1 whenever p(y = +1|x, ) p(y = 1|x, ) otherwise predict 1. However
E
p(y = +1|x, ) D
log
= (x), ,
p(y = 1|x, )
E
D
as our prediction function. Using (5.9) we can write the objective function of logistic regression
as
m
D
E
X
1
i ),
kk2 +
log 1 + exp yi (x
2
i=1
5.2 Regression
151
yi (xi ),
i=1 1 + exp
=+
m
X
i ).
(p(yi |xi , ) 1)yi (x
i=1
Notice that the second term of the gradient vanishes whenever p(yi |xi , ) =
1. Therefore, one way to interpret logistic regression is to view it as a method
to maximize p(yi |xi , ) for each point (xi , yi ) in the training set. Since the
objective function of logistic regression is twice differentiable one can also
compute its Hessian
2 J() = I
m
X
i )(x
i )> ,
p(yi |xi , )(1 p(yi |xi , ))(x
i=1
where we used yi2 = 1. The Hessian can be used in the Newton method
(Section 3.2.6) to obtain the optimal parameter .
5.2 Regression
5.2.1 Conditionally Normal Models
fixed variance
152
5 Conditional Densities
153
Definition
Discuss that they are modeling joint distribution p(x, y)
The way they predict is by marginalizing out x
Why they are wasteful and why CRFs generally outperform them
6
Kernels and Function Spaces
Kernels are measures of similarity. Broadly speaking, machine learning algorithms which rely only on the dot product between instances can be kernelized by replacing all instances of hx, x0 i by a kernel function k(x, x0 ).
We saw examples of such algorithms in Sections 1.3.3 and 1.3.4 and we will
see many more examples in Chapter 7. Arguably, the design of a good kernel underlies the success of machine learning in many applications. In this
chapter we will lay the ground for the theoretical properties of kernels and
present a number of examples. Algorithms which use these kernels can be
found in later chapters.
(6.1)
where is a feature map which maps X into some dot product space H. In
other words, kernels correspond to dot products in some dot product space.
The main advantage of using a kernel as a similarity measure are threefold:
First, if the feature space is rich enough, then simple estimators such as
hyperplanes and half-spaces may be sufficient. For instance, to classify the
points in Figure BUGBUG, we need a nonlinear decision boundary, but
once we map the points to a 3 dimensional space a hyperplane suffices.
Second, kernels allow us to construct machine learning algorithms in the
dot product space H without explicitly computing (x). Third, we need not
make any assumptions about the input space X other than for it to be a
set. As we will see later in this chapter, this allows us to compute similarity
between discrete objects such as strings, trees, and graphs. In the first half
of this chapter we will present some examples of kernels, and discuss some
theoretical properties of kernels in the second half.
155
156
6.1.1 Examples
6.1.1.1 Linear Kernel
Linear kernels are perhaps the simplest of all kernels. We assume that x Rn
and define
X
k(x, x0 ) = x, x0 =
xi x0i .
i
If x and x0 are dense then computing the kernel takes O(n) time. On the
other hand, for sparse vectors this can be reduced to O(|nnz(x) nnz(x0 )|),
where nnz() denotes the set of non-zero indices of a vector and | | denotes the size of a set. Linear kernels are a natural representation to use for
vectorial data. They are also widely used in text mining where documents
are represented by a vector containing the frequency of occurrence of words
(Recall that we encountered this so-called bag of words representation in
Chapter 1). Instead of a simple bag of words, one can also map a text to the
set of pairs of words that co-occur in a sentence for a richer representation.
6.1.1.2 Polynomial Kernel
Given x Rn , we can compute a feature map by taking all the d-th
order products (also called the monomials) of the entries of x. To illustrate
with a concrete example, let usconsider x = (x1 , x2 ) and d = 2, in which
case (x) = x21 , x22 , x1 x2 , x2 x1 . Although it is tedious to compute (x)
and (x0 ) explicitly in order to compute k(x, x), there is a shortcut as the
following proposition shows.
Proposition 6.1 Let (x) (resp. (x0 )) denote the vector whose entries
are all possible d-th degree ordered products of the entries of x (resp. x0 ).
Then
d
k(x, x0 ) = (x), (x0 ) = x, x0
.
(6.2)
Proof By direct computation
X
X
(x), (x0 ) =
...
xj1 . . . xjd x0j1 . . . x0jd
j1
jd
xj1 x0j1 . . .
j1
d
x, x0
X
jd
xjd x0jd =
X
j
xj x0j
157
The kernel (6.2) is called the polynomial kernel. An useful extension is the
inhomogeneous polynomial kernel
d
k(x, x0 ) = x, x0 + c ,
(6.3)
which computes all monomials up to degree d (problem 6.2).
6.1.1.3 Radial Basis Function Kernels
6.1.1.4 Convolution Kernels
The framework of convolution kernels is a general way to extend the notion
of kernels to structured objects such as strings, trees, and graphs. Let x X
be a discrete object which can be decomposed into P parts xp Xp in many
different ways. As a concrete example consider the string x = abc which can
be split into two sets of substrings of size two namely {a, bc} and {ab, c}.
We denote the set of all such decompositions as R(x), and use it to build a
kernel on X as follows:
[k1 ? . . . ? kP ] (x, x0 ) =
P
Y
x
R(x),
x0 R(x0 )
p=1
kp (
xp , x
0p ).
(6.4)
Here, the sum is over all possible ways in which we can decompose x and
x0 into x
1 , . . . , x
P and x
01 , . . . , x
0P respectively. If the cardinality of R(x) is
finite, then it can be shown that (6.4) results in a valid kernel. Although
convolution kernels provide the abstract framework, specific instantiations
of this idea lead to a rich set of kernels on discrete objects. We will now
discuss some of them in detail.
6.1.1.5 String Kernels
The basic idea behind string kernels is simple: Compare the strings by
means of the subsequences they contain. More the number of common subsequences, the more similar two strings are. The subsequences need not have
equal weights. For instance, the weight of a subsequence may be given by the
inverse frequency of its occurrence. Similarly, if the first and last characters
of a subsequence are rather far apart, then its contribution to the kernel
must be down-weighted.
Formally, a string x is composed of characters from a finite alphabet
and |x| denotes its length. We say that s is a subsequence of x = x1 x2 . . . x|x|
if s = xi1 xi2 . . . xi|s| for some 1 i1 < i2 < . . . < i|s| |x|. In particular, if
ii+1 = ii + 1 then s is a substring of x. For example, acb is not a subsequence
of adbc while abc is a subsequence and adc is a substring. Assume that there
exists a function #(x, s) which returns the number of times a subsequence
158
that is, we count all occurrences of s in x but now the weight associated with
a subsequence depends on its length. To illustrate, consider the subsequence
abc which occurs in the string abcebc twice, namely, abcebc and abcebc. The
first occurrence is counted with weight 3 while the second occurrence is
counted with the weight 6 . As it turns out, this kernel can be computed
by a dynamic programming algorithm (problem BUGBUG) in O(|x| |x0 |)
time.
159
(6.7)
(6.8)
160
21
31
2
1
3
34
12
24
22
14
32
G1
1
33
3
G2
23
G
13
Fig. 6.1. Two graphs (G1 & G2 ) and their direct product (G ). Each node of the
direct product graph is labeled with a pair of nodes (6.7); an edge exists in the
direct product if and only if the corresponding nodes are adjacent in both original
graphs (6.8). For instance, nodes 110 and 320 are adjacent because there is an edge
between nodes 1 and 3 in the first, and 10 and 20 in the second graph.
sum might not converge, leaving the kernel value undefined. To overcome
this problem, we introduce appropriately chosen non-negative coefficients
(t), and define the kernel between G and G0 as
0
k(G, G ) :=
> t
(t) q
A p .
(6.9)
t=0
This idea can be extended to graphs whose nodes are associated with labels
by replacing the matrix A with a matrix of label similarities. For appropriate choices of (t) the above sum converges and efficient algorithms for
computing the kernel can be devised. See [?] for details.
As it turns out, the simple idea of performing a random walk on the prod-
6.2 Kernels
161
(6.10)
i,j
(6.11)
162
i,j
We now establish the converse, that is, we show that every positive semidefinite kernel function can be written as (6.1). Towards this end, define a
map from X into the space of functions mapping X to R (denoted RX ) via
(x) = k(, x). In other words, (x) : X R is a function which assigns the
value k(x0 , x) to x0 X. Next construct a vector space by taking all possible
linear combinations of (x)
f () =
n
X
i (xi ) =
i=1
n
X
i k(, xi ),
(6.12)
i=1
hf, gi =
n X
n
X
i j k(xi , x0j ).
(6.13)
i=1 j=1
To see that the above dot product is well defined even though it contains
the expansion coefficients (which need not be unique), note that hf, gi =
Pn0
Pn
0
i=1 i f (xi ),
j=1 j f (xj ), independent of i . Similarly, for g, note that hf, gi =
this time independent of j . This also shows that hf, gi is bilinear. Symmetry follows because hf, gi = hg, f i, while the positive semi-definiteness of k
implies that
X
hf, f i =
i j k(xi , xj ) 0.
(6.14)
i,j
(6.15)
k(, x), k(, x0 ) = k(x, x0 ).
(6.16)
In particular
163
(6.17)
(6.18)
6.4.3 Regularization
Representer theorem, regularization
164
7
Linear Models
A hyperplane in a space H endowed with a dot product h, i is described by
the set
{x H| hw, xi + b = 0}
(7.1)
(7.2)
i=1,...m
166
7 Linear Models
{x | hw, xi + b = 1}
{x | hw, xi + b = 1}
yi = +1
x2
x1
yi = 1
hw, x1 i + b = +1
hw, x2 i + b = 1
D hw, x1 x2Ei = 2
w
2
kwk , x1 x2 = kwk
{x | hw, xi + b = 0}
Fig. 7.1. A linearly separable toy binary classification problem of separating the
diamonds from the circles. We normalize (w, b) to ensure that mini=1,...m | hw, xi i +
1
b | = 1. In this case, the margin is given by kwk
as the calculation in the inset shows.
whenever yi = +1 and hw, xi i+b < 0 whenever yi = 1. Furthermore, as discussed above, we fix the scaling of w by requiring mini=1,...m | hw, xi i+b | = 1.
A compact way to write our desiderata is to require yi (hw, xi i + b) 1 for
all i (also see Figure 7.1). The problem of maximizing the margin therefore
reduces to
max
w,b
1
kwk
(7.3a)
(7.3b)
or equivalently
1
kwk2
2
s.t. yi (hw, xi i + b) 1 for all i.
min
w,b
(7.4a)
(7.4b)
This is a constrained convex optimization problem with a quadratic objective function and linear constraints (see Section 3.3). In deriving (7.4) we
implicitly assumed that the data is linearly separable, that is, there is a
hyperplane which correctly classifies the training data. Such a classifier is
called a hard margin classifier. If the data is not linearly separable, then
(7.4) does not have a solution. To deal with this situation we introduce
167
1
CX
min kwk2 +
i
w,b, 2
m
(7.5a)
(7.5b)
i=1
i 0,
(7.5c)
i=1
i=1
i=1
X
X
CX
1
i +
i (1 i yi (hw, xi i + b))
i i .
L(w, b, , , ) = kwk2 +
2
m
Next take gradients with respect to w, b and and set them to zero.
w L = w
m
X
i yi xi = 0
(7.6a)
i=1
b L =
m
X
i yi = 0
(7.6b)
i=1
i L =
C
i i = 0.
m
(7.6c)
Substituting (7.6) into the Lagrangian and simplifying yields the dual objective function:
m
X
1X
yi yj i j hxi , xj i +
i ,
2
i,j
(7.7)
i=1
168
7 Linear Models
boils down to
m
min
s.t.
X
1X
yi yj i j hxi , xj i
i
2
i,j
m
X
(7.8a)
i=1
i yi = 0
(7.8b)
C
.
m
(7.8c)
i=1
0 i
2
s.t. > y = 0
C
0 i .
m
min
(7.9a)
(7.9b)
(7.9c)
m
X
i yi (xi ),
i=1
m
X
i yi k(xi , x) + b.
(7.10)
i=1
{x | hw, xi + b = 1}
169
{x | hw, xi + b = 1}
Fig. 7.2. The picture depicts the well classified points (yi (hw, xi i + b) > 1 in black,
the support vectors yi (hw, xi i + b) = 1 in blue, and margin errors yi (hw, xi i + b) < 1
in red.
yi (hw, xi i + b) < 1: In this case, i > 0, and hence the KKT conditions
C
imply that i = 0. Consequently, i = m
(see (7.6c)). Such points
are said to be margin errors.
yi (hw, xi i + b) > 1: In this case, i = 0, (1i yi (hw, xi i+b)) < 0, and by
the KKT conditions i = 0. Such points are said to be well classified.
It is easy to see that the decision boundary (7.10) does not change
even if these points are removed from the training set.
yi (hw, xi i + b) = 1: In this case i = 0 and i 0. Since i is non-negative
C
and satisfies (7.6c) it follows that 0 i m
. Such points are said
to be on the margin. They are also sometimes called support vectors.
Since the support vectors satisfy yi (hw, xi i + b) = 1 and yi {1} it follows
that b = yi hw, xi i for any support vector xi . However, in practice to
recover b we average
b = yi
hw, xi i .
(7.11)
i
C
over all support vectors, that is, points xi for which 0 < i < m
. Because
it uses support vectors, the overall algorithm is called C-Support Vector
classifier or C-SV classifier for short.
170
7 Linear Models
min
w,b
1
CX
kwk2 +
max(0, 1 yi (hw, xi i + b)).
2
m
(7.12)
i=1
(7.13)
It is easy to verify that the binary hinge loss (7.13) is convex but nondifferentiable (see Figure 7.3) which renders the overall objective function
(7.12) to be convex but non-smooth. There are two different strategies to
minimize such an objective function. If minimizing (7.12) in the primal, one
can employ non-smooth convex optimizers such as bundle methods (Section
3.2.7). This yields a d dimensional problem where d is the dimension of x.
On the other hand, since (7.12) is strongly convex because of the presence
of the 21 kwk2 term, its Fenchel dual has a Lipschitz continuous gradient
(see Lemma 3.10). The dual problem is m dimensional and contains linear
constraints. This strategy is particularly attractive when the kernel trick is
used or whenever d m. In fact, the dual problem obtained via Fenchel
duality is very related to the Quadratic programming problem (7.9) obtained
via Lagrange duality (problem 7.4).
m
Y
i=1
p(yi |xi , ),
(7.14)
171
loss
y(hw, xi + b)
Fig. 7.3. The binary hinge loss. Note that the loss is convex but non-differentiable
at the kink point. Furthermore, it increases linearly as the distance from the decision
hyperplane y(hw, xi + b) decreases.
m
X
(7.15)
i=1
In the absence of any prior knowledge about the data, we choose a zero
mean unit variance isotropic normal distribution for p(). This yields
m
X
1
log p(|X, Y ) = kk2
log p(yi |xi , ) + const.
2
(7.16)
i=1
(7.17)
Of course, our aim is not just to maximize p(yi |xi , ) but also to ensure
that p(y|xi , ) is small for all y 6= yi . This, for instance, can be achieved by
requiring
p(yi |xi , )
, for all y 6= yi and some 1.
p(y|xi , )
(7.18)
As we saw in Section 2.3 exponential families of distributions are rather flexible modeling tools. We could, for instance, model p(yi |xi , ) as a conditional
exponential family distribution. Recall the definition:
p(y|x, ) = exp (h(x, y), i g(|x)) .
(7.19)
172
7 Linear Models
Here (x, y) is a joint feature map which depends on both the input data x
and the label y, while g(|x) is the log-partition function. Now (7.18) boils
down to
p(yi |xi , )
= exp
(xi , yi ) max (xi , y),
.
(7.20)
y6=yi
maxy6=yi p(y|xi , )
If we choose such that log = 1, set (x, y) = y2 (x), and observe that
y {1} we can rewrite (7.20) as
Dy
y
E
i
i
(xi )
(xi ), = yi h(xi ), i 1.
(7.21)
2
2
By replacing log p(yi |xi , ) in (7.16) with the condition (7.21) we obtain
the following objective function:
min
s.t.
1
kk2
2
yi h(xi ), i 1 for all i,
(7.22a)
(7.22b)
which recovers (7.4), but without the bias b. The prediction function is
recovered by noting that (7.17) specializes to
y = argmax h(x, y), i = argmax
y{1}
y{1}
y
h(x), i = sign(h(x), i).
2
(7.23)
As before, we can replace (7.21) by a linear penalty for constraint violai |xi ,)
tion in order to recover (7.5). The quantity log maxp(y
is sometimes
y6=yi p(y|xi ,)
called the log-odds ratio, and the above discussion shows that SVMs can
be interpreted as maximizing the log-odds ratio in the exponential family.
This interpretation will be developed further when we consider extensions of
SVMs to tackle multiclass, multilabel, and structured prediction problems.
173
B
HBB
HB B
B
2
>
>
B B
s.t.
y=0
(7.24b)
C
for all i B.
(7.24c)
0 i
m
HBB HB B
Here,
is a permutation of the matrix H. By eliminating
HBB
HB B
constant terms and rearranging, one can simplify the above problem to
1 >
>
HBB B + B
(HBB
B
e)
2 B
>
>
s.t. B
yB = B
yB
C
for all i B.
0 i
m
min
B
(7.25a)
(7.25b)
(7.25c)
An extreme case of a decomposition method is the Sequential Minimal Optimization (SMO) algorithm of Platt [Pla99], which updates only two coefficients per iteration. The advantage of this strategy as we will see below is
that the resultant sub-problem can be solved analytically.
Without loss of
>
generality let B = {i, j}, and define s = yi /yj , ci cj = (HBB
B
e)
> y /y ). Then (7.25) specializes to
and d = (B
j
B
1
(Hii i2 + Hjj j2 + 2Hij j i ) + ci i + cj j
2
s.t. si + j = d
C
0 i , j .
m
min
i ,j
(7.26a)
(7.26b)
(7.26c)
H=
(
d C
max(0, sm )
d
s)
max(0,
(
C d
min( m
, s)
C
C d m
min( m
, s
if s > 0
(7.27)
otherwise
if s > 0
)
otherwise,
(7.28)
174
7 Linear Models
(7.29)
(7.30)
(7.31)
(7.32)
175
C
i ) = 0,
m
(7.33)
(7.34a)
(7.34b)
C
.
m
(7.34c)
Iup = {i : i <
Idown
(7.35a)
(7.35b)
iIdown
(7.36)
Therefore, a natural stopping criterion is to stop when the KKT gap falls
below a desired tolerance , that is,
m() M () + .
(7.37)
Finally, we turn our attention to the issue of working set selection. The
first order approximation to the objective function J() can be written as
J( + d) J() + J()> d.
Since we are only interested
in updating coefficients in the working set B
>
>
we set d = dB 0 , in which case we can rewrite the above first order
176
7 Linear Models
approximation as
J()>
B dB J( + d) J().
From among all possible directions dB we wish to choose one which decreases
the objective function the most while maintaining feasibility. This is best
expressed as the following optimization problem:
min J()>
B dB
(7.38a)
>
s.t. yB
dB = 0
(7.38b)
dB
di 0 if i = 0 and i B
C
di 0 if i =
and i B
m
1 di 1.
(7.38c)
(7.38d)
(7.38e)
Here (7.38b) comes from y > ( + d) = 0 and y > = 0, while (7.38c) and
C
. Finally, (7.38e) prevents the objective
(7.38d) comes from 0 i m
function from diverging to . If we specialize (7.38) to SMO, we obtain
min J()i di + J()j dj
(7.39a)
s.t. yi di + yj dj = 0
(7.39b)
i,j
dk 0 if k = 0 and k {i, j}
C
dk 0 if k =
and k {i, j}
m
1 dk 1 for k {i, j}.
(7.39c)
(7.39d)
(7.39e)
At first glance, it seems that choosing the optimal i and j from the set
{1, . . . , m}{1, . . . m} requires O(m2 ) effort. We now show that O(m) effort
suffices.
Define new variables dk = yk dk for k {i, j}, and use the observation
yk {1} to rewrite the objective function as
(yi J()i + yj J()j ) dj .
Consider the case J()i yi J()j yj . Because of the constraints
(7.39c) and (7.39d) if we choose i Iup and j Idown , then dj = 1 and
di = 1 is feasible and the objective function attains a negative value. For
all other choices of i and j (i, j Iup ; i, j Idown ; i Idown and j Iup )
the objective function value of 0 is attained by setting di = dj = 0. The
case J()j yj J()i yi is analogous. In summary, the optimization
7.2 Extensions
177
iIup ,jIdown
jIdown
which clearly can be solved in O(m) time. Comparison with (7.36) shows
that at every iteration of SMO we choose to update coefficients i and j
which maximally violate the KKT conditions.
7.2 Extensions
7.2.1 The trick
In the soft margin formulation the parameter C is a trade-off between two
conflicting requirements namely maximizing the margin and minimizing the
training error. Unfortunately, this parameter is rather unintuitive and hence
difficult to tune. The -SVM was proposed to address this issue. As Theorem
7.3 below shows, controls the number of support vectors and margin errors.
The primal problem for the -SVM can be written as
m
min
w,b,,
1
1 X
kwk2 +
i
2
m
(7.40a)
i=1
(7.40b)
(7.40c)
s.t.
1X
yi yj i j hxi , xj i
2
(7.41a)
i,j
m
X
i=1
m
X
i yi = 0
(7.41b)
i 1
(7.41c)
i=1
0 i
1
.
m
(7.41d)
It turns out that the dual can be further simplified via the following lemma.
178
7 Linear Models
Lemma 7.2 Let [0, 1] and (7.41) be feasible. Then there is at least one
P
solution which satisfies i i = 1. Furthermore, if the final objective value
P
of (7.41) is non-zero then all solutions satisfy i i = 1.
Proof The feasible region of (7.41) is bounded, therefore if it is feasible
then there exists an optimal solution. Let denote this solution and assume
P
that i i > 1. In this case we can define
=P
2
2
2
2
j j
This implies that either 12 > H = 0, in which case
is an optimal solution
1 >
with the desired property or 2 H 6= 0, in which case all optimal solutions
P
satisfy i i = 1.
In view of the above theorem one can equivalently replace (7.41) by the
following simplified optimization problem with two equality constraints
1X
min
yi yj i j hxi , xj i
(7.42a)
2
s.t.
i,j
m
X
i yi = 0
(7.42b)
i = 1
(7.42c)
i=1
m
X
i=1
1
.
(7.42d)
m
The following theorems, which we state without proof, explain the significance of and the connection between -SVM and the soft margin formulation.
0 i
Theorem 7.3 Suppose we run -SVM with kernel k on some data and
obtain > 0. Then
(i) is an upper bound on the fraction of margin errors, that is points
for which yi (hw, xi i + bi ) < .
7.2 Extensions
179
y(hw, xi + b)
Fig. 7.4. The 0-1 loss which is non-convex and intractable is depicted in red. The
hinge loss is a convex upper bound to the 0-1 loss and shown in blue. The square
hinge loss is a differentiable convex upper bound to the 0-1 loss and is depicted in
green.
fact, it has been shown that finding the optimal (w, b) pair which minimizes
the 0-1 loss on a training dataset of m labeled points is NP hard [BDEL03].
Therefore various proxy functions such as the binary hinge loss (7.13) which
we discussed in Section 7.1.1 are used. Another popular proxy is the square
180
7 Linear Models
hinge loss:
l(w, x, y) = max(0, 1 y(hw, xi + b))2 .
(7.44)
Besides being a proxy for the 0-1 loss, the squared hinge loss, unlike the
hinge loss, is also differentiable everywhere. This sometimes makes the optimization in the primal easier. Just like in the case of the hinge loss one can
derive the dual of the regularized risk minimization problem and show that
it is a quadratic programming problem (problem 7.5).
(7.45)
parameterized by s 0 is another proxy for the 0-1 loss (see Figure 7.5).
Although not convex, it can be expressed as the difference of two convex
functions
lconc (w, x, y) = max(0, 1 y(hw, xi + b)) and
lcave (w, x, y) = max(0, s y(hw, xi + b)).
Therefore the Convex-Concave procedure (CCP) we discussed in Section
Fig. 7.5. The ramp loss depicted here with s = 0.3 can be viewed as the sum
of a convex function namely the binary hinge loss (left) and a concave function
min(0, 1 y(hw, xi + b)) (right). Viewed alternatively, the ramp loss can be written
as the difference of two convex functions.
3.5.1 can be used to solve the resulting regularized risk minimization problem
with the ramp loss. Towards this end write
m
CX
CX
1
J(w) = kwk2 +
lconc (w, xi , yi )
lcave (w, xi , yi ) .
2
m
m
i=1
i=1
|
{z
} |
{z
}
Jconc (w)
Jcave (w)
(7.46)
181
Recall that at every iteration of the CCP we replace Jcave (w) by its first
order Taylor approximation, computing which requires
m
CX
w J(w) =
w lcave (w, xi , yi ).
m
(7.47)
i=1
(
1
0
if s > y(hw, xi + b)
otherwise.
(7.48)
Ignoring constant terms, each iteration of the CCP algorithm involves solving the following minimization problem (also see (3.134))
!
m
m
X
X
C
C
1
(7.49)
lconc (w, xi , yi )
i yi xi w.
J(w) = kwk2 +
2
m
m
i=1
i=1
2
s.t. > y = 0
C
C
i (e ).
m
m
min
(7.50a)
(7.50b)
(7.50c)
In fact, this problem can be solved by a SMO solver with minor modifications. Putting everything together yields Algorithm 7.1.
Algorithm 7.1 CCP for Ramp Loss
1: Initialize 0 and 0
2: repeat
3:
Solve (7.50) to find t+1
4:
Compute t+1 using (7.48)
5: until t+1 = t
182
7 Linear Models
loss
y (hw, xi + b)
Fig. 7.6. The insensitive loss. All points which lie within the tube shaded in
gray incur zero loss while points outside incur a linear loss.
(7.51)
In other words, we want to find a hyperplane such that all the training data
lies within an tube around the hyperplane. We may not always be able to
find such a hyperplane, hence we relax the above condition by introducing
slack variables i+ and i and write the corresponding primal problem as
m
min
w,b, + ,
1
CX +
kwk2 +
(i + i )
2
m
(7.52a)
i=1
0, and
(7.52b)
for all i
(7.52c)
0.
(7.52d)
i=1
i=1
X
1
CX +
L(w, b, + , , + , , + , ) = kwk2 +
(i + i )
(i+ i+ + i i )
2
m
+
+
m
X
i=1
m
X
i=1
i+ (yi (hw, xi i + b) + )
i ((hw, xi i + b) yi ).
183
Taking gradients with respect to the primal variables and setting them to
0, we obtain the following conditions:
w=
m
X
(i+ i )xi
(7.53)
i=1
m
X
i+ =
i=1
m
X
(7.54)
i=1
C
m
C
i + i = .
m
i+
{+,}
i+
(7.55)
(7.56)
{+,}
Noting that i
, i
0 and substituting the above conditions into
the Lagrangian yields the dual
1X +
min
(i i )(j+ j ) hxi , xj i
(7.57a)
+ , 2
i,j
+
m
X
(i+ + i )
i=1
s.t.
m
X
i+ =
i=1
m
X
yi (i+ i )
i=1
m
X
(7.57b)
i=1
C
(7.57c)
m
C
(7.57d)
0 i .
m
This is a quadratic programming problem with one equality constraint, and
hence a SMO like decomposition method can be derived for finding the
optimal coefficients + and (Problem 7.7).
As a consequence of (7.53), analogous to the classification case, one can
map the data via a feature map into an RKHS with kernel k and recover
the decision boundary f (x) = hw, (x)i + b via
0
f (x) =
m
X
i=1
(i+
i+
i ) h(x)i , (x)i
+b=
m
X
i=1
C
i i = 0 and
m
184
7 Linear Models
(7.59)
It turns out that the support vector regression framework can be easily
extended to handle other, more general, convex loss functions such as the
ones found in Table 7.1. Different losses have different properties and hence
lead to different estimators. For instance, the square loss leads to penalized
least squares (LS) regression, while the Laplace loss leads to the penalized
least absolute deviations (LAD) estimator. Hubers loss on the other hand is
a combination of the penalized LS and LAD estimators, and the pinball loss
with parameter [0, 1] is used to estimate -quantiles. Setting = 0.5
in the pinball loss leads to a scaled version of the Laplace loss. If we define
= y hw, xi, then it is easily verified that all these losses can all be written
as
+
if >
l ( )
l(w, x, y) = l ( ) if <
(7.60)
0
if [, ].
For all these different loss functions, the support vector regression formu-
185
(7.61a)
i=1
0, and
for all i
0.
(7.61b)
(7.61c)
(7.61d)
(7.62a)
i,j
i=1
s.t.
i=1
i=1
X
X
CX + +
T ( ) + T ( ) +
(i+ + i )
yi (i+ i )
m
m
X
i+ =
i=1
m
X
(7.62b)
C {+,} {+,}
l
(i
)
m
(7.62c)
i=1
{+,}
i
{+,}
0 i
{+,}
(7.62d)
= inf {+,} |
C {+,}
{+,}
l
i
.
m
(7.62e)
max(0, || )
||
1
2
2 ||
1 2
if ||
2
|| 2 otherwise
if 0
( 1) otherwise.
186
7 Linear Models
observe that = 0 for the pinball loss then (7.62) specializes as follows:
m
X
1X
i j hxi , xj i
min
yi i
2
i,j
s.t.
m
X
(7.63a)
i=1
i = 0
(7.63b)
i=1
C
C
( 1) i .
(7.63c)
m
m
Similar specializations of (7.62) for other loss functions in Table 7.1 can be
derived.
7.3.2 Incorporating the Trick
One can also incorporate the trick into support vector regression. The
primal problem obtained after incorporating the trick can be written as
!
m
X
1
1
kwk2 + +
(i+ + i )
(7.64a)
min
m
w,b, + , , 2
i=1
s.t.
0, i
for all i
0, and 0.
(7.64b)
(7.64c)
(7.64d)
min
+ ,
X
1X
(i i+ )(j j+ ) hxi , xj i
yi (i i+ )
2
i,j
m
X
s.t.
(i i+ ) = 0
i=1
m
X
(i + i+ ) = 1
(7.65a)
i=1
(7.65b)
(7.65c)
i=1
1
m
1
0 i
.
m
0 i+
(7.65d)
(7.65e)
187
(7.66)
Given the input data X one can compute the empirical density
(
1
if x X
p(x) = m
0
otherwise,
and estimate its (not necessarily unique) -quantiles. Unfortunately, such
estimates are very brittle and do not generalize well to unseen data. One
possible way to address this issue is to restrict C to be simple subsets such
as spheres or half spaces. In other words, we estimate simple sets which
contain fraction of the dataset. For our purposes, we specifically work
with half-spaces defined by hyperplanes. While half-spaces may seem rather
restrictive remember that the kernel trick can be used to map data into
a high-dimensional space; half-spaces in the mapped space correspond to
non-linear decision boundaries in the input space. Furthermore, instead of
explicitly identifying C we will learn an indicator function for C, that is, a
function f which takes on values 1 inside C and 1 elsewhere.
With 12 kwk2 as a regularizer, the problem of estimating a hyperplane such
that a large fraction of the points in the input data X lie on one of its sides
can be written as:
m
1
1 X
min kwk2 +
i
w,, 2
m
(7.67a)
i=1
s.t.
(7.67b)
i 0.
(7.67c)
Clearly, we want to be as large as possible so that the volume of the halfspace hw, xi is minimized. Furthermore, [0, 1] is a parameter which
is analogous to we introduced for the -SVM earlier. Roughly speaking,
it denotes the fraction of input data for which hw, xi i . An alternative
interpretation of (7.67) is to assume that we are separating the data set X
from the origin (See Figure 7.7 for an illustration). Therefore, this method
is also widely known as the one-class SVM.
188
7 Linear Models
Fig. 7.7. The novelty detection problem can be viewed as finding a large margin
hyperplane which separates fraction of the data points away from the origin.
i=1
i=1
i=1
X
X
1
1 X
L(w, , , , ) = kwk2 +
i +
i ( i hw, xi i)
i i .
2
m
By taking gradients with respect to the primal variables and setting them
to 0 we obtain
m
X
w=
i xi
(7.68)
i=1
i =
m
X
1
1
i
m
m
i = 1.
(7.69)
(7.70)
i=1
Noting that i , i 0 and substituting the above conditions into the Lagrangian yields the dual
1X
min
i j hxi , xj i
(7.71a)
2
i,j
s.t. 0 i
m
X
i=1
1
m
i = 1.
(7.71b)
(7.71c)
189
(7.72)
y{1,...,k}
Recall that the joint feature map (x, y) was introduced in section 7.1.2.
One way to interpret the above equation is to view f (x, y) as a compatibility
score between instance x and label y; we assign the label with the highest
compatibility score to x. There are a number of extensions of the binary
hinge loss (7.13) which can be used to estimate this score function. In all
these cases the objective function is written as
m
min J(w) :=
w
1 X
l(w, xi , yi ).
kwk2 +
2
m
i=1
(7.73)
190
7 Linear Models
Here is a scalar which trades off the regularizer 12 kwk2 with the empirical
1 Pm
risk m
i=1 l(w, xi , yi ). Plugging in different loss functions yields classifiers
for different settings. Two strategies exist for finding the optimal w. Just
like in the binary SVM case, one can compute and maximize the dual of
(7.73). However, the number of dual variables becomes m|Y|, where m is the
number of training points and |Y| denotes the size of the label set. The second
strategy is to optimize (7.73) directly. However, the loss functions we discuss
below are non-smooth, therefore non-smooth optimization algorithms such
as bundle methods (section 3.2.7) need to be used.
l(w, x, y) =
max 0, 1 ( (x, y) (x, y 0 ), w ) .
(7.74)
y 0 6=y
0
l(w, x, y) := max 0, max
(1 (x, y) (x, y ), w ) .
(7.75)
0
y 6=y
0
),
w
f (x, y) = h(x, y), wi 1 + max
(x,
y
= 1 + max
f (x, y 0 ).
0
0
y 6=y
y 6=y
(7.76)
191
as the log odds ratio in the exponential family. Towards this end choose
such that log = 1 and rewrite (7.20):
p(y|x, w)
0
log
= (x, y) max
(x, y ), w 1.
y 0 6=y
maxy0 6=y p(y 0 |x, w)
Rearranging yields (7.76).
l(w, x, y) =
yYx and
max 0, 1 ( (x, y) (x, y 0 ), w ) .
(7.77)
y 0 Y
/ x
0
1 (x, y) (x, y ), w
.
(7.78)
l(w, x, y) = max 0, max
yYx ,y 0 Y
/ x
One can immediately verify that specializing the above losses to the multiclass case recovers (7.74) and (7.75) respectively, while the binary case
recovers (7.13). The above losses are zero only when
min f (x, y) = min h(x, y), wi 1 + max (x, y 0 ), w = 1 + max f (x, y 0 ).
yYx
yYx
y 0 Y
/ x
y 0 Y
/ x
This can be interpreted as follows: The losses ensure that all the labels
assigned to x have larger scores compared to labels not assigned to x with
the margin of separation of at least 1.
Although the above loss functions are compatible with multiple labels,
the prediction function argmaxy f (x, y) only takes into account the label
with the highest score. This is a significant drawback of such models, which
can be overcome by using a multiclass approach instead. Let |Y| be the
size of the label set and z R|Y| denote a vector with 1 entries. We set
192
7 Linear Models
X
i
min (x)> W P
zi0 6=zi
!!
i
i
(zi zi0 )
(7.79)
A analogous specialization of (7.77) can also be derived wherein the minimum is replaced by a summation. Since the minimum (or summation as the
case may be) is over |Y| possible labels, computing the loss is tractable even
if the set of labels Y is large.
(i,j)G
where A(y) denotes the set of all possible subgraphs of y. The maximum
margin version, on the other hand, is given by
l(w, x, y) = max max (0, 1 (f (x, i) f (x, j))) .
GA(y) (i,j)G
(7.81)
193
7.8 Applications
7.8.1 Sequence Annotation
7.8.2 Matching
7.8.3 Ranking
7.8.4 Shortest Path Planning
7.8.5 Image Annotation
7.8.6 Contingency Table Loss
7.9 Optimization
7.9.1 Column Generation
subdifferentials
194
7 Linear Models
(7.82)
min
w,
s.t.
X
1
kwk2 + C
i
2
(7.83a)
(7.83b)
i 0,
(7.83c)
i=1
Derive the dual of (7.83) and contrast it with (7.9). What changes to the
SMO algorithm would you make to solve this dual?
Problem 7.3 (Deriving the simplified -SVM dual {2}) In Lemma 7.2
P
we used (7.41) to show that the constraint
i i 1 can be replaced by
P
i i = 1. Show that an equivalent way to arrive at the same conclusion is
by arguing that the constraint 0 is redundant in the primal (7.40). Hint:
Observe that whenever < 0 the objective function is always non-negative.
On the other hand, setting w = = b = = 0 yields an objective function
value of 0.
Problem 7.4 (Fenchel and Lagrange Duals {2}) We derived the Lagrange dual of (7.12) in Section 7.1 and showed that it is (7.9). Derive the
Fenchel dual of (7.12) and relate it to (7.9). Hint: See theorem 3.3.5 of
[BL00].
195
Problem 7.5 (Dual of the square hinge loss {1}) The analog of (7.5)
when working with the square hinge loss is the following
m
min
w,b,
1
CX 2
kwk2 +
i
2
m
(7.84a)
s.t.
(7.84b)
i 0,
(7.84c)
i=1
Derive the Lagrange dual of the above optimization problem and show that
it a Quadratic Programming problem.
Problem 7.6 (Dual of the ramp loss {1}) Derive the Lagrange dual of
(7.49) and show that it the Quadratic Programming problem (7.50).
Problem 7.7 (SMO for various SVM formulations {2}) Derive an SMO
like decomposition algorithm for solving the dual of the following problems:
-SVM (7.41).
SV regression (7.57).
SV novelty detection (7.71).
Problem 7.8 (Novelty detection with Balls {2}) In Section 7.4 we assumed that we wanted to estimate a halfspace which contains a major fraction of the input data. An alternative approach is to use balls, that is, we
estimate a ball of small radius in feature space which encloses a majority of
the input data. Write the corresponding optimization problem and its dual.
Show that if the kernel is translation invariant, that is, k(x, x0 ) depends only
on kx x0 k then the optimization problem with balls is equivalent to (7.71).
Explain why this happens geometrically.
Problem 7.9 (Multiclass and Multilabel loss from Ranking Loss {1})
Show how the multiclass (resp. multilabel) losses (7.74) and (7.75) (resp.
(7.77) and (7.79)) can be derived as special cases of (7.80) and (7.81) respectively.
Problem 7.10 Invariances (basic loss)
Problem 7.11 Polynomial transformations - SDP constraints
Appendix 1
Linear Algebra and Functional Analysis
4 + 2
log n
3 /3
2 /2
(1.1)
(1.2)
(1.3)
Our proof presentation by and large follows [?]. We first show that
Lemma 1.2 For any arbitrary vector Rd let qi denote the i-th component of f (). Then qi N(0, kk2 /k) and hence
h
E kf ()k
k
X
E qi2 = kk2 .
(1.4)
i=1
In other words, the expected length of vectors are preserved even after embedding them in a k dimensional space. Next we show that the lengths of
the embedded vectors are tightly concentrated around their mean.
Lemma 1.3 For any > 0 and any unit vector Rd we have
k 2
2
3
P r kf ()k > 1 + < exp /2 /3
2
k 2
2
3
P r kf ()k < 1 < exp /2 /3 .
2
(1.5)
(1.6)
197
198
2
.
n2+
(1.7)
2
d
d
d
d
X
2 1
1 XX
1X 2
1
E qi = E
Rij j
=
j l E [Rij Ril ] =
j = kk2 .
k
k
k
k
j=1
j=1 l=1
j=1
199
(1.8)
The last equality is because the qi s are i.i.d. Since is a unit vector, from
the previous lemma qi N(0, 1/k). Therefore, kqi2 is a 2 random variable
with moment generating function
1
2
2
.
kq
=q
E exp qi = E exp
k i
1 2
k
Plugging this into (1.8)
k
h
i
exp k (1 + )
.
q
P r exp kf ()k2 > exp ((1 + ))
2
1 k
k
Setting = 2(1+)
in the above inequality and simplifying
h
i
P r exp kf ()k2 > exp((1 + )) (exp()(1 + ))k/2 .
200
Appendix 2
Conjugate Distributions
201
202
2 Conjugate Distributions
Binomial Beta
(x) = x
(n + 1)(n(1 ) + 1)
eh(n,n) =
= B(n + 1, n(1 ) + 1)
(n + 2)
In traditional notation one represents the conjugate as
p(z; , ) =
( + ) 1
z
(1 z)1
()()
= nn (n)
Multinomial / Binomial
Gaussian
Laplace
Poisson
Dirichlet
Wishart
Student-t
Beta
Gamma
Appendix 3
Loss Functions
(3.1)
203
3 Loss Functions
204
Logistic [CSS00]
max(0, |f y| )
|f y|
1
2 (f
max(0, f )
log(1 + exp(yf ))
0 if |f y| , else sign(f y)
sign(f y)
f y
0 if f and 1 otherwise
y/(1 + exp(yf ))
y)2 if |f y| 1, else |f y|
y)2
-insensitive [VGS97]
1
2 (f
exp(f ) y
f y if |f y| 1, else sign(f y)
1
2
Loss
ey ey
where y is the argmax of the loss
Derivative
Vectorial loss functions and their derivatives, depending on the vector f := W x and on y.
exp(fy0 ) fy
(y, y 0 )(ey ey )
where y is the argmax of the loss
h
i
P
P
0
0
y 0 ey 0 exp(fy ) /
y 0 exp(fy ) ey
y0
log
y)> M (f y) where M 0
1
2 (f
M (f y)
Multivariate Regression
205
takes on the value max(0, 1 y hw, xi). The latter is a convex function in
w and x. Likewise, we may rewrite the -insensitive loss, Hubers robust
loss, the quantile regression loss, and the novelty detection loss in terms of
loss functions rather than a constrained optimization problem. In all cases,
hw, xi will play a key role insofar as the loss is convex in terms of the scalar
quantity hw, xi. A large number of loss functions fall into this category,
as described in Table A3.1. Note that not all functions of this type are
continuously differentiable. In this case we adopt the convention that
(
x f (x) if f (x) g(x)
x max(f (x), g(x)) =
(3.2)
x g(x) otherwise .
Since we are only interested in obtaining an arbitrary element of the subdifferential this convention is consistent with our requirements.
Let us discuss the issue of efficient computation. For all scalar losses we
may write l(x, y, w) = l(hw, xi , y), as described in Table A3.1. In this case a
simple application of the chain rule yields that w l(x, y, w) = l0 (hw, xi , y)x.
For instance, for squared loss we have
l(hw, xi , y) = 1 (hw, xi y)2 and l0 (hw, xi , y) = hw, xi y.
2
Consequently, the derivative of the empirical risk term is given by
m
w Remp (w) =
1 X 0
l (hw, xi i , yi ) xi .
m
(3.3)
i=1
(3.4)
This is possible for any of the loss functions listed in Table A3.1, and many
other similar losses. The advantage of this unified representation is that implementation of each individual loss can be done in very little time. The
computational infrastructure for computing Xw and g > X is shared. Evaluating l(fi , yi ) and l0 (fi , yi ) for all i can be done in O(m) time and it is
not time-critical in comparison to the remaining operations. Algorithm 3.1
describes the details.
An important but often neglected issue is worth mentioning. Computing f
requires us to right multiply the matrix X with the vector w while computing
g requires the left multiplication of X with the vector g > . If X is stored in a
row major format then Xw can be computed rather efficiently while g > X is
206
3 Loss Functions
l(x, y, w) = log
exp w, (x, y 0 ) hw, (x, y)i ,
(3.5)
y 0 Y
0
0
l(x, y, w) = max
(y,
y
)
w,
(x,
y
)
(x,
y)
+ (y, y 0 ).
0
y Y
(3.6)
Here (x, y) is a joint feature map, (y, y 0 ) 0 describes the cost of misclassifying y by y 0 , and (y, y 0 ) 0 is a scaling term which indicates by how
much the large margin property should be enforced. For instance, [TGK04]
choose (y, y 0 ) = 1. On the other hand [TJHA05] suggest (y, y 0 ) = (y, y 0 ),
which reportedly yields better performance. Finally, [McA07] recently suggested generic functions (y, y 0 ).
The logistic loss can also be interpreted as the negative log-likelihood of
a conditional exponential family model:
p(y|x; w) := exp(hw, (x, y)i g(w|x)),
(3.7)
where the normalizing constant g(w|x), often called the log-partition function, reads
X
g(w|x) := log
exp w, (x, y 0 ) .
(3.8)
y 0 Y
207
As a consequence of the Hammersley-Clifford theorem [Jor08] every exponential family distribution corresponds to a undirected graphical model. In
our case this implies that the labels y factorize according to an undirected
graphical model. A large number of problems have been addressed by this
setting, amongst them named entity tagging [LMP01], sequence alignment
[TJHA05], segmentation [RSS+ 07] and path planning [RBZ06]. It is clearly
impossible to give examples of all settings in this section, nor would a brief
summary do this field any justice. We therefore refer the reader to the edited
volume [BHS+ 07] and the references therein.
If the underlying graphical model is tractable then efficient inference algorithms based on dynamic programming can be used to compute (3.5) and
(3.6). We discuss intractable graphical models in Section A3.1.2.1, and now
turn our attention to the derivatives of the above structured losses.
When it comes to computing derivatives of the logistic loss, (3.5), we have
P
0
0
y 0 (x, y ) exp hw, (x, y )i
P
w l(x, y, w) =
(x, y)
(3.9)
0
y 0 exp hw, (x, y )i
= Ey0 p(y0 |x) (x, y 0 ) (x, y).
(3.10)
where p(y|x) is the exponential family model (3.7). In the case of (3.6) we
denote by y(x) the argmax of the RHS, that is
y(x) := argmax (y, y 0 ) w, (x, y 0 ) (x, y) + (y, y 0 ).
(3.11)
y0
(3.12)
In the case where the loss is maximized for more than one distinct value y(x)
we may average over the individual values, since any convex combination of
such terms lies in the subdifferential.
Note that (3.6) majorizes (y, y ), where y := argmaxy0 hw, (x, y 0 )i
[TJHA05]. This can be seen via the following series of inequalities:
(y, y ) (y, y ) hw, (x, y ) (x, y)i + (y, y ) l(x, y, w).
The first inequality follows because (y, y ) 0 and y maximizes hw, (x, y 0 )i
thus implying that (y, y ) hw, (x, y ) (x, y)i 0. The second inequality follows by definition of the loss.
We conclude this section with a simple lemma which is at the heart of
several derivations of [Joa05]. While the proof in the original paper is far
from trivial, it is straightforward in our setting:
208
3 Loss Functions
Lemma 3.1 Denote by (y, y 0 ) a loss and let (xi , yi ) be a feature map for
observations (xi , yi ) with 1 i m. Moreover, denote by X, Y the set of
all m patterns and labels respectively. Finally let
(X, Y ) :=
m
X
i=1
m
X
(yi , yi0 ).
(3.13)
i=1
0
0
0
max
w,
(x
,
y
)
(x
,
y
)
+
(y
,
y
)
and
max
w,
(X,
Y
)
(X,
Y
)
+ (Y, Y 0 ).
i
i
i
i
0
0
y
This is immediately obvious, since both feature map and loss decompose,
which allows us to perform maximization over Y 0 by maximizing each of its
m components. In doing so, we showed that aggregating all data and labels
into a single feature map and loss yields results identical to minimizing
the sum over all individual losses. This holds, in particular, for the sample
error loss of [Joa05]. Also note that this equivalence does not hold whenever
(y, y 0 ) is not constant.
A3.1.2.1 Intractable Models
We now discuss cases where computing l(x, y, w) itself is too expensive. For
P
instance, for intractable graphical models, the computation of y exp hw, (x, y)i
cannot be computed efficiently. [WJ03] propose the use of a convex majorization of the log-partition function in those cases. In our setting this means
that instead of dealing with
X
l(x, y, w) = g(w|x) hw, (x, y)i where g(w|x) := log
exp hw, (x, y)i
y
(3.14)
one uses a more easily computable convex upper bound on g via
sup
(3.15)
MARG(x)
209
Likewise note that [TGK04] use relaxations when solving structured estimation problems of the form
0
0
l(x, y, w) = max
(y,
y
)
w,
(x,
y
)
(x,
y)
+ (y, y 0 ),
(3.16)
0
y
(3.18)
Obviously, we could compute Remp (w) and its derivative by an O(m2 ) operation. However [Joa05] showed that both can be computed in O(m log m)
time using a sorting operation, which we now describe.
Denote by c = f 21 y an auxiliary variable and let i and j be indices such
210
3 Loss Functions
211
Using the same convex majorization as above when we were maximizing the
ROC score, we obtain an empirical risk of the form
Remp (w) =
1 X
C(yi , yj ) max(0, 1 + hw, xi xj i)
M y <y
i
(3.20)
Now the goal is to find an efficient algorithm for obtaining the number of
times when the individual losses are nonzero such as to compute both the
value and the gradient of Remp (w). The complication arises from the fact
that observations xi with label yi may appear in either side of the inequality
depending on whether yj < yi or yj > yi . This problem can be solved as
follows: sort f = Xw in ascending order and traverse it while keeping track
of how many items with a lower value yj are no more than 1 apart in terms
of their value of fi . This way we may compute the count statistics efficiently.
Algorithm 3.3 describes the details, generalizing the results of [Joa06]. Again,
its runtime is O(m log m), thus allowing for efficient computation.
(3.21)
212
3 Loss Functions
(i,j)P
(3.23)
The implementation is straightforward, as given in Algorithm 3.4.
213
A3.1.3.4 Ranking
In webpage and document ranking we are often in a situation similar to that
described in Section A3.1.3.2, however with the difference that we do not
only care about objects xi being ranked according to scores yi but moreover
that different degrees of importance are placed on different documents.
The information retrieval literature is full with a large number of different scoring functions. Examples are criteria such as Normalized Discounted
Cumulative Gain (NDCG), Mean Reciprocal Rank (MRR), Precision@n, or
Expected Rank Utility (ERU). They are used to address the issue of evaluating rankers, search engines or recommender sytems [Voo01, JK02, BHK98,
BH04]. For instance, in webpage ranking only the first k retrieved documents that matter, since users are unlikely to look beyond the first k, say
10, retrieved webpages in an internet search. [LS07] show that these scores
can be optimized directly by minimizing the following loss:
X
l(X, y, w) = max
ci w, x(i) xi + ha a(), b(y)i .
(3.24)
214
3 Loss Functions
(3.26)
Here 1 denotes the inverse permutation, such that 1 = 1. Finding the
permutation maximizing c> f () a()> b(y) is a linear assignment problem
which can be easily solved by the Hungarian Marriage algorithm, that is,
the Kuhn-Munkres algorithm.
The original papers by [Kuh55] and [Mun57] implied an algorithm with
O(m3 ) cost in the number of terms. Later, [Kar80] suggested an algorithm
with expected quadratic time in the size of the assignment problem (ignoring log-factors). Finally, [OL93] propose a linear time algorithm for large
problems. Since in our case the number of pages is fairly small (in the order
of 50 to 200 per query) the scaling behavior per query is not too important.
We used an existing implementation due to [JV87].
Note also that training sets consist of a collection of ranking problems,
that is, we have several ranking problems of size 50 to 200. By means of
parallelization we are able to distribute the work onto a cluster of workstations, which is able to overcome the issue of the rather costly computation
per collection of queries. Algorithm 3.5 spells out the steps in detail.
A3.1.3.5 Contingency Table Scores
[Joa05] observed that F scores and related quantities dependent on a contingency table can also be computed efficiently by means of structured estimation. Such scores depend in general on the number of true and false
positives and negatives alike. Algorithm 3.6 shows how a corresponding empirical risk and subgradient can be computed efficiently. As with the previous losses, here again we use convex majorization to obtain a tractable
optimization problem.
215
y<0
y0 > 0
T+
F+
y0 < 0
(1 + 2 )T+
.
T+ + m T + 2 m+
(3.27)
y)
f
+
(T
,
T
)
,
(3.28)
+
0
y
216
3 Loss Functions
18:
y0 1, i = 1, . . . , T
i
19:
20:
g (y 0 y)> X
return Risk r and subgradient g
l(x, y, W ) = max
(y, y 0 ) Wy0 Wy , x + (y, y 0 )
(3.29)
0
y
(3.30)
Here and are defined as in Section A3.1.2 and y denotes the value of y 0
217
for which the RHS of (3.29) is maximized. This means that for unstructured
multiclass settings we may simply compute xW . Since this needs to be performed for all observations xi we may take advantage of fast linear algebra
routines and compute f = XW for efficiency. Likewise note that computing the gradient over m observations is now a matrix-matrix multiplication,
too: denote by G the matrix of rows of gradients (yi , yi )(eyi eyi ). Then
W Remp (X, y, W ) = G> X. Note that G is very sparse with at most two
nonzero entries per row, which makes the computation of G> X essentially
as expensive as two matrix vector multiplications. Whenever we have many
classes, this may yield significant computational gains.
Log-likelihood scores of exponential families share similar expansions. We
have
X
X
l(x, y, W ) = log
exp w, (x, y 0 ) hw, (x, y)i = log
exp Wy0 , x hWy , xi
y0
y0
(3.31)
y 0 (ey 0 x) exp Wy 0 , x
P
ey x.
y 0 exp Wy 0 , x
P
W l(x, y, W ) =
(3.32)
The main difference to the soft-margin setting is that the gradients are
not sparse in the number of classes. This means that the computation of
gradients is slightly more costly.
A3.1.4.2 Ontologies
Fig. A3.1. Two ontologies. Left: a binary hierarchy with internal nodes {1, . . . , 7}
and labels {8, . . . 15}. Right: a generic directed acyclic graph with internal nodes
{1, . . . , 6, 12} and labels {7, . . . , 11, 13, . . . , 15}. Note that node 5 has two parents,
namely nodes 2 and 3. Moreover, the labels need not be found at the same level of
the tree: nodes 14 and 15 are one level lower than the rest of the nodes.
218
3 Loss Functions
219
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
The same reasoning applies to estimation when using an exponential families model. The only difference is that we need to compute a soft-max
over paths rather than exclusively choosing the best path over the ontology. Again, a breadth-first recursion suffices: each of the leaves y of the
DAG is associated with a probability p(y|x). To obtain Eyp(y|x) [(y)] all
we need to do is perform a bottom-up traversal of the DAG summing over
all probability weights on the path. Wherever a node has more than one
parent, we distribute the probability weight equally over its parents.
Bibliography
[ABB+ 00] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M.
Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris,
D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock, Gene ontology: tool for the
unification of biology. the gene ontology consortium, Nat Genet 25 (2000), 25
29.
[AGML90] S. F. Altschul, W. Gish, E. W. Myers, and D. J. Lipman, Basic local
alignment search tool, Journal of Molecular Biology 215 (1990), no. 3, 403
410.
[BBL05] O. Bousquet, S. Boucheron, and G. Lugosi, Theory of classification: a survey of recent advances, ESAIM: Probab. Stat. 9 (2005), 323 375.
[BCR84] C. Berg, J. P. R. Christensen, and P. Ressel, Harmonic analysis on semigroups, Springer, New York, 1984.
[BDEL03] S. Ben-David, N. Eiron, and P.M. Long, On the difficulty of approximately
maximizing agreements, J. Comput. System Sci. 66 (2003), no. 3, 496514.
[Bel61] R. E. Bellman, Adaptive control processes, Princeton University Press,
Princeton, NJ, 1961.
[Bel05] Alexandre Belloni, Introduction to bundle methods, Tech. report, Operation
Research Center, M.I.T., 2005.
[Ber85] J. O. Berger, Statistical decision theory and Bayesian analysis, Springer,
New York, 1985.
[BH04] J. Basilico and T. Hofmann, Unifying collaborative and content-based filtering, Proc. Intl. Conf. Machine Learning (New York, NY), ACM Press, 2004,
pp. 6572.
[BHK98] J. S. Breese, D. Heckerman, and C. Kardie, Empirical analysis of predictive
algorithms for collaborative filtering, Proceedings of the 14th Conference on
Uncertainty in Artificial Intelligence, 1998, pp. 4352.
[BHS+ 07] G. Bakir, T. Hofmann, B. Scholkopf, A. Smola, B. Taskar, and S. V. N.
Vishwanathan, Predicting structured data, MIT Press, Cambridge, Massachusetts, 2007.
[Bil68] Patrick Billingsley, Convergence of probability measures, John Wiley and
Sons, 1968.
[Bis95] C. M. Bishop, Neural networks for pattern recognition, Clarendon Press,
Oxford, 1995.
[BK07] R. M. Bell and Y. Koren, Lessons from the netflix prize challenge, SIGKDD
Explorations 9 (2007), no. 2, 7579.
[BKL06] A. Beygelzimer, S. Kakade, and J. Langford, Cover trees for nearest neighbor, International Conference on Machine Learning, 2006.
[BL00] J. M. Borwein and A. S. Lewis, Convex analysis and nonlinear optimization:
Theory and examples, CMS books in Mathematics, Canadian Mathematical
Society, 2000.
221
222
3 Bibliography
[BM92] K. P. Bennett and O. L. Mangasarian, Robust linear programming discrimination of two linearly inseparable sets, Optim. Methods Softw. 1 (1992), 2334.
[BNJ03] D. Blei, A. Ng, and M. Jordan, Latent Dirichlet allocation, Journal of Machine Learning Research 3 (2003), 9931022.
[BT03] D.P. Bertsekas and J.N. Tsitsiklis, Introduction to probability, Athena Scientific, 2003.
[BV04] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge University
Press, Cambridge, England, 2004.
[CDLS99] R. Cowell, A. Dawid, S. Lauritzen, and D. Spiegelhalter, Probabilistic
networks and expert sytems, Springer, New York, 1999.
[CH04] Lijuan Cai and T. Hofmann, Hierarchical document categorization with support vector machines, Proceedings of the Thirteenth ACM conference on Information and knowledge management (New York, NY, USA), ACM Press, 2004,
pp. 7887.
[Cra46] H. Cramer, Mathematical methods of statistics, Princeton University Press,
1946.
[Cre93] N. A. C. Cressie, Statistics for spatial data, John Wiley and Sons, New York,
1993.
[CS03] K. Crammer and Y. Singer, Ultraconservative online algorithms for multiclass problems, Journal of Machine Learning Research 3 (2003), 951991.
[CSS00] M. Collins, R. E. Schapire, and Y. Singer, Logistic regression, AdaBoost
and Bregman distances, Proc. 13th Annu. Conference on Comput. Learning
Theory, Morgan Kaufmann, San Francisco, 2000, pp. 158169.
[CV95] Corinna Cortes and V. Vapnik, Support vector networks, Machine Learning
20 (1995), no. 3, 273297.
[DG03] S. Dasgupta and A. Gupta, An elementary proof of a theorem of johnson
and lindenstrauss, Random Struct. Algorithms 22 (2003), no. 1, 6065.
[DG08] J. Dean and S. Ghemawat, MapReduce: simplified data processing on large
clusters, CACM 51 (2008), no. 1, 107113.
[DGL96] L. Devroye, L. Gy
orfi, and G. Lugosi, A probabilistic theory of pattern
recognition, Applications of mathematics, vol. 31, Springer, New York, 1996.
[Fel71] W. Feller, An introduction to probability theory and its applications, 2 ed.,
John Wiley and Sons, New York, 1971.
[FJ95] A. Frieze and M. Jerrum, An analysis of a monte carlo algorithm for estimating the permanent, Combinatorica 15 (1995), no. 1, 6783.
[FS99] Y. Freund and R. E. Schapire, Large margin classification using the perceptron algorithm, Machine Learning 37 (1999), no. 3, 277296.
[FT94] L. Fahrmeir and G. Tutz, Multivariate statistical modelling based on generalized linear models, Springer, 1994.
[GIM99] A. Gionis, P. Indyk, and R. Motwani, Similarity search in high dimensions
via hashing, Proceedings of the 25th VLDB Conference (Edinburgh, Scotland)
(M. P. Atkinson, M. E. Orlowska, P. Valduriez, S. B. Zdonik, and M. L. Brodie,
eds.), Morgan Kaufmann, 1999, pp. 518529.
[GS04] T.L. Griffiths and M. Steyvers, Finding scientific topics, Proceedings of the
National Academy of Sciences 101 (2004), 52285235.
[GW92] P. Groeneboom and J. A. Wellner, Information bounds and nonparametric
maximum likelihood estimation, DMV, vol. 19, Springer, 1992.
[Hal92] P. Hall, The bootstrap and edgeworth expansions, Springer, New York, 1992.
[Hay98] S. Haykin, Neural networks : A comprehensive foundation, Macmillan, New
York, 1998, 2nd edition.
Bibliography
223
[Heb49] D. O. Hebb, The organization of behavior, John Wiley and Sons, New York,
1949.
[Hoe63] W. Hoeffding, Probability inequalities for sums of bounded random variables,
Journal of the American Statistical Association 58 (1963), 1330.
[HUL93] J.B. Hiriart-Urruty and C. Lemarechal, Convex analysis and minimization
algorithms, I and II, vol. 305 and 306, Springer-Verlag, 1993.
[IM98] P. Indyk and R. Motawani, Approximate nearest neighbors: Towards removing the curse of dimensionality, Proceedings of the 30th Symposium on Theory
of Computing, 1998, pp. 604613.
[JK02] K. Jarvelin and J. Kekalainen, IR evaluation methods for retrieving highly
relevant documents, ACM Special Interest Group in Information Retrieval (SIGIR), New York: ACM, 2002, pp. 4148.
[Joa05] T. Joachims, A support vector method for multivariate performance measures, Proc. Intl. Conf. Machine Learning (San Francisco, California), Morgan
Kaufmann Publishers, 2005, pp. 377384.
[Joa06]
, Training linear SVMs in linear time, Proc. ACM Conf. Knowledge
Discovery and Data Mining (KDD), ACM, 2006.
[Jor08] M. I. Jordan, An introduction to probabilistic graphical models, MIT Press,
2008, To Appear.
[JV87] R. Jonker and A. Volgenant, A shortest augmenting path algorithm for dense
and sparse linear assignment problems, Computing 38 (1987), 325340.
[Kar80] R.M. Karp, An algorithm to solve the m n assignment problem in expected
time O(mn log n), Networks 10 (1980), no. 2, 143152.
[KD05] S. S. Keerthi and D. DeCoste, A modified finite Newton method for fast
solution of large scale linear SVMs, J. Mach. Learn. Res. 6 (2005), 341361.
[Kel60] J. E. Kelly, The cutting-plane method for solving convex programs, Journal
of the Society for Industrial and Applied Mathematics 8 (1960), no. 4, 703712.
[Kiw90] Krzysztof C. Kiwiel, Proximity control in bundle methods for convex nondifferentiable minimization, Mathematical Programming 46 (1990), 105122.
[KM00] Paul Komarek and Andrew Moore, A dynamic adaptation of AD-trees for
efficient machine learning on large data sets, Proc. Intl. Conf. Machine Learning, Morgan Kaufmann, San Francisco, CA, 2000, pp. 495502.
[Koe05] R. Koenker, Quantile regression, Cambridge University Press, 2005.
[Kuh55] H.W. Kuhn, The Hungarian method for the assignment problem, Naval Research Logistics Quarterly 2 (1955), 8397.
[Lew98] D. D. Lewis, Naive (Bayes) at forty: The independence assumption in information retrieval, Proceedings of ECML-98, 10th European Conference on
Machine Learning (Chemnitz, DE) (C. Nedellec and C. Rouveirol, eds.), no.
1398, Springer Verlag, Heidelberg, DE, 1998, pp. 415.
[LK03] C. Leslie and R. Kuang, Fast kernels for inexact string matching, Proc.
Annual Conf. Computational Learning Theory, 2003.
[LMP01] J. D. Lafferty, A. McCallum, and F. Pereira, Conditional random fields:
Probabilistic modeling for segmenting and labeling sequence data, Proceedings
of International Conference on Machine Learning (San Francisco, CA), vol. 18,
Morgan Kaufmann, 2001, pp. 282289.
[LNN95] Claude Lemarechal, Arkadii Nemirovskii, and Yurii Nesterov, New variants
of bundle methods, Mathematical Programming 69 (1995), 111147.
[LS07] Q. Le and A.J. Smola, Direct optimization of ranking measures, J. Mach.
Learn. Res. (2007), submitted.
[LT92] Z. Q. Luo and P. Tseng, On the convergence of coordinate descent method
224
3 Bibliography
Bibliography
225
[RSS+ 07] G. R
atsch, S. Sonnenburg, J. Srinivasan, H. Witte, K.-R. M
uller, R. J.
Sommer, and B. Sch
olkopf, Improving the Caenorhabditis elegans genome annotation using machine learning, PLoS Computational Biology 3 (2007), no. 2,
e20 doi:10.1371/journal.pcbi.0030020.
[Rud73] W. Rudin, Functional analysis, McGraw-Hill, New York, 1973.
[Sil86] B. W. Silverman, Density estimation for statistical and data analysis, Monographs on statistics and applied probability, Chapman and Hall, London, 1986.
[SPST+ 01] B. Sch
olkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C.
Williamson, Estimating the support of a high-dimensional distribution, Neural Comput. 13 (2001), no. 7, 14431471.
[SS02] B. Sch
olkopf and A. Smola, Learning with kernels, MIT Press, Cambridge,
MA, 2002.
[SW86] G.R. Shorack and J.A. Wellner, Empirical processes with applications to
statistics, Wiley, New York, 1986.
[SZ92] Helga Schramm and Jochem Zowe, A version of the bundle idea for minimizing a nonsmooth function: Conceptual idea, convergence analysis, numerical
results, SIAM J. Optimization 2 (1992), 121152.
[TGK04] B. Taskar, C. Guestrin, and D. Koller, Max-margin Markov networks,
Advances in Neural Information Processing Systems 16 (Cambridge, MA)
(S. Thrun, L. Saul, and B. Scholkopf, eds.), MIT Press, 2004, pp. 2532.
[TJHA05] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large margin
methods for structured and interdependent output variables, J. Mach. Learn.
Res. 6 (2005), 14531484.
[Vap82] V. Vapnik, Estimation of dependences based on empirical data, Springer,
Berlin, 1982.
[Vap95]
, The nature of statistical learning theory, Springer, New York, 1995.
[Vap98]
, Statistical learning theory, John Wiley and Sons, New York, 1998.
[vdG00] S. van de Geer, Empirical processes in M-estimation, Cambridge University
Press, 2000.
[vdVW96] A. W. van der Vaart and J. A. Wellner, Weak convergence and empirical
processes, Springer, 1996.
[VGS97] V. Vapnik, S. Golowich, and A. J. Smola, Support vector method for function approximation, regression estimation, and signal processing, Advances in
Neural Information Processing Systems 9 (Cambridge, MA) (M. C. Mozer,
M. I. Jordan, and T. Petsche, eds.), MIT Press, 1997, pp. 281287.
[Voo01] E. Voorhees, Overview of the TRECT 2001 question answering track,
TREC, 2001.
[VS04] S. V. N. Vishwanathan and A. J. Smola, Fast kernels for string and
tree matching, Kernel Methods in Computational Biology (Cambridge, MA)
(B. Sch
olkopf, K. Tsuda, and J. P. Vert, eds.), MIT Press, 2004, pp. 113130.
[VSV07] S. V. N. Vishwanathan, A. J. Smola, and R. Vidal, Binet-Cauchy kernels
on dynamical systems and its application to the analysis of dynamic scenes,
International Journal of Computer Vision 73 (2007), no. 1, 95119.
[Wah97] G. Wahba, Support vector machines, reproducing kernel Hilbert spaces and
the randomized GACV, Tech. Report 984, Department of Statistics, University
of Wisconsin, Madison, 1997.
[Wat64] G. S. Watson, Smooth regression analysis, Sankhya A 26 (1964), 359372.
[Wil98] C. K. I. Williams, Prediction with Gaussian processes: From linear regression
to linear prediction and beyond, Learning and Inference in Graphical Models
(M. I. Jordan, ed.), Kluwer Academic, 1998, pp. 599621.
226
3 Bibliography
[WJ03] M. J. Wainwright and M. I. Jordan, Graphical models, exponential families, and variational inference, Tech. Report 649, UC Berkeley, Department of
Statistics, September 2003.