[go: up one dir, main page]

0% found this document useful (0 votes)
228 views152 pages

Top 9 Data Science Algorithms

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 152

Top 9 Data Science Algorithms

An algorithm is a step by step method of solving a problem. It is commonly used for data processing, calculation

and other related computer and mathematical operations. In this article, we have listed some of the most important

algorithms in the history of Data Science. We've compiled these algorithms based on recommendations by big data

enthusiasts on various social media channels.

Contents

1. Apriori algorithm

2. Artificial Neural Networks

3. Decision Trees

4. The k-means clustering algorithm

5. Linear Regression

6. Logistic Regression

7. Principal components analysis

8. Recurrent Neural Networks

9. Support Vector Machines


Apriori algorithm
Seminar of Popular Algorithms in Data Mining and
Machine Learning, TKK

Presentation 12.3.2008
Lauri Lahti
Association rules
•Techniques for data mining and knowledge
discovery in databases

Five important algorithms in the


development of association rules (Yilmaz
et al., 2003):
•AIS algorithm 1993
•SETM algorithm 1995
•Apriori, AprioriTid and AprioriHybrid 1994
Apriori algorithm
• Developed by Agrawal and Srikant 1994
• Innovative way to find association rules on
large scale, allowing implication outcomes
that consist of more than one item
• Based on minimum support threshold
(already used in AIS algorithm)
• Three versions:
– Apriori (basic version) faster in first
iterations
– AprioriTid faster in later iteratons
– AprioriHybrid can change from Apriori to
AprioriTid after first iterations
Limitations of Apriori algorithm
• Needs several iterations of the data
• Uses a uniform minimum support threshold
• Difficulties to find rarely occuring events
• Alternative methods (other than appriori)
can address this by using a non-uniform
minimum support thresold
• Some competing alternative approaches
focus on partition and sampling
Phases of knowledge discovery
1) data selection
2) data cleansing
3) data enrichment (integration with
additional resources)
4) data transformation or encoding
5) data mining
6) reporting and display (visualization) of the
discovered knowledge
(Elmasri and Navathe, 2000)
Application of data mining
• Data mining can typically be used with
transactional databases (for ex. in
shopping cart analysis)
• Aim can be to build association rules
about the shopping events
• Based on item sets, such as
{milk, cocoa powder} 2-itemset
{milk, corn flakes, bread} 3-itemset
Association rules
• Items that occur often together can be
associated to each other
• These together occuring items form a
frequent itemset
• Conclusions based on the frequent
itemsets form association rules
• For ex. {milk, cocoa powder} can bring a
rule cocoa powder Î milk
Sets of database
• Transactional database D
• All products an itemset I = {i1, i2,…, im}
• Unique shopping event T ⊆ I
• T contains itemset X iff X ⊆ T
• Based on itemsets X and Y an
association rule can be X Î Y
• It is required that X ⊂ I, Y ⊂ I and
X∩Y=∅
Properties of rules
• Types of item values: boolen, quantitative,
categorical
• Dimensions of rules
– 1D: buys(cocoa powder) Î buys(milk)
– 3D: age(X,’under 12’) ∧ gender(X,’male’)
Î buys(X,’comic book’)
• Latter one is an example of a profile association
rule
• Intradimension rules, interdimension rules, hybrid-
dimension rules (Han and Kamber, 2001)
• Concept hierarchies and multilevel association
rules
Quality of rules
• Interestingness problem (Liu et al.,
1999):
– some generated rules can be self-evident
– some marginal events can dominate
– interesting events can be rarely occuring
• Need to estimate how interesting the
rules are
• Subjective and objective measures
Subjective measures
• Often based on earlier user
experiences and beliefs
• Unexpectedness: rules are interesting
if they are unknown or contradict the
existing knowledge (or expectations).
• Actionability: rules are interesting if
users can get advantage by using
them
• Weak and strong beliefs
Objective measures
• Based on threshold values controlled
by the user
• Some typical measures (Han and
Kamber, 2001):
– simplicity
– support (utility)
– confidence (certainty)
Simplicity
• Focus on generating simple association
rules
• Length of rule can be limited by user-
defined threshold
• With smaller itemsets the interpretation of
rules is more intuitive
• Unfortunately this can increase the amount
of rules too much
• Quantitative values can be quantized (for
ex. age groups)
Simplicity, example
• One association rule that holds
association between cocoa powder
and milk
buys(cocoa powder) Î
buys(bread,milk,salt)
• More simple and intuitive might be
buys(cocoa powder) Î buys(milk)
Support (utility)
• Usefulness of a rule can be measured with a
minimum support threshold
• This parameter lets to measure how many
events have such itemsets that match both
sides of the implication in the association
rule
• Rules for events whose itemsets do not
match boths sides sufficiently often (defined
by a threshold value) can be excluded
Support (utility) (2)
• Database D consists of events T1, T2,… Tm,
that is D = {T1, T2,…, Tm}
• Let there be an itemset X that is a subregion
of event Tk, that is X ⊆ Tk
• The support can be defined as
| {Tk ∈ D | X ⊆ Tk} |
sup(X) = ------------------------------
|D|
• This relation compares number of events
containing itemset X to number of all events
in database
Support (utility), example
• Let’s assume D = {(1,2,3), (2,3,4), (1,2,4),
(1,2,5), (1,3,5)}
• The support for itemset (1,2) is
| {Tk ∈ D | X ⊆ Tk} |
sup((1,2)) = --------------------------- = 3/5
|D|
• That is: relation of number of events
containing itemset (1,2) to number of all
events in database
Confidence (certainty)
• Certainty of a rule can be measured with a
threshold for confidence
• This parameter lets to measure how often
an event’s itemset that matches the left side
of the implication in the association rule
also matches for the right side
• Rules for events whose itemsets do not
match sufficiently often the right side while
mathching the left (defined by a threshold
value) can be excluded
Confidence (certainty) (2)
• Database D consists of events T1, T2,… Tm, that is:
D = {T1, T2,…, Tm}
• Let there be a rule Xa Î Xb so that itemsets Xa and Xb
are subregions of event Tk, that is: Xa ⊆ Tk ∧ Xb ⊆ Tk
• Also let Xa ∩ Xb = ∅
• The confidence can be defined as
sup(Xa ∪ Xb)
conf(Xa,Xb) = ---------------------
sup(Xa)
• This relation compares number of events containing
both itemsets Xa and Xb to number of events containing
an itemset Xa
Confidence (certainty), example
• Let’s assume D = {(1,2,3), (2,3,4), (1,2,4),
(1,2,5), (1,3,5)}
• The confidence for rule 1 Î 2
sup(1 ∪ 2) 3/5
conf((1,2)) = ---------------- = ------ = 3/4
sup(1) 4/5
• That is: relation of number of events
containing both itemsets Xa and Xb to
number of events containing an itemset Xa
Support and confidence
• If confidence gets a value of 100 % the rule
is an exact rule
• Even if confidence reaches high values the
rule is not useful unless the support value is
high as well
• Rules that have both high confidence and
support are called strong rules
• Some competing alternative approaches
(other that Apriori) can generate useful
rules even with low support values
Generating association rules
• Usually consists of two subproblems (Han and
Kamber, 2001):
1) Finding frequent itemsets whose occurences
exceed a predefined minimum support
threshold
2) Deriving association rules from those frequent
itemsets (with the constrains of minimum
confidence threshold)
• These two subproblems are soleved iteratively
until new rules no more emerge
• The second subproblem is quite straight- forward
and most of the research focus is on the first
subproblem
Use of Apriori algorithm
• Initial information: transactional
database D and user-defined numeric
minimun support threshold min_sup
• Algortihm uses knowledge from previous
iteration phase to produce frequent
itemsets
• This is reflected in the Latin origin of the
name that means ”from what comes
before”
Creating frequent sets
• Let’s define:
Ck as a candidate itemset of size k
Lk as a frequent itemset of size k
• Main steps of iteration are:
1)Find frequent set Lk-1
2)Join step: Ck is generated by joining Lk-1 with
itself (cartesian product Lk-1 x Lk-1)
3)Prune step (apriori property): Any (k − 1) size
itemset that is not frequent cannot be a subset
of a frequent k size itemset, hence should be
removed
4)Frequent set Lk has been achieved
Creating frequent sets (2)
• Algorithm uses breadth-first search
and a hash tree structure to make
candidate itemsets efficiently
• Then occurance frequency for each
candidate itemset is counted
• Those candidate itemsets that have
higher frequency than minimum
support threshold are qualified to be
frequent itemsets
Apriori algorithm in pseudocode
L1= {frequent items};
for (k= 2; Lk-1 !=∅; k++) do begin
Ck= candidates generated from Lk-1 (that is:
cartesian product Lk-1 x Lk-1 and eliminating any
k-1 size itemset that is not frequent);
for each transaction t in database do
increment the count of all candidates in
Ck that are contained in t
Lk = candidates in Ck with min_sup
end
return ∪k Lk;

(www.cs.sunysb.edu/~cse634/lecture_notes/07apriori.pdf)
Apriori algorithm in pseudocode (2)

Í Join step and prune step

(Wikipedia)
Lecture 6

Artificial Neural Networks

1 Artificial Neural Networks


In this note we provide an overview of the key concepts that have led to
the emergence of Artificial Neural Networks as a major paradigm for Data
Mining applications. Neural nets have gone through two major development
periods -the early 60’s and the mid 80’s. They were a key development in
the field of machine learning. Artificial Neural Networks were inspired by
biological findings relating to the behavior of the brain as a network of units
called neurons. The human brain is estimated to have around 10 billion
neurons each connected on average to 10,000 other neurons. Each neuron
receives signals through synapses that control the effects of the signal on
the neuron. These synaptic connections are believed to play a key role in
the behavior of the brain. The fundamental building block in an Artificial
Neural Network is the mathematical model of a neuron as shown in Figure
1. The three basic components of the (artificial) neuron are:
1. The synapses or connecting links that provide weights, wj , to the
input values, xj f or j = 1, ...m;
2. An adder that sums the weighted input values to compute the
m

input to the activation function v = w0 + wj xj ,where w0 is called the
j=1
bias (not to be confused with statistical bias in prediction or estimation) is
a numerical value associated with the neuron. It is convenient to think of
the bias as the weight for an input x0 whose value is always equal to one,
m

so that v = wj xj ;
j=0
3. An activation function g (also called a squashing function) that
maps v to g(v) the output value of the neuron. This function is a monotone
function.

Figure 1

While there are numerous different (artificial) neural network architec­


tures that have been studied by researchers, the most successful applica­
tions in data mining of neural networks have been multilayer feedforward
networks. These are networks in which there is an input layer consisting
of nodes that simply accept the input values and successive layers of nodes
that are neurons as depicted in Figure 1. The outputs of neurons in a layer
are inputs to neurons in the next layer. The last layer is called the output
layer. Layers between the input and output layers are known as hidden
layers. Figure 2 is a diagram for this architecture.
Figure 2

In a supervised setting where a neural net is used to predict a numerical


quantity there is one neuron in the output layer and its output is the predic­
tion. When the network is used for classification, the output layer typically
has as many nodes as the number of classes and the output layer node with

3
the largest output value gives the network’s estimate of the class for a given
input. In the special case of two classes it is common to have just one node
in the output layer, the classification between the two classes being made
by applying a cut-off to the output value at the node.

1.1 Single layer networks


Let us begin by examining neural networks with just one layer of neurons
(output layer only, no hidden layers). The simplest network consists of just
one neuron with the function g chosen to be the identity function, g(v) = v
m

for all v. In this case notice that the output of the network is
wj xj , a
j=0
linear function of the input vector x with components xj . If we are modeling
the dependent variable y using multiple linear regression, we can interpret
the neural network as a structure that predicts a value y� for a given input
vector x with the weights being the coefficients. If we choose these weights
to minimize the mean square error using observations in a training set, these
weights would simply be the least squares estimates of the coefficients. The
weights in neural nets are also often designed to minimize mean square error
in a training data set. There is, however, a different orientation in the case
of neural nets: the weights are ”learned”. The network is presented with
cases from the training data one at a time and the weights are revised after
each case in an attempt to minimize the mean square error. This process of
incremental adjustment of weights is based on the error made on training
cases and is known as ’training’ the neural net. The almost universally used
dynamic updating algorithm for the neural net version of linear regression is
known as the Widrow-Hoff rule or the least-mean-square (LMS) algorithm.
It is simply stated. Let x(i) denote the input vector x for the ith case used to
train the network, and the weights before this case is presented to the net by
the vector w(i). The updating rule is w(i+1) = w(i)+η(y(i)−y�(i))x(i) with
w(0) = 0. It can be shown that if the network is trained in this manner by
repeatedly presenting test data observations one-at-a-time then for suitably
small (absolute) values of η the network will learn (converge to) the optimal
values of w. Note that the training data may have to be presented several
times for w(i) to be close to the optimal w. The advantage of dynamic
updating is that the network tracks moderate time trends in the underlying
linear model quite effectively.

If we consider using the single layer neural net for classification into c
classes, we would use c nodes in the output layer. If we think of classical

4
discriminant analysis in neural network terms, the coefficients in Fisher’s
classification functions give us weights for the network that are optimal
if the input vectors come from Multivariate Normal distributions with a
common covariance matrix.

For classification into two classes, the linear optimization approach that
we examined in class, can be viewed as choosing optimal weights in a single
layer neural network using the appropriate objective function.

Maximum likelihood coefficients for logistic regression can also be con­


sidered as weights in a neural network to minimize a function of�the residuals

ev
called the deviance. In this case the logistic function g(v) = 1+e v is the
activation function for the output node.

1.2 Multilayer Neural networks

Multilayer neural networks are undoubtedly the most popular networks used
in applications. While it is possible to consider many activation functions, in
practice �it has�been found that the logistic (also called the sigmoid) function
ev
g(v) = 1+e v as the activation function (or minor variants such as the
tanh function) works best. In fact the revival of interest in neural nets was
sparked by successes in training neural networks using this function in place
of the historically (biologically inspired) step function (the ”perceptron”}.
Notice that using a linear function does not achieve anything in multilayer
networks that is beyond what can be done with single layer networks with
linear activation functions. The practical value of the logistic function arises
from the fact that it is almost linear in the range where g is between 0.1 and
0.9 but has a squashing effect on very small or very large values of v.
In theory it is sufficient to consider networks with two layers of neurons–
one hidden and one output layer–and this is certainly the case for most
applications. There are, however, a number of situations where three and
sometimes four and five layers have been more effective. For prediction the
output node is often given a linear activation function to provide forecasts
that are not limited to the zero to one range. An alternative is to scale the
output to the linear part (0.1 to 0.9) of the logistic function.
Unfortunately there is no clear theory to guide us on choosing the number
of nodes in each hidden layer or indeed the number of layers. The common
practice is to use trial and error, although there are schemes for combining

5
optimization methods such as genetic algorithms with network training for
these parameters.

Since trial and error is a necessary part of neural net applications it is


important to have an understanding of the standard method used to train
a multilayered network: backpropagation. It is no exaggeration to say that
the speed of the backprop algorithm made neural nets a practical tool in
the manner that the simplex method made linear optimization a practical
tool. The revival of strong interest in neural nets in the mid 80s was in large
measure due to the efficiency of the backprop algorithm.

1.3 Example1: Fisher’s Iris data


Let us look at the Iris data that Fisher analyzed using Discriminant Analysis.
Recall that the data consisted of four measurements on three types of iris
flowers. There are 50 observations for each class of iris. A part of the data
is reproduced below.

OBS# SPECIES CLASSCODE


SEPLEN SEPW PETLEN PETW
1 Iris-setosa 1 5.1 3.5 1.4 0.2
2 Iris-setosa 1 4.9 3 1.4 0.2
3 Iris-setosa 1 4.7 3.2 1.3 0.2
4 Iris-setosa 1 4.6 3.1 1.5 0.2
5 Iris-setosa 1 5 3.6 1.4 0.2
6 Iris-setosa 1 5.4 3.9 1.7 0.4
7 Iris-setosa 1 4.6 3.4 1.4 0.3
8 Iris-setosa 1 5 3.4 1.5 0.2
9 Iris-setosa 1 4.4 2.9 1.4 0.2
10 Iris-setosa 1 4.9 3.1 1.5 0.1
... ... ... ... ... ... ...
51 Iris-versicolor 2 7 3.2 4.7 1.4
52 Iris-versicolor 2 6.4 3.2 4.5 1.5
53 Iris-versicolor 2 6.9 3.1 4.9 1.5
54 Iris-versicolor 2 5.5 2.3 4 1.3
55 Iris-versicolor 2 6.5 2.8 4.6 1.5
56 Iris-versicolor 2 5.7 2.8 4.5 1.3
57 Iris-versicolor 2 6.3 3.3 4.7 1.6
58 Iris-versicolor 2 4.9 2.4 3.3 1
59 Iris-versicolor 2 6.6 2.9 4.6 1.3
60 Iris-versicolor 2 5.2 2.7 3.9 1.4
... ... ... ... ... ... ...
101 Iris-virginica 3 6.3 3.3 6 2.5
102 Iris-virginica 3 5.8 2.7 5.1 1.9
103 Iris-virginica 3 7.1 3 5.9 2.1
104 Iris-virginica 3 6.3 2.9 5.6 1.8
105 Iris-virginica 3 6.5 3 5.8 2.2
106 Iris-virginica 3 7.6 3 6.6 2.1
107 Iris-virginica 3 4.9 2.5 4.5 1.7
108 Iris-virginica 3 7.3 2.9 6.3 1.8
109 Iris-virginica 3 6.7 2.5 5.8 1.8
110 Iris-virginica 3 7.2 3.6 6.1 2.5

If we use a neural net architecture for this classification problem we will


need 4 nodes (not counting the bias node) one for each of the 4 independent
variables in the input layer and 3 neurons (one for each class) in the output
layer. Let us select one hidden layer with 25 neurons. Notice that there will
be a total of 25 connections from each node in the input layer to nodes in
the hidden layer. This makes a total of 4 x 25 = 100 connections between

7
the input layer and the hidden layer. In addition there will be a total of 3
connections from each node in the hidden layer to nodes in the output layer.
This makes a total of 25 x 3 = 75 connections between the hidden layer and
the output layer. Using the standard logistic activation functions, the net-
work was trained with a run consisting of 60,000 iterations. Each iteration
consists of presentation to the input layer of the independent variables in a
case, followed by successive computations of the outputs of the neurons of
the hidden layer and the output layer using the appropriate weights. The
output values of neurons in the output layer are used to compute the er­
ror. This error is used to adjust the weights of all the connections in the
network using the backward propagation (“backprop”) to complete the it­
eration. Since the training data has 150 cases, each case was presented to
the network 400 times. Another way of stating this is to say the network
was trained for 400 epochs where an epoch consists of one sweep through
the entire training data. The results for the last epoch of training the neural
net on this data are shown below:

Iris Output 1

Classification Confusion Matrix

Desired Computed Class


Class 1 2 3 Total
1 50 50

2 49 1 50

3 1 49 50

Total 50 50 50 150

Error Report

Class Patterns # Errors % Errors StdDev


1 50 0 0.00 ( 0.00)

2 50 1 2.00 ( 1.98)

3 50 1 2.00 ( 1.98)

Overall 150 2 1.3 ( 0.92)

The classification error of 1.3% is better than the error using discriminant
analysis which was 2% (See lecture note on Discriminant Analysis). Notice
that had we stopped after only one pass of the data (150 iterations) the

8
error is much worse (75%) as shown below:

Iris Output 2

Classification Confusion Matrix

Desired Computed Class


Class 1 2 3 Total
1 10 7 2 19

2 13 1 6 20

3 12 5 4 21

Total 35 13 12 60

The classification error rate of 1.3% was obtained by careful choice of


key control parameters for the training run by trial and error. If we set
the control parameters to poor values we can have terrible results. To un­
derstand the parameters involved we need to understand how the backward
propagation algorithm works.

1.4 The Backward Propagation Algorithm


We will discuss the backprop algorithm for classification problems. There is
a minor adjustment for prediction problems where we are trying to predict
a continuous numerical value. In that situation we change the activation
function for output layer neurons to the identity function that has output
value=input value. (An alternative is to rescale and recenter the logistic
function to permit the outputs to be approximately linear in the range of
dependent variable values).
The backprop algorithm cycles through two distinct passes, a forward
pass followed by a backward pass through the layers of the network. The
algorithm alternates between these passes several times as it scans the train­
ing data. Typically, the training data has to be scanned several times before
the networks ”learns” to make good classifications.

Forward Pass: Computation of outputs of all the neurons in


the network The algorithm starts with the first hidden layer using as
input values the independent variables of a case (often called an exemplar
in the machine learning community) from the training data set. The neuron
outputs are computed for all neurons in the first hidden layer by performing

9
the relevant sum and activation function evaluations. These outputs are
the inputs for neurons in the second hidden layer. Again the relevant sum
and activation function calculations are performed to compute the outputs
of second layer neurons. This continues layer by layer until we reach the
output layer and compute the outputs for this layer. These output values
constitute the neural net’s guess at the value of the dependent variable. If
we are using the neural net for classification, and we have c classes, we will
have c neuron outputs from the activation functions and we use the largest
value to determine the net’s classification. (If c = 2, we can use just one
output node with a cut-off value to map an numerical output value to one
of the two classes).
Let us denote by wij the weight of the connection from node i to node j.
The values of wij are initialized to small (generally random) numbers in the
range 0.00 ± 0.05. These weights are adjusted to new values in the backward
pass as described below.

Backward pass: Propagation of error and adjustment of weights


This phase begins with the computation of error at each neuron in the output
layer. A popular error function is the squared difference between ok the
output of node k and yk the target value for that node. The target value
is just 1 for the output node corresponding to the class of the exemplar
and zero for other output nodes.(In practice it has been found better to use
values of 0.9 and 0.1 respectively.) For each output layer node compute its
error term as δk = ok (1 − ok )(yk − ok ). These errors are used to adjust the
weights of the connections between the last-but-one layer of the network and
the output layer. The adjustment is similar to the simple Widrow-Huff rule
that we saw earlier in this note. The new value of the weight wjk of the
connection from node j to node k is given by: wjk new = w old + ηo δ . Here η is
jk j k
an important tuning parameter that is chosen by trial and error by repeated
runs on the training data. Typical values for η are in the range 0.1 to 0.9.
Low values give slow but steady learning, high values give erratic learning
and may lead to an unstable network.
The process is repeated for the connections between nodes in the last
hidden layer and the last-but-one hidden layer. The weight for the connec­
tion between nodes i and j is given by: wij new = w old + ηo δ where δ =
ij i j j

oj (1 − oj ) k wjk δk , for each node j in the last hidden layer.
The backward propagation of weight adjustments along these lines con­
tinues until we reach the input layer. At this time we have a new set of
weights on which we can make a new forward pass when presented with a
training data observation.

10
1.4.1 Multiple Local Optima and Epochs
The backprop algorithm is a version of the steepest descent optimization
method applied to the problem of finding the weights that minimize the
error function of the network output. Due to the complexity of the function
and the large numbers of weights that are being “trained” as the network
“learns”, there is no assurance that the backprop algorithm (and indeed any
practical algorithm) will find the optimum weights that minimize error. the
procedure can get stuck at a local minimum. It has been found useful to
randomize the order of presentation of the cases in a training set between
different scans. It is possible to speed up the algorithm by batching, that is
updating the weights for several exemplars in a pass. However, at least the
extreme case of using the entire training data set on each update has been
found to get stuck frequently at poor local minima.
A single scan of all cases in the training data is called an epoch. Most
applications of feedforward networks and backprop require several epochs
before errors are reasonably small. A number of modifications have been
proposed to reduce the epochs needed to train a neural net. One commonly
employed idea is to incorporate a momentum term that injects some inertia
in the weight adjustment on the backward pass. This is done by adding a
term to the expression for weight adjustment for a connection that is a frac­
tion of the previous weight adjustment for that connection. This fraction is
called the momentum control parameter. High values of the momentum pa­
rameter will force successive weight adjustments to be in similar directions.
Another idea is to vary the adjustment parameter δ so that it decreases as
the number of epochs increases. Intuitively this is useful because it avoids
overfitting that is more likely to occur at later epochs than earlier ones.

1.4.2 Overfitting and the choice of training epochs


A weakness of the neural network is that it can be easily overfitted, causing
the error rate on validation data to be much larger than the error rate on the
training data. It is therefore important not to overtrain the data. A good
method for choosing the number of training epochs is to use the validation
data set periodically to compute the error rate for it while the network is
being trained. The validation error decreases in the early epochs of backprop
but after a while it begins to increase. The point of minimum validation
error is a good indicator of the best number of epochs for training and the
weights at that stage are likely to provide the best error rate in new data.

11

1.5 Adaptive Selection of Architecture


One of the time consuming and complex aspects of using backprop is that
we need to decide on an architecture before we can use backprop. The
usual procedure is to make intelligent guesses using past experience and to
do several trial and error runs on different architectures. Algorithms exist
that grow the number of nodes selectively during training or trim them in a
manner analogous to what we have seen with CART. Research continues on
such methods. However, as of now there seems to be no automatic method
that is clearly superior to the trial and error approach.

1.6 Successful Applications


There have been a number of very successful applications of neural nets in
engineering applications. One of the well known ones is ALVINN that is an
autonomous vehicle driving application for normal speeds on highways. The
neural net uses a 30x32 grid of pixel intensities from a fixed camera on the
vehicle as input, the output is the direction of steering. It uses 30 output
units representing classes such as “sharp left”, “straight ahead”, and “bear
right”. It has 960 input units and a single layer of 4 hidden neurons. The
backprop algorithm is used to train ALVINN.
A number of successful applications have been reported in financial ap­
plications (see reference 2) such as bankruptcy predictions, currency market
trading, picking stocks and commodity trading. Credit card and CRM ap­
plications have also been reported.

2 References
1. Bishop, Christopher: Neural Networks for Pattern Recognition, Ox-
ford, 1995.

2. Trippi, Robert and Turban, Efraim (editors): Neural Networks in Fi­


nance and Investing, McGraw Hill 1996.

12

Chapter 9

DECISION TREES

Lior Rokach
Department of Industrial Engineering
Tel-Aviv University
liorr@eng.tau.ac.il

Oded Maimon
Department of Industrial Engineering
Tel-Aviv University
maimon@eng.tau.ac.il

Abstract Decision Trees are considered to be one of the most popular approaches for rep-
resenting classifiers. Researchers from various disciplines such as statistics, ma-
chine learning, pattern recognition, and Data Mining have dealt with the issue of
growing a decision tree from available data. This paper presents an updated sur-
vey of current methods for constructing decision tree classifiers in a top-down
manner. The chapter suggests a unified algorithmic framework for presenting
these algorithms and describes various splitting criteria and pruning methodolo-
gies.

Keywords: Decision tree, Information Gain, Gini Index, Gain Ratio, Pruning, Minimum
Description Length, C4.5, CART, Oblivious Decision Trees

1. Decision Trees
A decision tree is a classifier expressed as a recursive partition of the in-
stance space. The decision tree consists of nodes that form a rooted tree,
meaning it is a directed tree with a node called “root” that has no incoming
edges. All other nodes have exactly one incoming edge. A node with outgoing
edges is called an internal or test node. All other nodes are called leaves (also
known as terminal or decision nodes). In a decision tree, each internal node
splits the instance space into two or more sub-spaces according to a certain
discrete function of the input attributes values. In the simplest and most fre-
166 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK

quent case, each test considers a single attribute, such that the instance space is
partitioned according to the attribute’s value. In the case of numeric attributes,
the condition refers to a range.
Each leaf is assigned to one class representing the most appropriate target
value. Alternatively, the leaf may hold a probability vector indicating the prob-
ability of the target attribute having a certain value. Instances are classified by
navigating them from the root of the tree down to a leaf, according to the
outcome of the tests along the path. Figure 9.1 describes a decision tree that
reasons whether or not a potential customer will respond to a direct mailing.
Internal nodes are represented as circles, whereas leaves are denoted as tri-
angles. Note that this decision tree incorporates both nominal and numeric at-
tributes. Given this classifier, the analyst can predict the response of a potential
customer (by sorting it down the tree), and understand the behavioral charac-
teristics of the entire potential customers population regarding direct mailing.
Each node is labeled with the attribute it tests, and its branches are labeled with
its corresponding values.

Age

<=30
>30

Gender No

male Female

Yes Last R.

No Yes

No Yes

Figure 9.1. Decision Tree Presenting Response to Direct Mailing.


Decision Trees 167

In case of numeric attributes, decision trees can be geometrically interpreted


as a collection of hyperplanes, each orthogonal to one of the axes. Naturally,
decision-makers prefer less complex decision trees, since they may be consid-
ered more comprehensible. Furthermore, according to Breiman et al. (1984)
the tree complexity has a crucial effect on its accuracy. The tree complexity
is explicitly controlled by the stopping criteria used and the pruning method
employed. Usually the tree complexity is measured by one of the following
metrics: the total number of nodes, total number of leaves, tree depth and
number of attributes used. Decision tree induction is closely related to rule
induction. Each path from the root of a decision tree to one of its leaves can be
transformed into a rule simply by conjoining the tests along the path to form
the antecedent part, and taking the leaf’s class prediction as the class value.
For example, one of the paths in Figure 9.1 can be transformed into the rule:
“If customer age is is less than or equal to or equal to 30, and the gender of
the customer is “Male” – then the customer will respond to the mail”. The
resulting rule set can then be simplified to improve its comprehensibility to a
human user, and possibly its accuracy (Quinlan, 1987).

2. Algorithmic Framework for Decision Trees


Decision tree inducers are algorithms that automatically construct a decision
tree from a given dataset. Typically the goal is to find the optimal decision tree
by minimizing the generalization error. However, other target functions can be
also defined, for instance, minimizing the number of nodes or minimizing the
average depth.
Induction of an optimal decision tree from a given data is considered to be
a hard task. It has been shown that finding a minimal decision tree consistent
with the training set is NP–hard (Hancock et al., 1996). Moreover, it has been
shown that constructing a minimal binary tree with respect to the expected
number of tests required for classifying an unseen instance is NP–complete
(Hyafil and Rivest, 1976). Even finding the minimal equivalent decision tree
for a given decision tree (Zantema and Bodlaender, 2000) or building the op-
timal decision tree from decision tables is known to be NP–hard (Naumov,
1991).
The above results indicate that using optimal decision tree algorithms is
feasible only in small problems. Consequently, heuristics methods are required
for solving the problem. Roughly speaking, these methods can be divided into
two groups: top–down and bottom–up with clear preference in the literature to
the first group.
168 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK

There are various top–down decision trees inducers such as ID3 (Quinlan,
1986), C4.5 (Quinlan, 1993), CART (Breiman et al., 1984). Some consist of
two conceptual phases: growing and pruning (C4.5 and CART). Other inducers
perform only the growing phase.
Figure 9.2 presents a typical algorithmic framework for top–down induc-
ing of a decision tree using growing and pruning. Note that these algorithms
are greedy by nature and construct the decision tree in a top–down, recursive
manner (also known as “divide and conquer“). In each iteration, the algorithm
considers the partition of the training set using the outcome of a discrete func-
tion of the input attributes. The selection of the most appropriate function is
made according to some splitting measures. After the selection of an appro-
priate split, each node further subdivides the training set into smaller subsets,
until no split gains sufficient splitting measure or a stopping criteria is satisfied.

3. Univariate Splitting Criteria


3.1 Overview
In most of the cases, the discrete splitting functions are univariate. Uni-
variate means that an internal node is split according to the value of a single
attribute. Consequently, the inducer searches for the best attribute upon which
to split. There are various univariate criteria. These criteria can be character-
ized in different ways, such as:

According to the origin of the measure: information theory, dependence,


and distance.
According to the measure structure: impurity based criteria, normalized
impurity based criteria and Binary criteria.

The following section describes the most common criteria in the literature.

3.2 Impurity-based Criteria


Given a random variable x with k discrete values, distributed according to
P = (p1 , p2 , . . . , pk ), an impurity measure is a function φ:[0, 1]k → R that
satisfies the following conditions:
φ (P)≥0
φ (P) is minimum if ∃i such that component pi = 1.
φ (P) is maximum if ∀i, 1 ≤ i ≤ k, pi = 1/k.
φ (P) is symmetric with respect to components of P .
φ (P) is smooth (differentiable everywhere) in its range.
Decision Trees 169

TreeGrowing (S,A,y)
Where:
S - Training Set
A - Input Feature Set
y - Target Feature
Create a new tree T with a single root node.
IF One of the Stopping Criteria is fulfilled THEN
Mark the root node in T as a leaf with the most
common value of y in S as a label.
ELSE
Find a discrete function f(A) of the input
attributes values such that splitting S
according to f(A)’s outcomes (v1 ,...,vn ) gains
the best splitting metric.
IF best splitting metric > treshold THEN
Label t with f(A)
FOR each outcome vi of f(A):
Set Subtreei = TreeGrowing (σf (A)=vi S,A,y).
Connect the root node of tT to Subtreei with
an edge that is labelled as vi
END FOR
ELSE
Mark the root node in T as a leaf with the most
common value of y in S as a label.
END IF
END IF
RETURN T
TreePruning (S,T,y)
Where:
S - Training Set
y - Target Feature
T - The tree to be pruned
DO
Select a node t in T such that pruning it
maximally improve some evaluation criteria
IF t6=Ø THEN T=pruned(T,t)
UNTIL t=Ø
RETURN T

Figure 9.2. Top-Down Algorithmic Framework for Decision Trees Induction.


170 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK

Note that if the probability vector has a component of 1 (the variable x gets
only one value), then the variable is defined as pure. On the other hand, if all
components are equal, the level of impurity reaches maximum.
Given a training set S, the probability vector of the target attribute y is
defined as:  ¯ ¯
¯ ¯
|σ S| ¯ σ S
y=c|dom(y)| ¯
y=c
Py (S) =  1
,..., 
|S| |S|

The goodness–of–split due to discrete attribute ai is defined as reduction in


impurity of the target attribute after partitioning S according to the values
vi,j ∈ dom(ai ):
|dom(a
P i )| |σai =vi,j S|
∆Φ(ai , S) = φ(Py (S)) − |S| · φ(Py (σai =vi,j S))
j=1

3.3 Information Gain


Information gain is an impurity-based criterion that uses the entropy mea-
sure (origin from information theory) as the impurity measure (Quinlan, 1987).
Inf ormationGain(ai , S) =
P |σai =vi,j S |
Entropy(y, S) − |S| · Entropy(y, σai =vi,j S)
vi,j ∈dom(ai )

where:
¯ ¯ ¯ ¯
X ¯σy=c S ¯ ¯σy=c S ¯
j j
Entropy(y, S) = − · log2
|S| |S|
cj ∈dom(y)

3.4 Gini Index


Gini index is an impurity-based criterion that measures the divergences be-
tween the probability distributions of the target attribute’s values. The Gini in-
dex has been used in various works such as (Breiman et al., 1984) and (Gelfand
et al., 1991) and it is defined as:
ï ¯!
X ¯σy=c S ¯ 2
j
Gini(y, S) = 1 −
|S|
cj ∈dom(y)

Consequently the evaluation criterion for selecting the attribute ai is defined


as:
¯ ¯
X ¯σa =v S ¯
i i,j
GiniGain(ai , S) = Gini(y, S)− ·Gini(y, σai =vi,j S)
|S|
vi,j ∈dom(ai )
Decision Trees 171

3.5 Likelihood-Ratio Chi–Squared Statistics


The likelihood–ratio is defined as (Attneave, 1959)
G2 (ai , S) = 2 · ln(2) · |S| · Inf ormationGain(ai , S)
This ratio is useful for measuring the statistical significance of the information
gain criterion. The zero hypothesis (H0 ) is that the input attribute and the
target attribute are conditionally independent. If H0 holds, the test statistic is
distributed as χ2 with degrees of freedom equal to: (dom(ai ) − 1) · (dom(y) −
1).

3.6 DKM Criterion


The DKM criterion is an impurity-based splitting criterion designed for bi-
nary class attributes (Dietterich et al., 1996) and (Kearns and Mansour, 1999).
The impurity-based function is defined as:
sµ ¶ µ ¶
|σy=c1 S| |σy=c2 S|
DKM (y, S) = 2 · ·
|S| |S|

It has been theoretically proved (Kearns and Mansour, 1999) that this cri-
terion requires smaller trees for obtaining a certain error than other impurity
based criteria (information gain and Gini index).

3.7 Normalized Impurity Based Criteria


The impurity-based criterion described above is biased towards attributes
with larger domain values. Namely, it prefers input attributes with many values
over attributes with less values (Quinlan, 1986). For instance, an input attribute
that represents the national security number will probably get the highest in-
formation gain. However, adding this attribute to a decision tree will result in
a poor generalized accuracy. For that reason, it is useful to “normalize” the
impurity based measures, as described in the following sections.

3.8 Gain Ratio


The gain ratio “normalizes” the information gain as follows (Quinlan, 1993):
Inf ormationGain(ai , S)
GainRatio(ai , S) =
Entropy(ai , S)
Note that this ratio is not defined when the denominator is zero. Also the ratio
may tend to favor attributes for which the denominator is very small. Conse-
quently, it is suggested in two stages. First the information gain is calculated
for all attributes. As a consequence, taking into consideration only attributes
172 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK

that have performed at least as good as the average information gain, the at-
tribute that has obtained the best ratio gain is selected. It has been shown that
the gain ratio tends to outperform simple information gain criteria, both from
the accuracy aspect, as well as from classifier complexity aspects (Quinlan,
1988).

3.9 Distance Measure


The distance measure, like the gain ratio, normalizes the impurity measure.
However, it suggests normalizing it in a different way (Lopez de Mantras,
1991):

∆Φ(ai , S)
P P |σai =vi,j AN D y=ck S| |σai =vi,j AN D y=ck S|
− |S| · log2 |S|
vi,j ∈dom(ai ) ck ∈dom(y)

3.10 Binary Criteria


The binary criteria are used for creating binary decision trees. These mea-
sures are based on division of the input attribute domain into two sub-domains.
Let β(ai , dom1 (ai ), dom2 (ai ), S) denote the binary criterion value for at-
tribute ai over sample S when dom1 (ai ) and dom2 (ai ) are its correspond-
ing subdomains. The value obtained for the optimal division of the attribute
domain into two mutually exclusive and exhaustive sub-domains is used for
comparing attributes.

3.11 Twoing Criterion


The gini index may encounter problems when the domain of the target at-
tribute is relatively wide (Breiman et al., 1984). In this case it is possible to
employ binary criterion called twoing criterion. This criterion is defined as:

twoing(ai , dom1 (ai ), dom2 (ai ), S) =


|σ 1 (ai ) |
S |σ 2 (ai ) |
S
0.25 · ai ∈dom
|S| · ai ∈dom
|S| ·
à ¯ ¯!2
P ¯ |σa ∈dom (a ) AN D y=c S | |σa ∈dom (a ) AN D y=c S | ¯
¯ i 1 i i
− i σ 2 i i ¯
¯ |σai ∈dom1 (ai ) S | | ai ∈dom2 (ai ) S | ¯
c ∈dom(y)
i

When the target attribute is binary, the gini and twoing criteria are equivalent.
For multi–class problems, the twoing criteria prefer attributes with evenly di-
vided splits.
Decision Trees 173

3.12 Orthogonal (ORT) Criterion


The ORT criterion was presented by Fayyad and Irani (1992). This binary
criterion is defined as:

ORT (ai , dom1 (ai ), dom2 (ai ), S) = 1 − cosθ(Py,1 , Py,2 )

where θ(Py,1 , Py,2 ) is the angle between two vectors Py,1 and Py,2 . These vec-
tors represent the probability distribution of the target attribute in the partitions
σai ∈dom1 (ai ) S and σai ∈dom2 (ai ) S respectively.
It has been shown that this criterion performs better than the information
gain and the Gini index for specific problem constellations.

3.13 Kolmogorov–Smirnov Criterion


A binary criterion that uses Kolmogorov–Smirnov distance has been pro-
posed in Friedman (1977) and Rounds (1980). Assuming a binary target at-
tribute, namely dom(y) = {c1 , c2 }, the criterion is defined as:

KS(ai , dom1 (ai ), dom2 (ai ), S) =


¯¯ ¯ ¯ ¯ ¯¯
¯ ¯σ
¯ ai ∈dom1 (ai ) AN D y=c1 S ¯ ¯σai ∈dom1 (ai ) AN D y=c2 S ¯ ¯
¯ − ¯
¯ |σy=c1 S| |σy=c2 S| ¯

This measure was extended in (Utgoff and Clouse, 1996) to handle target
attributes with multiple classes and missing data values. Their results indicate
that the suggested method outperforms the gain ratio criteria.

3.14 AUC–Splitting Criteria


The idea of using the AUC metric as a splitting criterion was recently pro-
posed in (Ferri et al., 2002). The attribute that obtains the maximal area under
the convex hull of the ROC curve is selected. It has been shown that the AUC–
based splitting criterion outperforms other splitting criteria both with respect
to classification accuracy and area under the ROC curve. It is important to
note that unlike impurity criteria, this criterion does not perform a compari-
son between the impurity of the parent node with the weighted impurity of the
children after splitting.

3.15 Other Univariate Splitting Criteria


Additional univariate splitting criteria can be found in the literature, such
as permutation statistics (Li and Dubes, 1986), mean posterior improvements
(Taylor and Silverman, 1993) and hypergeometric distribution measures (Mar-
tin, 1997).
174 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK

3.16 Comparison of Univariate Splitting Criteria


Comparative studies of the splitting criteria described above, and others,
have been conducted by several researchers during the last thirty years, such
as (Baker and Jain, 1976; BenBassat, 1978; Mingers, 1989; Fayyad and Irani,
1992; Buntine and Niblett, 1992; Loh and Shih, 1997; Loh and Shih, 1999;
Lim et al., 2000). Most of these comparisons are based on empirical results,
although there are some theoretical conclusions.
Many of the researchers point out that in most of the cases, the choice of
splitting criteria will not make much difference on the tree performance. Each
criterion is superior in some cases and inferior in others, as the “No–Free–
Lunch” theorem suggests.

4. Multivariate Splitting Criteria


In multivariate splitting criteria, several attributes may participate in a single
node split test. Obviously, finding the best multivariate criteria is more com-
plicated than finding the best univariate split. Furthermore, although this type
of criteria may dramatically improve the tree’s performance, these criteria are
much less popular than the univariate criteria.
Most of the multivariate splitting criteria are based on the linear combination
of the input attributes. Finding the best linear combination can be performed
using a greedy search (Breiman et al., 1984; Murthy, 1998), linear program-
ming (Duda and Hart, 1973; Bennett and Mangasarian, 1994), linear discrim-
inant analysis (Duda and Hart, 1973; Friedman, 1977; Sklansky and Wassel,
1981; Lin and Fu, 1983; Loh and Vanichsetakul, 1988; John, 1996) and others
(Utgoff, 1989a; Lubinsky, 1993; Sethi and Yoo, 1994).

5. Stopping Criteria
The growing phase continues until a stopping criterion is triggered. The
following conditions are common stopping rules:

1. All instances in the training set belong to a single value of y.

2. The maximum tree depth has been reached.

3. The number of cases in the terminal node is less than the minimum num-
ber of cases for parent nodes.

4. If the node were split, the number of cases in one or more child nodes
would be less than the minimum number of cases for child nodes.

5. The best splitting criteria is not greater than a certain threshold.


Decision Trees 175

6. Pruning Methods
6.1 Overview
Employing tightly stopping criteria tends to create small and under–fitted
decision trees. On the other hand, using loosely stopping criteria tends to gen-
erate large decision trees that are over–fitted to the training set. Pruning meth-
ods originally suggested in (Breiman et al., 1984) were developed for solving
this dilemma. According to this methodology, a loosely stopping criterion is
used, letting the decision tree to overfit the training set. Then the over-fitted
tree is cut back into a smaller tree by removing sub–branches that are not con-
tributing to the generalization accuracy. It has been shown in various studies
that employing pruning methods can improve the generalization performance
of a decision tree, especially in noisy domains.
Another key motivation of pruning is “trading accuracy for simplicity” as
presented in (Bratko and Bohanec, 1994). When the goal is to produce a suffi-
ciently accurate compact concept description, pruning is highly useful. Within
this process, the initial decision tree is seen as a completely accurate one. Thus
the accuracy of a pruned decision tree indicates how close it is to the initial
tree.
There are various techniques for pruning decision trees. Most of them per-
form top-down or bottom-up traversal of the nodes. A node is pruned if this
operation improves a certain criteria. The following subsections describe the
most popular techniques.

6.2 Cost–Complexity Pruning


Cost-complexity pruning (also known as weakest link pruning or error-
complexity pruning) proceeds in two stages (Breiman et al., 1984). In the first
stage, a sequence of trees T0 , T1 , . . . , Tk is built on the training data where T0
is the original tree before pruning and Tk is the root tree.
In the second stage, one of these trees is chosen as the pruned tree, based on
its generalization error estimation.
The tree Ti+1 is obtained by replacing one or more of the sub–trees in the
predecessor tree Ti with suitable leaves. The sub–trees that are pruned are
those that obtain the lowest increase in apparent error rate per pruned leaf:
ε(pruned(T, t), S) − ε(T, S)
α=
|leaves(T )| − |leaves(pruned(T, t))|
where ε(T, S) indicates the error rate of the tree T over the sample S and
|leaves(T)| denotes the number of leaves in T . pruned(T,t) denotes the tree
obtained by replacing the node t in T with a suitable leaf.
In the second phase the generalization error of each pruned tree T0 , T1 , . . . ,
Tk is estimated. The best pruned tree is then selected. If the given dataset is
176 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK

large enough, the authors suggest breaking it into a training set and a prun-
ing set. The trees are constructed using the training set and evaluated on the
pruning set. On the other hand, if the given dataset is not large enough, they
propose to use cross–validation methodology, despite the computational com-
plexity implications.

6.3 Reduced Error Pruning


A simple procedure for pruning decision trees, known as reduced error prun-
ing, has been suggested by Quinlan (1987). While traversing over the internal
nodes from the bottom to the top, the procedure checks for each internal node,
whether replacing it with the most frequent class does not reduce the tree’s
accuracy. In this case, the node is pruned. The procedure continues until any
further pruning would decrease the accuracy.
In order to estimate the accuracy, Quinlan (1987) proposes to use a pruning
set. It can be shown that this procedure ends with the smallest accurate sub–
tree with respect to a given pruning set.

6.4 Minimum Error Pruning (MEP)


The minimum error pruning has been proposed in (Olaru and Wehenkel,
2003). It performs bottom–up traversal of the internal nodes. In each node it
compares the l-probability error rate estimation with and without pruning.
The l-probability error rate estimation is a correction to the simple probabil-
ity estimation using frequencies. If St denotes the instances that have reached
a leaf t, then the expected error rate in this leaf is:

|σy=ci St | + l · papr (y = ci )
ε0 (t) = 1 − max
ci ∈dom(y) |St | + l

where papr (y = ci ) is the a–priori probability of y getting the value ci , and


l denotes the weight given to the a–priori probability.
The error rate of an internal node is the weighted average of the error rate of
its branches. The weight is determined according to the proportion of instances
along each branch. The calculation is performed recursively up to the leaves.
If an internal node is pruned, then it becomes a leaf and its error rate is
calculated directly using the last equation. Consequently, we can compare the
error rate before and after pruning a certain internal node. If pruning this node
does not increase the error rate, the pruning should be accepted.

6.5 Pessimistic Pruning


Pessimistic pruning avoids the need of pruning set or cross validation and
uses the pessimistic statistical correlation test instead (Quinlan, 1993).
Decision Trees 177

The basic idea is that the error ratio estimated using the training set is not
reliable enough. Instead, a more realistic measure, known as the continuity
correction for binomial distribution, should be used:
|leaves(T )|
ε0 (T, S) = ε(T, S) +
2 · |S|
However, this correction still produces an optimistic error rate. Conse-
quently, one should consider pruning an internal node t if its error rate is within
one standard error from a reference tree, namely (Quinlan, 1993):
s
ε0 (T, S) · (1 − ε0 (T, S))
ε0 (pruned(T, t), S) ≤ ε0 (T, S) +
|S|

The last condition is based on statistical confidence interval for proportions.


Usually the last condition is used such that T refers to a sub–tree whose root
is the internal node t and S denotes the portion of the training set that refers to
the node t.
The pessimistic pruning procedure performs top–down traversing over the
internal nodes. If an internal node is pruned, then all its descendants are re-
moved from the pruning process, resulting in a relatively fast pruning.

6.6 Error–based Pruning (EBP)


Error–based pruning is an evolution of pessimistic pruning. It is imple-
mented in the well–known C4.5 algorithm.
As in pessimistic pruning, the error rate is estimated using the upper bound
of the statistical confidence interval for proportions.
s
ε(T, S) · (1 − ε(T, S))
εU B (T, S) = ε(T, S) + Zα ·
|S|

where ε(T, S) denotes the misclassification rate of the tree T on the training
set S. Z is the inverse of the standard normal cumulative distribution and α is
the desired significance level.
Let subtree(T,t) denote the subtree rooted by the node t. Let maxchild(T,t)
denote the most frequent child node of t (namely most of the instances in S
reach this particular child) and let St denote all instances in S that reach the
node t.
The procedure performs bottom–up traversal over all nodes and compares
the following values:
1. εU B (subtree(T, t), St )
2. εU B (pruned(subtree(T, t), t), St )
178 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK

3. εU B (subtree(T, maxchild(T, t)), Smaxchild(T,t) )

According to the lowest value the procedure either leaves the tree as is, prune
the node t, or replaces the node t with the subtree rooted by maxchild(T,t).

6.7 Optimal Pruning


The issue of finding optimal pruning has been studied in (Bratko and Bo-
hanec, 1994) and (Almuallim, 1996). The first research introduced an al-
gorithm which guarantees optimality, known as OPT. This algorithm finds
the optimal pruning based on dynamic programming, with the complexity of
Θ(|leveas(T )|2 ), where T is the initial decision tree. The second research in-
troduced an improvement of OPT called OPT–2, which also performs optimal
pruning using dynamic programming. However, the time and space complexi-
ties of OPT–2 are both Θ(|leveas(T ∗ )| · |internal(T )|), where T ∗ is the target
(pruned) decision tree and T is the initial decision tree.
Since the pruned tree is habitually much smaller than the initial tree and
the number of internal nodes is smaller than the number of leaves, OPT–2 is
usually more efficient than OPT in terms of computational complexity.

6.8 Minimum Description Length (MDL) Pruning


The minimum description length can be used for evaluating the generalized
accuracy of a node (Rissanen, 1989; Quinlan and Rivest, 1989; Mehta et al.,
1995). This method measures the size of a decision tree by means of the num-
ber of bits required to encode the tree. The MDL method prefers decision
trees that can be encoded with fewer bits. The cost of a split at a leaf t can be
estimated as (Mehta et al., 1995):

P |St | |dom(y)|−1
Cost(t) = |σy=ci St | · ln + ln |S2t | +
ci ∈dom(y)
|σy=ci St | 2
|dom(y)|
π 2
ln |dom(y)|
Γ( 2
)

where St denotes the instances that have reached node t. The splitting cost
of an internal node is calculated based on the cost aggregation of its children.

6.9 Other Pruning Methods


There are other pruning methods reported in the literature, such as the MML
(Minimum Message Length) pruning method (Wallace and Patrick, 1993) and
Critical Value Pruning (Mingers, 1989).
Decision Trees 179

6.10 Comparison of Pruning Methods


Several studies aim to compare the performance of different pruning tech-
niques (Quinlan, 1987; Mingers, 1989; Esposito et al., 1997). The results
indicate that some methods (such as cost–complexity pruning, reduced error
pruning) tend to over–pruning, i.e. creating smaller but less accurate decision
trees. Other methods (like error-based pruning, pessimistic error pruning and
minimum error pruning) bias toward under–pruning. Most of the comparisons
concluded that the “no free lunch” theorem applies in this case also, namely
there is no pruning method that in any case outperforms other pruning methods.

7. Other Issues
7.1 Weighting Instances
Some decision trees inducers may give different treatments to different in-
stances. This is performed by weighting the contribution of each instance in
the analysis according to a provided weight (between 0 and 1).

7.2 Misclassification costs


Several decision trees inducers can be provided with numeric penalties for
classifying an item into one class when it really belongs in another.

7.3 Handling Missing Values


Missing values are a common experience in real-world data sets. This sit-
uation can complicate both induction (a training set where some of its values
are missing) as well as classification (a new instance that miss certain values).
This problem has been addressed by several researchers (Friedman, 1977;
Breiman et al., 1984; Quinlan, 1989). One can handle missing values in the
training set in the following way: let σai =? S indicate the subset of instances
in S whose ai values are missing. When calculating the splitting criteria using
attribute ai , simply ignore all instances their values in attribute ai are unknown,
that is, instead of using the splitting criteria ∆Φ(ai , S) it uses ∆Φ(ai , S −
σai =? S).
On the other hand, in case of missing values, the splitting criteria should be
reduced proportionally as nothing has been learned from these instances (Quin-
lan, 1989). In other words, instead of using the splitting criteria ∆Φ(ai , S), it
uses the following correction:
|S − σai =? S|
∆Φ(ai , S − σai =? S).
|S|
In a case where the criterion value is normalized (as in the case of gain ratio),
the denominator should be calculated as if the missing values represent an
180 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK

additional value in the attribute domain. For instance, the Gain Ratio with
missing values should be calculated as follows:

GainRatio(ai , S) =
|S−σai =? S| Inf ormationGain(a ,S−σ
|S| i ai =? S)


|σai =? S| log( |σai =? S| )− P |σai =vi,j S| log( |σai =vi,j S| )
|S| |S| |S| |S|
vi,j ∈dom(ai )

Once a node is split, it is required to add σai =? S to each one of the outgoing
edges with the following corresponding weight:
¯ ¯±
¯σa =v S ¯ |S − σa =? S|
i i,j i

The same idea is used for classifying a new instance with missing attribute
values. When an instance encounters a node where its splitting criteria can be
evaluated due to a missing value, it is passed through to all outgoing edges. The
predicted class will be the class with the highest probability in the weighted
union of all the leaf nodes at which this instance ends up.
Another approach known as surrogate splits was presented by Breiman et al.
(1984) and is implemented in the CART algorithm. The idea is to find for
each split in the tree a surrogate split which uses a different input attribute and
which most resembles the original split. If the value of the input attribute used
in the original split is missing, then it is possible to use the surrogate split. The
resemblance between two binary splits over sample S is formally defined as:

res(a
¯ i , dom1 (ai ), dom2 (ai ), a¯j , dom
¯ 1 (aj ), dom2 (aj ), S) = ¯
¯ ¯ ¯ ¯
¯σai ∈dom1 (ai ) AN D aj ∈dom1 (aj ) S ¯ ¯σai ∈dom2 (ai ) AN D aj ∈dom2 (aj ) S ¯
|S| + |S|

When the first split refers to attribute ai and it splits dom(ai ) into dom1 (ai )
and dom2 (ai ). The alternative split refers to attribute aj and splits its domain
to dom1 (aj ) and dom2 (aj ).
The missing value can be estimated based on other instances (Loh and Shih,
1997). On the learning phase, if the value of a nominal attribute ai in tuple q
is missing, then it is estimated by its mode over all instances having the same
target attribute value. Formally,
¯ ¯
estimate(ai , yq , S) = argmax ¯σai =vi,j AN D y=yq S ¯
vi,j ∈dom(ai )

where yq denotes the value of the target attribute in the tuple q. If the missing
attribute ai is numeric, then instead of using mode of ai it is more appropriate
to use its mean.
Decision Trees 181

8. Decision Trees Inducers


8.1 ID3
The ID3 algorithm is considered as a very simple decision tree algorithm
(Quinlan, 1986). ID3 uses information gain as splitting criteria. The growing
stops when all instances belong to a single value of target feature or when
best information gain is not greater than zero. ID3 does not apply any pruning
procedures nor does it handle numeric attributes or missing values.

8.2 C4.5
C4.5 is an evolution of ID3, presented by the same author (Quinlan, 1993).
It uses gain ratio as splitting criteria. The splitting ceases when the number
of instances to be split is below a certain threshold. Error–based pruning is
performed after the growing phase. C4.5 can handle numeric attributes. It can
induce from a training set that incorporates missing values by using corrected
gain ratio criteria as presented above.

8.3 CART
CART stands for Classification and Regression Trees (Breiman et al., 1984).
It is characterized by the fact that it constructs binary trees, namely each in-
ternal node has exactly two outgoing edges. The splits are selected using the
twoing criteria and the obtained tree is pruned by cost–complexity Pruning.
When provided, CART can consider misclassification costs in the tree induc-
tion. It also enables users to provide prior probability distribution.
An important feature of CART is its ability to generate regression trees.
Regression trees are trees where their leaves predict a real number and not a
class. In case of regression, CART looks for splits that minimize the prediction
squared error (the least–squared deviation). The prediction in each leaf is based
on the weighted mean for node.

8.4 CHAID
Starting from the early seventies, researchers in applied statistics devel-
oped procedures for generating decision trees, such as: AID (Sonquist et al.,
1971), MAID (Gillo, 1972), THAID (Morgan and Messenger, 1973) and
CHAID (Kass, 1980). CHAID (Chisquare–Automatic–Interaction–Detection)
was originally designed to handle nominal attributes only. For each input at-
tribute ai , CHAID finds the pair of values in Vi that is least significantly differ-
ent with respect to the target attribute. The significant difference is measured
by the p value obtained from a statistical test. The statistical test used depends
on the type of target attribute. If the target attribute is continuous, an F test is
182 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK

used. If it is nominal, then a Pearson chi–squared test is used. If it is ordinal,


then a likelihood–ratio test is used.
For each selected pair, CHAID checks if the p value obtained is greater than
a certain merge threshold. If the answer is positive, it merges the values and
searches for an additional potential pair to be merged. The process is repeated
until no significant pairs are found.
The best input attribute to be used for splitting the current node is then se-
lected, such that each child node is made of a group of homogeneous values of
the selected attribute. Note that no split is performed if the adjusted p value of
the best input attribute is not less than a certain split threshold. This procedure
also stops when one of the following conditions is fulfilled:

1. Maximum tree depth is reached.

2. Minimum number of cases in node for being a parent is reached, so it


can not be split any further.

3. Minimum number of cases in node for being a child node is reached.

CHAID handles missing values by treating them all as a single valid category.
CHAID does not perform pruning.

8.5 QUEST
The QUEST (Quick, Unbiased, Efficient, Statistical Tree) algorithm sup-
ports univariate and linear combination splits (Loh and Shih, 1997). For each
split, the association between each input attribute and the target attribute is
computed using the ANOVA F–test or Levene’s test (for ordinal and contin-
uous attributes) or Pearson’s chi–square (for nominal attributes). If the target
attribute is multinomial, two–means clustering is used to create two super–
classes. The attribute that obtains the highest association with the target at-
tribute is selected for splitting. Quadratic Discriminant Analysis (QDA) is
applied to find the optimal splitting point for the input attribute. QUEST has
negligible bias and it yields binary decision trees. Ten–fold cross–validation is
used to prune the trees.

8.6 Reference to Other Algorithms


Table 9.1 describes other decision trees algorithms available in the litera-
ture. Obviously there are many other algorithms which are not included in this
table. Nevertheless, most of these algorithms are a variation of the algorithmic
framework presented above. A profound comparison of the above algorithms
and many others has been conducted in (Lim et al., 2000).
Decision Trees 183

Table 9.1. Additional Decision Tree Inducers.


Algorithm Description Reference

CAL5 Designed specifically for numerical– Muller and Wysotzki (1994)


valued attributes
FACT An earlier version of QUEST. Uses sta- Loh and Vanichsetakul (1988)
tistical tests to select an attribute for
splitting each node and then uses dis-
criminant analysis to find the split point.
LMDT Constructs a decision tree based on mul- Brodley and Utgoff (1995)
tivariate tests are linear combinations of
the attributes.
T1 A one–level decision tree that classi- Holte (1993)
fies instances using only one attribute.
Missing values are treated as a “spe-
cial value”. Support both continuous an
nominal attributes.
PUBLIC Integrates the growing and pruning by Rastogi and Shim (2000)
using MDL cost in order to reduce the
computational complexity.
MARS A multiple regression function is ap- Friedman (1991)
proximated using linear splines and their
tensor products.

9. Advantages and Disadvantages of Decision Trees


Several advantages of the decision tree as a classification tool have been
pointed out in the literature:

1. Decision trees are self–explanatory and when compacted they are also
easy to follow. In other words if the decision tree has a reasonable num-
ber of leaves, it can be grasped by non–professional users. Furthermore
decision trees can be converted to a set of rules. Thus, this representation
is considered as comprehensible.
2. Decision trees can handle both nominal and numeric input attributes.
3. Decision tree representation is rich enough to represent any discrete–
value classifier.
4. Decision trees are capable of handling datasets that may have errors.
5. Decision trees are capable of handling datasets that may have missing
values.
184 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK

Figure 9.3. Illustration of Decision Tree with Replication.

6. Decision trees are considered to be a nonparametric method. This means


that decision trees have no assumptions about the space distribution and
the classifier structure.

On the other hand, decision trees have such disadvantages as:

1. Most of the algorithms (like ID3 and C4.5) require that the target at-
tribute will have only discrete values.

2. As decision trees use the “divide and conquer” method, they tend to per-
form well if a few highly relevant attributes exist, but less so if many
complex interactions are present. One of the reasons for this is that other
classifiers can compactly describe a classifier that would be very chal-
lenging to represent using a decision tree. A simple illustration of this
phenomenon is the replication problem of decision trees (Pagallo and
Huassler, 1990). Since most decision trees divide the instance space into
mutually exclusive regions to represent a concept, in some cases the tree
should contain several duplications of the same sub-tree in order to rep-
resent the classifier. For instance if the concept follows the following
binary function: y = (A1 ∩ A2 ) ∪ (A3 ∩ A4 ) then the minimal univari-
ate decision tree that represents this function is illustrated in Figure 9.3.
Note that the tree contains two copies of the same subt-ree.

3. The greedy characteristic of decision trees leads to another disadvantage


that should be pointed out. This is its over–sensitivity to the training set,
to irrelevant attributes and to noise (Quinlan, 1993).
Decision Trees 185

10. Decision Tree Extensions


In the following sub-sections, we discuss some of the most popular exten-
sions to the classical decision tree induction paradigm.

10.1 Oblivious Decision Trees


Oblivious decision trees are decision trees for which all nodes at the same
level test the same feature. Despite its restriction, oblivious decision trees are
found to be effective for feature selection. Almuallim and Dietterich (1994) as
well as Schlimmer (1993) have proposed forward feature selection procedure
by constructing oblivious decision trees. Langley and Sage (1994) suggested
backward selection using the same means. It has been shown that oblivious
decision trees can be converted to a decision table (Kohavi and Sommerfield,
1998). Recently Last et al. (2002) have suggested a new algorithm for con-
structing oblivious decision trees, called IFN (Information Fuzzy Network)
that is based on information theory.
Figure 9.4 illustrates a typical oblivious decision tree with four input fea-
tures: glucose level (G), age (A), hypertension (H) and pregnant (P) and the
Boolean target feature representing whether that patient suffers from diabetes.
Each layer is uniquely associated with an input feature by representing the
interaction of that feature and the input features of the previous layers. The
number that appears in the terminal nodes indicates the number of instances
that fit this path. For example, regarding patients whose glucose level is less
than 107 and their age is greater than 50, 10 of them are positively diagnosed
with diabetes while 2 of them are not diagnosed with diabetes.
The principal difference between the oblivious decision tree and a regular
decision tree structure is the constant ordering of input attributes at every ter-
minal node of the oblivious decision tree, the property which is necessary for
minimizing the overall subset of input attributes (resulting in dimensionality
reduction). The arcs that connect the terminal nodes and the nodes of the tar-
get layer are labelled with the number of records that fit this path.
An oblivious decision tree is usually built by a greedy algorithm, which tries
to maximize the mutual information measure in every layer. The recursive
search for explaining attributes is terminated when there is no attribute that
explains the target with statistical significance.

10.2 Fuzzy Decision Trees


In classical decision trees, an instance can be associated with only one
branch of the tree. Fuzzy decision trees (FDT) may simultaneously assign
more than one branch to the same instance with gradual certainty.
186 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK

Figure 9.4. Illustration of Oblivious Decision Tree.

FDTs preserve the symbolic structure of the tree and its comprehensibility.
Nevertheless, FDT can represent concepts with graduated characteristics by
producing real-valued outputs with gradual shifts
Janikow (1998) presented a complete framework for building a fuzzy tree in-
cluding several inference procedures based on conflict resolution in rule-based
systems and efficient approximate reasoning methods.
Olaru and Wehenkel (2003) presented a new fuzzy decision trees called soft
decision trees (SDT). This approach combines tree-growing and pruning, to
determine the structure of the soft decision tree, with refitting and backfit-
ting, to improve its generalization capabilities. They empirically showed that
soft decision trees are significantly more accurate than standard decision trees.
Moreover, a global model variance study shows a much lower variance for soft
decision trees than for standard trees as a direct cause of the improved accu-
racy.
Peng (2004) has used FDT to improve the performance of the classical in-
ductive learning approach in manufacturing processes. Peng (2004) proposed
to use soft discretization of continuous-valued attributes. It has been shown
that FDT can deal with the noise or uncertainties existing in the data collected
in industrial systems.

10.3 Decision Trees Inducers for Large Datasets


With the recent growth in the amount of data collected by information sys-
tems, there is a need for decision trees that can handle large datasets. Catlett
(1991) has examined two methods for efficiently growing decision trees from a
large database by reducing the computation complexity required for induction.
However, the Catlett method requires that all data will be loaded into the main
Decision Trees 187

memory before induction. That is to say, the largest dataset that can be induced
is bounded by the memory size. Fifield (1992) suggests parallel implementa-
tion of the ID3 Algorithm. However, like Catlett, it assumes that all dataset
can fit in the main memory. Chan and Stolfo (1997) suggest partitioning the
datasets into several disjointed datasets, so that each dataset is loaded sepa-
rately into the memory and used to induce a decision tree. The decision trees
are then combined to create a single classifier. However, the experimental re-
sults indicate that partition may reduce the classification performance, meaning
that the classification accuracy of the combined decision trees is not as good as
the accuracy of a single decision tree induced from the entire dataset.
The SLIQ algorithm (Mehta et al., 1996) does not require loading the entire
dataset into the main memory, instead it uses a secondary memory (disk). In
other words, a certain instance is not necessarily resident in the main mem-
ory all the time. SLIQ creates a single decision tree from the entire dataset.
However, this method also has an upper limit for the largest dataset that can
be processed, because it uses a data structure that scales with the dataset size
and this data structure must be resident in main memory all the time. The
SPRINT algorithm uses a similar approach (Shafer et al., 1996). This algo-
rithm induces decision trees relatively quickly and removes all of the memory
restrictions from decision tree induction. SPRINT scales any impurity based
split criteria for large datasets. Gehrke et al (2000) introduced RainForest; a
unifying framework for decision tree classifiers that are capable of scaling any
specific algorithms from the literature (including C4.5, CART and CHAID). In
addition to its generality, RainForest improves SPRINT by a factor of three. In
contrast to SPRINT, however, RainForest requires a certain minimum amount
of main memory, proportional to the set of distinct values in a column of the
input relation. However, this requirement is considered modest and reasonable.
Other decision tree inducers for large datasets can be found in the literature
(Alsabti et al., 1998; Freitas and Lavington, 1998; Gehrke et al., 1999).

10.4 Incremental Induction


Most of the decision trees inducers require rebuilding the tree from scratch
for reflecting new data that has become available. Several researches have
addressed the issue of updating decision trees incrementally. Utgoff (1989b,
1997) presents several methods for updating decision trees incrementally. An
extension to the CART algorithm that is capable of inducing incrementally is
described in (Crawford et al., 2002).

References
Almuallim H., An Efficient Algorithm for Optimal Pruning of Decision Trees.
Artificial Intelligence 83(2): 347-362, 1996.
188 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK

Almuallim H,. and Dietterich T.G., Learning Boolean concepts in the presence
of many irrelevant features. Artificial Intelligence, 69: 1-2, 279-306, 1994.
Alsabti K., Ranka S. and Singh V., CLOUDS: A Decision Tree Classifier
for Large Datasets, Conference on Knowledge Discovery and Data Mining
(KDD-98), August 1998.
Attneave F., Applications of Information Theory to Psychology. Holt, Rinehart
and Winston, 1959.
Baker E., and Jain A. K., On feature ordering in practice and some finite sam-
ple effects. In Proceedings of the Third International Joint Conference on
Pattern Recognition, pages 45-49, San Diego, CA, 1976.
BenBassat M., Myopic policies in sequential classification. IEEE Trans. on
Computing, 27(2):170-174, February 1978.
Bennett X. and Mangasarian O.L., Multicategory discrimination via linear pro-
gramming. Optimization Methods and Software, 3:29-39, 1994.
Bratko I., and Bohanec M., Trading accuracy for simplicity in decision trees,
Machine Learning 15: 223-250, 1994.
Breiman L., Friedman J., Olshen R., and Stone C.. Classification and Regres-
sion Trees. Wadsworth Int. Group, 1984.
Brodley C. E. and Utgoff. P. E., Multivariate decision trees. Machine Learning,
19:45-77, 1995.
Buntine W., Niblett T., A Further Comparison of Splitting Rules for Decision-
Tree Induction. Machine Learning, 8: 75-85, 1992.
Catlett J., Mega induction: Machine Learning on Vary Large Databases, PhD,
University of Sydney, 1991.
Chan P.K. and Stolfo S.J, On the Accuracy of Meta-learning for Scalable Data
Mining, J. Intelligent Information Systems, 8:5-28, 1997.
Crawford S. L., Extensions to the CART algorithm. Int. J. of ManMachine
Studies, 31(2):197-217, August 1989.
Dietterich, T. G., Kearns, M., and Mansour, Y., Applying the weak learning
framework to understand and improve C4.5. Proceedings of the Thirteenth
International Conference on Machine Learning, pp. 96-104, San Francisco:
Morgan Kaufmann, 1996.
Duda, R., and Hart, P., Pattern Classification and Scene Analysis, New-York,
Wiley, 1973.
Esposito F., Malerba D. and Semeraro G., A Comparative Analysis of Meth-
ods for Pruning Decision Trees. EEE Transactions on Pattern Analysis and
Machine Intelligence, 19(5):476-492, 1997.
Fayyad U., and Irani K. B., The attribute selection problem in decision tree
generation. In proceedings of Tenth National Conference on Artificial Intel-
ligence, pp. 104–110, Cambridge, MA: AAAI Press/MIT Press, 1992.
Ferri C., Flach P., and Hernández-Orallo J., Learning Decision Trees Using the
Area Under the ROC Curve. In Claude Sammut and Achim Hoffmann, ed-
Decision Trees 189

itors, Proceedings of the 19th International Conference on Machine Learn-


ing, pp. 139-146. Morgan Kaufmann, July 2002
Fifield D. J., Distributed Tree Construction From Large Datasets, Bachelor’s
Honor Thesis, Australian National University, 1992.
Freitas X., and Lavington S. H., Mining Very Large Databases With Parallel
Processing, Kluwer Academic Publishers, 1998.
Friedman J. H., A recursive partitioning decision rule for nonparametric clas-
sifiers. IEEE Trans. on Comp., C26:404-408, 1977.
Friedman, J. H., “Multivariate Adaptive Regression Splines”, The Annual Of
Statistics, 19, 1-141, 1991.
Gehrke J., Ganti V., Ramakrishnan R., Loh W., BOAT-Optimistic Decision
Tree Construction. SIGMOD Conference 1999: pp. 169-180, 1999.
Gehrke J., Ramakrishnan R., Ganti V., RainForest - A Framework for Fast
Decision Tree Construction of Large Datasets,Data Mining and Knowledge
Discovery, 4, 2/3) 127-162, 2000.
Gelfand S. B., Ravishankar C. S., and Delp E. J., An iterative growing and
pruning algorithm for classification tree design. IEEE Transaction on Pattern
Analysis and Machine Intelligence, 13(2):163-174, 1991.
Gillo M. W., MAID: A Honeywell 600 program for an automatised survey
analysis. Behavioral Science 17: 251-252, 1972.
Hancock T. R., Jiang T., Li M., Tromp J., Lower Bounds on Learning Decision
Lists and Trees. Information and Computation 126(2): 114-122, 1996.
Holte R. C., Very simple classification rules perform well on most commonly
used datasets. Machine Learning, 11:63-90, 1993.
Hyafil L. and Rivest R.L., Constructing optimal binary decision trees is NP-
complete. Information Processing Letters, 5(1):15-17, 1976
Janikow, C.Z., Fuzzy Decision Trees: Issues and Methods, IEEE Transactions
on Systems, Man, and Cybernetics, Vol. 28, Issue 1, pp. 1-14. 1998.
John G. H., Robust linear discriminant trees. In D. Fisher and H. Lenz, editors,
Learning From Data: Artificial Intelligence and Statistics V, Lecture Notes
in Statistics, Chapter 36, pp. 375-385. Springer-Verlag, New York, 1996.
Kass G. V., An exploratory technique for investigating large quantities of cate-
gorical data. Applied Statistics, 29(2):119-127, 1980.
Kearns M. and Mansour Y., A fast, bottom-up decision tree pruning algorithm
with near-optimal generalization, in J. Shavlik, ed., ‘Machine Learning:
Proceedings of the Fifteenth International Conference’, Morgan Kaufmann
Publishers, Inc., pp. 269-277, 1998.
Kearns M. and Mansour Y., On the boosting ability of top-down decision tree
learning algorithms. Journal of Computer and Systems Sciences, 58(1): 109-
128, 1999.
Kohavi R. and Sommerfield D., Targeting business users with decision table
classifiers, in R. Agrawal, P. Stolorz & G. Piatetsky-Shapiro, eds, ‘Proceed-
190 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK

ings of the Fourth International Conference on Knowledge Discovery and


Data Mining’, AAAI Press, pp. 249-253, 1998.
Langley, P. and Sage, S., Oblivious decision trees and abstract cases. in Work-
ing Notes of the AAAI-94 Workshop on Case-Based Reasoning, pp. 113-
117, Seattle, WA: AAAI Press, 1994.
Last, M., Maimon, O. and Minkov, E., Improving Stability of Decision Trees,
International Journal of Pattern Recognition and Artificial Intelligence, 16:
2,145-159, 2002.
Li X. and Dubes R. C., Tree classifier design with a Permutation statistic, Pat-
tern Recognition 19:229-235, 1986.
Lim X., Loh W.Y., and Shih X., A comparison of prediction accuracy, com-
plexity, and training time of thirty-three old and new classification algo-
rithms . Machine Learning 40:203-228, 2000.
Lin Y. K. and Fu K., Automatic classification of cervical cells using a binary
tree classifier. Pattern Recognition, 16(1):69-80, 1983.
Loh W.Y.,and Shih X., Split selection methods for classification trees. Statis-
tica Sinica, 7: 815-840, 1997.
Loh W.Y. and Shih X., Families of splitting criteria for classification trees.
Statistics and Computing 9:309-315, 1999.
Loh W.Y. and Vanichsetakul N., Tree-structured classification via generalized
discriminant Analysis. Journal of the American Statistical Association, 83:
715-728, 1988.
Lopez de Mantras R., A distance-based attribute selection measure for decision
tree induction, Machine Learning 6:81-92, 1991.
Lubinsky D., Algorithmic speedups in growing classification trees by using an
additive split criterion. Proc. AI&Statistics93, pp. 435-444, 1993.
Martin J. K., An exact probability metric for decision tree splitting and stop-
ping. An Exact Probability Metric for Decision Tree Splitting and Stopping,
Machine Learning, 28, 2-3):257-291, 1997.
Mehta M., Rissanen J., Agrawal R., MDL-Based Decision Tree Pruning. KDD
1995: pp. 216-221, 1995.
Mehta M., Agrawal R. and Rissanen J., SLIQ: A fast scalable classifier for
Data Mining: In Proc. If the fifth Int’l Conference on Extending Database
Technology (EDBT), Avignon, France, March 1996.
Mingers J., An empirical comparison of pruning methods for decision tree in-
duction. Machine Learning, 4(2):227-243, 1989.
Morgan J. N. and Messenger R. C., THAID: a sequential search program for
the analysis of nominal scale dependent variables. Technical report, Institute
for Social Research, Univ. of Michigan, Ann Arbor, MI, 1973.
Muller W., and Wysotzki F., Automatic construction of decision trees for clas-
sification. Annals of Operations Research, 52:231-247, 1994.
Decision Trees 191

Murthy S. K., Automatic Construction of Decision Trees from Data: A Multi-


Disciplinary Survey. Data Mining and Knowledge Discovery, 2(4):345-389,
1998.
Naumov G.E., NP-completeness of problems of construction of optimal deci-
sion trees. Soviet Physics: Doklady, 36(4):270-271, 1991.
Niblett T. and Bratko I., Learning Decision Rules in Noisy Domains, Proc.
Expert Systems 86, Cambridge: Cambridge University Press, 1986.
Olaru C., Wehenkel L., A complete fuzzy decision tree technique, Fuzzy Sets
and Systems, 138(2):221–254, 2003.
Pagallo, G. and Huassler, D., Boolean feature discovery in empirical learning,
Machine Learning, 5(1): 71-99, 1990.
Peng Y., Intelligent condition monitoring using fuzzy inductive learning, Jour-
nal of Intelligent Manufacturing, 15 (3): 373-380, June 2004.
Quinlan, J.R., Induction of decision trees, Machine Learning 1, 81-106, 1986.
Quinlan, J.R., Simplifying decision trees, International Journal of Man-
Machine Studies, 27, 221-234, 1987.
Quinlan, J.R., Decision Trees and Multivalued Attributes, J. Richards, ed., Ma-
chine Intelligence, V. 11, Oxford, England, Oxford Univ. Press, pp. 305-318,
1988.
Quinlan, J. R., Unknown attribute values in induction. In Segre, A. (Ed.), Pro-
ceedings of the Sixth International Machine Learning Workshop Cornell,
New York. Morgan Kaufmann, 1989.
Quinlan, J. R., C4.5: Programs for Machine Learning, Morgan Kaufmann, Los
Altos, 1993.
Quinlan, J. R. and Rivest, R. L., Inferring Decision Trees Using The Minimum
Description Length Principle. Information and Computation, 80:227-248,
1989.
Rastogi, R., and Shim, K., PUBLIC: A Decision Tree Classifier that Integrates
Building and Pruning,Data Mining and Knowledge Discovery, 4(4):315-
344, 2000.
Rissanen, J., Stochastic complexity and statistical inquiry. World Scientific,
1989.
Rounds, E., A combined non-parametric approach to feature selection and bi-
nary decision tree design, Pattern Recognition 12, 313-317, 1980.
Schlimmer, J. C. , Efficiently inducing determinations: A complete and sys-
tematic search algorithm that uses optimal pruning. In Proceedings of the
1993 International Conference on Machine Learning: pp 284-290, San Ma-
teo, CA, Morgan Kaufmann, 1993.
Sethi, K., and Yoo, J. H., Design of multicategory, multifeature split decision
trees using perceptron learning. Pattern Recognition, 27(7):939-947, 1994.
Shafer, J. C., Agrawal, R. and Mehta, M. , SPRINT: A Scalable Parallel Clas-
sifier for Data Mining, Proc. 22nd Int. Conf. Very Large Databases, T. M.
192 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK

Vijayaraman and Alejandro P. Buchmann and C. Mohan and Nandlal L.


Sarda (eds), 544-555, Morgan Kaufmann, 1996.
Sklansky, J. and Wassel, G. N., Pattern classifiers and trainable machines.
SpringerVerlag, New York, 1981.
Sonquist, J. A., Baker E. L., and Morgan, J. N., Searching for Structure. Insti-
tute for Social Research, Univ. of Michigan, Ann Arbor, MI, 1971.
Taylor P. C., and Silverman, B. W., Block diagrams and splitting criteria for
classification trees. Statistics and Computing, 3(4):147-161, 1993.
Utgoff, P. E., Perceptron trees: A case study in hybrid concept representations.
Connection Science, 1(4):377-391, 1989.
Utgoff, P. E., Incremental induction of decision trees. Machine Learning, 4:
161-186, 1989.
Utgoff, P. E., Decision tree induction based on efficient tree restructuring, Ma-
chine Learning 29, 1):5-44, 1997.
Utgoff, P. E., and Clouse, J. A., A Kolmogorov-Smirnoff Metric for Decision
Tree Induction, Technical Report 96-3, University of Massachusetts, De-
partment of Computer Science, Amherst, MA, 1996.
Wallace, C. S., and Patrick J., Coding decision trees, Machine Learning 11:
7-22, 1993.
Zantema, H., and Bodlaender H. L., Finding Small Equivalent Decision Trees
is Hard, International Journal of Foundations of Computer Science, 11(2):
343-354, 2000.
CS229 Lecture notes
Andrew Ng

The k-means clustering algorithm


In the clustering problem, we are given a training set {x(1) , . . . , x(m) }, and
want to group the data into a few cohesive “clusters.” Here, x(i) ∈ Rn
as usual; but no labels y (i) are given. So, this is an unsupervised learning
problem.
The k-means clustering algorithm is as follows:

1. Initialize cluster centroids µ1 , µ2 , . . . , µk ∈ Rn randomly.

2. Repeat until convergence: {

For every i, set


c(i) := arg min ||x(i) − µj ||2 .
j

For each j, set Pm (i)


i=1 1{c = j}x(i)
µj := P m (i) = j}
.
i=1 1{c

In the algorithm above, k (a parameter of the algorithm) is the number


of clusters we want to find; and the cluster centroids µj represent our current
guesses for the positions of the centers of the clusters. To initialize the cluster
centroids (in step 1 of the algorithm above), we could choose k training
examples randomly, and set the cluster centroids to be equal to the values of
these k examples. (Other initialization methods are also possible.)
The inner-loop of the algorithm repeatedly carries out two steps: (i)
“Assigning” each training example x(i) to the closest cluster centroid µj , and
(ii) Moving each cluster centroid µj to the mean of the points assigned to it.
Figure 1 shows an illustration of running k-means.

1
2

(a) (b) (c)

(d) (e) (f)

Figure 1: K-means algorithm. Training examples are shown as dots, and


cluster centroids are shown as crosses. (a) Original dataset. (b) Random ini-
tial cluster centroids (in this instance, not chosen to be equal to two training
examples). (c-f) Illustration of running two iterations of k-means. In each
iteration, we assign each training example to the closest cluster centroid
(shown by “painting” the training examples the same color as the cluster
centroid to which is assigned); then we move each cluster centroid to the
mean of the points assigned to it. (Best viewed in color.) Images courtesy
Michael Jordan.

Is the k-means algorithm guaranteed to converge? Yes it is, in a certain


sense. In particular, let us define the distortion function to be:
X
m
J(c, µ) = ||x(i) − µc(i) ||2
i=1

Thus, J measures the sum of squared distances between each training exam-
ple x(i) and the cluster centroid µc(i) to which it has been assigned. It can
be shown that k-means is exactly coordinate descent on J. Specifically, the
inner-loop of k-means repeatedly minimizes J with respect to c while holding
µ fixed, and then minimizes J with respect to µ while holding c fixed. Thus,
J must monotonically decrease, and the value of J must converge. (Usu-
ally, this implies that c and µ will converge too. In theory, it is possible for
3

k-means to oscillate between a few different clusterings—i.e., a few different


values for c and/or µ—that have exactly the same value of J, but this almost
never happens in practice.)
The distortion function J is a non-convex function, and so coordinate
descent on J is not guaranteed to converge to the global minimum. In other
words, k-means can be susceptible to local optima. Very often k-means will
work fine and come up with very good clusterings despite this. But if you
are worried about getting stuck in bad local minima, one common thing to
do is run k-means many times (using different random initial values for the
cluster centroids µj ). Then, out of all the different clusterings found, pick
the one that gives the lowest distortion J(c, µ).
Chapter 3

Linear Regression

Once we’ve acquired data with multiple variables, one very important question is how the
variables are related. For example, we could ask for the relationship between people’s weights
and heights, or study time and test scores, or two animal populations. Regression is a set
of techniques for estimating relationships, and we’ll focus on them for the next two chapters.
In this chapter, we’ll focus on finding one of the simplest type of relationship: linear. This
process is unsurprisingly called linear regression, and it has many applications. For exam-
ple, we can relate the force for stretching a spring and the distance that the spring stretches
(Hooke’s law, shown in Figure 3.1a), or explain how many transistors the semiconductor
industry can pack into a circuit over time (Moore’s law, shown in Figure 3.1b).
Despite its simplicity, linear regression is an incredibly powerful tool for analyzing data.
While we’ll focus on the basics in this chapter, the next chapter will show how just a few
small tweaks and extensions can enable more complex analyses.

150 y = −0.044 + 35 ⋅ x, r 2 = 0.999



Amount of stretch (mm)


100

50

1 2 3 4 5
Force on spring (Newtons)

(a) In classical mechanics, one could empiri- (b) In the semiconductor industry, Moore’s
cally verify Hooke’s law by dangling a mass law is an observation that the number of
with a spring and seeing how much the spring transistors on an integrated circuit doubles
is stretched. roughly every two years.

Figure 3.1: Examples of where a line fit explains physical phenomena and engineering feats.1

1
The Moore’s law image is by Wgsimon (own work) [CC-BY-SA-3.0 or GFDL], via Wikimedia Commons.

1
Statistics for Research Projects Chapter 3

But just because fitting a line is easy doesn’t mean that it always makes sense. Let’s take
another look at Anscombe’s quartet to underscore this point.

Example: Anscombe’s Quartet Revisited


Recall Anscombe’s Quartet: 4 datasets with very similar statistical properties under a simple quanti-
tative analysis, but that look very different. Here they are again, but this time with linear regression
lines fitted to each one:

14 14 14 14
12 12 12 12
10 10 10 10
8 8 8 8
6 6 6 6
4 4 4 4
2 2 2 2
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20

For all 4 of them, the slope of the regression line is 0.500 (to three decimal places) and the intercept is
3.00 (to two decimal places). This just goes to show: visualizing data can often reveal patterns that are
hidden by pure numeric analysis!

We begin with simple linear regression in which there are only two variables of interest
(e.g., weight and height, or force used and distance stretched). After developing intuition
for this setting, we’ll then turn our attention to multiple linear regression, where there
are more variables.
Disclaimer: While some of the equations in this chapter might be a little intimidating,
it’s important to keep in mind that as a user of statistics, the most important thing is to
understand their uses and limitations. Toward this end, make sure not to get bogged down
in the details of the equations, but instead focus on understanding how they fit in to the big
picture.

 3.1 Simple linear regression

We’re going to fit a line y = β0 + β1 x to our data. Here, x is called the independent
variable or predictor variable, and y is called the dependent variable or response
variable.
Before we talk about how to do the fit, let’s take a closer look at the important quantities
from the fit:

• β1 is the slope of the line: this is one of the most important quantities in any linear
regression analysis. A value very close to 0 indicates little to no relationship; large
positive or negative values indicate large positive or negative relationships, respectively.
For our Hooke’s law example earlier, the slope is the spring constant2 .
2
Since the spring constant k is defined as F = −kx (where F is the force and x is the stretch), the slope
in Figure 3.1a is actually the inverse of the spring constant.

2
Statistics for Research Projects Chapter 3

• β0 is the intercept of the line.

In order to actually fit a line, we’ll start with a way to quantify how good a line is. We’ll
then use this to fit the “best” line we can.
One way to quantify a line’s “goodness” is to propose a probabilistic model that generates
data from lines. Then the “best” line is the one for which data generated from the line is
“most likely”. This is a commonly used technique in statistics: proposing a probabilistic
model and using the probability of data to evaluate how good a particular model is. Let’s
make this more concrete.

A probabilistic model for linearly related data

We observe paired data points (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ), where we assume that as a func-
tion of xi , each yi is generated by using some true underlying line y = β0 + β1 x that we
evaluate at xi , and then adding some Gaussian noise. Formally,

yi = β0 + β1 xi + εi . (3.1)

Here, the noise εi represents the fact that our data won’t fit the model perfectly. We’ll model
εi as being Gaussian: ε ∼ N (0, σ 2 ). Note that the intercept β0 , the slope β1 , and the noise
variance σ 2 are all treated as fixed (i.e., deterministic) but unknown quantities.

Solving for the fit: least-squares regression

Assuming that this is actually how the data (x1 , y1 ), . . . , (xn , yn ) we observe are generated,
then it turns out that we can find the line for which the probability of the data is highest
by solving the following optimization problem3 :
n
X
min : [yi − (β0 + β1 xi )]2 , (3.2)
β0 ,β1
i=1

where minβ0 ,β1 means “minimize over β0 and β1 ”. This is known as the least-squares linear
regression problem. Given a set of points, the solution is:
Pn 1
Pn Pn
i=1 xi yi − n i=1 xi i=1 yi
β̂1 = Pn 2 1 Pn 2
(3.3)
i=1 xi − n ( i=1 xi )
sy
=r , (3.4)
sx
β̂0 = ȳ − βˆ1 x̄, (3.5)
3
This is an important point: the assumption of Gaussian noise leads to squared error as our minimization
criterion. We’ll see more regression techniques later that use different distributions and therefore different
cost functions.

3
Statistics for Research Projects Chapter 3

r = -0.9 r = -0.2 r = 0.2 r = 0.9


4 4 4 4

2 2 2 2

0 0 0 0

−2 −2 −2 −2

−4 −4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4

Figure 3.2: An illustration of correlation strength. Each plot shows data with a particular correla-
tion coefficient r. Values farther than 0 (outside) indicate a stronger relationship than values closer
to 0 (inside). Negative values (left) indicate an inverse relationship, while positive values (right)
indicate a direct relationship.

where x̄, ȳ, sx and sy are the sample means and standard deviations for x values and y
values, respectively, and r is the correlation coefficient, defined as
n
1 X  xi − x̄  yi − ȳ 
r= . (3.6)
n − 1 i=1 sx sy

By examining the second equation for the estimated slope β̂1 , we see that since sample
standard deviations sx and sy are positive quantities, the correlation coefficient r, which is
always between −1 and 1, measures how much x is related to y and whether the trend is
positive or negative. Figure 3.2 illustrates different correlation strengths.
The square of the correlation coefficient r2 will always be positive and is called the coefficient
of determination. As we’ll see later, this also is equal to the proportion of the total
variability that’s explained by a linear model.
As an extremely crucial remark, correlation does not imply causation! We devote the entire
next page to this point, which is one of the most common sources of error in interpreting
statistics.

4
Statistics for Research Projects Chapter 3

Example: Correlation and Causation

Just because there’s a strong correlation between two variables, there isn’t necessarily a causal rela-
tionship between them. For example, drowning deaths and ice-cream sales are strongly correlated, but
that’s because both are affected by the season (summer vs. winter). In general, there are several possible
cases, as illustrated below:

x y z z
x y x y x y x y
(a) Causal link: Even (b) Hidden Cause: (c) Confounding (d) Coincidence:
if there is a causal link A hidden variable z Factor: A hidden The correlation just
between x and y, corre- causes both x and y, variable z and x happened by chance
lation alone cannot tell creating the correla- both affect y, so the (e.g. the strong cor-
us whether y causes x or tion. results also depend relation between sun
x causes y. on the value of z. cycles and number
of Republicans in
Congress, as shown
below).

(e)The number of Republican senators in congress (red) and the sunspot num-
ber (blue, before 1986)/inverted sunspot number (blue, after 1986). This fig-
ure comes from http://www.realclimate.org/index.php/archives/2007/05/
fun-with-correlations/.

Figure 3.3: Different explanations for correlation between two variables. In this diagram, arrows represent causation.

5
Statistics for Research Projects Chapter 3

 3.2 Tests and Intervals

Recall from last time that in order to do hypothesis tests and compute confidence intervals,
we need to know our test statistic, its standard error, and its distribution. We’ll look at the
standard errors for the most important quantities and their interpretation. Any statistical
analysis software can compute these quantities automatically, so we’ll focus on interpreting
and understanding what comes out.
Warning: All the statistical tests here crucially depend on the assumption that the observed
data actually comes from the probabilistic model defined in Equation (3.1)!

 3.2.1 Slope
For the slope β1 , our test statistic is

βˆ1 − β1
tβ1 = , (3.7)
sβ1

which has a Student’s t distribution with n − 2 degrees of freedom. The standard error of
the slope sβ1 is
σ̂
sβ1 = v (3.8)
u X n
(xi − x̄)2
u
u
u
|i=1 {z
t
}
how close together x values are

and the mean squared error σ̂ 2 is


how large the errors are
z }| {
X n
(ŷi − yi )2
i=1
σ̂ 2 = (3.9)
n−2
These terms make intuitive sense: if the x-values are all really close together, it’s harder to
fit a line. This will also make our standard error sβ1 larger, so we’ll be less confident about
our slope. The standard error also gets larger as the errors grow, as we should expect it to:
larger errors should indicate a worse fit.

 3.2.2 Intercept
For the intercept β0 , our test statistic is

βˆ0 − β0
tβ0 = , (3.10)
sβ0

6
Statistics for Research Projects Chapter 3

10

tr
−5

−10
−1.0 −0.5 0.0 0.5 1.0
r
Figure 3.4: The test statistic for the correlation coefficient r for n = 10 (blue) and n = 100 (green).

which is also t-distributed with n − 2 degrees of freedom. The standard error is


s
1 x̄2
sβ0 = σ̂ +P 2
, (3.11)
n i (xi − x̄)

and σ̂ is given by Equation (3.9).

 3.2.3 Correlation
For the correlation coefficient r, our test statistic is the standardized correlation
r
n−2
tr = r , (3.12)
1 − r2
which is t-distributed with n − 2 degrees of freedom. Figure 3.4 plots tr against r.

 3.2.4 Prediction
Let’s look at the prediction at a particular value x∗ , which we’ll call ŷ(x∗ ). In particular:

ŷ(x∗ ) = β̂0 + β̂1 x∗ .

We can do this even if x∗ wasn’t in our original dataset.


Let’s introduce some notation that will help us distinguish between predicting the line versus
predicting a particular point generated from the model. From the probabilistic model given
by Equation (3.1), we can similarly write how y is generated for the new point x∗ :

y(x∗ ) = β0 + β1 x∗ +ε, (3.13)


| {z }
defined as µ(x∗ )

where ε ∼ N (0, σ 2 ).

7
Statistics for Research Projects Chapter 3

Then it turns out that the standard error sµ̂ for estimating µ(x∗ ) (i.e., the mean of the line
at point x∗ ) using ŷ(x∗ ) is:
v
u1 (x∗ − x̄)
sµ̂ = σ̂ u + P n 2
.
tn i=1 (xi − x̄)
u
| {z }
distance from “comfortable prediction region”

This makes sense because if we’re trying to predict for a point that’s far from the mean,
then we should be less sure, and our prediction should have more variance. To compute the
standard error for estimating a particular point y(x∗ ) and not just its mean µ(x∗ ), we’d also
need to factor in the extra noise term ε in Equation (3.13):
v
u1 (x∗ − x̄)
u
sŷ = σ̂ t + P +1 .
2 |{z}
n i (xi − x̄)
added

While both of these quantities have the same value when computed from the data, when
analyzing them, we have to remember that they’re different random variables: ŷ has more
variation because of the extra ε.

Interpolation vs. extrapolation

As a reminder, everything here crucially depends on the probabilistic model given by Equa-
tion (3.1) being true. In practice, when we do prediction for some value of x we haven’t seen
before, we need to be very careful. Predicting y for a value of x that is within the interval
of points that we saw in the original data (the data that we fit our model with) is called
interpolation. Predicting y for a value of x that’s outside the range of values we actually
saw for x in the original data is called extrapolation.
For real datasets, even if a linear fit seems appropriate, we need to be extremely careful
about extrapolation, which can often lead to false predictions!

8
Statistics for Research Projects Chapter 3

Example: The perils of extrapolation

By fitting a line to the Rotten Tomatoes ratings for movies that M. Night Shyamalan directed over time,
one may erroneously be led to believe that in 2014 and onward, Shyamalan’s movies will have negative
ratings, which isn’t even possible!

M. Night Shyamalan's Rotten Tomatoes Movie Ratings


90
80
70
60
50
Rating

40
30 y = -4.6531x + 9367.7
20 R² = 0.6875

10
0
1995 2000 2005 2010 2015
-10
Year

 3.3 Multiple Linear Regression

Now, let’s talk about the case when instead of just a single scalar value x, we have a vector
(x1 , . . . , xp ) for every data point i. So, we have n data points (just like before), each with p
different predictor variables or features. We’ll then try to predict y for each data point as
a linear function of the different x variables:

y = β1 x1 + β2 x2 + · · · + βp xp . (3.14)

Even though it’s still linear, this representation is very versatile; here are just a few of the
things we can represent with it:

9
Statistics for Research Projects Chapter 3

• Multiple dependent variables: for example, suppose we’re trying to predict medical
outcome as a function of several variables such as age, genetic susceptibility, and clinical
diagnosis. Then we might say that for each patient, x1 = age, x2 = genetics, x3 =
diagnosis, and y = outcome.

• Nonlinearities: Suppose we want to predict a quadratic function y = ax2 + bx + c: then


for each data point we might say x1 = 1, x2 = x, and x3 = x2 . This can easily be
extended to any nonlinear function we want.

One may ask: why not just use multiple linear regression and fit an extremely high-degree
polynomial to our data? While the model then would be much richer, one runs the risk of
overfitting, where the model is so rich that it ends up fitting to the noise! We illustrate
this with an example; it’s also illustrated by a song4 .

Example: Overfitting
Using too many features or too complex of a model can often lead to overfitting. Suppose we want to
fit a model to the points in Figure 3.3(a). If we fit a linear model, it might look like Figure 3.3(b). But,
the fit isn’t perfect. What if we use our newly acquired multiple regression powers to fit a 6th order
polynomial to these points? The result is shown in Figure 3.3(c). While our errors are definitely smaller
than they were with the linear model, the new model is far too complex, and will likely go wrong for
values too far outside the range.

12 12 12
11 11 11
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

(a)A set of points with a sim- (b) The same set of points (c) The same points with
ple linear relationship. with a linear fit (blue). a 6th-order polynomial fit
(green). As before, the linear
fit is shown in blue.

We’ll talk a little more about this in Chapters 4 and 5.

We’ll represent our input data in matrix form as X, an x × p matrix where each row corre-
sponds to a data point and each column corresponds to a feature. Since each output yi is
just a single number, we’ll represent the collection as an n-element column vector y. Then
our linear model can be expressed as

y = Xβ + ε (3.15)
4
Machine Learning A Cappella, Udacity. https://www.youtube.com/watch?v=DQWI1kvmwRg

10
Statistics for Research Projects Chapter 3

where β is a p-element vector of coefficients, and ε is an n-element matrix where each element,
like εi earlier, is normal with mean 0 and variance σ 2 . Notice that in this version, we haven’t
explicitly written out a constant term like β0 from before. We’ll often add a column of
1s to the matrix X to accomplish this (try multiplying things out and making sure you
understand why this solves the problem). The software you use might do this automatically,
so it’s something worth checking in the documentation.
This leads to the following optimization problem:
n
X
min (yi − Xi β)2 , (3.16)
β
i=1

where minβ . just means “find values of β that minimize the following”, and Xi refers to row
i of the matrix X.
We can use some basic linear algebra to solve this problem and find the optimal estimates:

β̂ = (X T X)−1 X T y, (3.17)

which most computer programs will do for you. Once we have this, what conclusions can
we make with the help of statistics? We can obtain confidence intervals and/or hypothesis
tests for each coefficient, which most statistical software will do for you. The test statistics
are very similar to their counterparts for simple linear regression.
It’s important not to blindly test whether all the coefficients are greater than zero: since this
involves doing multiple comparisons, we’d need to correct appropriately using Bonferroni
correction or FDR correction as described in the last chapter. But before even doing that,
it’s often smarter to measure whether the model even explains a significant amount of the
variability in the data: if it doesn’t, then it isn’t even worth testing any of the coefficients
individually. Typically, we’ll use an analysis of variance (ANOVA) test to measure this.
If the ANOVA test determines that the model explains a significant portion of the variability
in the data, then we can consider testing each of the hypotheses and correcting for multiple
comparisons.
We can also ask about which features have the most effect: if a feature’s coefficient is 0 or
close to 0, then that feature has little to no impact on the final result. We need to avoid the
effect of scale: for example, if one feature is measured in feet and another in inches, even if
they’re the same, the coefficient for the feet feature will be twelve times larger. In order to
avoid this problem, we’ll usually look at the standardized coefficients sβ̂k .
β̂k

 3.4 Model Evaluation

How can we measure the performance of our model? Suppose for a moment that every point
yi was very close to the mean ȳ: this would mean that each yi wouldn’t depend on xi , and
that there wasn’t much random error in the value either. Since we expect that this shouldn’t
be the case, we can try to understand how much the prediction from xi and random error

11
Statistics for Research Projects Chapter 3

(xi , yi )

y − ŷi
(x, ŷ )
ŷi − ȳ

Figure 3.5: An illustration of the components contributing to the difference between the average
y-value ȳ and a particular point (xi , yi ) (blue). Some of the difference, ŷi − ȳ, can be explained by
the model (orange), and the remainder, yi − ŷi , is known as the residual (green).

contribute to yi . In particular, let’s look at how far yi is from the mean ȳ. We’ll write this
difference as:

yi − ȳ = (ŷ − ȳ) + (y − ŷ ) (3.18)


| i {z } | i {z i}
difference explained by model difference not explained by model

In particular, the residual is defined to be yi − ŷi : the distance from the original data point
to the predicted value on the line. You can think of it as the error left over after the model
has done its work. This difference is shown graphically in Figure 3.5. Note that the residual
yi − ŷ isn’t quite the same as the noise ε! We’ll talk a little more about analyzing residuals
(and why this distinction matters) in the next chapter.
If our model is doing a good job, then it should explain most of the difference from ȳ, and
the first term should be bigger than the second term. If the second term is much bigger,
then the model is probably not as useful.
If we square the quantity on the left, work through some algebra, and use some facts about
linear regression, we’ll find that
X X X
(yi − ȳ)2 = (ŷi − ȳ)2 + (yi − ŷi )2 , (3.19)
|i {z } |i {z } |i {z }
SStotal SSmodel SSerror

where “SS” stands for “sum of squares”. These terms are often abbreviated as SST, SSM,
and SSE respectively.
If we divide through by SST, we obtain
SSM SSE
1= + ,
|SST SST
{z } |{z}
r2 1−r2

12
Statistics for Research Projects Chapter 3

where we note that r2 is precisely the coefficient of determination mentioned earlier. Here,
we see why r2 can be interpreted as the fraction of variability in the data that is explained
by the model.
One way we might evaluate a model’s performance is to compare the ratio SSM/SSE. We’ll
do this with a slight tweak: we’ll instead consider the mean values, MSM = SSM/(p − 1) and
MSE = SSE/(n − p), where the denominators correspond to the degrees of freedom. These
new variables MSM and MSE have χ2 distributions, and their ratio
MSM
f= (3.20)
MSE
has what’s known as an F distribution with parameters p − 1 and n − p. The widely used
ANOVA test for categorical data, which we’ll see in Chapter 6, is based on this F statistic:
it’s a way of measuring how much of the variability in the data is from the model and how
much is from random error, and comparing the two.

13
LOGISTIC REGRESSION

Nitin R Patel

Logistic regression extends the ideas of multiple linear regression to the situation
where the dependent variable, y, is binary (for convenience we often code these values as
0 and 1). As with multiple linear regression the independent variables x1 , x2 · · · xk may
be categorical or continuous variables or a mixture of these two types.

Let us take some examples to illustrate [1]:

Example 1: Market Research

The data in Table 1 were obtained in a survey conducted by AT & T in the US


from a national sample of co-operating households. Interest was centered on the adoption

of a new telecommunications service as it related to education, residential stability and


income.

Table 1: Adoption of New Telephone Service

High School or below Some College or above


No Change in Change in No change in Change in
Residence during Residence during Residence during Residence during
Last five years Last five years Last five years Last five years
Low 153/2160 = 0.071 226/1137 = 0.199 61/886 = 0.069 233/1091 = 0.214
Income
High 147/1363 = 0.108 139/ 547 = 0.254 287/1925 = 0.149 382/1415 = 0.270
Income

(For fractions in cells above, the numerator is the number of adopters out of the
number in the denominator).

Note that the overall probability of adoption in the sample is 1628/10524 = 0.155.
However, the adoption probability varies depending on the categorical independent vari-
ables education, residential stability and income. The lowest value is 0.069 for low- income
no-residence-change households with some college education while the highest is 0.270 for

1
high-income residence changers with some college education.
The standard multiple linear regression model is inappropriate to model this data for
the following reasons:

1. The model’s predicted probabilities could fall outside the range 0 to 1.

2. The dependent variable is not normally distributed. In fact a binomial model would
be more appropriate. For example, if a cell total is 11 then this variable can take
on only 12 distinct values 0, 1, 2 · · · 11. Think of the response of the households in
a cell being determined by independent flips of a coin with, say, heads representing
adoption with the probability of heads varying between cells.

3. If we consider the normal distribution as an approximation for the binomial model,


the variance of the dependent variable is not constant across all cells: it will be

higher for cells where the probability of adoption, p, is near 0.5 than where it is
near 0 or 1. It will also increase with the total number of households, n, falling in

the cell. The variance equals n(p(1 − p)).

The logistic regression model was developed to account for all these difficulties. It
has become very popular in describing choice behavior in econometrics and in modeling
risk factors in epidemiology. In the context of choice behavior it can be shown to follow

from the random utility theory developed by Manski [2] as an extension of the standard
economic theory of consumer behavior.
In essence the consumer theory states that when faced with a set of choices a consumer
makes a choice which has the highest utility ( a numeric measure of worth with arbitrary

zero and scale). It assumes that the consumer has a preference order on the list of choices
that satisfies reasonable criteria such as transitivity. The preference order can depend on
the individual (e.g. socioeconomic characteristics as in the Example 1 above) as well as

2
attributes of the choice. The random utility model considers the utility of a choice to
incorporate a random element. When we model the random element as coming from a
”reasonable” distribution, we can logically derive the logistic model for predicting choice
behavior.

If we let y = 1 represent choosing an option versus y = 0 for not choosing it, the
logistic regression model stipulates:

exp(βO + β1 ∗ x1 + · · · βk ∗ xk )
Probability(Y = 1|x1 , x2 · · · xk ) =
1 + exp(βO + β1 ∗ x1 + · · · βk ∗ xk )

where β0 , β1 , β2 · · · βk are unknown constants analogous to the multiple linear regression

model.

The independent variables for our model would be:

x1 ≡ ( Education: High School or below = 0, Some College or above = 1

x2 ≡ (Residential Stability: No change over past five years = 0, Change over past five

years = 1

x3 ≡ Income: Low = 0 High = 1

The data in Table 1 is shown below in the format typically required by regression

programs.

x1 x2 x3 # in sample #adopters # Non-adopters Fraction adopters


0 0 0 2160 153 2007 .071
0 0 1 1363 147 1216 .108
0 1 0 1137 226 911 .199
0 1 1 547 139 408 .254
1 0 0 886 61 825 .069
1 1 0 1091 233 858 .214
1 0 1 1925 287 1638 .149
1 1 1 1415 382 1033 .270
10524 1628 8896 1.000

The logistic model for this example is:

3
exp(β0 + β1 ∗ xl + β2 ∗ x2 + β3 ∗ x3 )
P rob(Y = 1|x1 , x2 , x3 ) = .
1 + exp(β0 + β1 ∗ xl + β2 ∗ x2 + β3 ∗ x3 )
We obtain a useful interpretation for the coefficients by noting that:

P rob(Y = 1|x1 = x2 = x3 = 0)
exp(β0 ) =
P rob(Y = 0|x1 = x2 = x3 = 0)
= Odds of adopting in the base case (x1 = x2 = x3 = 0)
Odds of adopting when x1 = 1, x2 = x3 = 0
exp(β1 ) =
Odds of adopting in the base case
Odds of adopting when x2 = 1, x1 = x3 = 0
exp(β2 ) =
Odds of adopting in the base case
Odds of adopting when x3 = 1, x1 = x2 = 0
exp(β3 ) =
Odds of adopting in the base case

The logistic model is multiplicative in odds in the following sense:


Odds of adopting for a given x1 , x2 , x3

= exp(β0 ) ∗ exp(β1 x1 ) ∗ exp(β2 x2 ) ∗ exp(β3 x3 )


       
 Odds 
    F actor     F actor     F actor 

=  f or ∗ due ∗ due ∗ due
  
   
  
   

basecase to x1 to x2 to x3

If x1 = 1 the odds of adoption get multiplied by the same factor for any given level of

x2 and x3 . Similarly the multiplicative factors for x2 and x3 do not vary with the levels

of the remaining factors. The factor for a variable gives us the impact of the presence of
that factor on the odds of adopting.
If βi = 0, the presence of the corresponding factor has no effect (multiplication by
one). If βi < 0, presence of the factor reduces the odds (and the probability) of adoption,
whereas if βi > 0, presence of the factor increases the probability of adoption.
The computations required to produce these maximum likelihood estimates require
iterations using a computer program. The output of a typical program is shown below:

4
95% Conf. Intvl. for odds
Variable Coeff. Std. Error p-Value Odds Lower Limit Upper Limit
Constant -2.500 0.058 0.000 0.082 0.071 0.095
x1 0.161 0.058 0.006 1.175 1.048 1.316
x2 0.992 0.056 0.000 2.698 2.416 3.013
x3 0.444 0.058 0.000 1.560 1.393 1.746

From the estimated values of the coefficients, we see that the estimated probability of
adoption for a household with values x1 , x2 and x3 for the independent variables is:

exp(−2.500 + 0.161 ∗ x1 + 0.992 ∗ x2 + 0.444 ∗ x3 )


P rob(Y = 1|x1 , x2 , x3 ) = .
1 + exp(−2.500 + 0.161 ∗ x1 + 0.992 ∗ x2 + 0.444 ∗ x3 )

The estimated number of adopters from this model will be the total number of households
with values x1 , x2 and x3 for the independent variables multiplied by the above probability.

The table below shows the estimated number of adopters for the various combinations

of the independent variables.

x1 x2 x3 # in # adopters Estimated Fraction Estimated


sample (# adopters) Adopters P rob(Y = l|x1 , x2 , x3 )
0 0 0 2160 153 164 0.071 0.076
0 0 1 1363 147 155 0.108 0.113
0 1 0 1137 226 206 0.199 0.181
0 1 1 547 139 140 0.254 0.257
1 0 0 886 61 78 0.069 0.088
1 1 0 1091 233 225 0.214 0.206
1 0 1 1925 287 252 0.149 0.131
1 1 1 1415 382 408 0.270 0.289

In data mining applications we will have validation data that is a hold-out sample not
used in fitting the model.

Let us suppose we have the following validation data consisting of 598 households:

5
x1 x2 x3 # in # adopters Estimated Error Absolute
validation in validation (# adopters) (Estimate Value
sample sample -Actal) of Error
0 0 0 29 3 2.200 -0.800 0.800
0 0 1 23 7 2.610 -4.390 4.390
0 1 0 112 25 20.302 -4.698 4.698
0 1 1 143 27 36.705 9.705 9.705
1 0 0 27 2 2.374 0.374 0.374
1 1 0 54 12 11.145 -0.855 0.855
1 0 1 125 13 16.338 3.338 3.338
1 1 1 85 30 24.528 -5.472 5.472
Totals 598 119 116.202

The total error is -2.8 adopters or a percentage error in estimating adopters of -2.8/119
= 2.3%.

The average percentage absolute error is

0.800 + 4.390 + 4.698 + 9.705 + 0.374 + 0.855 + 3.338 + 5.472


119

= .249 = 24.9% adopters.

The confusion matrix for households in the validation data for set is given below:

Observed
Adopters Non-adopters Total
Predicted:
Adopters 103 13 116
Non-adopters 16 466 482
Total 119 479 598

As with multiple linear regression we can build more complex models that reflect

interactions between independent variables by including factors that are calculated from
the interacting factors. For example if we felt that there is an interactive effect b etween
x1 and x2 we would add an interaction term x4 = x1 × x2 .

6
Example 2: Financial Conditions of Banks [2]
Table 2 gives data on a sample of banks. The second column records the judgment of
an expert on the financial condition of each bank. The last two columns give the values
of two commonly ratios commonly used in financial analysis of banks.

Table 2: Financial Conditions of Banks

Financial Total Loans & Leases/ Total Expenses /


Obs Condition∗ Total Assets Total Assets
(y) (x1 ) (x2 )
1 1 0.64 0.13
2 1 1.04 0.10
3 1 0.66 0.11
4 1 0.80 0.09
5 1 0.69 0.11
6 1 0.74 0.14
7 1 0.63 0.12
8 1 0.75 0.12
9 1 0.56 0.16
10 1 0.65 0.12
11 0 0.55 0.10
12 0 0.46 0.08
13 0 0.72 0.08
14 0 0.43 0.08
15 0 0.52 0.07
16 0 0.54 0.08
17 0 0.30 0.09
18 0 0.67 0.07
19 0 0.51 0.09
20 0 0.79 0.13

∗ Financial Condition = 1 for financially weak banks;

= 0 for financially strong banks.

Let us first consider a simple logistic regression model with just one independent
variable. This is analogous to the simple linear regression model in which we fit a straight
line to relate the dependent variable, y, to a single independent variable, x.

7
Let us construct a simple logistic regression model for classification of banks using
the Total Loans & Leases to Total Assets ratio as the independent variable in our model.
This model would have the following variables:

Dependent variable:

Y = 1, if financially distressed,

= 0, otherwise.

Independent (or Explanatory) variable:

x1 = Total Loans & Leases/Total Assets Ratio

The equation relating the dependent variable with the explanatory variable is:

exp(β0 + β1 ∗ xl )
P rob(Y = 1|x1 ) =
1 + exp(β0 + β1 ∗ xl )

or, equivalently,

Odds (Y = 1 versus Y = 0) = (β0 + β1 ∗ xl ).

The Maximum Likelihood Estimates of the coefficients for the model are: β̂0 = −6.926,

β̂1 = 10.989

So that the fitted model is:

exp(−6.926 + 10.989 ∗ x1 )
P rob(Y = 1|x1 ) = .
(1 + exp(−6.926 + 10.989 ∗ x1 )

8
Figure 1 displays the data points and the fitted logistic regression model.

9
We can think of the model as a multiplicative model of odds ratios as we did for
Example 1. The odds that a bank with a Loan & Leases/Assets Ratio that is zero will
be in financial distress = exp(−6.926) = 0.001. These are the base case odds. The
odds of distress for a bank with a ratio of 0.6 will increase by a multiplicative factor of

exp(10.989∗0.6) = 730 over the base case, so the odds that such a bank will be in financial
distress = 0.730.
Notice that there is a small difference in interpretation of the multiplicative factors for
this example compared to Example 1. While the interpretation of the sign of βi remains
as before, its magnitude gives the amount by which the odds of Y = 1 against Y = 0

are changed for a unit change in xi . If we construct a simple logistic regression model for
classification of banks using the Total Expenses/ Total Assets ratio as the independent

variable we would have the following variables:

Dependent variable:

Y = 1, if financially distressed,

= 0, otherwise.

Independent (or Explanatory) variables:

x2 = Total Expenses/ Total Assets Ratio

The equation relating the dependent variable with the explanatory variable is:

exp(β0 + β2 ∗ x2 )
P rob(Y = l|x1 ) =
1 + exp(β0 + β2 ∗ x2 )

or, equivalently,

Odds (Y = 1 versus Y = 0) = (β0 + β2 ∗ x2 ).

The Maximum Likelihood Estimates of the coefficients for the model are: β0 = −9.587,

β2 = 94.345

10
Figure 2 displays the data points and the fitted logistic regression model.

Computation of Estimates

As illustrated in Examples 1 and 2, estimation of coefficients is usually carried out


based on the principle of maximum likelihood which ensures good asymptotic (large sam-
ple) properties for the estimates. Under very general conditions maximum likelihood

estimators are:

• Consistent : the probability of the estimator differing from the true value approaches

zero with increasing sample size;

• Asymptotically Efficient : the variance is the smallest possible among consistent


estimators

• Asymptotically Normally Distributed: This allows us to compute confidence intervals


and perform statistical tests in a manner analogous to the analysis of linear multiple
regression models, provided the sample size is ’large’.

11
Algorithms to compute the coefficient estimates and confidence intervals are iterative
and less robust than algorithms for linear regression. Computed estimates are generally
reliable for well-behaved datasets where the number of observations with depende nt
variable values of both 0 and 1 are ‘large’; their ratio is ‘not too close’ to either zero or

one; and when the number of coefficients in the logistic regression model is small relative
to the sample size (say, no more than 10%). As with linear regression collinearity (strong
correlation amongst the independent variables) can lead to computational difficulties.
Computationally intensive algorithms have been developed recently that circumvent some

of these difficulties [3].

12
Appendix A
Computing Maximum Likelihood Estimates and Confidence Intervals
for Regression Coefficients
We denote the coefficients by the p × 1 column vector β with the row element i equal

to βi , The n observed values of the dependent variable will be denoted by the n × 1


column vector y with the row element j equal to yj ; and the corresponding values of the
independent variable i by xij for

i = 1 · · · p; j = 1 · · · n.

Data : yj , x1j , x2j , · · · , xpj , j = 1, 2, · · · , n.


Likelihood Function: The likelihood function, L, is the probability of the observed

data viewed as a function of the parameters (β2i in a logistic regression).

n
 eyi (β0 +β1 x1j +β2 x2j ···+βp xpj )
β0 +β1 x1j +β2 x2j ···+βi xpj )
j=1 1 + e
n
 eΣi yj βi xij
=
j=1 1 + e
Σi βi xij

eΣi (Σj yj xij )βi


= 
n
[1 + eΣi βi xij ]
j=1

eΣi βi ti
= 
n
[1 + eΣi βi xij ]
j=1
where ti = Σj yj xij

These are the sufficient statistics for a logistic regression model analogous to ŷ and S in
linear regression.
Loglikelihood Function: This is the logarithm of the likelihood function,

l = Σi βi ti − Σj log[1 + eΣi βi xij ].

13
We find the maximum likelihood estimates, β̂i , of βi by maximizing the loglikelihood
function for the observed values of yj and xij in our data. Since maximizing the log of

a function is equivalent to maximizing the function, we often work with the loglikeli-
hood because it is generally less cumbersome to use for mathematical operations such as

differentiation.

Since the likelihood function can be shown to be concave, we will find the global max-
imum of the function (if it exists) by equating the partial derivatives of the loglikelihood
to zero and solving the resulting nonlinear equations for β̂i .

∂l xij eΣi βi xij


= ti − Σj
∂βi [1 + eΣi βi xij ]
= ti − Σj xij π̂j = 0, i = 1, 2, · · · , p

or Σi xij π̂j = ti

eΣi bbi xij


where π̂j = [1+eΣi βi xij ]
= E(Yj )

An intuitive way to understand these equations is to note that

Σj xij E(Yj ) = Σj xij yj

In words, the maximum likelihood estimates are such that the expected value of the
sufficient statistics are equal to their observed values.

Note : If the model includes the constant term xij = 1 for all j then Σj E(Yj ) = Σj yj , i.e.
the expected number of successes (responses of one) using MLE estimates of βi equals the
observed number of successes. The β̂i ’s are consistent, asymptotically efficient and follow
a multivariate Normal distribution (subject to mild regularity conditions).

Algorithm : A popular algorithm for computing β̂i uses the Newton-Raphson method
for maximizing twice differentiable functions of several variables (see Appendix B).

14
The Newton-Raphson method involves computing the following successive approxima-
tions to find β̂i , the likelihood function

β t+1 = β t + [I(β t )]−1 ∇I(β t )

where
∂2l
Iij =
∂βi ∂j βj

• On convergence, the diagonal elements of I(β t )−1 give squared standard errors (ap-
proximate variance) for β̂i .

• Confidence intervals and hypothesis tests are based on asymptotic normal distribu-

tion of β̂i .

The loglikelihood function is always negative and does not have a maximum when it

can be made arbitrary close to zero. In that case the likelihood function can be made
arbitrarily close to one and the first term of the loglikelihood function given above
approaches infinity. In this situation the predicted probabilities for observations
with yj = 0 can be made arbitrarily close to 0 and those for yj = 1 can be made
arbitrarily close to 1 by choosing suitable very large absolute values of some βi . This
is the situation when we have a perfect model (at least in terms of the training data

set)! This phenomenon is more likely to occur when the number of parameters is a
large fraction (say > 20%) of the number of observations.

15
Appendix B
The Newton-Raphson Method

This method finds the values of βi that maximize a twice differentiable concave

function, g(β). If the function is not concave, it finds a local maximum. The
method uses successive quadratic approximations to g based on Taylor series. It
converges rapidly if the starting value, β 0 , is reasonably close to the maximizing
value, β̂, of β.

The gradient vector ∇ and the Hessian matrix, H, as defined below, are used to
update an estimate β t to β t+1 .
   
.. ..
 .   . 
 
∂g  ∂2g 
∇g(β t ) =  ∂βi  H(β t ) =  · · · ∂βi ∂βk
···  .
   
.. ..
. βt . βt
The Taylor series expansion around β t gives us:

g(β) g(β t ) + ∇g(β t )(β − β t ) + 1/2(β − β t ) H(β t )(β − β t )

Provided H(β t ) is positive definite, the maximum of this approximation occurs when
its derivative is zero.

∇g(β t ) − H(β t )(β − β t ) = 0

or

β = β t − [H(β t )]−1 ∇g(β t ).

This gives us a way to compute β t+1 , the next value in our iterations.

β t+1 = β t − [H(β t ]−1 ∇g(β t ).

To use this equation H should be non-singular. This is generally not a problem


although sometimes numerical difficulties can arise due to collinearity.

16
Near the maximum the rate of convergence is quadratic as it can be shown that

|βit+1 − β̂i | ≤ c|βit − β̂i |2 for some c ≥ 0 when βit is near β̂i for all i.

17
CS229 Lecture notes
Andrew Ng

Part XI
Principal components analysis
In our discussion of factor analysis, we gave a way to model data x ∈ Rn as
“approximately” lying in some k-dimension subspace, where k ≪ n. Specif-
ically, we imagined that each point x(i) was created by first generating some
z (i) lying in the k-dimension affine space {Λz + µ; z ∈ Rk }, and then adding
Ψ-covariance noise. Factor analysis is based on a probabilistic model, and
parameter estimation used the iterative EM algorithm.
In this set of notes, we will develop a method, Principal Components
Analysis (PCA), that also tries to identify the subspace in which the data
approximately lies. However, PCA will do so more directly, and will require
only an eigenvector calculation (easily done with the eig function in Matlab),
and does not need to resort to EM.
Suppose we are given a dataset {x(i) ; i = 1, . . . , m} of attributes of m dif-
ferent types of automobiles, such as their maximum speed, turn radius, and
so on. Let x(i) ∈ Rn for each i (n ≪ m). But unknown to us, two different
attributes—some xi and xj —respectively give a car’s maximum speed mea-
sured in miles per hour, and the maximum speed measured in kilometers per
hour. These two attributes are therefore almost linearly dependent, up to
only small differences introduced by rounding off to the nearest mph or kph.
Thus, the data really lies approximately on an n − 1 dimensional subspace.
How can we automatically detect, and perhaps remove, this redundancy?
For a less contrived example, consider a dataset resulting from a survey of
(i)
pilots for radio-controlled helicopters, where x1 is a measure of the piloting
(i)
skill of pilot i, and x2 captures how much he/she enjoys flying. Because
RC helicopters are very difficult to fly, only the most committed students,
ones that truly enjoy flying, become good pilots. So, the two attributes
x1 and x2 are strongly correlated. Indeed, we might posit that that the

1
2

data actually likes along some diagonal axis (the u1 direction) capturing the
intrinsic piloting “karma” of a person, with only a small amount of noise
lying off this axis. (See figure.) How can we automatically compute this u1
direction?

u1
x2 (enjoyment)

u2

x1 (skill)

We will shortly develop the PCA algorithm. But prior to running PCA
per se, typically we first pre-process the data to normalize its mean and
variance, as follows:

1. Let µ = m1 m (i)
P
i=1 x .

2. Replace each x(i) with x(i) − µ.


P (i)
3. Let σj2 = m1 i (xj )2
(i) (i)
4. Replace each xj with xj /σj .

Steps (1-2) zero out the mean of the data, and may be omitted for data
known to have zero mean (for instance, time series corresponding to speech
or other acoustic signals). Steps (3-4) rescale each coordinate to have unit
variance, which ensures that different attributes are all treated on the same
“scale.” For instance, if x1 was cars’ maximum speed in mph (taking values
in the high tens or low hundreds) and x2 were the number of seats (taking
values around 2-4), then this renormalization rescales the different attributes
to make them more comparable. Steps (3-4) may be omitted if we had
apriori knowledge that the different attributes are all on the same scale. One
3

example of this is if each data point represented a grayscale image, and each
(i)
xj took a value in {0, 1, . . . , 255} corresponding to the intensity value of
pixel j in image i.
Now, having carried out the normalization, how do we compute the “ma-
jor axis of variation” u—that is, the direction on which the data approxi-
mately lies? One way to pose this problem is as finding the unit vector u so
that when the data is projected onto the direction corresponding to u, the
variance of the projected data is maximized. Intuitively, the data starts off
with some amount of variance/information in it. We would like to choose a
direction u so that if we were to approximate the data as lying in the direc-
tion/subspace corresponding to u, as much as possible of this variance is still
retained.
Consider the following dataset, on which we have already carried out the
normalization steps:

Now, suppose we pick u to correspond the the direction shown in the


figure below. The circles denote the projections of the original data onto this
line.
4

111
000
000
111
000
111
00
11
0000
1111
000
111
00
11
1
0
0000
1111
0000
1111
0000
1111
0000
1111
111
000
000
111 0000
1111
000
111
00
11 0000
1111
000
111
00
11

1
0
00
11
0
1
00
11
00
11
00
11
00
11
00
11
11
00 00
11
00
11
1
0

We see that the projected data still has a fairly large variance, and the
points tend to be far from zero. In contrast, suppose had instead picked the
following direction:

000000000
111111111
000000000
111111111
000000000
111111111
000000000
111111111
000000000
111111111
000000000
111111111
000000000
111111111
00000
11111
000000000
111111111
00000
11111
000000000
111111111
000000000
111111111
00000
11111
000000000
111111111
000000000
111111111
00000
11111
000000000
111111111
000000000
111111111
00000
11111
000000000
111111111
000000000
111111111
00000
11111
000000000
111111111
000000000
111111111
00000
11111
000000000
111111111
000000000
00
11
00
11 111111111
00000
11111
1
0
000000000
111111111
00000000000
11111111111
00 111111111
11
00000000000
11111111111 000000000
000000000
111111111
00000000000
11111111111
00000000000
11111111111 000000000
111111111
00000000000
11111111111 000000000
111111111
00000000000
11111111111 000000000
111111111
000000000
111111111
00000000000
11111111111 0
1
000000000
111111111
000000
111111
0
1
00000000000
11111111111 000000000
111111111
000000
111111
00000000000
11111111111 00
11
00000000000
11111111111 000000000
111111111
000000
111111
00
11
000000
111111
00000000000
11111111111
000000
111111
00000000000
11111111111
000000
111111
00000000000
11111111111
000000
111111
00000000000
11111111111
000000
111111
00000000000
11111111111
000000
111111
00000000000
11111111111
000000
111111
00000000000
11111111111
000000
111111

Here, the projections have a significantly smaller variance, and are much
closer to the origin.
We would like to automatically select the direction u corresponding to
the first of the two figures shown above. To formalize this, note that given a
5

unit vector u and a point x, the length of the projection of x onto u is given
by xT u. I.e., if x(i) is a point in our dataset (one of the crosses in the plot),
then its projection onto u (the corresponding circle in the figure) is distance
xT u from the origin. Hence, to maximize the variance of the projections, we
would like to choose a unit-length u so as to maximize:
m m
1 X (i) T 2 1 X T (i) (i) T
(x u) = u x x u
m i=1 m i=1
m
!
1 X (i) (i) T
= uT x x u.
m i=1

We easily recognize that the maximizing this subject to ||u||2 = 1 gives the
(i) (i) T
principal eigenvector of Σ = m1 m
P
i=1 x x , which is just the empirical
covariance matrix of the data (assuming it has zero mean).1
To summarize, we have found that if we wish to find a 1-dimensional
subspace with with to approximate the data, we should choose u to be the
principal eigenvector of Σ. More generally, if we wish to project our data
into a k-dimensional subspace (k < n), we should choose u1 , . . . , uk to be the
top k eigenvectors of Σ. The ui ’s now form a new, orthogonal basis for the
data.2
Then, to represent x(i) in this basis, we need only compute the corre-
sponding vector  
uT1 x(i)
 uT x(i) 
(i)  2
y = ..  ∈ Rk .

 . 
T (i)
uk x
Thus, whereas x(i) ∈ Rn , the vector y (i) now gives a lower, k-dimensional,
approximation/representation for x(i) . PCA is therefore also referred to as
a dimensionality reduction algorithm. The vectors u1 , . . . , uk are called
the first k principal components of the data.
Remark. Although we have shown it formally only for the case of k = 1,
using well-known properties of eigenvectors it is straightforward to show that
1
If you haven’t seen this before, try using the method of Lagrange multipliers to max-
imize uT Σu subject to that uT u = 1. You should be able to show that Σu = λu, for some
λ, which implies u is an eigenvector of Σ, with eigenvalue λ.
2
Because Σ is symmetric, the ui ’s will (or always can be chosen to be) orthogonal to
each other.
6

of all possible orthogonal bases u1 , . . . , uk , the one that we have chosen max-
(i) 2
P
imizes i ||y ||2. Thus, our choice of a basis preserves as much variability
as possible in the original data.
In problem set 4, you will see that PCA can also be derived by picking
the basis that minimizes the approximation error arising from projecting the
data onto the k-dimensional subspace spanned by them.
PCA has many applications; we will close our discussion with a few exam-
ples. First, compression—representing x(i) ’s with lower dimension y (i) ’s—is
an obvious application. If we reduce high dimensional data to k = 2 or 3 di-
mensions, then we can also plot the y (i) ’s to visualize the data. For instance,
if we were to reduce our automobiles data to 2 dimensions, then we can plot
it (one point in our plot would correspond to one car type, say) to see what
cars are similar to each other and what groups of cars may cluster together.
Another standard application is to preprocess a dataset to reduce its
dimension before running a supervised learning learning algorithm with the
x(i) ’s as inputs. Apart from computational benefits, reducing the data’s
dimension can also reduce the complexity of the hypothesis class considered
and help avoid overfitting (e.g., linear classifiers over lower dimensional input
spaces will have smaller VC dimension).
Lastly, as in our RC pilot example, we can also view PCA as a noise re-
duction algorithm. In our example it estimates the intrinsic “piloting karma”
from the noisy measures of piloting skill and enjoyment. In class, we also saw
the application of this idea to face images, resulting in eigenfaces method.
Here, each point x(i) ∈ R100×100 was a 10000 dimensional vector, with each co-
ordinate corresponding to a pixel intensity value in a 100x100 image of a face.
Using PCA, we represent each image x(i) with a much lower-dimensional y (i) .
In doing so, we hope that the principal components we found retain the inter-
esting, systematic variations between faces that capture what a person really
looks like, but not the “noise” in the images introduced by minor lighting
variations, slightly different imaging conditions, and so on. We then measure
distances between faces i and j by working in the reduced dimension, and
computing ||y (i) − y (j) ||2 . This resulted in a surprisingly good face-matching
and retrieval algorithm.
Recurrent Neural Networks
Neural Computation : Lecture 12

© John A. Bullinaria, 2015

1. Recurrent Neural Network Architectures

2. State Space Models and Dynamical Systems

3. Backpropagation Through Time

4. The NARX Model

5. Computational Power of Recurrent Networks

6. Applications – Reading Model, Associative Memory

7. Hopfield Networks and Boltzmann Machines


Recurrent Neural Network Architectures
The fundamental feature of a Recurrent Neural Network (RNN) is that the network
contains at least one feed-back connection, so the activations can flow round in a loop.
That enables the networks to do temporal processing and learn sequences, e.g., perform
sequence recognition/reproduction or temporal association/prediction.

Recurrent neural network architectures can have many different forms. One common
type consists of a standard Multi-Layer Perceptron (MLP) plus added loops. These can
exploit the powerful non-linear mapping capabilities of the MLP, and also have some
form of memory. Others have more uniform structures, potentially with every neuron
connected to all the others, and may also have stochastic activation functions.

For simple architectures and deterministic activation functions, learning can be achieved
using similar gradient descent procedures to those leading to the back-propagation
algorithm for feed-forward networks. When the activations are stochastic, simulated
annealing approaches may be more appropriate. The following will look at a few of the
most important types and features of recurrent networks.

L12-2
A Fully Recurrent Network
The simplest form of fully recurrent neural network is an MLP with the previous set of
hidden unit activations feeding back into the network along with the inputs:

Outputs

h(t) h(t)

Hidden Units Delay

x(t) h(t-1)

Inputs

Note that the time t has to be discretized, with the activations updated at each time step.
The time scale might correspond to the operation of real neurons, or for artificial systems
any time step size appropriate for the given problem can be used. A delay unit needs to
be introduced to hold activations until they are processed at the next time step.

L12-3
The State Space Model
If the neural network inputs and outputs are the vectors x(t) and y(t), the three connection
weight matrices are W IH, W HH and WHO, and the hidden and output unit activation
functions are fH and fO, the behaviour of the recurrent network can be described as a
dynamical system by the pair of non-linear matrix equations:

h(t) = f H (WIH x(t) + WHH h(t −1))


y(t) = fO (WHO h(t))

In general, the€state of a dynamical system is a set of values that summarizes all the
information about the past behaviour of the system that is necessary to provide a unique

description of its future behaviour, apart from the effect of any external factors. In this
case the state is defined by the set of hidden unit activations h(t).

Thus, in addition to the input and output spaces, there is also a state space. The order of
the dynamical system is the dimensionality of the state space, the number of hidden units.

L12-4
Stability, Controllability and Observability
Since one can think about recurrent networks in terms of their properties as dynamical
systems, it is natural to ask about their stability, controllability and observability:

Stability concerns the boundedness over time of the network outputs, and the response of
the network outputs to small changes (e.g., to the network inputs or weights).

Controllability is concerned with whether it is possible to control the dynamic behaviour.


A recurrent neural network is said to be controllable if an initial state is steerable to any
desired state within a finite number of time steps.

Observability is concerned with whether it is possible to observe the results of the control
applied. A recurrent network is said to be observable if the state of the network can be
determined from a finite set of input/output measurements.

A rigorous treatment of these issues is way beyond the scope of this module!

L12-5
Learning and Universal Approximation
The Universal Approximation Theorem tells us that:

Any non-linear dynamical system can be approximated to any accuracy by a


recurrent neural network, with no restrictions on the compactness of the state
space, provided that the network has enough sigmoidal hidden units.

This underlies the computational power of recurrent neural networks.

However, knowing that a recurrent neural network can approximate any dynamical
system does not tell us how to achieve it. As with feed-forward neural networks, we
generally want them to learn from a set of training data to perform appropriately.

We can use Continuous Training, for which the network state is never reset during
training, or Epochwise Training, which has the network state reset at each epoch. We
now look at one particular learning approach that works for both training modes.

L12-6
Unfolding Over Time
The recurrent network can be converted into a feed-forward network by unfolding over time:

Outputs
WOH
Hidden Units
WHH
WIH
Inputs Hidden Units
Time t WHH
WIH
Inputs Hidden Units
Time t-1 WHH
WIH
Inputs
Time t-2

This means all the earlier theory about feed-forward network learning follows through.

L12-7
Backpropagation Through Time
The Backpropagation Through Time (BPTT) learning algorithm is a natural extension of
standard backpropagation that performs gradient descent on a complete unfolded network.

If a network training sequence starts at time t0 and ends at time t1, the total cost function is
simply the sum over time of the standard error function Esse/ce(t) at each time-step:
t1
Etotal (t0 ,t1 ) = ∑ Esse /ce (t)
t=t0

and the gradient descent weight updates have contributions from each time-step
∂Etotal (t0 ,t1 ) t1 ∂E
sse /ce (t)
€ Δwij = −η = −η∑
∂wij t=t0 ∂wij

The constituent partial derivatives ∂Esse/ce/∂wij now have contributions from the multiple
instances of each weight wij ∈ {W IH, W HH} and depend on the inputs and hidden unit
€ previous time steps. The errors now have to be back-propagated through
activations at
time as well as through the network.

L12-8
Practical Considerations for BPTT
Even for the relatively simple recurrent network shown above, the unfolded network is
quite complex, and keeping track of all the components at different points of time can
become unwieldy. Most useful networks are even more problematic in this regard.

Typically the updates are made in an online fashion with the weights being updated at
each time step. This requires storage of the history of the inputs and past network states
at earlier times. For this to be computationally feasible, it requires truncation at a
certain number of time steps, with the earlier information being ignored.

Assuming the network is stable, the contributions to the weight updates should become
smaller the further back in time they come from. This is because they depend on higher
powers of small feedback strengths (corresponding to the sigmoid derivatives multiplied
by the feedback weights). This means that truncation is not as problematic as it sounds,
though many times step may be needed in practice (e.g., ~30). Despite its complexity,
BPTT has been shown many times to be an effective learning algorithm.

L12-9
Simple Recurrent Networks
Truncating the unfolded network to just one time step reduces it to a so-called Simple
Recurrent Network (which is also commonly known as an Elman network):

Outputs

WOH
Hidden Units
WHH
WIH
Inputs Hidden Units
Time t Time t-1

In this case, each set of weights now only appears only once, so it is possible to apply the
gradient descent approach using the standard backpropagation algorithm rather than full
BPTT. This means that the error signal will not get propagated back very far, and it will
be difficult for the network to learn how to use information from far back in time. In
practice, this approximation proves to be too great for many practical applications.

L12-10
The NARX Model
An alternative RNN formulation has a single input and a single output, with a delay line
on the inputs, and the outputs fed back to the input by another delay line. This is known
as the Non-linear Auto-Regressive with eXogeneous inputs (NARX) model:

Output y(t+1)

Multi-Layer Perceptron

Delay Delay … … Delay Delay Delay


x(t–1) x(t–2) y(t–2) y(t–1) y(t)
Input
x(t)

It is particularly effective for Time Series Prediction, where the target y(t+1) is x(t+1).

L12-11
Computational Power of Recurrent Networks
Recurrent neural networks exemplified by the fully recurrent network and the NARX
model have an inherent ability to simulate finite state automata. Automata represent
abstractions of information processing devices such as computers. The computational
power of a recurrent network is embodied in two main theorems:

Theorem 1

All Turing machines may be simulated by fully connected recurrent networks built of neurons
with sigmoidal activation functions.
Proof: Siegelmann & Sontag, 1991, Applied Mathematics Letters, vol 4, pp 77-80.

Theorem 2

NARX Networks with one layer of hidden neurons with bounded, one sided saturated (BOSS)
activation functions and a linear output neuron can simulate fully connected recurrent networks
with bounded one-sided saturated activation functions, except for a linear slowdown.
Proof: Siegelmann et al., 1997, Systems, Man and Cybernetics, Part B, vol 27, pp 208-215.

L12-12
Application – Triangular Model of Reading
Any realistic model of reading must consider the interaction of orthography (letters),
phonology (sounds) and semantics (meanings). A recurrent neural network of the form:

PHONOLOGY

SEMANTICS

ORTHOGRAPHY

is required. The various sub-tasks (e.g., reading and spelling) need to be trained in line
with human experience, and the dominant emergent pathways are found to depend on the
nature of the language (e.g., English versus Chinese).

L12-13
Application – Associative Memory
Associative Memory is the general idea of accessing memory through content rather than
by address or location. One approach to this is for the memory content to be the pattern
of activations on the nodes of a recurrent neural network. The idea is to start the network
off with a pattern of activations that is a partial or noisy representation of the required
memory content, and have the network settle down to the required content.

The basic associative memory problem can be stated as:

Store a set of P binary valued patterns {tp} = {tip} in such a way that, when it is
presented with a new pattern s = {si}, the system (e.g., recurrent neural network)
responds by producing whichever stored pattern tp most closely resembles s.

One could have separate input and output nodes to achieve this, or one could have a single
set of nodes that act both as inputs and outputs. Either way, one needs to specify suitable
node activation functions, and weight definition or update equations, so that the recurrent
connections allow the learned memories to be accessed.

L12-14
Hopfield Networks
The Hopfield Network or Hopfield Model is one good way to implement an associative
memory. It is simply a fully connected recurrent network of N McCulloch-Pitts neurons.
Activations are normally ±1, rather than 0 and 1, so the neuron activation equation is:

  +1 if x ≥ 0
xi = sgn∑ wij x j − θi  where sgn(x) = 
 j  −1 if x < 0

Unlike the earlier feed-forward McCulloch-Pitts networks, the activations here depend on
time, because the activations keep changing €till they have all settled down to some stable

pattern. Those activations can be updated either synchronously or asynchronously.

It can be shown that the required associative memory can be achieved by simply setting
the weights wij and thresholds θj in relation to the target outputs tp as follows:

1 P p p
wij = ∑ ti t j , θi = 0
N p=1

L12-15
A stored pattern tq will then be stable if the neuron activations are not changing, i.e.

   1 
t = sgn∑ wijt j − θi  = sgn∑ ∑ ti t j t j 
i
q q p p q

 j   j N p 

which is best analyzed by separating out the q term from the p sum to give
€  
1
t = sgnti + ∑ ∑ ti t j t j 
i
q q p p q

 N j p≠q 

If the second term in this is zero, it is clear that pattern number q is stable. It will also be
stable if the second term is non-zero but has magnitude less than 1, because that will not

be enough to move the argument over the step of the sign function sgn(x). In practice,
this happens in most cases of interest as long as the number of stored patterns P is small
enough. Moreover, not only will the stored patterns be stable, but they will also be
attractors of patterns close by. Estimates of what constitutes a small enough number P
leads to the idea of the storage capacity of a Hopfield network. A full discussion of
Hopfield Networks can be found in most introductory books on neural networks.

L12-16
Boltzmann Machines
A Boltzmann Machine is a variant of the Hopfield Network composed of N units with
activations {xi}. The state of each unit i is updated asynchronously according to the rule:
 +1 with probability pi
xi = 
−1 with probability 1 − pi

1
pi =
where
( N
1 + exp −(∑ j =1 w ij x j − θ j ) / T )
with positive temperature constant T, network weights wij and thresholds θj.

The fundamental difference between the Boltzmann Machine and a standard Hopfield
Network is the stochastic activation of the units. If T is very small, the dynamics of the
Boltzmann Machine approximates the dynamics of the discrete Hopfield Network, but
when T is large, the network visits the whole state space. Another difference is that the
nodes of a Boltzmann Machine are split between visible input and output nodes, and
hidden nodes, and the aim is to have the machine learn input-output mappings.

L12-17
Training proceeds by updating the weights using the Boltzmann learning algorithm
η
Δwij = −
T (
xi x j fixed
− xi x j free )
where xi x j fixed is the expected/average product of xi and xj during training with the input
and output nodes fixed at a training pattern and the hidden nodes free to update, and
xi x j free is the corresponding quantity when the output nodes are also free.

For both Hopfield Networks and Boltzmann Machines one can define an energy function

1 N N N
E = − ∑ ∑ w ij xi x j + ∑ θi xi
2 i =1 j =1 i=1

and the network activation updates cause the network to settle into a local minimum of
this energy. This implies that the stored patterns will be local minima of the energy. If a
Boltzmann Machine starts off with a high temperature and is gradually cooled (known as
simulated annealing), it will tend to stay longer in the basins of attraction of the deepest
minima, and have a good change of ending up in a global minimum at the end. Further
details about Boltzmann Machines can be found in most introductory textbooks.

L12-18
Restricted Boltzmann Machines and Deep Learning
One particularly important variation of the standard Boltzmann Machine is the Restricted
Boltzmann Machine that has distinct sets of visible and hidden nodes, with no visible-to-
visible or hidden-to-hidden connections allowed. This restriction makes it possible to
formulate more efficient training algorithms, in particular, the Contrastive Divergence
learning algorithm that performs an approximation to gradient descent.

These networks have proved effective in the field of Deep Learning, and were what
started much of the current interest in all forms of deep learning neural networks. Each
layer of a many-layered feed-forward neural network can be efficiently pre-trained as an
unsupervised Restricted Boltzmann Machine, and the whole network can then be fine-
tuned using standard supervised Backpropagation.

Such use of semi-supervised feature learning in deep hierarchical neural networks is


replacing hand-crafted feature creation in many areas. This is a major ongoing research
area, with many new developments published each year.

L12-19
Overview and Reading
1. We began by noting the important features of recurrent neural networks
and their properties as fully fledged dynamical systems.
2. Then we studied a basic Fully Recurrent Network and unfolding it over
time to give the Backpropagation Through Time learning algorithm.
3. This was followed by a brief overview of the NARX model and two
theorems about the computational power of recurrent networks.
4. We looked at a Reading Model and Associative Memory as examples.
5. Finally, Hopfield Networks and Boltzmann Machines were introduced.

Reading

1. Haykin-2009: Sections 11.7, 13.7, 15.1 to 15.14


2. Hertz, Krogh & Palmer: Sections 2.1, 2.2, 7.1, 7.2, 7.3
3. Ham & Kostanic: Sections 5.1 to 5.9

L12-20
CS229 Lecture notes
Andrew Ng

Part V
Support Vector Machines
This set of notes presents the Support Vector Machine (SVM) learning al-
gorithm. SVMs are among the best (and many believe are indeed the best)
“off-the-shelf” supervised learning algorithms. To tell the SVM story, we’ll
need to first talk about margins and the idea of separating data with a large
“gap.” Next, we’ll talk about the optimal margin classifier, which will lead
us into a digression on Lagrange duality. We’ll also see kernels, which give
a way to apply SVMs efficiently in very high dimensional (such as infinite-
dimensional) feature spaces, and finally, we’ll close off the story with the
SMO algorithm, which gives an efficient implementation of SVMs.

1 Margins: Intuition
We’ll start our story on SVMs by talking about margins. This section will
give the intuitions about margins and about the “confidence” of our predic-
tions; these ideas will be made formal in Section 3.
Consider logistic regression, where the probability p(y = 1|x; θ) is mod-
eled by hθ (x) = g(θT x). We would then predict “1” on an input x if and
only if hθ (x) ≥ 0.5, or equivalently, if and only if θT x ≥ 0. Consider a
positive training example (y = 1). The larger θT x is, the larger also is
hθ (x) = p(y = 1|x; w, b), and thus also the higher our degree of “confidence”
that the label is 1. Thus, informally we can think of our prediction as being
a very confident one that y = 1 if θT x ≫ 0. Similarly, we think of logistic
regression as making a very confident prediction of y = 0, if θT x ≪ 0. Given
a training set, again informally it seems that we’d have found a good fit to
the training data if we can find θ so that θT x(i) ≫ 0 whenever y (i) = 1, and

1
2

θT x(i) ≪ 0 whenever y (i) = 0, since this would reflect a very confident (and
correct) set of classifications for all the training examples. This seems to be
a nice goal to aim for, and we’ll soon formalize this idea using the notion of
functional margins.
For a different type of intuition, consider the following figure, in which x’s
represent positive training examples, o’s denote negative training examples,
a decision boundary (this is the line given by the equation θT x = 0, and
is also called the separating hyperplane) is also shown, and three points
have also been labeled A, B and C.

A0
1

B0
1
C0
1

Notice that the point A is very far from the decision boundary. If we are
asked to make a prediction for the value of y at A, it seems we should be
quite confident that y = 1 there. Conversely, the point C is very close to
the decision boundary, and while it’s on the side of the decision boundary
on which we would predict y = 1, it seems likely that just a small change to
the decision boundary could easily have caused out prediction to be y = 0.
Hence, we’re much more confident about our prediction at A than at C. The
point B lies in-between these two cases, and more broadly, we see that if
a point is far from the separating hyperplane, then we may be significantly
more confident in our predictions. Again, informally we think it’d be nice if,
given a training set, we manage to find a decision boundary that allows us
to make all correct and confident (meaning far from the decision boundary)
predictions on the training examples. We’ll formalize this later using the
notion of geometric margins.
3

2 Notation
To make our discussion of SVMs easier, we’ll first need to introduce a new
notation for talking about classification. We will be considering a linear
classifier for a binary classification problem with labels y and features x.
From now, we’ll use y ∈ {−1, 1} (instead of {0, 1}) to denote the class labels.
Also, rather than parameterizing our linear classifier with the vector θ, we
will use parameters w, b, and write our classifier as

hw,b (x) = g(wT x + b).

Here, g(z) = 1 if z ≥ 0, and g(z) = −1 otherwise. This “w, b” notation


allows us to explicitly treat the intercept term b separately from the other
parameters. (We also drop the convention we had previously of letting x0 = 1
be an extra coordinate in the input feature vector.) Thus, b takes the role of
what was previously θ0 , and w takes the role of [θ1 . . . θn ]T .
Note also that, from our definition of g above, our classifier will directly
predict either 1 or −1 (cf. the perceptron algorithm), without first going
through the intermediate step of estimating the probability of y being 1
(which was what logistic regression did).

3 Functional and geometric margins


Let’s formalize the notions of the functional and geometric margins. Given a
training example (x(i) , y (i) ), we define the functional margin of (w, b) with
respect to the training example

γ̂ (i) = y (i) (wT x + b).

Note that if y (i) = 1, then for the functional margin to be large (i.e., for
our prediction to be confident and correct), we need wT x + b to be a large
positive number. Conversely, if y (i) = −1, then for the functional margin
to be large, we need wT x + b to be a large negative number. Moreover, if
y (i) (wT x + b) > 0, then our prediction on this example is correct. (Check
this yourself.) Hence, a large functional margin represents a confident and a
correct prediction.
For a linear classifier with the choice of g given above (taking values in
{−1, 1}), there’s one property of the functional margin that makes it not a
very good measure of confidence, however. Given our choice of g, we note that
if we replace w with 2w and b with 2b, then since g(wT x + b) = g(2wT x + 2b),
4

this would not change hw,b (x) at all. I.e., g, and hence also hw,b (x), depends
only on the sign, but not on the magnitude, of wT x + b. However, replacing
(w, b) with (2w, 2b) also results in multiplying our functional margin by a
factor of 2. Thus, it seems that by exploiting our freedom to scale w and b,
we can make the functional margin arbitrarily large without really changing
anything meaningful. Intuitively, it might therefore make sense to impose
some sort of normalization condition such as that ||w||2 = 1; i.e., we might
replace (w, b) with (w/||w||2 , b/||w||2 ), and instead consider the functional
margin of (w/||w||2 , b/||w||2 ). We’ll come back to this later.
Given a training set S = {(x(i) , y (i) ); i = 1, . . . , m}, we also define the
function margin of (w, b) with respect to S as the smallest of the functional
margins of the individual training examples. Denoted by γ̂, this can therefore
be written:
γ̂ = min γ̂ (i) .
i=1,...,m

Next, let’s talk about geometric margins. Consider the picture below:

A w

γ (i)

The decision boundary corresponding to (w, b) is shown, along with the


vector w. Note that w is orthogonal (at 90◦ ) to the separating hyperplane.
(You should convince yourself that this must be the case.) Consider the
point at A, which represents the input x(i) of some training example with
label y (i) = 1. Its distance to the decision boundary, γ (i) , is given by the line
segment AB.
How can we find the value of γ (i) ? Well, w/||w|| is a unit-length vector
pointing in the same direction as w. Since A represents x(i) , we therefore
5

find that the point B is given by x(i) − γ (i) · w/||w||. But this point lies on
the decision boundary, and all points x on the decision boundary satisfy the
equation wT x + b = 0. Hence,
 
T (i) (i) w
w x −γ + b = 0.
||w||

Solving for γ (i) yields


T
wT x(i) + b

(i) w b
γ = = x(i) + .
||w|| ||w|| ||w||
This was worked out for the case of a positive training example at A in the
figure, where being on the “positive” side of the decision boundary is good.
More generally, we define the geometric margin of (w, b) with respect to a
training example (x(i) , y (i) ) to be
 T !
(i) (i) w (i) b
γ =y x + .
||w|| ||w||

Note that if ||w|| = 1, then the functional margin equals the geometric
margin—this thus gives us a way of relating these two different notions of
margin. Also, the geometric margin is invariant to rescaling of the parame-
ters; i.e., if we replace w with 2w and b with 2b, then the geometric margin
does not change. This will in fact come in handy later. Specifically, because
of this invariance to the scaling of the parameters, when trying to fit w and b
to training data, we can impose an arbitrary scaling constraint on w without
changing anything important; for instance, we can demand that ||w|| = 1, or
|w1 | = 5, or |w1 + b| + |w2 | = 2, and any of these can be satisfied simply by
rescaling w and b.
Finally, given a training set S = {(x(i) , y (i) ); i = 1, . . . , m}, we also define
the geometric margin of (w, b) with respect to S to be the smallest of the
geometric margins on the individual training examples:

γ = min γ (i) .
i=1,...,m

4 The optimal margin classifier


Given a training set, it seems from our previous discussion that a natural
desideratum is to try to find a decision boundary that maximizes the (ge-
ometric) margin, since this would reflect a very confident set of predictions
6

on the training set and a good “fit” to the training data. Specifically, this
will result in a classifier that separates the positive and the negative training
examples with a “gap” (geometric margin).
For now, we will assume that we are given a training set that is linearly
separable; i.e., that it is possible to separate the positive and negative ex-
amples using some separating hyperplane. How will we find the one that
achieves the maximum geometric margin? We can pose the following opti-
mization problem:

maxγ,w,b γ
s.t. y (i) (wT x(i) + b) ≥ γ, i = 1, . . . , m
||w|| = 1.

I.e., we want to maximize γ, subject to each training example having func-


tional margin at least γ. The ||w|| = 1 constraint moreover ensures that the
functional margin equals to the geometric margin, so we are also guaranteed
that all the geometric margins are at least γ. Thus, solving this problem will
result in (w, b) with the largest possible geometric margin with respect to the
training set.
If we could solve the optimization problem above, we’d be done. But the
“||w|| = 1” constraint is a nasty (non-convex) one, and this problem certainly
isn’t in any format that we can plug into standard optimization software to
solve. So, let’s try transforming the problem into a nicer one. Consider:
γ̂
maxγ̂,w,b
||w||
s.t. y (i) (wT x(i) + b) ≥ γ̂, i = 1, . . . , m

Here, we’re going to maximize γ̂/||w||, subject to the functional margins all
being at least γ̂. Since the geometric and functional margins are related by
γ = γ̂/||w|, this will give us the answer we want. Moreover, we’ve gotten rid
of the constraint ||w|| = 1 that we didn’t like. The downside is that we now
γ̂
have a nasty (again, non-convex) objective ||w|| function; and, we still don’t
have any off-the-shelf software that can solve this form of an optimization
problem.
Let’s keep going. Recall our earlier discussion that we can add an arbi-
trary scaling constraint on w and b without changing anything. This is the
key idea we’ll use now. We will introduce the scaling constraint that the
functional margin of w, b with respect to the training set must be 1:

γ̂ = 1.
7

Since multiplying w and b by some constant results in the functional margin


being multiplied by that same constant, this is indeed a scaling constraint,
and can be satisfied by rescaling w, b. Plugging this into our problem above,
and noting that maximizing γ̂/||w|| = 1/||w|| is the same thing as minimizing
||w||2 , we now have the following optimization problem:
1
minw,b ||w||2
2
s.t. y (i) (wT x(i) + b) ≥ 1, i = 1, . . . , m
We’ve now transformed the problem into a form that can be efficiently
solved. The above is an optimization problem with a convex quadratic ob-
jective and only linear constraints. Its solution gives us the optimal mar-
gin classifier. This optimization problem can be solved using commercial
quadratic programming (QP) code.1
While we could call the problem solved here, what we will instead do is
make a digression to talk about Lagrange duality. This will lead us to our
optimization problem’s dual form, which will play a key role in allowing us to
use kernels to get optimal margin classifiers to work efficiently in very high
dimensional spaces. The dual form will also allow us to derive an efficient
algorithm for solving the above optimization problem that will typically do
much better than generic QP software.

5 Lagrange duality
Let’s temporarily put aside SVMs and maximum margin classifiers, and talk
about solving constrained optimization problems.
Consider a problem of the following form:
minw f (w)
s.t. hi (w) = 0, i = 1, . . . , l.
Some of you may recall how the method of Lagrange multipliers can be used
to solve it. (Don’t worry if you haven’t seen it before.) In this method, we
define the Lagrangian to be
l
X
L(w, β) = f (w) + βi hi (w)
i=1
1
You may be familiar with linear programming, which solves optimization problems
that have linear objectives and linear constraints. QP software is also widely available,
which allows convex quadratic objectives and linear constraints.
8

Here, the βi ’s are called the Lagrange multipliers. We would then find
and set L’s partial derivatives to zero:
∂L ∂L
= 0; = 0,
∂wi ∂βi
and solve for w and β.
In this section, we will generalize this to constrained optimization prob-
lems in which we may have inequality as well as equality constraints. Due to
time constraints, we won’t really be able to do the theory of Lagrange duality
justice in this class,2 but we will give the main ideas and results, which we
will then apply to our optimal margin classifier’s optimization problem.
Consider the following, which we’ll call the primal optimization problem:
minw f (w)
s.t. gi (w) ≤ 0, i = 1, . . . , k
hi (w) = 0, i = 1, . . . , l.
To solve it, we start by defining the generalized Lagrangian
k
X l
X
L(w, α, β) = f (w) + αi gi (w) + βi hi (w).
i=1 i=1

Here, the αi ’s and βi ’s are the Lagrange multipliers. Consider the quantity
θP (w) = max L(w, α, β).
α,β : αi ≥0

Here, the “P” subscript stands for “primal.” Let some w be given. If w
violates any of the primal constraints (i.e., if either gi (w) > 0 or hi (w) 6= 0
for some i), then you should be able to verify that
k
X l
X
θP (w) = max f (w) + αi gi (w) + βi hi (w) (1)
α,β : αi ≥0
i=1 i=1
= ∞. (2)
Conversely, if the constraints are indeed satisfied for a particular value of w,
then θP (w) = f (w). Hence,

f (w) if w satisfies primal constraints
θP (w) =
∞ otherwise.
2
Readers interested in learning more about this topic are encouraged to read, e.g., R.
T. Rockarfeller (1970), Convex Analysis, Princeton University Press.
9

Thus, θP takes the same value as the objective in our problem for all val-
ues of w that satisfies the primal constraints, and is positive infinity if the
constraints are violated. Hence, if we consider the minimization problem
min θP (w) = min max L(w, α, β),
w w α,β : αi ≥0

we see that it is the same problem (i.e., and has the same solutions as) our
original, primal problem. For later use, we also define the optimal value of
the objective to be p∗ = minw θP (w); we call this the value of the primal
problem.
Now, let’s look at a slightly different problem. We define
θD (α, β) = min L(w, α, β).
w
Here, the “D” subscript stands for “dual.” Note also that whereas in the
definition of θP we were optimizing (maximizing) with respect to α, β, here
we are minimizing with respect to w.
We can now pose the dual optimization problem:
max θD (α, β) = max min L(w, α, β).
α,β : αi ≥0 α,β : αi ≥0 w

This is exactly the same as our primal problem shown above, except that the
order of the “max” and the “min” are now exchanged. We also define the
optimal value of the dual problem’s objective to be d∗ = maxα,β : αi ≥0 θD (w).
How are the primal and the dual problems related? It can easily be shown
that
d∗ = max min L(w, α, β) ≤ min max L(w, α, β) = p∗ .
α,β : αi ≥0 w w α,β : αi ≥0

(You should convince yourself of this; this follows from the “max min” of a
function always being less than or equal to the “min max.”) However, under
certain conditions, we will have
d ∗ = p∗ ,
so that we can solve the dual problem in lieu of the primal problem. Let’s
see what these conditions are.
Suppose f and the gi ’s are convex,3 and the hi ’s are affine.4 Suppose
further that the constraints gi are (strictly) feasible; this means that there
exists some w so that gi (w) < 0 for all i.
3
When f has a Hessian, then it is convex if and only if the Hessian is positive semi-
definite. For instance, f (w) = wT w is convex; similarly, all linear (and affine) functions
are also convex. (A function f can also be convex without being differentiable, but we
won’t need those more general definitions of convexity here.)
4
I.e., there exists ai , bi , so that hi (w) = aTi w + bi . “Affine” means the same thing as
linear, except that we also allow the extra intercept term bi .
10

Under our above assumptions, there must exist w∗ , α∗ , β ∗ so that w∗ is the


solution to the primal problem, α∗ , β ∗ are the solution to the dual problem,
and moreover p∗ = d∗ = L(w∗ , α∗ , β ∗ ). Moreover, w∗ , α∗ and β ∗ satisfy the
Karush-Kuhn-Tucker (KKT) conditions, which are as follows:

L(w∗ , α∗ , β ∗ ) = 0, i = 1, . . . , n (3)
∂wi

L(w∗ , α∗ , β ∗ ) = 0, i = 1, . . . , l (4)
∂βi
αi∗ gi (w∗ ) = 0, i = 1, . . . , k (5)
gi (w∗ ) ≤ 0, i = 1, . . . , k (6)
α∗ ≥ 0, i = 1, . . . , k (7)

Moreover, if some w∗ , α∗ , β ∗ satisfy the KKT conditions, then it is also a


solution to the primal and dual problems.
We draw attention to Equation (5), which is called the KKT dual com-
plementarity condition. Specifically, it implies that if αi∗ > 0, then gi (w∗ ) =
0. (I.e., the “gi (w) ≤ 0” constraint is active, meaning it holds with equality
rather than with inequality.) Later on, this will be key for showing that the
SVM has only a small number of “support vectors”; the KKT dual comple-
mentarity condition will also give us our convergence test when we talk about
the SMO algorithm.

6 Optimal margin classifiers


Previously, we posed the following (primal) optimization problem for finding
the optimal margin classifier:
1
minw,b ||w||2
2
s.t. y (i) (wT x(i) + b) ≥ 1, i = 1, . . . , m

We can write the constraints as

gi (w) = −y (i) (wT x(i) + b) + 1 ≤ 0.

We have one such constraint for each training example. Note that from the
KKT dual complementarity condition, we will have αi > 0 only for the train-
ing examples that have functional margin exactly equal to one (i.e., the ones
11

corresponding to constraints that hold with equality, gi (w) = 0). Consid-


er the figure below, in which a maximum margin separating hyperplane is
shown by the solid line.

The points with the smallest margins are exactly the ones closest to the
decision boundary; here, these are the three points (one negative and two pos-
itive examples) that lie on the dashed lines parallel to the decision boundary.
Thus, only three of the αi ’s—namely, the ones corresponding to these three
training examples—will be non-zero at the optimal solution to our optimiza-
tion problem. These three points are called the support vectors in this
problem. The fact that the number of support vectors can be much smaller
than the size the training set will be useful later.
Let’s move on. Looking ahead, as we develop the dual form of the prob-
lem, one key idea to watch out for is that we’ll try to write our algorithm
in terms of only the inner product hx(i) , x(j) i (think of this as (x(i) )T x(j) )
between points in the input feature space. The fact that we can express our
algorithm in terms of these inner products will be key when we apply the
kernel trick.
When we construct the Lagrangian for our optimization problem we have:
m
1 X 
L(w, b, α) = ||w||2 − αi y (i) (wT x(i) + b) − 1 .

(8)
2 i=1

Note that there’re only “αi ” but no “βi ” Lagrange multipliers, since the
problem has only inequality constraints.
Let’s find the dual form of the problem. To do so, we need to first
minimize L(w, b, α) with respect to w and b (for fixed α), to get θD , which
12

we’ll do by setting the derivatives of L with respect to w and b to zero. We


have: m
X
∇w L(w, b, α) = w − αi y (i) x(i) = 0
i=1

This implies that


m
X
w= αi y (i) x(i) . (9)
i=1

As for the derivative with respect to b, we obtain


m
∂ X
L(w, b, α) = αi y (i) = 0. (10)
∂b i=1

If we take the definition of w in Equation (9) and plug that back into the
Lagrangian (Equation 8), and simplify, we get
m m m
X 1 X (i) (j) (i) T (j)
X
L(w, b, α) = αi − y y αi αj (x ) x − b αi y (i) .
i=1
2 i,j=1 i=1

But from Equation (10), the last term must be zero, so we obtain
m m
X 1 X (i) (j)
L(w, b, α) = αi − y y αi αj (x(i) )T x(j) .
i=1
2 i,j=1

Recall that we got to the equation above by minimizing L with respect to w


and b. Putting this together with the constraints αi ≥ 0 (that we always had)
and the constraint (10), we obtain the following dual optimization problem:
m m
X 1 X (i) (j)
maxα W (α) = αi − y y αi αj hx(i) , x(j) i.
i=1
2 i,j=1
s.t. αi ≥ 0, i = 1, . . . , m
Xm
αi y (i) = 0,
i=1

You should also be able to verify that the conditions required for p∗ =
d∗ and the KKT conditions (Equations 3–7) to hold are indeed satisfied in
our optimization problem. Hence, we can solve the dual in lieu of solving
the primal problem. Specifically, in the dual problem above, we have a
maximization problem in which the parameters are the αi ’s. We’ll talk later
13

about the specific algorithm that we’re going to use to solve the dual problem,
but if we are indeed able to solve it (i.e., find the α’s that maximize W (α)
subject to the constraints), then we can use Equation (9) to go back and find
the optimal w’s as a function of the α’s. Having found w∗ , by considering
the primal problem, it is also straightforward to find the optimal value for
the intercept term b as
maxi:y(i) =−1 w∗ T x(i) + mini:y(i) =1 w∗ T x(i)
b∗ = − . (11)
2
(Check for yourself that this is correct.)
Before moving on, let’s also take a more careful look at Equation (9),
which gives the optimal value of w in terms of (the optimal value of) α.
Suppose we’ve fit our model’s parameters to a training set, and now wish to
make a prediction at a new point input x. We would then calculate wT x + b,
and predict y = 1 if and only if this quantity is bigger than zero. But
using (9), this quantity can also be written:
m
!T
X
T (i) (i)
w x+b = αi y x x+b (12)
i=1
m
X
= αi y (i) hx(i) , xi + b. (13)
i=1
Hence, if we’ve found the αi ’s, in order to make a prediction, we have to
calculate a quantity that depends only on the inner product between x and
the points in the training set. Moreover, we saw earlier that the αi ’s will all
be zero except for the support vectors. Thus, many of the terms in the sum
above will be zero, and we really need to find only the inner products between
x and the support vectors (of which there is often only a small number) in
order calculate (13) and make our prediction.
By examining the dual form of the optimization problem, we gained sig-
nificant insight into the structure of the problem, and were also able to write
the entire algorithm in terms of only inner products between input feature
vectors. In the next section, we will exploit this property to apply the ker-
nels to our classification problem. The resulting algorithm, support vector
machines, will be able to efficiently learn in very high dimensional spaces.

7 Kernels
Back in our discussion of linear regression, we had a problem in which the
input x was the living area of a house, and we considered performing regres-
14

sion using the features x, x2 and x3 (say) to obtain a cubic function. To


distinguish between these two sets of variables, we’ll call the “original” input
value the input attributes of a problem (in this case, x, the living area).
When that is mapped to some new set of quantities that are then passed to
the learning algorithm, we’ll call those new quantities the input features.
(Unfortunately, different authors use different terms to describe these two
things, but we’ll try to use this terminology consistently in these notes.) We
will also let φ denote the feature mapping, which maps from the attributes
to the features. For instance, in our example, we had
 
x
φ(x) =  x2  .
x3

Rather than applying SVMs using the original input attributes x, we may
instead want to learn using some features φ(x). To do so, we simply need to
go over our previous algorithm, and replace x everywhere in it with φ(x).
Since the algorithm can be written entirely in terms of the inner prod-
ucts hx, zi, this means that we would replace all those inner products with
hφ(x), φ(z)i. Specifically, given a feature mapping φ, we define the corre-
sponding Kernel to be

K(x, z) = φ(x)T φ(z).

Then, everywhere we previously had hx, zi in our algorithm, we could simply


replace it with K(x, z), and our algorithm would now be learning using the
features φ.
Now, given φ, we could easily compute K(x, z) by finding φ(x) and φ(z)
and taking their inner product. But what’s more interesting is that often,
K(x, z) may be very inexpensive to calculate, even though φ(x) itself may
be very expensive to calculate (perhaps because it is an extremely high di-
mensional vector). In such settings, by using in our algorithm an efficient
way to calculate K(x, z), we can get SVMs to learn in the high dimensional
feature space space given by φ, but without ever having to explicitly find or
represent vectors φ(x).
Let’s see an example. Suppose x, z ∈ Rn , and consider

K(x, z) = (xT z)2 .


15

We can also write this as


n
! n
!
X X
K(x, z) = xi z i xj z j
i=1 j=1
n
XX n
= xi xj z i z j
i=1 j=1
Xn
= (xi xj )(zi zj )
i,j=1

Thus, we see that K(x, z) = φ(x)T φ(z), where the feature mapping φ is given
(shown here for the case of n = 3) by
 
x1 x1
 x1 x2 
 
 x1 x3 
 
 x2 x1 
 
φ(x) = 
 x 2 x 2
.

 x2 x3 
 
 x3 x1 
 
 x3 x2 
x3 x3

Note that whereas calculating the high-dimensional φ(x) requires O(n2 ) time,
finding K(x, z) takes only O(n) time—linear in the dimension of the input
attributes.
For a related kernel, also consider

K(x, z) = (xT z + c)2


n n
X X √ √
= (xi xj )(zi zj ) + ( 2cxi )( 2czi ) + c2 .
i,j=1 i=1

(Check this yourself.) This corresponds to the feature mapping (again shown
16

for n = 3)
x1 x1
 

 x1 x2 


 x1 x3 


 x2 x1 


 x2 x2 


 x2 x3 

φ(x) = 
 x3 x1 ,


 x3 x2 

√x3 x3
 
 
√2cx1
 
 
√2cx2
 
 
 2cx3 
c
and the parameter c controls the relative weighting between the xi (first
order) and the xi xj (second order) terms.
T d
More broadly, the  kernel K(x, z) = (x z + c) corresponds to a feature
n+d
mapping to an d feature space, corresponding of all monomials of the
form xi1 xi2 . . . xik that are up to order d. However, despite working in this
O(nd )-dimensional space, computing K(x, z) still takes only O(n) time, and
hence we never need to explicitly represent feature vectors in this very high
dimensional feature space.
Now, let’s talk about a slightly different view of kernels. Intuitively, (and
there are things wrong with this intuition, but nevermind), if φ(x) and φ(z)
are close together, then we might expect K(x, z) = φ(x)T φ(z) to be large.
Conversely, if φ(x) and φ(z) are far apart—say nearly orthogonal to each
other—then K(x, z) = φ(x)T φ(z) will be small. So, we can think of K(x, z)
as some measurement of how similar are φ(x) and φ(z), or of how similar are
x and z.
Given this intuition, suppose that for some learning problem that you’re
working on, you’ve come up with some function K(x, z) that you think might
be a reasonable measure of how similar x and z are. For instance, perhaps
you chose
||x − z||2
 
K(x, z) = exp − .
2σ 2
This is a reasonable measure of x and z’s similarity, and is close to 1 when
x and z are close, and near 0 when x and z are far apart. Can we use this
definition of K as the kernel in an SVM? In this particular example, the
answer is yes. (This kernel is called the Gaussian kernel, and corresponds
17

to an infinite dimensional feature mapping φ.) But more broadly, given some
function K, how can we tell if it’s a valid kernel; i.e., can we tell if there is
some feature mapping φ so that K(x, z) = φ(x)T φ(z) for all x, z?
Suppose for now that K is indeed a valid kernel corresponding to some
feature mapping φ. Now, consider some finite set of m points (not necessarily
the training set) {x(1) , . . . , x(m) }, and let a square, m-by-m matrix K be
defined so that its (i, j)-entry is given by Kij = K(x(i) , x(j) ). This matrix
is called the Kernel matrix. Note that we’ve overloaded the notation and
used K to denote both the kernel function K(x, z) and the kernel matrix K,
due to their obvious close relationship.
Now, if K is a valid Kernel, then Kij = K(x(i) , x(j) ) = φ(x(i) )T φ(x(j) ) =
φ(x(j) )T φ(x(i) ) = K(x(j) , x(i) ) = Kji , and hence K must be symmetric. More-
over, letting φk (x) denote the k-th coordinate of the vector φ(x), we find that
for any vector z, we have
XX
z T Kz = zi Kij zj
i j
XX
= zi φ(x(i) )T φ(x(j) )zj
i j
XX X
= zi φk (x(i) )φk (x(j) )zj
i j k
XXX
= zi φk (x(i) )φk (x(j) )zj
k i j
!2
X X
= zi φk (x(i) )
k i
≥ 0.

The second-to-last step above used the same trick as you saw in Problem
set 1 Q1. Since z was arbitrary, this shows that K is positive semi-definite
(K ≥ 0).
Hence, we’ve shown that if K is a valid kernel (i.e., if it corresponds to
some feature mapping φ), then the corresponding Kernel matrix K ∈ Rm×m
is symmetric positive semidefinite. More generally, this turns out to be not
only a necessary, but also a sufficient, condition for K to be a valid kernel
(also called a Mercer kernel). The following result is due to Mercer.5
5
Many texts present Mercer’s theorem in a slightly more complicated form involving
L functions, but when the input attributes take values in Rn , the version given here is
2

equivalent.
18

Theorem (Mercer). Let K : Rn × Rn 7→ R be given. Then for K


to be a valid (Mercer) kernel, it is necessary and sufficient that for any
{x(1) , . . . , x(m) }, (m < ∞), the corresponding kernel matrix is symmetric
positive semi-definite.

Given a function K, apart from trying to find a feature mapping φ that


corresponds to it, this theorem therefore gives another way of testing if it is
a valid kernel. You’ll also have a chance to play with these ideas more in
problem set 2.
In class, we also briefly talked about a couple of other examples of ker-
nels. For instance, consider the digit recognition problem, in which given an
image (16x16 pixels) of a handwritten digit (0-9), we have to figure out which
digit it was. Using either a simple polynomial kernel K(x, z) = (xT z)d or
the Gaussian kernel, SVMs were able to obtain extremely good performance
on this problem. This was particularly surprising since the input attributes
x were just 256-dimensional vectors of the image pixel intensity values, and
the system had no prior knowledge about vision, or even about which pixels
are adjacent to which other ones. Another example that we briefly talked
about in lecture was that if the objects x that we are trying to classify are
strings (say, x is a list of amino acids, which strung together form a protein),
then it seems hard to construct a reasonable, “small” set of features for
most learning algorithms, especially if different strings have different length-
s. However, consider letting φ(x) be a feature vector that counts the number
of occurrences of each length-k substring in x. If we’re considering strings
of English letters, then there are 26k such strings. Hence, φ(x) is a 26k di-
mensional vector; even for moderate values of k, this is probably too big for
us to efficiently work with. (e.g., 264 ≈ 460000.) However, using (dynam-
ic programming-ish) string matching algorithms, it is possible to efficiently
compute K(x, z) = φ(x)T φ(z), so that we can now implicitly work in this
26k -dimensional feature space, but without ever explicitly computing feature
vectors in this space.
The application of kernels to support vector machines should already
be clear and so we won’t dwell too much longer on it here. Keep in mind
however that the idea of kernels has significantly broader applicability than
SVMs. Specifically, if you have any learning algorithm that you can write
in terms of only inner products hx, zi between input attribute vectors, then
by replacing this with K(x, z) where K is a kernel, you can “magically”
allow your algorithm to work efficiently in the high dimensional feature space
corresponding to K. For instance, this kernel trick can be applied with the
perceptron to derive a kernel perceptron algorithm. Many of the algorithms
19

that we’ll see later in this class will also be amenable to this method, which
has come to be known as the “kernel trick.”

8 Regularization and the non-separable case


The derivation of the SVM as presented so far assumed that the data is
linearly separable. While mapping data to a high dimensional feature space
via φ does generally increase the likelihood that the data is separable, we
can’t guarantee that it always will be so. Also, in some cases it is not clear
that finding a separating hyperplane is exactly what we’d want to do, since
that might be susceptible to outliers. For instance, the left figure below
shows an optimal margin classifier, and when a single outlier is added in the
upper-left region (right figure), it causes the decision boundary to make a
dramatic swing, and the resulting classifier has a much smaller margin.

To make the algorithm work for non-linearly separable datasets as well


as be less sensitive to outliers, we reformulate our optimization (using ℓ1
regularization) as follows:
m
1 X
minγ,w,b ||w||2 + C ξi
2 i=1

s.t. y (i) (wT x(i) + b) ≥ 1 − ξi , i = 1, . . . , m


ξi ≥ 0, i = 1, . . . , m.

Thus, examples are now permitted to have (functional) margin less than 1,
and if an example has functional margin 1 − ξi (with ξ > 0), we would pay
a cost of the objective function being increased by Cξi . The parameter C
controls the relative weighting between the twin goals of making the ||w||2
small (which we saw earlier makes the margin large) and of ensuring that
most examples have functional margin at least 1.
20

As before, we can form the Lagrangian:


m m m
1 X X   X
L(w, b, ξ, α, r) = wT w + C ξi − αi y (i) (xT w + b) − 1 + ξi − ri ξ i .
2 i=1 i=1 i=1

Here, the αi ’s and ri ’s are our Lagrange multipliers (constrained to be ≥ 0).


We won’t go through the derivation of the dual again in detail, but after
setting the derivatives with respect to w and b to zero as before, substituting
them back in, and simplifying, we obtain the following dual form of the
problem:
m m
X 1 X (i) (j)
maxα W (α) = αi − y y αi αj hx(i) , x(j) i
i=1
2 i,j=1
s.t. 0 ≤ αi ≤ C, i = 1, . . . , m
Xm
αi y (i) = 0,
i=1

As before, we also have that w can be expressed in terms of the αi ’s


as given in Equation (9), so that after solving the dual problem, we can
continue to use Equation (13) to make our predictions. Note that, somewhat
surprisingly, in adding ℓ1 regularization, the only change to the dual problem
is that what was originally a constraint that 0 ≤ αi has now become 0 ≤
αi ≤ C. The calculation for b∗ also has to be modified (Equation 11 is no
longer valid); see the comments in the next section/Platt’s paper.
Also, the KKT dual-complementarity conditions (which in the next sec-
tion will be useful for testing for the convergence of the SMO algorithm)
are:

αi = 0 ⇒ y (i) (wT x(i) + b) ≥ 1 (14)


αi = C ⇒ y (i) (wT x(i) + b) ≤ 1 (15)
0 < αi < C ⇒ y (i) (wT x(i) + b) = 1. (16)

Now, all that remains is to give an algorithm for actually solving the dual
problem, which we will do in the next section.

9 The SMO algorithm


The SMO (sequential minimal optimization) algorithm, due to John Platt,
gives an efficient way of solving the dual problem arising from the derivation
21

of the SVM. Partly to motivate the SMO algorithm, and partly because it’s
interesting in its own right, let’s first take another digression to talk about
the coordinate ascent algorithm.

9.1 Coordinate ascent


Consider trying to solve the unconstrained optimization problem

max W (α1 , α2 , . . . , αm ).
α

Here, we think of W as just some function of the parameters αi ’s, and for now
ignore any relationship between this problem and SVMs. We’ve already seen
two optimization algorithms, gradient ascent and Newton’s method. The
new algorithm we’re going to consider here is called coordinate ascent:

Loop until convergence: {

For i = 1, . . . , m, {
αi := arg maxα̂i W (α1 , . . . , αi−1 , α̂i , αi+1 , . . . , αm ).
}

Thus, in the innermost loop of this algorithm, we will hold all the vari-
ables except for some αi fixed, and reoptimize W with respect to just the
parameter αi . In the version of this method presented here, the inner-loop
reoptimizes the variables in order α1 , α2 , . . . , αm , α1 , α2 , . . .. (A more sophis-
ticated version might choose other orderings; for instance, we may choose
the next variable to update according to which one we expect to allow us to
make the largest increase in W (α).)
When the function W happens to be of such a form that the “arg max”
in the inner loop can be performed efficiently, then coordinate ascent can be
a fairly efficient algorithm. Here’s a picture of coordinate ascent in action:
22

2.5

1.5

0.5

−0.5

−1

−1.5

−2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

The ellipses in the figure are the contours of a quadratic function that
we want to optimize. Coordinate ascent was initialized at (2, −2), and also
plotted in the figure is the path that it took on its way to the global maximum.
Notice that on each step, coordinate ascent takes a step that’s parallel to one
of the axes, since only one variable is being optimized at a time.

9.2 SMO
We close off the discussion of SVMs by sketching the derivation of the SMO
algorithm. Some details will be left to the homework, and for others you
may refer to the paper excerpt handed out in class.
Here’s the (dual) optimization problem that we want to solve:
m m
X 1 X (i) (j)
maxα W (α) = αi − y y αi αj hx(i) , x(j) i. (17)
i=1
2 i,j=1
s.t. 0 ≤ αi ≤ C, i = 1, . . . , m (18)
Xm
αi y (i) = 0. (19)
i=1

Let’s say we have set of αi ’s that satisfy the constraints (18-19). Now,
suppose we want to hold α2 , . . . , αm fixed, and take a coordinate ascent step
and reoptimize the objective with respect to α1 . Can we make any progress?
The answer is no, because the constraint (19) ensures that
m
X
(1)
α1 y =− αi y (i) .
i=2
23

Or, by multiplying both sides by y (1) , we equivalently have


m
X
α1 = −y (1) αi y (i) .
i=2

(This step used the fact that y (1) ∈ {−1, 1}, and hence (y (1) )2 = 1.) Hence,
α1 is exactly determined by the other αi ’s, and if we were to hold α2 , . . . , αm
fixed, then we can’t make any change to α1 without violating the constrain-
t (19) in the optimization problem.
Thus, if we want to update some subject of the αi ’s, we must update at
least two of them simultaneously in order to keep satisfying the constraints.
This motivates the SMO algorithm, which simply does the following:
Repeat till convergence {
1. Select some pair αi and αj to update next (using a heuristic that
tries to pick the two that will allow us to make the biggest progress
towards the global maximum).
2. Reoptimize W (α) with respect to αi and αj , while holding all the
other αk ’s (k 6= i, j) fixed.
}
To test for convergence of this algorithm, we can check whether the KKT
conditions (Equations 14-16) are satisfied to within some tol. Here, tol is
the convergence tolerance parameter, and is typically set to around 0.01 to
0.001. (See the paper and pseudocode for details.)
The key reason that SMO is an efficient algorithm is that the update to
αi , αj can be computed very efficiently. Let’s now briefly sketch the main
ideas for deriving the efficient update.
Let’s say we currently have some setting of the αi ’s that satisfy the con-
straints (18-19), and suppose we’ve decided to hold α3 , . . . , αm fixed, and
want to reoptimize W (α1 , α2 , . . . , αm ) with respect to α1 and α2 (subject to
the constraints). From (19), we require that
m
X
(1) (2)
α1 y + α2 y =− αi y (i) .
i=3

Since the right hand side is fixed (as we’ve fixed α3 , . . . αm ), we can just let
it be denoted by some constant ζ:
α1 y (1) + α2 y (2) = ζ. (20)
We can thus picture the constraints on α1 and α2 as follows:
24

H α1y(1)+ α2y(2)=ζ
α2

L
α1 C

From the constraints (18), we know that α1 and α2 must lie within the box
[0, C] × [0, C] shown. Also plotted is the line α1 y (1) + α2 y (2) = ζ, on which we
know α1 and α2 must lie. Note also that, from these constraints, we know
L ≤ α2 ≤ H; otherwise, (α1 , α2 ) can’t simultaneously satisfy both the box
and the straight line constraint. In this example, L = 0. But depending on
what the line α1 y (1) + α2 y (2) = ζ looks like, this won’t always necessarily be
the case; but more generally, there will be some lower-bound L and some
upper-bound H on the permissible values for α2 that will ensure that α1 , α2
lie within the box [0, C] × [0, C].
Using Equation (20), we can also write α1 as a function of α2 :

α1 = (ζ − α2 y (2) )y (1) .

(Check this derivation yourself; we again used the fact that y (1) ∈ {−1, 1} so
that (y (1) )2 = 1.) Hence, the objective W (α) can be written

W (α1 , α2 , . . . , αm ) = W ((ζ − α2 y (2) )y (1) , α2 , . . . , αm ).

Treating α3 , . . . , αm as constants, you should be able to verify that this is


just some quadratic function in α2 . I.e., this can also be expressed in the
form aα22 + bα2 + c for some appropriate a, b, and c. If we ignore the “box”
constraints (18) (or, equivalently, that L ≤ α2 ≤ H), then we can easily
maximize this quadratic function by setting its derivative to zero and solving.
We’ll let α2new,unclipped denote the resulting value of α2 . You should also be
able to convince yourself that if we had instead wanted to maximize W with
respect to α2 but subject to the box constraint, then we can find the resulting
value optimal simply by taking α2new,unclipped and “clipping” it to lie in the
25

[L, H] interval, to get



 H if α2new,unclipped > H
α2new = αnew,unclipped if L ≤ α2new,unclipped ≤ H
 2
L if α2new,unclipped < L

Finally, having found the α2new , we can use Equation (20) to go back and find
the optimal value of α1new .
There’re a couple more details that are quite easy but that we’ll leave you
to read about yourself in Platt’s paper: One is the choice of the heuristics
used to select the next αi , αj to update; the other is how to update b as the
SMO algorithm is run.

You might also like