Top 9 Data Science Algorithms
Top 9 Data Science Algorithms
Top 9 Data Science Algorithms
An algorithm is a step by step method of solving a problem. It is commonly used for data processing, calculation
and other related computer and mathematical operations. In this article, we have listed some of the most important
algorithms in the history of Data Science. We've compiled these algorithms based on recommendations by big data
Contents
1. Apriori algorithm
3. Decision Trees
5. Linear Regression
6. Logistic Regression
Presentation 12.3.2008
Lauri Lahti
Association rules
•Techniques for data mining and knowledge
discovery in databases
(www.cs.sunysb.edu/~cse634/lecture_notes/07apriori.pdf)
Apriori algorithm in pseudocode (2)
(Wikipedia)
Lecture 6
Figure 1
3
the largest output value gives the network’s estimate of the class for a given
input. In the special case of two classes it is common to have just one node
in the output layer, the classification between the two classes being made
by applying a cut-off to the output value at the node.
If we consider using the single layer neural net for classification into c
classes, we would use c nodes in the output layer. If we think of classical
4
discriminant analysis in neural network terms, the coefficients in Fisher’s
classification functions give us weights for the network that are optimal
if the input vectors come from Multivariate Normal distributions with a
common covariance matrix.
For classification into two classes, the linear optimization approach that
we examined in class, can be viewed as choosing optimal weights in a single
layer neural network using the appropriate objective function.
Multilayer neural networks are undoubtedly the most popular networks used
in applications. While it is possible to consider many activation functions, in
practice �it has�been found that the logistic (also called the sigmoid) function
ev
g(v) = 1+e v as the activation function (or minor variants such as the
tanh function) works best. In fact the revival of interest in neural nets was
sparked by successes in training neural networks using this function in place
of the historically (biologically inspired) step function (the ”perceptron”}.
Notice that using a linear function does not achieve anything in multilayer
networks that is beyond what can be done with single layer networks with
linear activation functions. The practical value of the logistic function arises
from the fact that it is almost linear in the range where g is between 0.1 and
0.9 but has a squashing effect on very small or very large values of v.
In theory it is sufficient to consider networks with two layers of neurons–
one hidden and one output layer–and this is certainly the case for most
applications. There are, however, a number of situations where three and
sometimes four and five layers have been more effective. For prediction the
output node is often given a linear activation function to provide forecasts
that are not limited to the zero to one range. An alternative is to scale the
output to the linear part (0.1 to 0.9) of the logistic function.
Unfortunately there is no clear theory to guide us on choosing the number
of nodes in each hidden layer or indeed the number of layers. The common
practice is to use trial and error, although there are schemes for combining
5
optimization methods such as genetic algorithms with network training for
these parameters.
7
the input layer and the hidden layer. In addition there will be a total of 3
connections from each node in the hidden layer to nodes in the output layer.
This makes a total of 25 x 3 = 75 connections between the hidden layer and
the output layer. Using the standard logistic activation functions, the net-
work was trained with a run consisting of 60,000 iterations. Each iteration
consists of presentation to the input layer of the independent variables in a
case, followed by successive computations of the outputs of the neurons of
the hidden layer and the output layer using the appropriate weights. The
output values of neurons in the output layer are used to compute the er
ror. This error is used to adjust the weights of all the connections in the
network using the backward propagation (“backprop”) to complete the it
eration. Since the training data has 150 cases, each case was presented to
the network 400 times. Another way of stating this is to say the network
was trained for 400 epochs where an epoch consists of one sweep through
the entire training data. The results for the last epoch of training the neural
net on this data are shown below:
Iris Output 1
2 49 1 50
3 1 49 50
Total 50 50 50 150
Error Report
2 50 1 2.00 ( 1.98)
3 50 1 2.00 ( 1.98)
The classification error of 1.3% is better than the error using discriminant
analysis which was 2% (See lecture note on Discriminant Analysis). Notice
that had we stopped after only one pass of the data (150 iterations) the
8
error is much worse (75%) as shown below:
Iris Output 2
2 13 1 6 20
3 12 5 4 21
Total 35 13 12 60
9
the relevant sum and activation function evaluations. These outputs are
the inputs for neurons in the second hidden layer. Again the relevant sum
and activation function calculations are performed to compute the outputs
of second layer neurons. This continues layer by layer until we reach the
output layer and compute the outputs for this layer. These output values
constitute the neural net’s guess at the value of the dependent variable. If
we are using the neural net for classification, and we have c classes, we will
have c neuron outputs from the activation functions and we use the largest
value to determine the net’s classification. (If c = 2, we can use just one
output node with a cut-off value to map an numerical output value to one
of the two classes).
Let us denote by wij the weight of the connection from node i to node j.
The values of wij are initialized to small (generally random) numbers in the
range 0.00 ± 0.05. These weights are adjusted to new values in the backward
pass as described below.
10
1.4.1 Multiple Local Optima and Epochs
The backprop algorithm is a version of the steepest descent optimization
method applied to the problem of finding the weights that minimize the
error function of the network output. Due to the complexity of the function
and the large numbers of weights that are being “trained” as the network
“learns”, there is no assurance that the backprop algorithm (and indeed any
practical algorithm) will find the optimum weights that minimize error. the
procedure can get stuck at a local minimum. It has been found useful to
randomize the order of presentation of the cases in a training set between
different scans. It is possible to speed up the algorithm by batching, that is
updating the weights for several exemplars in a pass. However, at least the
extreme case of using the entire training data set on each update has been
found to get stuck frequently at poor local minima.
A single scan of all cases in the training data is called an epoch. Most
applications of feedforward networks and backprop require several epochs
before errors are reasonably small. A number of modifications have been
proposed to reduce the epochs needed to train a neural net. One commonly
employed idea is to incorporate a momentum term that injects some inertia
in the weight adjustment on the backward pass. This is done by adding a
term to the expression for weight adjustment for a connection that is a frac
tion of the previous weight adjustment for that connection. This fraction is
called the momentum control parameter. High values of the momentum pa
rameter will force successive weight adjustments to be in similar directions.
Another idea is to vary the adjustment parameter δ so that it decreases as
the number of epochs increases. Intuitively this is useful because it avoids
overfitting that is more likely to occur at later epochs than earlier ones.
11
2 References
1. Bishop, Christopher: Neural Networks for Pattern Recognition, Ox-
ford, 1995.
12
Chapter 9
DECISION TREES
Lior Rokach
Department of Industrial Engineering
Tel-Aviv University
liorr@eng.tau.ac.il
Oded Maimon
Department of Industrial Engineering
Tel-Aviv University
maimon@eng.tau.ac.il
Abstract Decision Trees are considered to be one of the most popular approaches for rep-
resenting classifiers. Researchers from various disciplines such as statistics, ma-
chine learning, pattern recognition, and Data Mining have dealt with the issue of
growing a decision tree from available data. This paper presents an updated sur-
vey of current methods for constructing decision tree classifiers in a top-down
manner. The chapter suggests a unified algorithmic framework for presenting
these algorithms and describes various splitting criteria and pruning methodolo-
gies.
Keywords: Decision tree, Information Gain, Gini Index, Gain Ratio, Pruning, Minimum
Description Length, C4.5, CART, Oblivious Decision Trees
1. Decision Trees
A decision tree is a classifier expressed as a recursive partition of the in-
stance space. The decision tree consists of nodes that form a rooted tree,
meaning it is a directed tree with a node called “root” that has no incoming
edges. All other nodes have exactly one incoming edge. A node with outgoing
edges is called an internal or test node. All other nodes are called leaves (also
known as terminal or decision nodes). In a decision tree, each internal node
splits the instance space into two or more sub-spaces according to a certain
discrete function of the input attributes values. In the simplest and most fre-
166 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK
quent case, each test considers a single attribute, such that the instance space is
partitioned according to the attribute’s value. In the case of numeric attributes,
the condition refers to a range.
Each leaf is assigned to one class representing the most appropriate target
value. Alternatively, the leaf may hold a probability vector indicating the prob-
ability of the target attribute having a certain value. Instances are classified by
navigating them from the root of the tree down to a leaf, according to the
outcome of the tests along the path. Figure 9.1 describes a decision tree that
reasons whether or not a potential customer will respond to a direct mailing.
Internal nodes are represented as circles, whereas leaves are denoted as tri-
angles. Note that this decision tree incorporates both nominal and numeric at-
tributes. Given this classifier, the analyst can predict the response of a potential
customer (by sorting it down the tree), and understand the behavioral charac-
teristics of the entire potential customers population regarding direct mailing.
Each node is labeled with the attribute it tests, and its branches are labeled with
its corresponding values.
Age
<=30
>30
Gender No
male Female
Yes Last R.
No Yes
No Yes
There are various top–down decision trees inducers such as ID3 (Quinlan,
1986), C4.5 (Quinlan, 1993), CART (Breiman et al., 1984). Some consist of
two conceptual phases: growing and pruning (C4.5 and CART). Other inducers
perform only the growing phase.
Figure 9.2 presents a typical algorithmic framework for top–down induc-
ing of a decision tree using growing and pruning. Note that these algorithms
are greedy by nature and construct the decision tree in a top–down, recursive
manner (also known as “divide and conquer“). In each iteration, the algorithm
considers the partition of the training set using the outcome of a discrete func-
tion of the input attributes. The selection of the most appropriate function is
made according to some splitting measures. After the selection of an appro-
priate split, each node further subdivides the training set into smaller subsets,
until no split gains sufficient splitting measure or a stopping criteria is satisfied.
The following section describes the most common criteria in the literature.
TreeGrowing (S,A,y)
Where:
S - Training Set
A - Input Feature Set
y - Target Feature
Create a new tree T with a single root node.
IF One of the Stopping Criteria is fulfilled THEN
Mark the root node in T as a leaf with the most
common value of y in S as a label.
ELSE
Find a discrete function f(A) of the input
attributes values such that splitting S
according to f(A)’s outcomes (v1 ,...,vn ) gains
the best splitting metric.
IF best splitting metric > treshold THEN
Label t with f(A)
FOR each outcome vi of f(A):
Set Subtreei = TreeGrowing (σf (A)=vi S,A,y).
Connect the root node of tT to Subtreei with
an edge that is labelled as vi
END FOR
ELSE
Mark the root node in T as a leaf with the most
common value of y in S as a label.
END IF
END IF
RETURN T
TreePruning (S,T,y)
Where:
S - Training Set
y - Target Feature
T - The tree to be pruned
DO
Select a node t in T such that pruning it
maximally improve some evaluation criteria
IF t6=Ø THEN T=pruned(T,t)
UNTIL t=Ø
RETURN T
Note that if the probability vector has a component of 1 (the variable x gets
only one value), then the variable is defined as pure. On the other hand, if all
components are equal, the level of impurity reaches maximum.
Given a training set S, the probability vector of the target attribute y is
defined as: ¯ ¯
¯ ¯
|σ S| ¯ σ S
y=c|dom(y)| ¯
y=c
Py (S) = 1
,...,
|S| |S|
where:
¯ ¯ ¯ ¯
X ¯σy=c S ¯ ¯σy=c S ¯
j j
Entropy(y, S) = − · log2
|S| |S|
cj ∈dom(y)
It has been theoretically proved (Kearns and Mansour, 1999) that this cri-
terion requires smaller trees for obtaining a certain error than other impurity
based criteria (information gain and Gini index).
that have performed at least as good as the average information gain, the at-
tribute that has obtained the best ratio gain is selected. It has been shown that
the gain ratio tends to outperform simple information gain criteria, both from
the accuracy aspect, as well as from classifier complexity aspects (Quinlan,
1988).
∆Φ(ai , S)
P P |σai =vi,j AN D y=ck S| |σai =vi,j AN D y=ck S|
− |S| · log2 |S|
vi,j ∈dom(ai ) ck ∈dom(y)
When the target attribute is binary, the gini and twoing criteria are equivalent.
For multi–class problems, the twoing criteria prefer attributes with evenly di-
vided splits.
Decision Trees 173
where θ(Py,1 , Py,2 ) is the angle between two vectors Py,1 and Py,2 . These vec-
tors represent the probability distribution of the target attribute in the partitions
σai ∈dom1 (ai ) S and σai ∈dom2 (ai ) S respectively.
It has been shown that this criterion performs better than the information
gain and the Gini index for specific problem constellations.
This measure was extended in (Utgoff and Clouse, 1996) to handle target
attributes with multiple classes and missing data values. Their results indicate
that the suggested method outperforms the gain ratio criteria.
5. Stopping Criteria
The growing phase continues until a stopping criterion is triggered. The
following conditions are common stopping rules:
3. The number of cases in the terminal node is less than the minimum num-
ber of cases for parent nodes.
4. If the node were split, the number of cases in one or more child nodes
would be less than the minimum number of cases for child nodes.
6. Pruning Methods
6.1 Overview
Employing tightly stopping criteria tends to create small and under–fitted
decision trees. On the other hand, using loosely stopping criteria tends to gen-
erate large decision trees that are over–fitted to the training set. Pruning meth-
ods originally suggested in (Breiman et al., 1984) were developed for solving
this dilemma. According to this methodology, a loosely stopping criterion is
used, letting the decision tree to overfit the training set. Then the over-fitted
tree is cut back into a smaller tree by removing sub–branches that are not con-
tributing to the generalization accuracy. It has been shown in various studies
that employing pruning methods can improve the generalization performance
of a decision tree, especially in noisy domains.
Another key motivation of pruning is “trading accuracy for simplicity” as
presented in (Bratko and Bohanec, 1994). When the goal is to produce a suffi-
ciently accurate compact concept description, pruning is highly useful. Within
this process, the initial decision tree is seen as a completely accurate one. Thus
the accuracy of a pruned decision tree indicates how close it is to the initial
tree.
There are various techniques for pruning decision trees. Most of them per-
form top-down or bottom-up traversal of the nodes. A node is pruned if this
operation improves a certain criteria. The following subsections describe the
most popular techniques.
large enough, the authors suggest breaking it into a training set and a prun-
ing set. The trees are constructed using the training set and evaluated on the
pruning set. On the other hand, if the given dataset is not large enough, they
propose to use cross–validation methodology, despite the computational com-
plexity implications.
|σy=ci St | + l · papr (y = ci )
ε0 (t) = 1 − max
ci ∈dom(y) |St | + l
The basic idea is that the error ratio estimated using the training set is not
reliable enough. Instead, a more realistic measure, known as the continuity
correction for binomial distribution, should be used:
|leaves(T )|
ε0 (T, S) = ε(T, S) +
2 · |S|
However, this correction still produces an optimistic error rate. Conse-
quently, one should consider pruning an internal node t if its error rate is within
one standard error from a reference tree, namely (Quinlan, 1993):
s
ε0 (T, S) · (1 − ε0 (T, S))
ε0 (pruned(T, t), S) ≤ ε0 (T, S) +
|S|
where ε(T, S) denotes the misclassification rate of the tree T on the training
set S. Z is the inverse of the standard normal cumulative distribution and α is
the desired significance level.
Let subtree(T,t) denote the subtree rooted by the node t. Let maxchild(T,t)
denote the most frequent child node of t (namely most of the instances in S
reach this particular child) and let St denote all instances in S that reach the
node t.
The procedure performs bottom–up traversal over all nodes and compares
the following values:
1. εU B (subtree(T, t), St )
2. εU B (pruned(subtree(T, t), t), St )
178 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK
According to the lowest value the procedure either leaves the tree as is, prune
the node t, or replaces the node t with the subtree rooted by maxchild(T,t).
P |St | |dom(y)|−1
Cost(t) = |σy=ci St | · ln + ln |S2t | +
ci ∈dom(y)
|σy=ci St | 2
|dom(y)|
π 2
ln |dom(y)|
Γ( 2
)
where St denotes the instances that have reached node t. The splitting cost
of an internal node is calculated based on the cost aggregation of its children.
7. Other Issues
7.1 Weighting Instances
Some decision trees inducers may give different treatments to different in-
stances. This is performed by weighting the contribution of each instance in
the analysis according to a provided weight (between 0 and 1).
additional value in the attribute domain. For instance, the Gain Ratio with
missing values should be calculated as follows:
GainRatio(ai , S) =
|S−σai =? S| Inf ormationGain(a ,S−σ
|S| i ai =? S)
−
|σai =? S| log( |σai =? S| )− P |σai =vi,j S| log( |σai =vi,j S| )
|S| |S| |S| |S|
vi,j ∈dom(ai )
Once a node is split, it is required to add σai =? S to each one of the outgoing
edges with the following corresponding weight:
¯ ¯±
¯σa =v S ¯ |S − σa =? S|
i i,j i
The same idea is used for classifying a new instance with missing attribute
values. When an instance encounters a node where its splitting criteria can be
evaluated due to a missing value, it is passed through to all outgoing edges. The
predicted class will be the class with the highest probability in the weighted
union of all the leaf nodes at which this instance ends up.
Another approach known as surrogate splits was presented by Breiman et al.
(1984) and is implemented in the CART algorithm. The idea is to find for
each split in the tree a surrogate split which uses a different input attribute and
which most resembles the original split. If the value of the input attribute used
in the original split is missing, then it is possible to use the surrogate split. The
resemblance between two binary splits over sample S is formally defined as:
res(a
¯ i , dom1 (ai ), dom2 (ai ), a¯j , dom
¯ 1 (aj ), dom2 (aj ), S) = ¯
¯ ¯ ¯ ¯
¯σai ∈dom1 (ai ) AN D aj ∈dom1 (aj ) S ¯ ¯σai ∈dom2 (ai ) AN D aj ∈dom2 (aj ) S ¯
|S| + |S|
When the first split refers to attribute ai and it splits dom(ai ) into dom1 (ai )
and dom2 (ai ). The alternative split refers to attribute aj and splits its domain
to dom1 (aj ) and dom2 (aj ).
The missing value can be estimated based on other instances (Loh and Shih,
1997). On the learning phase, if the value of a nominal attribute ai in tuple q
is missing, then it is estimated by its mode over all instances having the same
target attribute value. Formally,
¯ ¯
estimate(ai , yq , S) = argmax ¯σai =vi,j AN D y=yq S ¯
vi,j ∈dom(ai )
where yq denotes the value of the target attribute in the tuple q. If the missing
attribute ai is numeric, then instead of using mode of ai it is more appropriate
to use its mean.
Decision Trees 181
8.2 C4.5
C4.5 is an evolution of ID3, presented by the same author (Quinlan, 1993).
It uses gain ratio as splitting criteria. The splitting ceases when the number
of instances to be split is below a certain threshold. Error–based pruning is
performed after the growing phase. C4.5 can handle numeric attributes. It can
induce from a training set that incorporates missing values by using corrected
gain ratio criteria as presented above.
8.3 CART
CART stands for Classification and Regression Trees (Breiman et al., 1984).
It is characterized by the fact that it constructs binary trees, namely each in-
ternal node has exactly two outgoing edges. The splits are selected using the
twoing criteria and the obtained tree is pruned by cost–complexity Pruning.
When provided, CART can consider misclassification costs in the tree induc-
tion. It also enables users to provide prior probability distribution.
An important feature of CART is its ability to generate regression trees.
Regression trees are trees where their leaves predict a real number and not a
class. In case of regression, CART looks for splits that minimize the prediction
squared error (the least–squared deviation). The prediction in each leaf is based
on the weighted mean for node.
8.4 CHAID
Starting from the early seventies, researchers in applied statistics devel-
oped procedures for generating decision trees, such as: AID (Sonquist et al.,
1971), MAID (Gillo, 1972), THAID (Morgan and Messenger, 1973) and
CHAID (Kass, 1980). CHAID (Chisquare–Automatic–Interaction–Detection)
was originally designed to handle nominal attributes only. For each input at-
tribute ai , CHAID finds the pair of values in Vi that is least significantly differ-
ent with respect to the target attribute. The significant difference is measured
by the p value obtained from a statistical test. The statistical test used depends
on the type of target attribute. If the target attribute is continuous, an F test is
182 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK
CHAID handles missing values by treating them all as a single valid category.
CHAID does not perform pruning.
8.5 QUEST
The QUEST (Quick, Unbiased, Efficient, Statistical Tree) algorithm sup-
ports univariate and linear combination splits (Loh and Shih, 1997). For each
split, the association between each input attribute and the target attribute is
computed using the ANOVA F–test or Levene’s test (for ordinal and contin-
uous attributes) or Pearson’s chi–square (for nominal attributes). If the target
attribute is multinomial, two–means clustering is used to create two super–
classes. The attribute that obtains the highest association with the target at-
tribute is selected for splitting. Quadratic Discriminant Analysis (QDA) is
applied to find the optimal splitting point for the input attribute. QUEST has
negligible bias and it yields binary decision trees. Ten–fold cross–validation is
used to prune the trees.
1. Decision trees are self–explanatory and when compacted they are also
easy to follow. In other words if the decision tree has a reasonable num-
ber of leaves, it can be grasped by non–professional users. Furthermore
decision trees can be converted to a set of rules. Thus, this representation
is considered as comprehensible.
2. Decision trees can handle both nominal and numeric input attributes.
3. Decision tree representation is rich enough to represent any discrete–
value classifier.
4. Decision trees are capable of handling datasets that may have errors.
5. Decision trees are capable of handling datasets that may have missing
values.
184 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK
1. Most of the algorithms (like ID3 and C4.5) require that the target at-
tribute will have only discrete values.
2. As decision trees use the “divide and conquer” method, they tend to per-
form well if a few highly relevant attributes exist, but less so if many
complex interactions are present. One of the reasons for this is that other
classifiers can compactly describe a classifier that would be very chal-
lenging to represent using a decision tree. A simple illustration of this
phenomenon is the replication problem of decision trees (Pagallo and
Huassler, 1990). Since most decision trees divide the instance space into
mutually exclusive regions to represent a concept, in some cases the tree
should contain several duplications of the same sub-tree in order to rep-
resent the classifier. For instance if the concept follows the following
binary function: y = (A1 ∩ A2 ) ∪ (A3 ∩ A4 ) then the minimal univari-
ate decision tree that represents this function is illustrated in Figure 9.3.
Note that the tree contains two copies of the same subt-ree.
FDTs preserve the symbolic structure of the tree and its comprehensibility.
Nevertheless, FDT can represent concepts with graduated characteristics by
producing real-valued outputs with gradual shifts
Janikow (1998) presented a complete framework for building a fuzzy tree in-
cluding several inference procedures based on conflict resolution in rule-based
systems and efficient approximate reasoning methods.
Olaru and Wehenkel (2003) presented a new fuzzy decision trees called soft
decision trees (SDT). This approach combines tree-growing and pruning, to
determine the structure of the soft decision tree, with refitting and backfit-
ting, to improve its generalization capabilities. They empirically showed that
soft decision trees are significantly more accurate than standard decision trees.
Moreover, a global model variance study shows a much lower variance for soft
decision trees than for standard trees as a direct cause of the improved accu-
racy.
Peng (2004) has used FDT to improve the performance of the classical in-
ductive learning approach in manufacturing processes. Peng (2004) proposed
to use soft discretization of continuous-valued attributes. It has been shown
that FDT can deal with the noise or uncertainties existing in the data collected
in industrial systems.
memory before induction. That is to say, the largest dataset that can be induced
is bounded by the memory size. Fifield (1992) suggests parallel implementa-
tion of the ID3 Algorithm. However, like Catlett, it assumes that all dataset
can fit in the main memory. Chan and Stolfo (1997) suggest partitioning the
datasets into several disjointed datasets, so that each dataset is loaded sepa-
rately into the memory and used to induce a decision tree. The decision trees
are then combined to create a single classifier. However, the experimental re-
sults indicate that partition may reduce the classification performance, meaning
that the classification accuracy of the combined decision trees is not as good as
the accuracy of a single decision tree induced from the entire dataset.
The SLIQ algorithm (Mehta et al., 1996) does not require loading the entire
dataset into the main memory, instead it uses a secondary memory (disk). In
other words, a certain instance is not necessarily resident in the main mem-
ory all the time. SLIQ creates a single decision tree from the entire dataset.
However, this method also has an upper limit for the largest dataset that can
be processed, because it uses a data structure that scales with the dataset size
and this data structure must be resident in main memory all the time. The
SPRINT algorithm uses a similar approach (Shafer et al., 1996). This algo-
rithm induces decision trees relatively quickly and removes all of the memory
restrictions from decision tree induction. SPRINT scales any impurity based
split criteria for large datasets. Gehrke et al (2000) introduced RainForest; a
unifying framework for decision tree classifiers that are capable of scaling any
specific algorithms from the literature (including C4.5, CART and CHAID). In
addition to its generality, RainForest improves SPRINT by a factor of three. In
contrast to SPRINT, however, RainForest requires a certain minimum amount
of main memory, proportional to the set of distinct values in a column of the
input relation. However, this requirement is considered modest and reasonable.
Other decision tree inducers for large datasets can be found in the literature
(Alsabti et al., 1998; Freitas and Lavington, 1998; Gehrke et al., 1999).
References
Almuallim H., An Efficient Algorithm for Optimal Pruning of Decision Trees.
Artificial Intelligence 83(2): 347-362, 1996.
188 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK
Almuallim H,. and Dietterich T.G., Learning Boolean concepts in the presence
of many irrelevant features. Artificial Intelligence, 69: 1-2, 279-306, 1994.
Alsabti K., Ranka S. and Singh V., CLOUDS: A Decision Tree Classifier
for Large Datasets, Conference on Knowledge Discovery and Data Mining
(KDD-98), August 1998.
Attneave F., Applications of Information Theory to Psychology. Holt, Rinehart
and Winston, 1959.
Baker E., and Jain A. K., On feature ordering in practice and some finite sam-
ple effects. In Proceedings of the Third International Joint Conference on
Pattern Recognition, pages 45-49, San Diego, CA, 1976.
BenBassat M., Myopic policies in sequential classification. IEEE Trans. on
Computing, 27(2):170-174, February 1978.
Bennett X. and Mangasarian O.L., Multicategory discrimination via linear pro-
gramming. Optimization Methods and Software, 3:29-39, 1994.
Bratko I., and Bohanec M., Trading accuracy for simplicity in decision trees,
Machine Learning 15: 223-250, 1994.
Breiman L., Friedman J., Olshen R., and Stone C.. Classification and Regres-
sion Trees. Wadsworth Int. Group, 1984.
Brodley C. E. and Utgoff. P. E., Multivariate decision trees. Machine Learning,
19:45-77, 1995.
Buntine W., Niblett T., A Further Comparison of Splitting Rules for Decision-
Tree Induction. Machine Learning, 8: 75-85, 1992.
Catlett J., Mega induction: Machine Learning on Vary Large Databases, PhD,
University of Sydney, 1991.
Chan P.K. and Stolfo S.J, On the Accuracy of Meta-learning for Scalable Data
Mining, J. Intelligent Information Systems, 8:5-28, 1997.
Crawford S. L., Extensions to the CART algorithm. Int. J. of ManMachine
Studies, 31(2):197-217, August 1989.
Dietterich, T. G., Kearns, M., and Mansour, Y., Applying the weak learning
framework to understand and improve C4.5. Proceedings of the Thirteenth
International Conference on Machine Learning, pp. 96-104, San Francisco:
Morgan Kaufmann, 1996.
Duda, R., and Hart, P., Pattern Classification and Scene Analysis, New-York,
Wiley, 1973.
Esposito F., Malerba D. and Semeraro G., A Comparative Analysis of Meth-
ods for Pruning Decision Trees. EEE Transactions on Pattern Analysis and
Machine Intelligence, 19(5):476-492, 1997.
Fayyad U., and Irani K. B., The attribute selection problem in decision tree
generation. In proceedings of Tenth National Conference on Artificial Intel-
ligence, pp. 104–110, Cambridge, MA: AAAI Press/MIT Press, 1992.
Ferri C., Flach P., and Hernández-Orallo J., Learning Decision Trees Using the
Area Under the ROC Curve. In Claude Sammut and Achim Hoffmann, ed-
Decision Trees 189
1
2
Thus, J measures the sum of squared distances between each training exam-
ple x(i) and the cluster centroid µc(i) to which it has been assigned. It can
be shown that k-means is exactly coordinate descent on J. Specifically, the
inner-loop of k-means repeatedly minimizes J with respect to c while holding
µ fixed, and then minimizes J with respect to µ while holding c fixed. Thus,
J must monotonically decrease, and the value of J must converge. (Usu-
ally, this implies that c and µ will converge too. In theory, it is possible for
3
Linear Regression
Once we’ve acquired data with multiple variables, one very important question is how the
variables are related. For example, we could ask for the relationship between people’s weights
and heights, or study time and test scores, or two animal populations. Regression is a set
of techniques for estimating relationships, and we’ll focus on them for the next two chapters.
In this chapter, we’ll focus on finding one of the simplest type of relationship: linear. This
process is unsurprisingly called linear regression, and it has many applications. For exam-
ple, we can relate the force for stretching a spring and the distance that the spring stretches
(Hooke’s law, shown in Figure 3.1a), or explain how many transistors the semiconductor
industry can pack into a circuit over time (Moore’s law, shown in Figure 3.1b).
Despite its simplicity, linear regression is an incredibly powerful tool for analyzing data.
While we’ll focus on the basics in this chapter, the next chapter will show how just a few
small tweaks and extensions can enable more complex analyses.
●
100
50
1 2 3 4 5
Force on spring (Newtons)
(a) In classical mechanics, one could empiri- (b) In the semiconductor industry, Moore’s
cally verify Hooke’s law by dangling a mass law is an observation that the number of
with a spring and seeing how much the spring transistors on an integrated circuit doubles
is stretched. roughly every two years.
Figure 3.1: Examples of where a line fit explains physical phenomena and engineering feats.1
1
The Moore’s law image is by Wgsimon (own work) [CC-BY-SA-3.0 or GFDL], via Wikimedia Commons.
1
Statistics for Research Projects Chapter 3
But just because fitting a line is easy doesn’t mean that it always makes sense. Let’s take
another look at Anscombe’s quartet to underscore this point.
14 14 14 14
12 12 12 12
10 10 10 10
8 8 8 8
6 6 6 6
4 4 4 4
2 2 2 2
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
For all 4 of them, the slope of the regression line is 0.500 (to three decimal places) and the intercept is
3.00 (to two decimal places). This just goes to show: visualizing data can often reveal patterns that are
hidden by pure numeric analysis!
We begin with simple linear regression in which there are only two variables of interest
(e.g., weight and height, or force used and distance stretched). After developing intuition
for this setting, we’ll then turn our attention to multiple linear regression, where there
are more variables.
Disclaimer: While some of the equations in this chapter might be a little intimidating,
it’s important to keep in mind that as a user of statistics, the most important thing is to
understand their uses and limitations. Toward this end, make sure not to get bogged down
in the details of the equations, but instead focus on understanding how they fit in to the big
picture.
We’re going to fit a line y = β0 + β1 x to our data. Here, x is called the independent
variable or predictor variable, and y is called the dependent variable or response
variable.
Before we talk about how to do the fit, let’s take a closer look at the important quantities
from the fit:
• β1 is the slope of the line: this is one of the most important quantities in any linear
regression analysis. A value very close to 0 indicates little to no relationship; large
positive or negative values indicate large positive or negative relationships, respectively.
For our Hooke’s law example earlier, the slope is the spring constant2 .
2
Since the spring constant k is defined as F = −kx (where F is the force and x is the stretch), the slope
in Figure 3.1a is actually the inverse of the spring constant.
2
Statistics for Research Projects Chapter 3
In order to actually fit a line, we’ll start with a way to quantify how good a line is. We’ll
then use this to fit the “best” line we can.
One way to quantify a line’s “goodness” is to propose a probabilistic model that generates
data from lines. Then the “best” line is the one for which data generated from the line is
“most likely”. This is a commonly used technique in statistics: proposing a probabilistic
model and using the probability of data to evaluate how good a particular model is. Let’s
make this more concrete.
We observe paired data points (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ), where we assume that as a func-
tion of xi , each yi is generated by using some true underlying line y = β0 + β1 x that we
evaluate at xi , and then adding some Gaussian noise. Formally,
yi = β0 + β1 xi + εi . (3.1)
Here, the noise εi represents the fact that our data won’t fit the model perfectly. We’ll model
εi as being Gaussian: ε ∼ N (0, σ 2 ). Note that the intercept β0 , the slope β1 , and the noise
variance σ 2 are all treated as fixed (i.e., deterministic) but unknown quantities.
Assuming that this is actually how the data (x1 , y1 ), . . . , (xn , yn ) we observe are generated,
then it turns out that we can find the line for which the probability of the data is highest
by solving the following optimization problem3 :
n
X
min : [yi − (β0 + β1 xi )]2 , (3.2)
β0 ,β1
i=1
where minβ0 ,β1 means “minimize over β0 and β1 ”. This is known as the least-squares linear
regression problem. Given a set of points, the solution is:
Pn 1
Pn Pn
i=1 xi yi − n i=1 xi i=1 yi
β̂1 = Pn 2 1 Pn 2
(3.3)
i=1 xi − n ( i=1 xi )
sy
=r , (3.4)
sx
β̂0 = ȳ − βˆ1 x̄, (3.5)
3
This is an important point: the assumption of Gaussian noise leads to squared error as our minimization
criterion. We’ll see more regression techniques later that use different distributions and therefore different
cost functions.
3
Statistics for Research Projects Chapter 3
2 2 2 2
0 0 0 0
−2 −2 −2 −2
−4 −4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
Figure 3.2: An illustration of correlation strength. Each plot shows data with a particular correla-
tion coefficient r. Values farther than 0 (outside) indicate a stronger relationship than values closer
to 0 (inside). Negative values (left) indicate an inverse relationship, while positive values (right)
indicate a direct relationship.
where x̄, ȳ, sx and sy are the sample means and standard deviations for x values and y
values, respectively, and r is the correlation coefficient, defined as
n
1 X xi − x̄ yi − ȳ
r= . (3.6)
n − 1 i=1 sx sy
By examining the second equation for the estimated slope β̂1 , we see that since sample
standard deviations sx and sy are positive quantities, the correlation coefficient r, which is
always between −1 and 1, measures how much x is related to y and whether the trend is
positive or negative. Figure 3.2 illustrates different correlation strengths.
The square of the correlation coefficient r2 will always be positive and is called the coefficient
of determination. As we’ll see later, this also is equal to the proportion of the total
variability that’s explained by a linear model.
As an extremely crucial remark, correlation does not imply causation! We devote the entire
next page to this point, which is one of the most common sources of error in interpreting
statistics.
4
Statistics for Research Projects Chapter 3
Just because there’s a strong correlation between two variables, there isn’t necessarily a causal rela-
tionship between them. For example, drowning deaths and ice-cream sales are strongly correlated, but
that’s because both are affected by the season (summer vs. winter). In general, there are several possible
cases, as illustrated below:
x y z z
x y x y x y x y
(a) Causal link: Even (b) Hidden Cause: (c) Confounding (d) Coincidence:
if there is a causal link A hidden variable z Factor: A hidden The correlation just
between x and y, corre- causes both x and y, variable z and x happened by chance
lation alone cannot tell creating the correla- both affect y, so the (e.g. the strong cor-
us whether y causes x or tion. results also depend relation between sun
x causes y. on the value of z. cycles and number
of Republicans in
Congress, as shown
below).
(e)The number of Republican senators in congress (red) and the sunspot num-
ber (blue, before 1986)/inverted sunspot number (blue, after 1986). This fig-
ure comes from http://www.realclimate.org/index.php/archives/2007/05/
fun-with-correlations/.
Figure 3.3: Different explanations for correlation between two variables. In this diagram, arrows represent causation.
5
Statistics for Research Projects Chapter 3
Recall from last time that in order to do hypothesis tests and compute confidence intervals,
we need to know our test statistic, its standard error, and its distribution. We’ll look at the
standard errors for the most important quantities and their interpretation. Any statistical
analysis software can compute these quantities automatically, so we’ll focus on interpreting
and understanding what comes out.
Warning: All the statistical tests here crucially depend on the assumption that the observed
data actually comes from the probabilistic model defined in Equation (3.1)!
3.2.1 Slope
For the slope β1 , our test statistic is
βˆ1 − β1
tβ1 = , (3.7)
sβ1
which has a Student’s t distribution with n − 2 degrees of freedom. The standard error of
the slope sβ1 is
σ̂
sβ1 = v (3.8)
u X n
(xi − x̄)2
u
u
u
|i=1 {z
t
}
how close together x values are
3.2.2 Intercept
For the intercept β0 , our test statistic is
βˆ0 − β0
tβ0 = , (3.10)
sβ0
6
Statistics for Research Projects Chapter 3
10
tr
−5
−10
−1.0 −0.5 0.0 0.5 1.0
r
Figure 3.4: The test statistic for the correlation coefficient r for n = 10 (blue) and n = 100 (green).
3.2.3 Correlation
For the correlation coefficient r, our test statistic is the standardized correlation
r
n−2
tr = r , (3.12)
1 − r2
which is t-distributed with n − 2 degrees of freedom. Figure 3.4 plots tr against r.
3.2.4 Prediction
Let’s look at the prediction at a particular value x∗ , which we’ll call ŷ(x∗ ). In particular:
where ε ∼ N (0, σ 2 ).
7
Statistics for Research Projects Chapter 3
Then it turns out that the standard error sµ̂ for estimating µ(x∗ ) (i.e., the mean of the line
at point x∗ ) using ŷ(x∗ ) is:
v
u1 (x∗ − x̄)
sµ̂ = σ̂ u + P n 2
.
tn i=1 (xi − x̄)
u
| {z }
distance from “comfortable prediction region”
This makes sense because if we’re trying to predict for a point that’s far from the mean,
then we should be less sure, and our prediction should have more variance. To compute the
standard error for estimating a particular point y(x∗ ) and not just its mean µ(x∗ ), we’d also
need to factor in the extra noise term ε in Equation (3.13):
v
u1 (x∗ − x̄)
u
sŷ = σ̂ t + P +1 .
2 |{z}
n i (xi − x̄)
added
While both of these quantities have the same value when computed from the data, when
analyzing them, we have to remember that they’re different random variables: ŷ has more
variation because of the extra ε.
As a reminder, everything here crucially depends on the probabilistic model given by Equa-
tion (3.1) being true. In practice, when we do prediction for some value of x we haven’t seen
before, we need to be very careful. Predicting y for a value of x that is within the interval
of points that we saw in the original data (the data that we fit our model with) is called
interpolation. Predicting y for a value of x that’s outside the range of values we actually
saw for x in the original data is called extrapolation.
For real datasets, even if a linear fit seems appropriate, we need to be extremely careful
about extrapolation, which can often lead to false predictions!
8
Statistics for Research Projects Chapter 3
By fitting a line to the Rotten Tomatoes ratings for movies that M. Night Shyamalan directed over time,
one may erroneously be led to believe that in 2014 and onward, Shyamalan’s movies will have negative
ratings, which isn’t even possible!
40
30 y = -4.6531x + 9367.7
20 R² = 0.6875
10
0
1995 2000 2005 2010 2015
-10
Year
Now, let’s talk about the case when instead of just a single scalar value x, we have a vector
(x1 , . . . , xp ) for every data point i. So, we have n data points (just like before), each with p
different predictor variables or features. We’ll then try to predict y for each data point as
a linear function of the different x variables:
y = β1 x1 + β2 x2 + · · · + βp xp . (3.14)
Even though it’s still linear, this representation is very versatile; here are just a few of the
things we can represent with it:
9
Statistics for Research Projects Chapter 3
• Multiple dependent variables: for example, suppose we’re trying to predict medical
outcome as a function of several variables such as age, genetic susceptibility, and clinical
diagnosis. Then we might say that for each patient, x1 = age, x2 = genetics, x3 =
diagnosis, and y = outcome.
One may ask: why not just use multiple linear regression and fit an extremely high-degree
polynomial to our data? While the model then would be much richer, one runs the risk of
overfitting, where the model is so rich that it ends up fitting to the noise! We illustrate
this with an example; it’s also illustrated by a song4 .
Example: Overfitting
Using too many features or too complex of a model can often lead to overfitting. Suppose we want to
fit a model to the points in Figure 3.3(a). If we fit a linear model, it might look like Figure 3.3(b). But,
the fit isn’t perfect. What if we use our newly acquired multiple regression powers to fit a 6th order
polynomial to these points? The result is shown in Figure 3.3(c). While our errors are definitely smaller
than they were with the linear model, the new model is far too complex, and will likely go wrong for
values too far outside the range.
12 12 12
11 11 11
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
(a)A set of points with a sim- (b) The same set of points (c) The same points with
ple linear relationship. with a linear fit (blue). a 6th-order polynomial fit
(green). As before, the linear
fit is shown in blue.
We’ll represent our input data in matrix form as X, an x × p matrix where each row corre-
sponds to a data point and each column corresponds to a feature. Since each output yi is
just a single number, we’ll represent the collection as an n-element column vector y. Then
our linear model can be expressed as
y = Xβ + ε (3.15)
4
Machine Learning A Cappella, Udacity. https://www.youtube.com/watch?v=DQWI1kvmwRg
10
Statistics for Research Projects Chapter 3
where β is a p-element vector of coefficients, and ε is an n-element matrix where each element,
like εi earlier, is normal with mean 0 and variance σ 2 . Notice that in this version, we haven’t
explicitly written out a constant term like β0 from before. We’ll often add a column of
1s to the matrix X to accomplish this (try multiplying things out and making sure you
understand why this solves the problem). The software you use might do this automatically,
so it’s something worth checking in the documentation.
This leads to the following optimization problem:
n
X
min (yi − Xi β)2 , (3.16)
β
i=1
where minβ . just means “find values of β that minimize the following”, and Xi refers to row
i of the matrix X.
We can use some basic linear algebra to solve this problem and find the optimal estimates:
β̂ = (X T X)−1 X T y, (3.17)
which most computer programs will do for you. Once we have this, what conclusions can
we make with the help of statistics? We can obtain confidence intervals and/or hypothesis
tests for each coefficient, which most statistical software will do for you. The test statistics
are very similar to their counterparts for simple linear regression.
It’s important not to blindly test whether all the coefficients are greater than zero: since this
involves doing multiple comparisons, we’d need to correct appropriately using Bonferroni
correction or FDR correction as described in the last chapter. But before even doing that,
it’s often smarter to measure whether the model even explains a significant amount of the
variability in the data: if it doesn’t, then it isn’t even worth testing any of the coefficients
individually. Typically, we’ll use an analysis of variance (ANOVA) test to measure this.
If the ANOVA test determines that the model explains a significant portion of the variability
in the data, then we can consider testing each of the hypotheses and correcting for multiple
comparisons.
We can also ask about which features have the most effect: if a feature’s coefficient is 0 or
close to 0, then that feature has little to no impact on the final result. We need to avoid the
effect of scale: for example, if one feature is measured in feet and another in inches, even if
they’re the same, the coefficient for the feet feature will be twelve times larger. In order to
avoid this problem, we’ll usually look at the standardized coefficients sβ̂k .
β̂k
How can we measure the performance of our model? Suppose for a moment that every point
yi was very close to the mean ȳ: this would mean that each yi wouldn’t depend on xi , and
that there wasn’t much random error in the value either. Since we expect that this shouldn’t
be the case, we can try to understand how much the prediction from xi and random error
11
Statistics for Research Projects Chapter 3
(xi , yi )
y − ŷi
(x, ŷ )
ŷi − ȳ
Figure 3.5: An illustration of the components contributing to the difference between the average
y-value ȳ and a particular point (xi , yi ) (blue). Some of the difference, ŷi − ȳ, can be explained by
the model (orange), and the remainder, yi − ŷi , is known as the residual (green).
contribute to yi . In particular, let’s look at how far yi is from the mean ȳ. We’ll write this
difference as:
In particular, the residual is defined to be yi − ŷi : the distance from the original data point
to the predicted value on the line. You can think of it as the error left over after the model
has done its work. This difference is shown graphically in Figure 3.5. Note that the residual
yi − ŷ isn’t quite the same as the noise ε! We’ll talk a little more about analyzing residuals
(and why this distinction matters) in the next chapter.
If our model is doing a good job, then it should explain most of the difference from ȳ, and
the first term should be bigger than the second term. If the second term is much bigger,
then the model is probably not as useful.
If we square the quantity on the left, work through some algebra, and use some facts about
linear regression, we’ll find that
X X X
(yi − ȳ)2 = (ŷi − ȳ)2 + (yi − ŷi )2 , (3.19)
|i {z } |i {z } |i {z }
SStotal SSmodel SSerror
where “SS” stands for “sum of squares”. These terms are often abbreviated as SST, SSM,
and SSE respectively.
If we divide through by SST, we obtain
SSM SSE
1= + ,
|SST SST
{z } |{z}
r2 1−r2
12
Statistics for Research Projects Chapter 3
where we note that r2 is precisely the coefficient of determination mentioned earlier. Here,
we see why r2 can be interpreted as the fraction of variability in the data that is explained
by the model.
One way we might evaluate a model’s performance is to compare the ratio SSM/SSE. We’ll
do this with a slight tweak: we’ll instead consider the mean values, MSM = SSM/(p − 1) and
MSE = SSE/(n − p), where the denominators correspond to the degrees of freedom. These
new variables MSM and MSE have χ2 distributions, and their ratio
MSM
f= (3.20)
MSE
has what’s known as an F distribution with parameters p − 1 and n − p. The widely used
ANOVA test for categorical data, which we’ll see in Chapter 6, is based on this F statistic:
it’s a way of measuring how much of the variability in the data is from the model and how
much is from random error, and comparing the two.
13
LOGISTIC REGRESSION
Nitin R Patel
Logistic regression extends the ideas of multiple linear regression to the situation
where the dependent variable, y, is binary (for convenience we often code these values as
0 and 1). As with multiple linear regression the independent variables x1 , x2 · · · xk may
be categorical or continuous variables or a mixture of these two types.
(For fractions in cells above, the numerator is the number of adopters out of the
number in the denominator).
Note that the overall probability of adoption in the sample is 1628/10524 = 0.155.
However, the adoption probability varies depending on the categorical independent vari-
ables education, residential stability and income. The lowest value is 0.069 for low- income
no-residence-change households with some college education while the highest is 0.270 for
1
high-income residence changers with some college education.
The standard multiple linear regression model is inappropriate to model this data for
the following reasons:
2. The dependent variable is not normally distributed. In fact a binomial model would
be more appropriate. For example, if a cell total is 11 then this variable can take
on only 12 distinct values 0, 1, 2 · · · 11. Think of the response of the households in
a cell being determined by independent flips of a coin with, say, heads representing
adoption with the probability of heads varying between cells.
higher for cells where the probability of adoption, p, is near 0.5 than where it is
near 0 or 1. It will also increase with the total number of households, n, falling in
The logistic regression model was developed to account for all these difficulties. It
has become very popular in describing choice behavior in econometrics and in modeling
risk factors in epidemiology. In the context of choice behavior it can be shown to follow
from the random utility theory developed by Manski [2] as an extension of the standard
economic theory of consumer behavior.
In essence the consumer theory states that when faced with a set of choices a consumer
makes a choice which has the highest utility ( a numeric measure of worth with arbitrary
zero and scale). It assumes that the consumer has a preference order on the list of choices
that satisfies reasonable criteria such as transitivity. The preference order can depend on
the individual (e.g. socioeconomic characteristics as in the Example 1 above) as well as
2
attributes of the choice. The random utility model considers the utility of a choice to
incorporate a random element. When we model the random element as coming from a
”reasonable” distribution, we can logically derive the logistic model for predicting choice
behavior.
If we let y = 1 represent choosing an option versus y = 0 for not choosing it, the
logistic regression model stipulates:
exp(βO + β1 ∗ x1 + · · · βk ∗ xk )
Probability(Y = 1|x1 , x2 · · · xk ) =
1 + exp(βO + β1 ∗ x1 + · · · βk ∗ xk )
model.
x2 ≡ (Residential Stability: No change over past five years = 0, Change over past five
years = 1
The data in Table 1 is shown below in the format typically required by regression
programs.
3
exp(β0 + β1 ∗ xl + β2 ∗ x2 + β3 ∗ x3 )
P rob(Y = 1|x1 , x2 , x3 ) = .
1 + exp(β0 + β1 ∗ xl + β2 ∗ x2 + β3 ∗ x3 )
We obtain a useful interpretation for the coefficients by noting that:
P rob(Y = 1|x1 = x2 = x3 = 0)
exp(β0 ) =
P rob(Y = 0|x1 = x2 = x3 = 0)
= Odds of adopting in the base case (x1 = x2 = x3 = 0)
Odds of adopting when x1 = 1, x2 = x3 = 0
exp(β1 ) =
Odds of adopting in the base case
Odds of adopting when x2 = 1, x1 = x3 = 0
exp(β2 ) =
Odds of adopting in the base case
Odds of adopting when x3 = 1, x1 = x2 = 0
exp(β3 ) =
Odds of adopting in the base case
If x1 = 1 the odds of adoption get multiplied by the same factor for any given level of
x2 and x3 . Similarly the multiplicative factors for x2 and x3 do not vary with the levels
of the remaining factors. The factor for a variable gives us the impact of the presence of
that factor on the odds of adopting.
If βi = 0, the presence of the corresponding factor has no effect (multiplication by
one). If βi < 0, presence of the factor reduces the odds (and the probability) of adoption,
whereas if βi > 0, presence of the factor increases the probability of adoption.
The computations required to produce these maximum likelihood estimates require
iterations using a computer program. The output of a typical program is shown below:
4
95% Conf. Intvl. for odds
Variable Coeff. Std. Error p-Value Odds Lower Limit Upper Limit
Constant -2.500 0.058 0.000 0.082 0.071 0.095
x1 0.161 0.058 0.006 1.175 1.048 1.316
x2 0.992 0.056 0.000 2.698 2.416 3.013
x3 0.444 0.058 0.000 1.560 1.393 1.746
From the estimated values of the coefficients, we see that the estimated probability of
adoption for a household with values x1 , x2 and x3 for the independent variables is:
The estimated number of adopters from this model will be the total number of households
with values x1 , x2 and x3 for the independent variables multiplied by the above probability.
The table below shows the estimated number of adopters for the various combinations
In data mining applications we will have validation data that is a hold-out sample not
used in fitting the model.
Let us suppose we have the following validation data consisting of 598 households:
5
x1 x2 x3 # in # adopters Estimated Error Absolute
validation in validation (# adopters) (Estimate Value
sample sample -Actal) of Error
0 0 0 29 3 2.200 -0.800 0.800
0 0 1 23 7 2.610 -4.390 4.390
0 1 0 112 25 20.302 -4.698 4.698
0 1 1 143 27 36.705 9.705 9.705
1 0 0 27 2 2.374 0.374 0.374
1 1 0 54 12 11.145 -0.855 0.855
1 0 1 125 13 16.338 3.338 3.338
1 1 1 85 30 24.528 -5.472 5.472
Totals 598 119 116.202
The total error is -2.8 adopters or a percentage error in estimating adopters of -2.8/119
= 2.3%.
The confusion matrix for households in the validation data for set is given below:
Observed
Adopters Non-adopters Total
Predicted:
Adopters 103 13 116
Non-adopters 16 466 482
Total 119 479 598
As with multiple linear regression we can build more complex models that reflect
interactions between independent variables by including factors that are calculated from
the interacting factors. For example if we felt that there is an interactive effect b etween
x1 and x2 we would add an interaction term x4 = x1 × x2 .
6
Example 2: Financial Conditions of Banks [2]
Table 2 gives data on a sample of banks. The second column records the judgment of
an expert on the financial condition of each bank. The last two columns give the values
of two commonly ratios commonly used in financial analysis of banks.
Let us first consider a simple logistic regression model with just one independent
variable. This is analogous to the simple linear regression model in which we fit a straight
line to relate the dependent variable, y, to a single independent variable, x.
7
Let us construct a simple logistic regression model for classification of banks using
the Total Loans & Leases to Total Assets ratio as the independent variable in our model.
This model would have the following variables:
Dependent variable:
Y = 1, if financially distressed,
= 0, otherwise.
The equation relating the dependent variable with the explanatory variable is:
exp(β0 + β1 ∗ xl )
P rob(Y = 1|x1 ) =
1 + exp(β0 + β1 ∗ xl )
or, equivalently,
The Maximum Likelihood Estimates of the coefficients for the model are: β̂0 = −6.926,
β̂1 = 10.989
exp(−6.926 + 10.989 ∗ x1 )
P rob(Y = 1|x1 ) = .
(1 + exp(−6.926 + 10.989 ∗ x1 )
8
Figure 1 displays the data points and the fitted logistic regression model.
9
We can think of the model as a multiplicative model of odds ratios as we did for
Example 1. The odds that a bank with a Loan & Leases/Assets Ratio that is zero will
be in financial distress = exp(−6.926) = 0.001. These are the base case odds. The
odds of distress for a bank with a ratio of 0.6 will increase by a multiplicative factor of
exp(10.989∗0.6) = 730 over the base case, so the odds that such a bank will be in financial
distress = 0.730.
Notice that there is a small difference in interpretation of the multiplicative factors for
this example compared to Example 1. While the interpretation of the sign of βi remains
as before, its magnitude gives the amount by which the odds of Y = 1 against Y = 0
are changed for a unit change in xi . If we construct a simple logistic regression model for
classification of banks using the Total Expenses/ Total Assets ratio as the independent
Dependent variable:
Y = 1, if financially distressed,
= 0, otherwise.
The equation relating the dependent variable with the explanatory variable is:
exp(β0 + β2 ∗ x2 )
P rob(Y = l|x1 ) =
1 + exp(β0 + β2 ∗ x2 )
or, equivalently,
The Maximum Likelihood Estimates of the coefficients for the model are: β0 = −9.587,
β2 = 94.345
10
Figure 2 displays the data points and the fitted logistic regression model.
Computation of Estimates
estimators are:
• Consistent : the probability of the estimator differing from the true value approaches
11
Algorithms to compute the coefficient estimates and confidence intervals are iterative
and less robust than algorithms for linear regression. Computed estimates are generally
reliable for well-behaved datasets where the number of observations with depende nt
variable values of both 0 and 1 are ‘large’; their ratio is ‘not too close’ to either zero or
one; and when the number of coefficients in the logistic regression model is small relative
to the sample size (say, no more than 10%). As with linear regression collinearity (strong
correlation amongst the independent variables) can lead to computational difficulties.
Computationally intensive algorithms have been developed recently that circumvent some
12
Appendix A
Computing Maximum Likelihood Estimates and Confidence Intervals
for Regression Coefficients
We denote the coefficients by the p × 1 column vector β with the row element i equal
i = 1 · · · p; j = 1 · · · n.
n
eyi (β0 +β1 x1j +β2 x2j ···+βp xpj )
β0 +β1 x1j +β2 x2j ···+βi xpj )
j=1 1 + e
n
eΣi yj βi xij
=
j=1 1 + e
Σi βi xij
eΣi βi ti
=
n
[1 + eΣi βi xij ]
j=1
where ti = Σj yj xij
These are the sufficient statistics for a logistic regression model analogous to ŷ and S in
linear regression.
Loglikelihood Function: This is the logarithm of the likelihood function,
13
We find the maximum likelihood estimates, β̂i , of βi by maximizing the loglikelihood
function for the observed values of yj and xij in our data. Since maximizing the log of
a function is equivalent to maximizing the function, we often work with the loglikeli-
hood because it is generally less cumbersome to use for mathematical operations such as
differentiation.
Since the likelihood function can be shown to be concave, we will find the global max-
imum of the function (if it exists) by equating the partial derivatives of the loglikelihood
to zero and solving the resulting nonlinear equations for β̂i .
or Σi xij π̂j = ti
In words, the maximum likelihood estimates are such that the expected value of the
sufficient statistics are equal to their observed values.
Note : If the model includes the constant term xij = 1 for all j then Σj E(Yj ) = Σj yj , i.e.
the expected number of successes (responses of one) using MLE estimates of βi equals the
observed number of successes. The β̂i ’s are consistent, asymptotically efficient and follow
a multivariate Normal distribution (subject to mild regularity conditions).
Algorithm : A popular algorithm for computing β̂i uses the Newton-Raphson method
for maximizing twice differentiable functions of several variables (see Appendix B).
14
The Newton-Raphson method involves computing the following successive approxima-
tions to find β̂i , the likelihood function
where
∂2l
Iij =
∂βi ∂j βj
• On convergence, the diagonal elements of I(β t )−1 give squared standard errors (ap-
proximate variance) for β̂i .
• Confidence intervals and hypothesis tests are based on asymptotic normal distribu-
tion of β̂i .
The loglikelihood function is always negative and does not have a maximum when it
can be made arbitrary close to zero. In that case the likelihood function can be made
arbitrarily close to one and the first term of the loglikelihood function given above
approaches infinity. In this situation the predicted probabilities for observations
with yj = 0 can be made arbitrarily close to 0 and those for yj = 1 can be made
arbitrarily close to 1 by choosing suitable very large absolute values of some βi . This
is the situation when we have a perfect model (at least in terms of the training data
set)! This phenomenon is more likely to occur when the number of parameters is a
large fraction (say > 20%) of the number of observations.
15
Appendix B
The Newton-Raphson Method
This method finds the values of βi that maximize a twice differentiable concave
function, g(β). If the function is not concave, it finds a local maximum. The
method uses successive quadratic approximations to g based on Taylor series. It
converges rapidly if the starting value, β 0 , is reasonably close to the maximizing
value, β̂, of β.
The gradient vector ∇ and the Hessian matrix, H, as defined below, are used to
update an estimate β t to β t+1 .
.. ..
. .
∂g ∂2g
∇g(β t ) = ∂βi H(β t ) = · · · ∂βi ∂βk
··· .
.. ..
. βt . βt
The Taylor series expansion around β t gives us:
Provided H(β t ) is positive definite, the maximum of this approximation occurs when
its derivative is zero.
or
This gives us a way to compute β t+1 , the next value in our iterations.
16
Near the maximum the rate of convergence is quadratic as it can be shown that
|βit+1 − β̂i | ≤ c|βit − β̂i |2 for some c ≥ 0 when βit is near β̂i for all i.
17
CS229 Lecture notes
Andrew Ng
Part XI
Principal components analysis
In our discussion of factor analysis, we gave a way to model data x ∈ Rn as
“approximately” lying in some k-dimension subspace, where k ≪ n. Specif-
ically, we imagined that each point x(i) was created by first generating some
z (i) lying in the k-dimension affine space {Λz + µ; z ∈ Rk }, and then adding
Ψ-covariance noise. Factor analysis is based on a probabilistic model, and
parameter estimation used the iterative EM algorithm.
In this set of notes, we will develop a method, Principal Components
Analysis (PCA), that also tries to identify the subspace in which the data
approximately lies. However, PCA will do so more directly, and will require
only an eigenvector calculation (easily done with the eig function in Matlab),
and does not need to resort to EM.
Suppose we are given a dataset {x(i) ; i = 1, . . . , m} of attributes of m dif-
ferent types of automobiles, such as their maximum speed, turn radius, and
so on. Let x(i) ∈ Rn for each i (n ≪ m). But unknown to us, two different
attributes—some xi and xj —respectively give a car’s maximum speed mea-
sured in miles per hour, and the maximum speed measured in kilometers per
hour. These two attributes are therefore almost linearly dependent, up to
only small differences introduced by rounding off to the nearest mph or kph.
Thus, the data really lies approximately on an n − 1 dimensional subspace.
How can we automatically detect, and perhaps remove, this redundancy?
For a less contrived example, consider a dataset resulting from a survey of
(i)
pilots for radio-controlled helicopters, where x1 is a measure of the piloting
(i)
skill of pilot i, and x2 captures how much he/she enjoys flying. Because
RC helicopters are very difficult to fly, only the most committed students,
ones that truly enjoy flying, become good pilots. So, the two attributes
x1 and x2 are strongly correlated. Indeed, we might posit that that the
1
2
data actually likes along some diagonal axis (the u1 direction) capturing the
intrinsic piloting “karma” of a person, with only a small amount of noise
lying off this axis. (See figure.) How can we automatically compute this u1
direction?
u1
x2 (enjoyment)
u2
x1 (skill)
We will shortly develop the PCA algorithm. But prior to running PCA
per se, typically we first pre-process the data to normalize its mean and
variance, as follows:
1. Let µ = m1 m (i)
P
i=1 x .
Steps (1-2) zero out the mean of the data, and may be omitted for data
known to have zero mean (for instance, time series corresponding to speech
or other acoustic signals). Steps (3-4) rescale each coordinate to have unit
variance, which ensures that different attributes are all treated on the same
“scale.” For instance, if x1 was cars’ maximum speed in mph (taking values
in the high tens or low hundreds) and x2 were the number of seats (taking
values around 2-4), then this renormalization rescales the different attributes
to make them more comparable. Steps (3-4) may be omitted if we had
apriori knowledge that the different attributes are all on the same scale. One
3
example of this is if each data point represented a grayscale image, and each
(i)
xj took a value in {0, 1, . . . , 255} corresponding to the intensity value of
pixel j in image i.
Now, having carried out the normalization, how do we compute the “ma-
jor axis of variation” u—that is, the direction on which the data approxi-
mately lies? One way to pose this problem is as finding the unit vector u so
that when the data is projected onto the direction corresponding to u, the
variance of the projected data is maximized. Intuitively, the data starts off
with some amount of variance/information in it. We would like to choose a
direction u so that if we were to approximate the data as lying in the direc-
tion/subspace corresponding to u, as much as possible of this variance is still
retained.
Consider the following dataset, on which we have already carried out the
normalization steps:
111
000
000
111
000
111
00
11
0000
1111
000
111
00
11
1
0
0000
1111
0000
1111
0000
1111
0000
1111
111
000
000
111 0000
1111
000
111
00
11 0000
1111
000
111
00
11
1
0
00
11
0
1
00
11
00
11
00
11
00
11
00
11
11
00 00
11
00
11
1
0
We see that the projected data still has a fairly large variance, and the
points tend to be far from zero. In contrast, suppose had instead picked the
following direction:
000000000
111111111
000000000
111111111
000000000
111111111
000000000
111111111
000000000
111111111
000000000
111111111
000000000
111111111
00000
11111
000000000
111111111
00000
11111
000000000
111111111
000000000
111111111
00000
11111
000000000
111111111
000000000
111111111
00000
11111
000000000
111111111
000000000
111111111
00000
11111
000000000
111111111
000000000
111111111
00000
11111
000000000
111111111
000000000
111111111
00000
11111
000000000
111111111
000000000
00
11
00
11 111111111
00000
11111
1
0
000000000
111111111
00000000000
11111111111
00 111111111
11
00000000000
11111111111 000000000
000000000
111111111
00000000000
11111111111
00000000000
11111111111 000000000
111111111
00000000000
11111111111 000000000
111111111
00000000000
11111111111 000000000
111111111
000000000
111111111
00000000000
11111111111 0
1
000000000
111111111
000000
111111
0
1
00000000000
11111111111 000000000
111111111
000000
111111
00000000000
11111111111 00
11
00000000000
11111111111 000000000
111111111
000000
111111
00
11
000000
111111
00000000000
11111111111
000000
111111
00000000000
11111111111
000000
111111
00000000000
11111111111
000000
111111
00000000000
11111111111
000000
111111
00000000000
11111111111
000000
111111
00000000000
11111111111
000000
111111
00000000000
11111111111
000000
111111
Here, the projections have a significantly smaller variance, and are much
closer to the origin.
We would like to automatically select the direction u corresponding to
the first of the two figures shown above. To formalize this, note that given a
5
unit vector u and a point x, the length of the projection of x onto u is given
by xT u. I.e., if x(i) is a point in our dataset (one of the crosses in the plot),
then its projection onto u (the corresponding circle in the figure) is distance
xT u from the origin. Hence, to maximize the variance of the projections, we
would like to choose a unit-length u so as to maximize:
m m
1 X (i) T 2 1 X T (i) (i) T
(x u) = u x x u
m i=1 m i=1
m
!
1 X (i) (i) T
= uT x x u.
m i=1
We easily recognize that the maximizing this subject to ||u||2 = 1 gives the
(i) (i) T
principal eigenvector of Σ = m1 m
P
i=1 x x , which is just the empirical
covariance matrix of the data (assuming it has zero mean).1
To summarize, we have found that if we wish to find a 1-dimensional
subspace with with to approximate the data, we should choose u to be the
principal eigenvector of Σ. More generally, if we wish to project our data
into a k-dimensional subspace (k < n), we should choose u1 , . . . , uk to be the
top k eigenvectors of Σ. The ui ’s now form a new, orthogonal basis for the
data.2
Then, to represent x(i) in this basis, we need only compute the corre-
sponding vector
uT1 x(i)
uT x(i)
(i) 2
y = .. ∈ Rk .
.
T (i)
uk x
Thus, whereas x(i) ∈ Rn , the vector y (i) now gives a lower, k-dimensional,
approximation/representation for x(i) . PCA is therefore also referred to as
a dimensionality reduction algorithm. The vectors u1 , . . . , uk are called
the first k principal components of the data.
Remark. Although we have shown it formally only for the case of k = 1,
using well-known properties of eigenvectors it is straightforward to show that
1
If you haven’t seen this before, try using the method of Lagrange multipliers to max-
imize uT Σu subject to that uT u = 1. You should be able to show that Σu = λu, for some
λ, which implies u is an eigenvector of Σ, with eigenvalue λ.
2
Because Σ is symmetric, the ui ’s will (or always can be chosen to be) orthogonal to
each other.
6
of all possible orthogonal bases u1 , . . . , uk , the one that we have chosen max-
(i) 2
P
imizes i ||y ||2. Thus, our choice of a basis preserves as much variability
as possible in the original data.
In problem set 4, you will see that PCA can also be derived by picking
the basis that minimizes the approximation error arising from projecting the
data onto the k-dimensional subspace spanned by them.
PCA has many applications; we will close our discussion with a few exam-
ples. First, compression—representing x(i) ’s with lower dimension y (i) ’s—is
an obvious application. If we reduce high dimensional data to k = 2 or 3 di-
mensions, then we can also plot the y (i) ’s to visualize the data. For instance,
if we were to reduce our automobiles data to 2 dimensions, then we can plot
it (one point in our plot would correspond to one car type, say) to see what
cars are similar to each other and what groups of cars may cluster together.
Another standard application is to preprocess a dataset to reduce its
dimension before running a supervised learning learning algorithm with the
x(i) ’s as inputs. Apart from computational benefits, reducing the data’s
dimension can also reduce the complexity of the hypothesis class considered
and help avoid overfitting (e.g., linear classifiers over lower dimensional input
spaces will have smaller VC dimension).
Lastly, as in our RC pilot example, we can also view PCA as a noise re-
duction algorithm. In our example it estimates the intrinsic “piloting karma”
from the noisy measures of piloting skill and enjoyment. In class, we also saw
the application of this idea to face images, resulting in eigenfaces method.
Here, each point x(i) ∈ R100×100 was a 10000 dimensional vector, with each co-
ordinate corresponding to a pixel intensity value in a 100x100 image of a face.
Using PCA, we represent each image x(i) with a much lower-dimensional y (i) .
In doing so, we hope that the principal components we found retain the inter-
esting, systematic variations between faces that capture what a person really
looks like, but not the “noise” in the images introduced by minor lighting
variations, slightly different imaging conditions, and so on. We then measure
distances between faces i and j by working in the reduced dimension, and
computing ||y (i) − y (j) ||2 . This resulted in a surprisingly good face-matching
and retrieval algorithm.
Recurrent Neural Networks
Neural Computation : Lecture 12
Recurrent neural network architectures can have many different forms. One common
type consists of a standard Multi-Layer Perceptron (MLP) plus added loops. These can
exploit the powerful non-linear mapping capabilities of the MLP, and also have some
form of memory. Others have more uniform structures, potentially with every neuron
connected to all the others, and may also have stochastic activation functions.
For simple architectures and deterministic activation functions, learning can be achieved
using similar gradient descent procedures to those leading to the back-propagation
algorithm for feed-forward networks. When the activations are stochastic, simulated
annealing approaches may be more appropriate. The following will look at a few of the
most important types and features of recurrent networks.
L12-2
A Fully Recurrent Network
The simplest form of fully recurrent neural network is an MLP with the previous set of
hidden unit activations feeding back into the network along with the inputs:
Outputs
h(t) h(t)
x(t) h(t-1)
Inputs
Note that the time t has to be discretized, with the activations updated at each time step.
The time scale might correspond to the operation of real neurons, or for artificial systems
any time step size appropriate for the given problem can be used. A delay unit needs to
be introduced to hold activations until they are processed at the next time step.
L12-3
The State Space Model
If the neural network inputs and outputs are the vectors x(t) and y(t), the three connection
weight matrices are W IH, W HH and WHO, and the hidden and output unit activation
functions are fH and fO, the behaviour of the recurrent network can be described as a
dynamical system by the pair of non-linear matrix equations:
In general, the€state of a dynamical system is a set of values that summarizes all the
information about the past behaviour of the system that is necessary to provide a unique
€
description of its future behaviour, apart from the effect of any external factors. In this
case the state is defined by the set of hidden unit activations h(t).
Thus, in addition to the input and output spaces, there is also a state space. The order of
the dynamical system is the dimensionality of the state space, the number of hidden units.
L12-4
Stability, Controllability and Observability
Since one can think about recurrent networks in terms of their properties as dynamical
systems, it is natural to ask about their stability, controllability and observability:
Stability concerns the boundedness over time of the network outputs, and the response of
the network outputs to small changes (e.g., to the network inputs or weights).
Observability is concerned with whether it is possible to observe the results of the control
applied. A recurrent network is said to be observable if the state of the network can be
determined from a finite set of input/output measurements.
A rigorous treatment of these issues is way beyond the scope of this module!
L12-5
Learning and Universal Approximation
The Universal Approximation Theorem tells us that:
However, knowing that a recurrent neural network can approximate any dynamical
system does not tell us how to achieve it. As with feed-forward neural networks, we
generally want them to learn from a set of training data to perform appropriately.
We can use Continuous Training, for which the network state is never reset during
training, or Epochwise Training, which has the network state reset at each epoch. We
now look at one particular learning approach that works for both training modes.
L12-6
Unfolding Over Time
The recurrent network can be converted into a feed-forward network by unfolding over time:
Outputs
WOH
Hidden Units
WHH
WIH
Inputs Hidden Units
Time t WHH
WIH
Inputs Hidden Units
Time t-1 WHH
WIH
Inputs
Time t-2
This means all the earlier theory about feed-forward network learning follows through.
L12-7
Backpropagation Through Time
The Backpropagation Through Time (BPTT) learning algorithm is a natural extension of
standard backpropagation that performs gradient descent on a complete unfolded network.
If a network training sequence starts at time t0 and ends at time t1, the total cost function is
simply the sum over time of the standard error function Esse/ce(t) at each time-step:
t1
Etotal (t0 ,t1 ) = ∑ Esse /ce (t)
t=t0
and the gradient descent weight updates have contributions from each time-step
∂Etotal (t0 ,t1 ) t1 ∂E
sse /ce (t)
€ Δwij = −η = −η∑
∂wij t=t0 ∂wij
The constituent partial derivatives ∂Esse/ce/∂wij now have contributions from the multiple
instances of each weight wij ∈ {W IH, W HH} and depend on the inputs and hidden unit
€ previous time steps. The errors now have to be back-propagated through
activations at
time as well as through the network.
L12-8
Practical Considerations for BPTT
Even for the relatively simple recurrent network shown above, the unfolded network is
quite complex, and keeping track of all the components at different points of time can
become unwieldy. Most useful networks are even more problematic in this regard.
Typically the updates are made in an online fashion with the weights being updated at
each time step. This requires storage of the history of the inputs and past network states
at earlier times. For this to be computationally feasible, it requires truncation at a
certain number of time steps, with the earlier information being ignored.
Assuming the network is stable, the contributions to the weight updates should become
smaller the further back in time they come from. This is because they depend on higher
powers of small feedback strengths (corresponding to the sigmoid derivatives multiplied
by the feedback weights). This means that truncation is not as problematic as it sounds,
though many times step may be needed in practice (e.g., ~30). Despite its complexity,
BPTT has been shown many times to be an effective learning algorithm.
L12-9
Simple Recurrent Networks
Truncating the unfolded network to just one time step reduces it to a so-called Simple
Recurrent Network (which is also commonly known as an Elman network):
Outputs
WOH
Hidden Units
WHH
WIH
Inputs Hidden Units
Time t Time t-1
In this case, each set of weights now only appears only once, so it is possible to apply the
gradient descent approach using the standard backpropagation algorithm rather than full
BPTT. This means that the error signal will not get propagated back very far, and it will
be difficult for the network to learn how to use information from far back in time. In
practice, this approximation proves to be too great for many practical applications.
L12-10
The NARX Model
An alternative RNN formulation has a single input and a single output, with a delay line
on the inputs, and the outputs fed back to the input by another delay line. This is known
as the Non-linear Auto-Regressive with eXogeneous inputs (NARX) model:
Output y(t+1)
Multi-Layer Perceptron
It is particularly effective for Time Series Prediction, where the target y(t+1) is x(t+1).
L12-11
Computational Power of Recurrent Networks
Recurrent neural networks exemplified by the fully recurrent network and the NARX
model have an inherent ability to simulate finite state automata. Automata represent
abstractions of information processing devices such as computers. The computational
power of a recurrent network is embodied in two main theorems:
Theorem 1
All Turing machines may be simulated by fully connected recurrent networks built of neurons
with sigmoidal activation functions.
Proof: Siegelmann & Sontag, 1991, Applied Mathematics Letters, vol 4, pp 77-80.
Theorem 2
NARX Networks with one layer of hidden neurons with bounded, one sided saturated (BOSS)
activation functions and a linear output neuron can simulate fully connected recurrent networks
with bounded one-sided saturated activation functions, except for a linear slowdown.
Proof: Siegelmann et al., 1997, Systems, Man and Cybernetics, Part B, vol 27, pp 208-215.
L12-12
Application – Triangular Model of Reading
Any realistic model of reading must consider the interaction of orthography (letters),
phonology (sounds) and semantics (meanings). A recurrent neural network of the form:
PHONOLOGY
SEMANTICS
ORTHOGRAPHY
is required. The various sub-tasks (e.g., reading and spelling) need to be trained in line
with human experience, and the dominant emergent pathways are found to depend on the
nature of the language (e.g., English versus Chinese).
L12-13
Application – Associative Memory
Associative Memory is the general idea of accessing memory through content rather than
by address or location. One approach to this is for the memory content to be the pattern
of activations on the nodes of a recurrent neural network. The idea is to start the network
off with a pattern of activations that is a partial or noisy representation of the required
memory content, and have the network settle down to the required content.
Store a set of P binary valued patterns {tp} = {tip} in such a way that, when it is
presented with a new pattern s = {si}, the system (e.g., recurrent neural network)
responds by producing whichever stored pattern tp most closely resembles s.
One could have separate input and output nodes to achieve this, or one could have a single
set of nodes that act both as inputs and outputs. Either way, one needs to specify suitable
node activation functions, and weight definition or update equations, so that the recurrent
connections allow the learned memories to be accessed.
L12-14
Hopfield Networks
The Hopfield Network or Hopfield Model is one good way to implement an associative
memory. It is simply a fully connected recurrent network of N McCulloch-Pitts neurons.
Activations are normally ±1, rather than 0 and 1, so the neuron activation equation is:
+1 if x ≥ 0
xi = sgn∑ wij x j − θi where sgn(x) =
j −1 if x < 0
Unlike the earlier feed-forward McCulloch-Pitts networks, the activations here depend on
time, because the activations keep changing €till they have all settled down to some stable
€
pattern. Those activations can be updated either synchronously or asynchronously.
It can be shown that the required associative memory can be achieved by simply setting
the weights wij and thresholds θj in relation to the target outputs tp as follows:
1 P p p
wij = ∑ ti t j , θi = 0
N p=1
L12-15
A stored pattern tq will then be stable if the neuron activations are not changing, i.e.
1
t = sgn∑ wijt j − θi = sgn∑ ∑ ti t j t j
i
q q p p q
j j N p
which is best analyzed by separating out the q term from the p sum to give
€
1
t = sgnti + ∑ ∑ ti t j t j
i
q q p p q
N j p≠q
If the second term in this is zero, it is clear that pattern number q is stable. It will also be
stable if the second term is non-zero but has magnitude less than 1, because that will not
€
be enough to move the argument over the step of the sign function sgn(x). In practice,
this happens in most cases of interest as long as the number of stored patterns P is small
enough. Moreover, not only will the stored patterns be stable, but they will also be
attractors of patterns close by. Estimates of what constitutes a small enough number P
leads to the idea of the storage capacity of a Hopfield network. A full discussion of
Hopfield Networks can be found in most introductory books on neural networks.
L12-16
Boltzmann Machines
A Boltzmann Machine is a variant of the Hopfield Network composed of N units with
activations {xi}. The state of each unit i is updated asynchronously according to the rule:
+1 with probability pi
xi =
−1 with probability 1 − pi
1
pi =
where
( N
1 + exp −(∑ j =1 w ij x j − θ j ) / T )
with positive temperature constant T, network weights wij and thresholds θj.
The fundamental difference between the Boltzmann Machine and a standard Hopfield
Network is the stochastic activation of the units. If T is very small, the dynamics of the
Boltzmann Machine approximates the dynamics of the discrete Hopfield Network, but
when T is large, the network visits the whole state space. Another difference is that the
nodes of a Boltzmann Machine are split between visible input and output nodes, and
hidden nodes, and the aim is to have the machine learn input-output mappings.
L12-17
Training proceeds by updating the weights using the Boltzmann learning algorithm
η
Δwij = −
T (
xi x j fixed
− xi x j free )
where xi x j fixed is the expected/average product of xi and xj during training with the input
and output nodes fixed at a training pattern and the hidden nodes free to update, and
xi x j free is the corresponding quantity when the output nodes are also free.
For both Hopfield Networks and Boltzmann Machines one can define an energy function
1 N N N
E = − ∑ ∑ w ij xi x j + ∑ θi xi
2 i =1 j =1 i=1
and the network activation updates cause the network to settle into a local minimum of
this energy. This implies that the stored patterns will be local minima of the energy. If a
Boltzmann Machine starts off with a high temperature and is gradually cooled (known as
simulated annealing), it will tend to stay longer in the basins of attraction of the deepest
minima, and have a good change of ending up in a global minimum at the end. Further
details about Boltzmann Machines can be found in most introductory textbooks.
L12-18
Restricted Boltzmann Machines and Deep Learning
One particularly important variation of the standard Boltzmann Machine is the Restricted
Boltzmann Machine that has distinct sets of visible and hidden nodes, with no visible-to-
visible or hidden-to-hidden connections allowed. This restriction makes it possible to
formulate more efficient training algorithms, in particular, the Contrastive Divergence
learning algorithm that performs an approximation to gradient descent.
These networks have proved effective in the field of Deep Learning, and were what
started much of the current interest in all forms of deep learning neural networks. Each
layer of a many-layered feed-forward neural network can be efficiently pre-trained as an
unsupervised Restricted Boltzmann Machine, and the whole network can then be fine-
tuned using standard supervised Backpropagation.
L12-19
Overview and Reading
1. We began by noting the important features of recurrent neural networks
and their properties as fully fledged dynamical systems.
2. Then we studied a basic Fully Recurrent Network and unfolding it over
time to give the Backpropagation Through Time learning algorithm.
3. This was followed by a brief overview of the NARX model and two
theorems about the computational power of recurrent networks.
4. We looked at a Reading Model and Associative Memory as examples.
5. Finally, Hopfield Networks and Boltzmann Machines were introduced.
Reading
L12-20
CS229 Lecture notes
Andrew Ng
Part V
Support Vector Machines
This set of notes presents the Support Vector Machine (SVM) learning al-
gorithm. SVMs are among the best (and many believe are indeed the best)
“off-the-shelf” supervised learning algorithms. To tell the SVM story, we’ll
need to first talk about margins and the idea of separating data with a large
“gap.” Next, we’ll talk about the optimal margin classifier, which will lead
us into a digression on Lagrange duality. We’ll also see kernels, which give
a way to apply SVMs efficiently in very high dimensional (such as infinite-
dimensional) feature spaces, and finally, we’ll close off the story with the
SMO algorithm, which gives an efficient implementation of SVMs.
1 Margins: Intuition
We’ll start our story on SVMs by talking about margins. This section will
give the intuitions about margins and about the “confidence” of our predic-
tions; these ideas will be made formal in Section 3.
Consider logistic regression, where the probability p(y = 1|x; θ) is mod-
eled by hθ (x) = g(θT x). We would then predict “1” on an input x if and
only if hθ (x) ≥ 0.5, or equivalently, if and only if θT x ≥ 0. Consider a
positive training example (y = 1). The larger θT x is, the larger also is
hθ (x) = p(y = 1|x; w, b), and thus also the higher our degree of “confidence”
that the label is 1. Thus, informally we can think of our prediction as being
a very confident one that y = 1 if θT x ≫ 0. Similarly, we think of logistic
regression as making a very confident prediction of y = 0, if θT x ≪ 0. Given
a training set, again informally it seems that we’d have found a good fit to
the training data if we can find θ so that θT x(i) ≫ 0 whenever y (i) = 1, and
1
2
θT x(i) ≪ 0 whenever y (i) = 0, since this would reflect a very confident (and
correct) set of classifications for all the training examples. This seems to be
a nice goal to aim for, and we’ll soon formalize this idea using the notion of
functional margins.
For a different type of intuition, consider the following figure, in which x’s
represent positive training examples, o’s denote negative training examples,
a decision boundary (this is the line given by the equation θT x = 0, and
is also called the separating hyperplane) is also shown, and three points
have also been labeled A, B and C.
A0
1
B0
1
C0
1
Notice that the point A is very far from the decision boundary. If we are
asked to make a prediction for the value of y at A, it seems we should be
quite confident that y = 1 there. Conversely, the point C is very close to
the decision boundary, and while it’s on the side of the decision boundary
on which we would predict y = 1, it seems likely that just a small change to
the decision boundary could easily have caused out prediction to be y = 0.
Hence, we’re much more confident about our prediction at A than at C. The
point B lies in-between these two cases, and more broadly, we see that if
a point is far from the separating hyperplane, then we may be significantly
more confident in our predictions. Again, informally we think it’d be nice if,
given a training set, we manage to find a decision boundary that allows us
to make all correct and confident (meaning far from the decision boundary)
predictions on the training examples. We’ll formalize this later using the
notion of geometric margins.
3
2 Notation
To make our discussion of SVMs easier, we’ll first need to introduce a new
notation for talking about classification. We will be considering a linear
classifier for a binary classification problem with labels y and features x.
From now, we’ll use y ∈ {−1, 1} (instead of {0, 1}) to denote the class labels.
Also, rather than parameterizing our linear classifier with the vector θ, we
will use parameters w, b, and write our classifier as
Note that if y (i) = 1, then for the functional margin to be large (i.e., for
our prediction to be confident and correct), we need wT x + b to be a large
positive number. Conversely, if y (i) = −1, then for the functional margin
to be large, we need wT x + b to be a large negative number. Moreover, if
y (i) (wT x + b) > 0, then our prediction on this example is correct. (Check
this yourself.) Hence, a large functional margin represents a confident and a
correct prediction.
For a linear classifier with the choice of g given above (taking values in
{−1, 1}), there’s one property of the functional margin that makes it not a
very good measure of confidence, however. Given our choice of g, we note that
if we replace w with 2w and b with 2b, then since g(wT x + b) = g(2wT x + 2b),
4
this would not change hw,b (x) at all. I.e., g, and hence also hw,b (x), depends
only on the sign, but not on the magnitude, of wT x + b. However, replacing
(w, b) with (2w, 2b) also results in multiplying our functional margin by a
factor of 2. Thus, it seems that by exploiting our freedom to scale w and b,
we can make the functional margin arbitrarily large without really changing
anything meaningful. Intuitively, it might therefore make sense to impose
some sort of normalization condition such as that ||w||2 = 1; i.e., we might
replace (w, b) with (w/||w||2 , b/||w||2 ), and instead consider the functional
margin of (w/||w||2 , b/||w||2 ). We’ll come back to this later.
Given a training set S = {(x(i) , y (i) ); i = 1, . . . , m}, we also define the
function margin of (w, b) with respect to S as the smallest of the functional
margins of the individual training examples. Denoted by γ̂, this can therefore
be written:
γ̂ = min γ̂ (i) .
i=1,...,m
Next, let’s talk about geometric margins. Consider the picture below:
A w
γ (i)
find that the point B is given by x(i) − γ (i) · w/||w||. But this point lies on
the decision boundary, and all points x on the decision boundary satisfy the
equation wT x + b = 0. Hence,
T (i) (i) w
w x −γ + b = 0.
||w||
Note that if ||w|| = 1, then the functional margin equals the geometric
margin—this thus gives us a way of relating these two different notions of
margin. Also, the geometric margin is invariant to rescaling of the parame-
ters; i.e., if we replace w with 2w and b with 2b, then the geometric margin
does not change. This will in fact come in handy later. Specifically, because
of this invariance to the scaling of the parameters, when trying to fit w and b
to training data, we can impose an arbitrary scaling constraint on w without
changing anything important; for instance, we can demand that ||w|| = 1, or
|w1 | = 5, or |w1 + b| + |w2 | = 2, and any of these can be satisfied simply by
rescaling w and b.
Finally, given a training set S = {(x(i) , y (i) ); i = 1, . . . , m}, we also define
the geometric margin of (w, b) with respect to S to be the smallest of the
geometric margins on the individual training examples:
γ = min γ (i) .
i=1,...,m
on the training set and a good “fit” to the training data. Specifically, this
will result in a classifier that separates the positive and the negative training
examples with a “gap” (geometric margin).
For now, we will assume that we are given a training set that is linearly
separable; i.e., that it is possible to separate the positive and negative ex-
amples using some separating hyperplane. How will we find the one that
achieves the maximum geometric margin? We can pose the following opti-
mization problem:
maxγ,w,b γ
s.t. y (i) (wT x(i) + b) ≥ γ, i = 1, . . . , m
||w|| = 1.
Here, we’re going to maximize γ̂/||w||, subject to the functional margins all
being at least γ̂. Since the geometric and functional margins are related by
γ = γ̂/||w|, this will give us the answer we want. Moreover, we’ve gotten rid
of the constraint ||w|| = 1 that we didn’t like. The downside is that we now
γ̂
have a nasty (again, non-convex) objective ||w|| function; and, we still don’t
have any off-the-shelf software that can solve this form of an optimization
problem.
Let’s keep going. Recall our earlier discussion that we can add an arbi-
trary scaling constraint on w and b without changing anything. This is the
key idea we’ll use now. We will introduce the scaling constraint that the
functional margin of w, b with respect to the training set must be 1:
γ̂ = 1.
7
5 Lagrange duality
Let’s temporarily put aside SVMs and maximum margin classifiers, and talk
about solving constrained optimization problems.
Consider a problem of the following form:
minw f (w)
s.t. hi (w) = 0, i = 1, . . . , l.
Some of you may recall how the method of Lagrange multipliers can be used
to solve it. (Don’t worry if you haven’t seen it before.) In this method, we
define the Lagrangian to be
l
X
L(w, β) = f (w) + βi hi (w)
i=1
1
You may be familiar with linear programming, which solves optimization problems
that have linear objectives and linear constraints. QP software is also widely available,
which allows convex quadratic objectives and linear constraints.
8
Here, the βi ’s are called the Lagrange multipliers. We would then find
and set L’s partial derivatives to zero:
∂L ∂L
= 0; = 0,
∂wi ∂βi
and solve for w and β.
In this section, we will generalize this to constrained optimization prob-
lems in which we may have inequality as well as equality constraints. Due to
time constraints, we won’t really be able to do the theory of Lagrange duality
justice in this class,2 but we will give the main ideas and results, which we
will then apply to our optimal margin classifier’s optimization problem.
Consider the following, which we’ll call the primal optimization problem:
minw f (w)
s.t. gi (w) ≤ 0, i = 1, . . . , k
hi (w) = 0, i = 1, . . . , l.
To solve it, we start by defining the generalized Lagrangian
k
X l
X
L(w, α, β) = f (w) + αi gi (w) + βi hi (w).
i=1 i=1
Here, the αi ’s and βi ’s are the Lagrange multipliers. Consider the quantity
θP (w) = max L(w, α, β).
α,β : αi ≥0
Here, the “P” subscript stands for “primal.” Let some w be given. If w
violates any of the primal constraints (i.e., if either gi (w) > 0 or hi (w) 6= 0
for some i), then you should be able to verify that
k
X l
X
θP (w) = max f (w) + αi gi (w) + βi hi (w) (1)
α,β : αi ≥0
i=1 i=1
= ∞. (2)
Conversely, if the constraints are indeed satisfied for a particular value of w,
then θP (w) = f (w). Hence,
f (w) if w satisfies primal constraints
θP (w) =
∞ otherwise.
2
Readers interested in learning more about this topic are encouraged to read, e.g., R.
T. Rockarfeller (1970), Convex Analysis, Princeton University Press.
9
Thus, θP takes the same value as the objective in our problem for all val-
ues of w that satisfies the primal constraints, and is positive infinity if the
constraints are violated. Hence, if we consider the minimization problem
min θP (w) = min max L(w, α, β),
w w α,β : αi ≥0
we see that it is the same problem (i.e., and has the same solutions as) our
original, primal problem. For later use, we also define the optimal value of
the objective to be p∗ = minw θP (w); we call this the value of the primal
problem.
Now, let’s look at a slightly different problem. We define
θD (α, β) = min L(w, α, β).
w
Here, the “D” subscript stands for “dual.” Note also that whereas in the
definition of θP we were optimizing (maximizing) with respect to α, β, here
we are minimizing with respect to w.
We can now pose the dual optimization problem:
max θD (α, β) = max min L(w, α, β).
α,β : αi ≥0 α,β : αi ≥0 w
This is exactly the same as our primal problem shown above, except that the
order of the “max” and the “min” are now exchanged. We also define the
optimal value of the dual problem’s objective to be d∗ = maxα,β : αi ≥0 θD (w).
How are the primal and the dual problems related? It can easily be shown
that
d∗ = max min L(w, α, β) ≤ min max L(w, α, β) = p∗ .
α,β : αi ≥0 w w α,β : αi ≥0
(You should convince yourself of this; this follows from the “max min” of a
function always being less than or equal to the “min max.”) However, under
certain conditions, we will have
d ∗ = p∗ ,
so that we can solve the dual problem in lieu of the primal problem. Let’s
see what these conditions are.
Suppose f and the gi ’s are convex,3 and the hi ’s are affine.4 Suppose
further that the constraints gi are (strictly) feasible; this means that there
exists some w so that gi (w) < 0 for all i.
3
When f has a Hessian, then it is convex if and only if the Hessian is positive semi-
definite. For instance, f (w) = wT w is convex; similarly, all linear (and affine) functions
are also convex. (A function f can also be convex without being differentiable, but we
won’t need those more general definitions of convexity here.)
4
I.e., there exists ai , bi , so that hi (w) = aTi w + bi . “Affine” means the same thing as
linear, except that we also allow the extra intercept term bi .
10
We have one such constraint for each training example. Note that from the
KKT dual complementarity condition, we will have αi > 0 only for the train-
ing examples that have functional margin exactly equal to one (i.e., the ones
11
The points with the smallest margins are exactly the ones closest to the
decision boundary; here, these are the three points (one negative and two pos-
itive examples) that lie on the dashed lines parallel to the decision boundary.
Thus, only three of the αi ’s—namely, the ones corresponding to these three
training examples—will be non-zero at the optimal solution to our optimiza-
tion problem. These three points are called the support vectors in this
problem. The fact that the number of support vectors can be much smaller
than the size the training set will be useful later.
Let’s move on. Looking ahead, as we develop the dual form of the prob-
lem, one key idea to watch out for is that we’ll try to write our algorithm
in terms of only the inner product hx(i) , x(j) i (think of this as (x(i) )T x(j) )
between points in the input feature space. The fact that we can express our
algorithm in terms of these inner products will be key when we apply the
kernel trick.
When we construct the Lagrangian for our optimization problem we have:
m
1 X
L(w, b, α) = ||w||2 − αi y (i) (wT x(i) + b) − 1 .
(8)
2 i=1
Note that there’re only “αi ” but no “βi ” Lagrange multipliers, since the
problem has only inequality constraints.
Let’s find the dual form of the problem. To do so, we need to first
minimize L(w, b, α) with respect to w and b (for fixed α), to get θD , which
12
If we take the definition of w in Equation (9) and plug that back into the
Lagrangian (Equation 8), and simplify, we get
m m m
X 1 X (i) (j) (i) T (j)
X
L(w, b, α) = αi − y y αi αj (x ) x − b αi y (i) .
i=1
2 i,j=1 i=1
But from Equation (10), the last term must be zero, so we obtain
m m
X 1 X (i) (j)
L(w, b, α) = αi − y y αi αj (x(i) )T x(j) .
i=1
2 i,j=1
You should also be able to verify that the conditions required for p∗ =
d∗ and the KKT conditions (Equations 3–7) to hold are indeed satisfied in
our optimization problem. Hence, we can solve the dual in lieu of solving
the primal problem. Specifically, in the dual problem above, we have a
maximization problem in which the parameters are the αi ’s. We’ll talk later
13
about the specific algorithm that we’re going to use to solve the dual problem,
but if we are indeed able to solve it (i.e., find the α’s that maximize W (α)
subject to the constraints), then we can use Equation (9) to go back and find
the optimal w’s as a function of the α’s. Having found w∗ , by considering
the primal problem, it is also straightforward to find the optimal value for
the intercept term b as
maxi:y(i) =−1 w∗ T x(i) + mini:y(i) =1 w∗ T x(i)
b∗ = − . (11)
2
(Check for yourself that this is correct.)
Before moving on, let’s also take a more careful look at Equation (9),
which gives the optimal value of w in terms of (the optimal value of) α.
Suppose we’ve fit our model’s parameters to a training set, and now wish to
make a prediction at a new point input x. We would then calculate wT x + b,
and predict y = 1 if and only if this quantity is bigger than zero. But
using (9), this quantity can also be written:
m
!T
X
T (i) (i)
w x+b = αi y x x+b (12)
i=1
m
X
= αi y (i) hx(i) , xi + b. (13)
i=1
Hence, if we’ve found the αi ’s, in order to make a prediction, we have to
calculate a quantity that depends only on the inner product between x and
the points in the training set. Moreover, we saw earlier that the αi ’s will all
be zero except for the support vectors. Thus, many of the terms in the sum
above will be zero, and we really need to find only the inner products between
x and the support vectors (of which there is often only a small number) in
order calculate (13) and make our prediction.
By examining the dual form of the optimization problem, we gained sig-
nificant insight into the structure of the problem, and were also able to write
the entire algorithm in terms of only inner products between input feature
vectors. In the next section, we will exploit this property to apply the ker-
nels to our classification problem. The resulting algorithm, support vector
machines, will be able to efficiently learn in very high dimensional spaces.
7 Kernels
Back in our discussion of linear regression, we had a problem in which the
input x was the living area of a house, and we considered performing regres-
14
Rather than applying SVMs using the original input attributes x, we may
instead want to learn using some features φ(x). To do so, we simply need to
go over our previous algorithm, and replace x everywhere in it with φ(x).
Since the algorithm can be written entirely in terms of the inner prod-
ucts hx, zi, this means that we would replace all those inner products with
hφ(x), φ(z)i. Specifically, given a feature mapping φ, we define the corre-
sponding Kernel to be
Thus, we see that K(x, z) = φ(x)T φ(z), where the feature mapping φ is given
(shown here for the case of n = 3) by
x1 x1
x1 x2
x1 x3
x2 x1
φ(x) =
x 2 x 2
.
x2 x3
x3 x1
x3 x2
x3 x3
Note that whereas calculating the high-dimensional φ(x) requires O(n2 ) time,
finding K(x, z) takes only O(n) time—linear in the dimension of the input
attributes.
For a related kernel, also consider
(Check this yourself.) This corresponds to the feature mapping (again shown
16
for n = 3)
x1 x1
x1 x2
x1 x3
x2 x1
x2 x2
x2 x3
φ(x) =
x3 x1 ,
x3 x2
√x3 x3
√2cx1
√2cx2
2cx3
c
and the parameter c controls the relative weighting between the xi (first
order) and the xi xj (second order) terms.
T d
More broadly, the kernel K(x, z) = (x z + c) corresponds to a feature
n+d
mapping to an d feature space, corresponding of all monomials of the
form xi1 xi2 . . . xik that are up to order d. However, despite working in this
O(nd )-dimensional space, computing K(x, z) still takes only O(n) time, and
hence we never need to explicitly represent feature vectors in this very high
dimensional feature space.
Now, let’s talk about a slightly different view of kernels. Intuitively, (and
there are things wrong with this intuition, but nevermind), if φ(x) and φ(z)
are close together, then we might expect K(x, z) = φ(x)T φ(z) to be large.
Conversely, if φ(x) and φ(z) are far apart—say nearly orthogonal to each
other—then K(x, z) = φ(x)T φ(z) will be small. So, we can think of K(x, z)
as some measurement of how similar are φ(x) and φ(z), or of how similar are
x and z.
Given this intuition, suppose that for some learning problem that you’re
working on, you’ve come up with some function K(x, z) that you think might
be a reasonable measure of how similar x and z are. For instance, perhaps
you chose
||x − z||2
K(x, z) = exp − .
2σ 2
This is a reasonable measure of x and z’s similarity, and is close to 1 when
x and z are close, and near 0 when x and z are far apart. Can we use this
definition of K as the kernel in an SVM? In this particular example, the
answer is yes. (This kernel is called the Gaussian kernel, and corresponds
17
to an infinite dimensional feature mapping φ.) But more broadly, given some
function K, how can we tell if it’s a valid kernel; i.e., can we tell if there is
some feature mapping φ so that K(x, z) = φ(x)T φ(z) for all x, z?
Suppose for now that K is indeed a valid kernel corresponding to some
feature mapping φ. Now, consider some finite set of m points (not necessarily
the training set) {x(1) , . . . , x(m) }, and let a square, m-by-m matrix K be
defined so that its (i, j)-entry is given by Kij = K(x(i) , x(j) ). This matrix
is called the Kernel matrix. Note that we’ve overloaded the notation and
used K to denote both the kernel function K(x, z) and the kernel matrix K,
due to their obvious close relationship.
Now, if K is a valid Kernel, then Kij = K(x(i) , x(j) ) = φ(x(i) )T φ(x(j) ) =
φ(x(j) )T φ(x(i) ) = K(x(j) , x(i) ) = Kji , and hence K must be symmetric. More-
over, letting φk (x) denote the k-th coordinate of the vector φ(x), we find that
for any vector z, we have
XX
z T Kz = zi Kij zj
i j
XX
= zi φ(x(i) )T φ(x(j) )zj
i j
XX X
= zi φk (x(i) )φk (x(j) )zj
i j k
XXX
= zi φk (x(i) )φk (x(j) )zj
k i j
!2
X X
= zi φk (x(i) )
k i
≥ 0.
The second-to-last step above used the same trick as you saw in Problem
set 1 Q1. Since z was arbitrary, this shows that K is positive semi-definite
(K ≥ 0).
Hence, we’ve shown that if K is a valid kernel (i.e., if it corresponds to
some feature mapping φ), then the corresponding Kernel matrix K ∈ Rm×m
is symmetric positive semidefinite. More generally, this turns out to be not
only a necessary, but also a sufficient, condition for K to be a valid kernel
(also called a Mercer kernel). The following result is due to Mercer.5
5
Many texts present Mercer’s theorem in a slightly more complicated form involving
L functions, but when the input attributes take values in Rn , the version given here is
2
equivalent.
18
that we’ll see later in this class will also be amenable to this method, which
has come to be known as the “kernel trick.”
Thus, examples are now permitted to have (functional) margin less than 1,
and if an example has functional margin 1 − ξi (with ξ > 0), we would pay
a cost of the objective function being increased by Cξi . The parameter C
controls the relative weighting between the twin goals of making the ||w||2
small (which we saw earlier makes the margin large) and of ensuring that
most examples have functional margin at least 1.
20
Now, all that remains is to give an algorithm for actually solving the dual
problem, which we will do in the next section.
of the SVM. Partly to motivate the SMO algorithm, and partly because it’s
interesting in its own right, let’s first take another digression to talk about
the coordinate ascent algorithm.
max W (α1 , α2 , . . . , αm ).
α
Here, we think of W as just some function of the parameters αi ’s, and for now
ignore any relationship between this problem and SVMs. We’ve already seen
two optimization algorithms, gradient ascent and Newton’s method. The
new algorithm we’re going to consider here is called coordinate ascent:
For i = 1, . . . , m, {
αi := arg maxα̂i W (α1 , . . . , αi−1 , α̂i , αi+1 , . . . , αm ).
}
Thus, in the innermost loop of this algorithm, we will hold all the vari-
ables except for some αi fixed, and reoptimize W with respect to just the
parameter αi . In the version of this method presented here, the inner-loop
reoptimizes the variables in order α1 , α2 , . . . , αm , α1 , α2 , . . .. (A more sophis-
ticated version might choose other orderings; for instance, we may choose
the next variable to update according to which one we expect to allow us to
make the largest increase in W (α).)
When the function W happens to be of such a form that the “arg max”
in the inner loop can be performed efficiently, then coordinate ascent can be
a fairly efficient algorithm. Here’s a picture of coordinate ascent in action:
22
2.5
1.5
0.5
−0.5
−1
−1.5
−2
The ellipses in the figure are the contours of a quadratic function that
we want to optimize. Coordinate ascent was initialized at (2, −2), and also
plotted in the figure is the path that it took on its way to the global maximum.
Notice that on each step, coordinate ascent takes a step that’s parallel to one
of the axes, since only one variable is being optimized at a time.
9.2 SMO
We close off the discussion of SVMs by sketching the derivation of the SMO
algorithm. Some details will be left to the homework, and for others you
may refer to the paper excerpt handed out in class.
Here’s the (dual) optimization problem that we want to solve:
m m
X 1 X (i) (j)
maxα W (α) = αi − y y αi αj hx(i) , x(j) i. (17)
i=1
2 i,j=1
s.t. 0 ≤ αi ≤ C, i = 1, . . . , m (18)
Xm
αi y (i) = 0. (19)
i=1
Let’s say we have set of αi ’s that satisfy the constraints (18-19). Now,
suppose we want to hold α2 , . . . , αm fixed, and take a coordinate ascent step
and reoptimize the objective with respect to α1 . Can we make any progress?
The answer is no, because the constraint (19) ensures that
m
X
(1)
α1 y =− αi y (i) .
i=2
23
(This step used the fact that y (1) ∈ {−1, 1}, and hence (y (1) )2 = 1.) Hence,
α1 is exactly determined by the other αi ’s, and if we were to hold α2 , . . . , αm
fixed, then we can’t make any change to α1 without violating the constrain-
t (19) in the optimization problem.
Thus, if we want to update some subject of the αi ’s, we must update at
least two of them simultaneously in order to keep satisfying the constraints.
This motivates the SMO algorithm, which simply does the following:
Repeat till convergence {
1. Select some pair αi and αj to update next (using a heuristic that
tries to pick the two that will allow us to make the biggest progress
towards the global maximum).
2. Reoptimize W (α) with respect to αi and αj , while holding all the
other αk ’s (k 6= i, j) fixed.
}
To test for convergence of this algorithm, we can check whether the KKT
conditions (Equations 14-16) are satisfied to within some tol. Here, tol is
the convergence tolerance parameter, and is typically set to around 0.01 to
0.001. (See the paper and pseudocode for details.)
The key reason that SMO is an efficient algorithm is that the update to
αi , αj can be computed very efficiently. Let’s now briefly sketch the main
ideas for deriving the efficient update.
Let’s say we currently have some setting of the αi ’s that satisfy the con-
straints (18-19), and suppose we’ve decided to hold α3 , . . . , αm fixed, and
want to reoptimize W (α1 , α2 , . . . , αm ) with respect to α1 and α2 (subject to
the constraints). From (19), we require that
m
X
(1) (2)
α1 y + α2 y =− αi y (i) .
i=3
Since the right hand side is fixed (as we’ve fixed α3 , . . . αm ), we can just let
it be denoted by some constant ζ:
α1 y (1) + α2 y (2) = ζ. (20)
We can thus picture the constraints on α1 and α2 as follows:
24
H α1y(1)+ α2y(2)=ζ
α2
L
α1 C
From the constraints (18), we know that α1 and α2 must lie within the box
[0, C] × [0, C] shown. Also plotted is the line α1 y (1) + α2 y (2) = ζ, on which we
know α1 and α2 must lie. Note also that, from these constraints, we know
L ≤ α2 ≤ H; otherwise, (α1 , α2 ) can’t simultaneously satisfy both the box
and the straight line constraint. In this example, L = 0. But depending on
what the line α1 y (1) + α2 y (2) = ζ looks like, this won’t always necessarily be
the case; but more generally, there will be some lower-bound L and some
upper-bound H on the permissible values for α2 that will ensure that α1 , α2
lie within the box [0, C] × [0, C].
Using Equation (20), we can also write α1 as a function of α2 :
α1 = (ζ − α2 y (2) )y (1) .
(Check this derivation yourself; we again used the fact that y (1) ∈ {−1, 1} so
that (y (1) )2 = 1.) Hence, the objective W (α) can be written
Finally, having found the α2new , we can use Equation (20) to go back and find
the optimal value of α1new .
There’re a couple more details that are quite easy but that we’ll leave you
to read about yourself in Platt’s paper: One is the choice of the heuristics
used to select the next αi , αj to update; the other is how to update b as the
SMO algorithm is run.