Lecture Notes
Lecture Notes
Lecture Notes
These are notes for a one-semester undergraduate course on machine learning given by Prof.
Miguel Á. Carreira-Perpiñán at the University of California, Merced.
These notes may be used for educational, non-commercial purposes.
©2015–2023 Miguel Á. Carreira-Perpiñán
1 Introduction
What is machine learning (ML)?
• Data is being produced and stored continuously (“big data”):
– science: genomics, astronomy, materials science, particle accelerators. . .
– sensor networks: weather measurements, traffic. . .
– people: social networks, blogs, mobile phones, purchases, bank transactions. . .
– etc.
• Data is not random; it contains structure that can be used to predict outcomes, or gain knowl-
edge in some way.
Ex: patterns of Amazon purchases can be used to recommend items.
• It is more difficult to design algorithms for such tasks (compared to, say, sorting an array or
calculating a payroll). Such algorithms need data.
Ex: construct a spam filter, using a collection of email messages labelled as spam/not spam.
1
Examples of ML problems
• Supervised learning: labels provided.
Low-Risk
y: price
θ2
High-Risk
Income
θ1 x: mileage
– Learning associations:
∗ Basket analysis: let p(Y |X) = “probability that a customer who buys product X
also buys product Y ”, estimated from past purchases. If p(Y |X) is large (say 0.7),
associate “X → Y ”. When someone buys X, recommend them Y .
– Clustering: group similar data points.
– Density estimation: where are data points likely to lie?
– Dimensionality reduction: data lies in a low-dimensional manifold.
2
– Feature selection: keep only useful features.
– Outlier/novelty detection.
• Reinforcement learning: find a sequence of actions (policy) that reaches a goal. No supervised
output but delayed reward.
Ex: playing chess or a computer game, robot in a maze.
3
2 Supervised learning: classification & regression
Classification
Binary classification (two classes)
• We are given a training set of labeled examples (positive and negative) and want to learn a
classifier that we can use to predict unseen examples, or to understand the data.
• Input representation: we need to decide what attributes (features) to use to describe the input
patterns (examples, instances, data points). This implies ignoring other attributes as irrelevant.
e1
x2t
x1t p1 p2
x1: Price x1: Price
4
where yn is coded as one-of-K and hk is the two-class classifier for problem k, i.e., hk (x) ∈ {0, 1}.
• Ideally, for a given pattern x only one hk (x) is one. When no, or more than one, hk (x) is one
then the classifier is in doubt and may reject the pattern.
y: price
Luxury sedan
Family car
Price x: milage
Regression
• Training set X = {(xn , yn )}N D
n=1 where the label for a pattern xn ∈ R is a real value yn ∈ R.
In multivariate regression, yn ∈ Rd is a real vector.
• Interpolation: we learn a function h that passes through each training pair (xn , yn ): yn = h(xn ),
n = 1, . . . , N. Not advisable if the data is noisy.
Ex: polynomial interpolation (requires a polynomial of degree N − 1 with N points in general position).
Noise
• Noise is any unwanted anomaly in the data. It can be due to:
– Imprecision in recording the input attributes: xn .
– Errors in labeling the input vectors: yn .
– Attributes not considered that affect the label (hidden or latent attributes, may be unob-
servable).
• Noise makes learning harder.
• Should we keep the hypothesis class simple rather than complex? A simpler class:
5
– Is easier to use and to train (fewer parameters, faster).
– Is easier to explain or interpret.
– Has less variance in the learned model than for a complex class (less affected by single
instances), but also has higher bias.
Given comparable empirical error, a simple model will generalize better than a complex one.
(Occam’s razor : simpler explanations are more plausible; eliminate unnecessary complexity.)
In the regression example, a line with low enough error may be preferable to a parabola with a slightly lower error.
– Abnormal behaviour. Fraud in credit card transactions, intruder in network traffic, etc.
• Not usually cast as a two-class classification problem because there are typically few outliers
and they don’t fit a consistent pattern that can be easily learned.
• We can also identify outliers as points that are far away from the training samples.
– Discard examples having any missing values. Easy but throws away data.
– Fill in the missing values by estimating them (imputation).
Ex: mean imputation: impute feature d as its average value over the examples where it is present. Independent of the
present features in xn .
• Whether a feature is missing for xn may depend on the values of the other features at xn .
Ex: in a census survey, rich people may wish not to give their salary.
• In order to find a unique solution, and learn something useful, we must make assumptions (=
inductive bias of the learning algorithm).
Ex: the use of a hypothesis class H; the use of the largest margin; the use of the least-squares objective function.
• We can always enlarge the class of functions that can be learned by using a larger hypothesis
class H (higher capacity, or complexity of H).
Ex: a union of rectangles; a polynomial of order N .
6
• How to choose the right inductive bias, in particular the right hypothesis class H? This is the
model selection problem.
• The goal of ML is not to replicate the training data, but to predict unseen data well, i.e., to
generalize well.
• For best generalization, we should match the complexity of the hypothesis class H with the
complexity of the function underlying the data:
– If H is less complex: underfitting. Ex: fitting a line to data generated from a cubic polynomial.
– If H is more complex: overfitting. Ex: fitting a cubic polynomial to data generated from a line.
• Validation set:
– Used to minimize the generalization error.
– Optimize hyperparameters or model structure.
– Usually done with a “grid search”. Ex: try all values of H ∈ {10, 50, 100} and λ ∈ {10−5 , 10−3 , 10−1 }.
Ex: select the number of hidden units H, the architecture, the regularization parameter λ, or how long to train for a neural net;
select k for a k-nearest-neighbor classifier.
• Test set:
– Used to report the generalization error.
– We optimize nothing on it, we just evaluate the final model on it.
Then:
1. For each class Hi , fit its optimal hypothesis hi using the training set.
2. Of all the optimal hypotheses, pick the one that is most accurate in the validation set.
We can then train the selected one on the training and validation set together (useful if we have little data altogether).
7
Dimensions of a supervised ML algorithm
• We have a sample X = {(xn , yn )}N
n=1 (usually independent and identically distributed, “iid”)
drawn from an unknown distribution.
• We want to learn a useful approximation to the underlying function that generated the data.
• We must choose:
2. A loss function L(·, ·) to compute the difference between the desired output (label) yn and
our prediction to it h(xn ; Θ). Approximation error (loss):
N
X
E(Θ; X ) = L(yn , h(xn ; Θ)) = sum of errors over instances
n=1
• The model, loss and learning algorithm are chosen by the ML system designer so that:
– The model class is large enough to contain a good approximation to the underlying function
that generated the data in X in a noisy form.
– The learning algorithm is efficient and accurate.
– We must have sufficient training data to pinpoint the right model.
8
Joint distribution: p(X = x, Y = y).
p(X = x, Y = y)
Conditioning (product rule): p(Y = y | X = x) = .
p(X = x)
3 Bayesian decision theory Marginalizing (sum rule): p(X = x) =
X
p(X = x, Y = y).
y
Bayes’ theorem: p(Y = y | X = x) p(X = x)
Probability review: appendix A. (inverse probability)
p(X = x | Y = y) =
p(Y = y)
.
Probability theory (and Bayes’ rule) is the framework for making decisions under uncertainty. It can
also be used to make rational decisions among multiple actions to minimize expected risk.
Classification
• Binary classification with random variables x ∈ RD (example) and C ∈ {0, 1} (label).
– Joint probability p(x, C): how likely it is to observe an example x and a class label C.
– Prior probability p(C): how likely it is to observe a class label C, regardless of x.
p(C) ≥ 0 and p(C = 1) + p(C = 0) = 1.
– Class likelihood p(x|C): how likely it is that, having observed an example with class label
C, the example is at x.
This represents how the examples are distributed for each class. We need a model of x for each class.
– Posterior probability p(C|x): how likely it is that, having observed an example x, its class
label is C.
p(C = 1|x) + p(C = 0|x) = 1.
This is what we need to classify x. We infer it from p(C) and p(x|C) using Bayes’ rule.
• Making a decision on a new example: given x, classify it as class 1 iff p(C = 1|x) > p(C = 0|x).
• Examples: 0.5
x−µk
2 0.4
p(x|C1 ) p(x|C2 )
− 12
– Gaussian classes in 1D: p(x|Ck ) = √ 1 e σk
.
2πσk 0.3
PK (✐ p(C1 |x) = ?)
– Prior probability: p(Ck ) ≥ 0, k = 1, . . . , K, and k=1 p(Ck ) = 1.
– Class likelihood: p(x|Ck ).
p(x|Ck )p(Ck ) p(x|Ck )p(Ck )
– Bayes’ rule: p(Ck |x) = p(x)
= PK .
i=1 p(x|Ci )p(Ci )
• We learn the distributions p(C) and p(x|C) for each class from data using an algorithm.
• Naive Bayes classifier : a very simple classifier were we assume p(x|Ck ) = p(x1 |Ck ) · · · p(xD |Ck )
for each class k = 1, . . . , K, i.e., within each class the features are independent from each other.
9
Losses and risks
• Sometimes the decision of what class to choose for x based on p(Ck |x) has a cost that depends
on the class. In that case, we should choose the class with lowest risk.
Ex: medical diagnosis, earthquake prediction.
• Define:
– λik : loss (cost) incurred for choosing class i when the input actually belongs to class k.
XK
– Expected risk for choosing class i: Ri (x) = λik p(Ck |x).
k=1
• Decision: we choose the class with minimum risk: arg mink=1,...,K Rk (x).
0 1000
Ex: C1 = no cancer, C2 = cancer; (λik ) = 1 0 ; p(C1 |x) = 0.8. ✐ R1 = ?, R2 = ?
Discriminant functions C1
C2
C3
• Examples:
– Risk based on Bayes’ classifier: gk = −Rk .
x
– With the 0/1 loss: 1
gk (x) = p(Ck |x), or? gk (x) = p(x|Ck )p(Ck ), or? gk (x) = log p(x|Ck ) + log p(Ck ).
– Many others: SVMs, neural nets, etc.
• They divide the feature space into K decision regions Rk = {x: k = arg maxi gi (x)}, k =
1, . . . , K. The regions are separated by decision boundaries, where ties occur.
• With two classes we can define a single discriminant? g(x) = g1 (x) − g2 (x) and choose class 1
iff g(x) > 0.
10
Association rules
• Association rule: implication of the form X → Y (X: antecedent, Y : consequent).
• Application: basket analysis (recommend product Y when a customer buys product X).
• Generalization to more than 2 items: X, Z → Y has confidence p(Y |X, Z), etc.
• Example: given this database of purchases. . . consider the following rules:
purchase items in basket association rule support confidence
1 milk, bananas, chocolate milk → bananas 2/6 2/4
2 milk, chocolate bananas → milk 2/6 2/2
3 milk, bananas milk → chocolate ? ?
4 chocolate chocolate → milk ? ?
5 chocolate etc.
6 milk, chocolate
• Given a database of purchases, we want to find all possible rules having enough support and
confidence.
• Algorithm Apriori :
1. Find itemsets with enough support (by finding frequent itemsets).
We don’t need to enumerate all possible subsets of items: for (X, Y, Z) to have enough support, all its subsets (X, Y ), (Y, Z), (X, Z)
must have enough support? . Hence: start by finding frequent one-item sets (by doing a pass over the database). Then,
inductively, from frequent k-item sets generate k + 1-item sets and check whether they have enough support (by doing a
pass over the database).
2. Convert the found itemsets into rules with enough confidence (by splitting the itemset into antecedent and consequent).
Again, we don’t need to enumerate all possible subsets: for X → Y, Z to have enough confidence, X, Y → Z must have
enough confidence? . Hence: start by considering a single consequent and test the confidence for all possible single consequents
(adding as a valid rule if it has enough confidence). Then, inductively, consider two consequents for the added rules; etc.
• A more general way to establish rules (which may include hidden variables) are graphical
models.
11
Measuring classifier performance
Binary classification problems (K = 2 classes)
# misclassified patterns
• The most basic measure is the classification error (or accuracy) in %: .
# patterns
• It is often of interest to distinguish
Predicted class
the different types of errors: confus-
True class Positive Negative Total
ing the positive class with the nega-
Positive tp: true positive fn: false negative p
tive one, or vice versa.
Ex: voice authentication to log into a user account. Negative fp: false positive tn: true negative n
A false positive (allowing an impostor) is much
worse than a false negative (refusing a valid user).
Total p′ n′ N
• Having trained a classifier, we can control fn vs fp through a threshold θ ∈ [0, 1]. Let C1 be
the positive class and C2 the negative class. If the classifier returns p(C1 |x), let us choose the
positive class if p(C1 |x) > θ (so θ = 12 gives the usual classifier):
– θ ≈ 1: almost always choose C2 , nearly no false positives but nearly no true positives.
– Decreasing θ increases the number of true positives but risks introducing false positives.
• ROC curve (“receiver operating curve”): the pair values (fp-rate,tp-rate) as a function of θ ∈
[0, 1]. Properties:
– It is always increasing.
– An ideal classifier is at (0, 1) (top left corner).
– Diagonal (fp-rate = tp-rate): random classifier, which outputs p(C1 |x) = u ∼ U(0, 1)? .
This is the worst we can do.
– Any classifier that is below the diagonal can be improved by flipping its decision? .
– For a dataset with N points, the ROC curve is really a set of N points (fp-rate,tp-rate)? .
To construct it we don’t need to know the classifier, just its outputs p(C1 |xn ) ∈ [0, 1] on
the points {(xn , yn )}N
n=1 of the dataset.
• Area under the curve (AUC): reduces the ROC curve to a number. Ideal classifier: AUC = 1.
• ROC and AUC allow us to compare classifiers over different loss conditions, and choose a value
of θ accordingly. Often, there is not a dominant classifier.
12
Information retrieval
We have a database of records (documents, images. . . ) Precision & recall using Venn diagrams
and a query, for which some of the records are rel- a
evant (“positive”) and the rest are not (“negative”). a + b
True class
2
number of instances of class Ci that are classified as Cj . 3
4
13
4 Univariate parametric methods
Joint distribution: p(X = x, Y = y).
• How to learn probability distributions from data Conditioning (product rule): p(Y = y | X = x) =
p(X = x, Y = y)
.
p(X = x)
(in order to use them to make decisions). Marginalizing (sum rule): p(X = x) =
X
p(X = x, Y = y).
y
• We assume such distributions follow a particular Bayes’ theorem: p(Y = y | X = x) p(X = x)
p(X = x | Y = y) = .
parametric form (e.g. Gaussian), so we need to (inverse probability) p(Y = y)
estimate its parameters (µ, σ). X and Y are independent ⇔ p(X = x, Y = y) = p(X = x) p(Y = y).
For more complicated distributions, we usually need an algorithm to find the MLE.
• Resulting estimate for the probability at a new point x? : p(x|X ) = p(x|Θ) p(Θ|X ) dΘ. Hence, rather than using the prediction
R
of a single Θ value (“frequentist statistics”), we average the prediction of every parameter value Θ using its posterior distribution
(“Bayesian statistics”).
✓ works well when the sample size N is small (if the prior is helpful).
✗ computationally harder (needs to compute, usually approximately, integrals or summations); needs to define a prior.
14
Maximum likelihood estimation: parametric classification
• From ch. 3, for classes k = 1, . . . , K, we use:
• Ex: assume x|Ck ∼ N (x; µk , σk2 ). Estimate the parameters from a data sample {(xn , yn )}N
n=1 :
(a) Likelihoods
0.4
0.3
C1
p(x|C )
C2
i
0.2
0.1
0
−10 −8 −6 −4 −2 0 2 4 6 8 10
x
(b) Posteriors with equal priors
1
C C2
2
p(C |x)
C
0.5 1
i
0
−10 −8 −6 −4 −2 0 2 4 6 8 10
x
(c) Expected risks
1
C1 C2
R(α |x)
C2
i
0.5
rej rej
0
−10 −8 −6 −4 −2 0 2 4 6 8 10
x
15
Maximum likelihood estimation: parametric regression
• Assume there exists an unknown function f that maps inputs x to outputs y = f (x), but that
what we observe as output is a noisy version y = f (x) + ǫ, where ǫ is an random error. We
want to estimate f by a parametric function h(x; Θ). In ch. 2 we saw the least-squares error
was a good loss function to use for that purpose. We will now show that maximum likelihood
estimation under Gaussian noise is equivalent to that.
• Examples:
x: milage
16
5 Multivariate parametric methods
• Sample X = {xn }N
n=1 where each example xn contains D features (continuous or discrete).
• We often write the sample (or dataset) as a matrix X = (x1 , . . . , xN ) (examples = columns).
Sometimes as its transpose (examples = rows).
PN p(x) = √1 e σ
1 2πσ
• Mean vector : µ = N n=1 xn .
1
PN ? 1
PN
• Covariance matrix : Σ = N n=1 (xn − µ)(xn − µ)T = N n=1 xn xTn − µµT .
Symmetric of D × D with elements σij , positive definite, diagonal = variances along each variable (σ11 = σ12 , . . . , σDD = σD
2 ).
σij
• Correlation matrix : corr(Xi , Xj ) = σi σj ∈ [−1, 1].
If corr(Xi , Xj ) = 0 (equivalently σij = 0) then Xi and Xj are uncorrelated (but not necessarily independent).
If corr(Xi , Xj ) = ±1 then Xi and Xj are linearly related.
• Properties:
– If Σ is diagonal ⇒ the D features are uncorrelated and independent: p(x) = p(x1 ) · · · p(xD ).
– If x is Gaussian ⇒ each marginal p(xi ) and conditional distribution p(xi |xj ) is also Gaussian.
– If x ∼ ND (µ, Σ), WD×K ⇒ WT x ∼ NK (WT µ, WT ΣW). A linear projection of a Gaussian is Gaussian.
0.4
0.35
2
x
0.3
0.25
0.2
x1
0.15
x
2 x
1
17
Multivariate classification with Gaussian classes
Assume the class-conditional densities are Gaussian: x|Ck ∼ N (µk , Σk ). The maths carry over from
the 1D case in a straightforward way: p. 100
• MLE? : class proportions for p(Ck ); sample mean and sample covariance for µk and Σk , resp.
?
• Discriminant function gk (x) = log p(x|Ck ) + log p(Ck ) = quadratic form on x.
D(D+1) ?
• Number of parameters per class k = 1, . . . , K: p(Ck ): 1; µk : D; Σk : 2
.
A lot of parameters for the covariances! Estimating them may need a large dataset.
2 2
• Diagonal covariances (not shared): Σk = diag (σk1 , . . . , σkD ). Total KD parameters.
This assumes the features are independent within each class, and gives a naive Bayes classifier.
✐ 2
– ⇒ MLE for σkd = dth diagonal element of Σk = variance of class k for feature d.
Σ1 6= Σ2 , full Σ1 = Σ2 , full
0.1
p( x|C1)
0.05
x2 x
1
Σ1 = Σ2 , diagonal Σ1 = Σ2 = σ 2 I
1
p(C | x)
0.5
1
x x
2 1
18
Tuning complexity
• Again a bias-variance dilemma:
– Two few parameters (e.g. Σ1 = Σ2 = σ 2 I): high bias.
– Too many parameters (e.g. Σ1 6= Σ2 , full): high variance.
• We can use cross-validation, regularization, Bayesian priors on the covariances, etc.
– This is a naive Bayes classifier. Its discriminant gk (x) is linear? (but x is binary). p. 108
Multivariate regression
• Linear regression where x ∈ RD : y = f (x) + ǫ with f (x) = wD xD + · · · + w1 x1 + w0 = wT x
(where we define x0 = 1).
• The maths carry over from the 1D case in a straightforward way: p. 110
• We can fit a nonlinear function h(x) similarly to a linear regression by defining additional,
nonlinear features such as x2 = x2 , x3 = ex , etc. (just as happens with polynomial regression).
Then, using a linear model in the augmented space x = (x, x2 , ex )T will correspond to a
nonlinear model in the original space x. Radial basis function networks, kernel SVMs. . .
19
6 Bias, variance and model selection
Evaluating an estimator: bias and variance
• Example: assume a Bernoulli distribution p(x; θ) with true X1 = 1, 1, 0, 0, 0, 0, 0, 0, 1, 1 → θ̂ = 0.4
parameter θ = 0.3. We want to estimate X = 0, 1, 0, 0, 1, 0, 0, 0, 0, 0 → θ̂ = 0.2
1
PN θ from a sample 2 ...
{x1 , . . . , xN } of N iid tosses using θ̂ = N n=1 xn as estimator.
But, the estimated θ̂ itself varies depending on the sample!
Indeed, θ̂ is a r.v. with a certain average and variance (over all possible samples X ). We’d like
its average to be close to θ and its variance to be small.
1 PN
Another ex: repeated measurements of your weight x ∈ R in a balance, assuming x ∼ N (µ, σ2 ) and an estimator µ̂ = N n=1 xn .
• Statistic φ(X ): any value that is calculated from a sample X (e.g. average, maximum. . . ). It is a
r.v. with an expectation (over samples) EX {φ(X )} and a variance EX {(φ(X ) − EX {φ(X )})2 }.
Ex.: if the sample
Z X has N points x1 , . . . , xN : Z
iid
EX {φ(X )} = φ(x1 , . . . , xN ) p(x1 , . . . , xN ) dx1 . . . dxN = φ(x1 , . . . , xN ) p(x1 ) . . . p(xN ) dx1 . . . dxN .
• X = (x1 , . . . , xN ) iid sample from p(x; θ). Let φ(X ) be an estimator for θ. How good is it?
mean square error of the estimator φ: error(φ, θ) = EX (φ(X ) − θ)2 .
• Bias of the estimator: bθ (φ) = EX {φ(X )} − θ. How much the expected value of the estimator
over samples differs from the true parameter value.
If bθ (φ) = 0 for all θ values: unbiased estimator.
1 PN ?
Ex: the sample average x = N n=1 xn is an unbiased estimator of the true mean µ , regardless of the distribution p(x). The
1 P N 2 ?
sample variance N n=1 (xn − x) is a biased estimator of the true variance .
• Variance of the estimator: var {φ} = EX {(φ(X ) − EX {φ(X )})2 }. How much the estimator
varies around its expected value from one sample to another. variance
If var {φ} → 0 as N → ∞: consistent estimator.
di
Ex: the sample average is a consistent estimator of the true mean? .
E[d] θ
• Bias-variance decomposition:
bias
?
2 2 2
error(φ, θ) = EX (φ − θ) = EX (φ − EX {φ}) + (EX {φ} − θ) = var {φ} + b2θ (φ).
| {z } | {z }
variance bias2
We want estimators that have both low bias and low variance; this is difficult.
• Ex: assume a Gaussian distribution p(x; µ, σ 2 ). We want to estimate
P µ from a sample {x1 , . . . , xN }
of N iid tosses using each of the following estimators: µ̂ = N1 Nn=1 n , µ̂ = 7, µ̂ = x1 , µ̂ = x1 x2 .
x
✐ What is their bias, variance and error?
Illustration: estimate the bullseye location given the location of N shots at it.
20
Tuning model complexity: bias-variance dilemma
The ideal regression function: the conditional mean E {y|x}
• Consider the regression setting with a true, unknown regression function f and additive noise ǫ:
y = f (x) + ǫ. Given a sample X = {(xn , yn )}N
n=1 drawn iid from p(x, y), we construct a regres-
sion estimate h(x). Then, for any h:
?
Expected square error at x: E (y − h(x))2 |x = E (y − E {y|x})2 |x + (E {y|x} − h(x))2 .
| {z } | {z }
noise squared error wrt E {y|x}
The expectations are over the joint density p(x, y) for fixed x, or, equivalently? , over p(y|x).
– Noise term: variance of y given x. Equal to the variance of the noise ǫ added, independent
of h or X . Can never be removed no matter which estimator we use.
– Squared error term: how much the estimate h(x) deviates (at each x) from the regression
function E {y|x}. Depends on h and X . It is zero if h(x) = E {y|x} for each x.
The ideal? , optimal regression function is h(x) = f (x) = E {y|x} (conditional mean of p(y|x)).
• Consider h(x) as an estimate (for each x) of the true f (x). Bias-variance decomposition at x:
EX (E {y|x} − h(x))2 |x = (E {y|x} − EX {h(x)})2 + EX (h(x) − EX {h(x)})2 .
| {z } | {z }
bias2 variance
The expectation EX {·} is over samples X of size N drawn from p(x, y). The other expectations are over p(y|x).
– hm (x) = 2 ∀x (constant fit): zero var, high bias (unless f (x) ≈ 2 ∀x), high total error.
P (m)
– hm (x) = N1 Nn=1 yn (average of sample Xm ): higher var, lower bias, lower total error.
– hm (x) = polynomial of degree k: if k ↑ then bias↓, var↑.
• Low bias requires sufficiently flexible models, but flexible models have high variance. As we
increase complexity, bias decreases (better fit to data) and variance increases (fit varies more
with data). The optimal model has the best trade-off between bias and variance.
– Underfitting: model class doesn’t contain the solution (it is not flexible enough) ⇒ bias.
– Overfitting: model class too general and also learns the noise ⇒ variance.
Also, the variance due to the sample decreases as the sample size increases.
21
Model selection procedures
• Methods to find the model complexity that is best suited to the data.
• Cross-validation: the method most used in practice. We cannot calculate the bias and variance
(since we don’t know the true function f ), but we can estimate the total error.
– Given a dataset X , divide it into 3 disjoint parts (by sampling at random without replace-
ment from X ) as training, validation and test sets: T , V and T ’.
– Train candidate models of different complexities on T . Ex: polynomials of degree 0, 1, . . . , K.
– Pick the trained model that gives lowest error on V. This is the final model.
– Estimate the generalization error of the final model by its error on T ’.
If X is small in sample size, use K-fold cross-validation, to make better use of the available data.
• Regularization: instead of minimizing just the error on the data, minimize error + penalty:
N
X
min (yn − h(xn ))2 + λ C(h)
h
n=1
– Model selection criteria (AIC, BIC. . . ): C(h) ∝ number of parameters in h (model size).
λ is set to a certain constant depending on the criterion. We try multiple model sizes.
P P
– Smoothness penalty: e.g. C(h) = ki=0 wi2 for polynomials of degree k, h(x) = ki=0 wi xi .
λ is set by cross-validation. We try a single, relatively large model size.
22
Illustration of cross-validation for polynomials
4 1.5
Task: regression problem in 1D using 3.5
underfits training set overfits
validation set
RMSE
polynomials of degree
Pk k =i 0, 1, . . . , K:
3
1
k 2.5
pk (x; {ai }i=0 ) = i=0 ai x . Training,
y
2
2.5 2.5
k=0 2
1.5
2
1.5
1 1
0.5 0.5
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
4 4
2.5 2.5
k=1 2
1.5
2
1.5
1 1
0.5 0.5
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
4 4 4
k=2 2
1.5
2
1.5
2
1.5
1 1 1
0 0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6
4 4
2.5 2.5
k=3 2
1.5
2
1.5
1 1
0.5 0.5
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
4 4
2.5 2.5
k=4 2
1.5
2
1.5
1 1
0.5 0.5
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
23
7 Nonparametric methods
Parametric methods
• Examples:
– etc.
• Training is the process of choosing the best parameter values Θ from a given dataset.
• The size of the parameters is separate and smaller from the size of the training set N. The
model “compresses” the dataset into the parameters Θ, and the complexity of the model doesn’t
grow with N. Ex: a linear function f (x) = w1 x + w0 has 2 parameters (w0 , w1 ) regardless of the size N of the training set.
• After training, we discard the dataset, and use only the parameters to apply the model to
future data. Ex: keep the 2 parameters (w0 , w1 ) of the learned linear function and discard the training set {(xn , yn )}Nn=1 .
• The hypothesis space is restricted: only models of a certain parametric form (linear, etc.).
• Small complexity at test time as a function of N: both memory and time are Θ(1).
Nonparametric methods
• There is no training.
• The size of the model is given by the size of the dataset N, hence the complexity of the model
can grow with N.
• We keep the training set, which we need to apply the model to future data.
• The hypothesis space is less restricted (fewer assumptions). Often based on a smoothness
assumption: similar instances have similar outputs, and the output for an instance is a local
function of its neighboring instances. Essentially, the model behaves like a lookup table that we
use to interpolate future instances.
• Large complexity at test time as a function of N: both memory and time are Θ(N), because
we have to store all training instances and look through them to make predictions.
• They are particularly preferable to parametric methods with small training sets, where they
give better models and are relatively efficient computationally.
• Examples: kernel density estimate, nearest-neighbor classifier, running mean smoother, etc.
24
Nonparametric density estimation
• Given a sample X = {xn }N n=1 drawn iid from an unknown density, we want to construct an
estimator p(x) of the density.
• Histogram (consider first x ∈ R): split the real line into bins [x0 + mh, x0 + (m + 1)h] of width
h for m ∈ Z, and count points in each bin:
1
p(x) = (number of xn in the same bin as x) x ∈ R.
Nh
– We need to select the bin width h and the origin x0 .
– x0 has a small but annoying effect on the histogram (near bin boundaries).
– h controls the histogram smoothness: spiky if h ↓ and smooth if h ↑.
– p(x) is discontinuous at bin boundaries.
– We don’t have to retain the training set once we have computed the counts.
– They generalize to D dimensions, but are practically useful only for D . 2
In D dimensions, it requires an exponential number of bins, most of which are empty.
– Also possible to define a different bandwidth hn for each data point xn (adaptive KDE ).
– The KDE quality degrades as the dimension D increases (no matter how h is chosen).
Could be improved by using a full covariance Σn per point, but it is preferable to use a mixture with K < N components.
k 1
• k-nearest-neighbor density estimate: p(x) = for x ∈ RD , where dk (x) = (Euclidean) distance of x to its kth nearest
sample in X . 2N dk (x)
0 0 0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
h=1 h=1 h = 0.5 k=3
0.4 0.4 0.4 1
0 0 0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
h = 0.5 h = 0.5 h = 0.25 k=1
0.8 0.8 0.8 1
0 0 0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
25
Nonparametric classification
• Given {(xn , yn )}N D
n=1 where xn ∈ R is a feature vector and yn ∈ {1, . . . , K} a class label.
• Estimate the class-conditional densities p(x|Ck ) with a KDE each. The resulting discriminant
function can be written (ignoring constant factors) as? :
N
X x − xn
gk (x) = K k = 1, . . . , K
h
n=1,yn =k
so each data point xn votes only for its class, with a weight given by K(·) (instances closer to
x have bigger weight).
• k-nearest-neighbor classifier : assigns x to the class k having most
instances among the k nearest neighbors of x.
x2
– Most common case: k = 1, nearest-neighbor classifier.
It defines a Voronoi tesselation in RD .
It works with as few as one data point per class!
✐ How does the k-nearest-neighbor classifier differ from the Euclidean distance classifier?
Nonparametric regression
• Consider a sample {(xn , yn )}N
n=1 with xn ∈ R
D
and yn ∈ RE drawn iid from an unknown
function plus noise y = f(x) + ǫ.
We construct a smoother (nonparametric regression estimate) g(x) of f(x).
• Kernel smoother : weighted average of the labels of instances near x (local average):
N
K (x − xn )/h ⇔ g(x) = N of {yn }N
P
n=1 wn (x) yn is a weighted average
X n=1
g(x) = PN yn . with weights satisfying w1 (x), . . . , wN (x) ≥ 0, N
P
n=1 wn (x) = 1,
n=1 n′ =1 K (x − xn′ )/h and wn (x) is large if xn is near x and small otherwise
g(x) can be seen as the conditional mean E {y|x} of a KDE p(x, y) constructed on the sample? .
Also, as with KDEs:
– Regressogram: with an origin x0 and bin width h, and discontinuous at boundaries. eq. (8.24)
– Running mean smoother : K = uniform kernel. No origin x0 , only bin width h, but still discontinuous at boundaries. eq. (8.25)
– Running median smoother : like the running mean but using the median instead; more robust to outliers and noise.
– k-nearest-neighbor smoother : fixes k instead of h, adapting to the density around x.
• Running-line smoother : we use the data points around x (as defined by h or k) to estimate a
line rather than a constant, hence obtaining a local regression line.
Regressogram smoother: h = 6 Running mean smoother: h = 6 Kernel smooth: h = 1 Running line smooth: h = 6
4 4 4 4
2 2 2 2
0 0 0 0
−2 −2 −2 −2
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
h=3 h=3 h = 0.5 h=3
4 4 4 4
2 2 2 2
0 0 0 0
−2 −2 −2 −2
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
h=1 h=1 h = 0.25 h=1
4 4 4 6
4
2 2 2
2
0 0 0
0
−2 −2 −2 −2
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
26
How to choose the smoothing parameter
• Smoothing parameter : number of neighbors k or kernel bandwidth h.
27
8 Clustering
• Unsupervised problem: given a sample {xn }N D
n=1 ⊂ R , partition it into groups such that points
within each group are similar and groups are dissimilar (“hard partition”).
Or, determine soft assignments
PK znk of point xn to cluster k for n = 1, . . . , N and k = 1, . . . , K,
where znk ∈ [0, 1] and k=1 znk = 1 (“soft partition”).
• Useful to understand structure in the data, or as preprocessing for supervised learning (e.g.
train a classifier for each cluster). Clustering does not need labels, which are usually costly to obtain.
• Examples:
– Customer segmentation: group customers according to demographic and transaction at-
tributes, then provide different strategies for each segment.
– Image segmentation: partition pixels into meaningful objects in the image.
Represent each pixel xn by a feature vector, typically position & intensity (i, j, I) or color (i, j, L∗ , u∗ , v∗ ).
• Basic types:
– Centroid-based : find a prototype (“centroid”) for each cluster.
k-means, k-medoids, k-modes, etc.
– Probabilistic models: mixture densities, where each component models one cluster.
Gaussian mixtures. Usually trained with the EM algorithm.
– Graph-based (or distance- or similarity-based ): construct a graph with the data points as
vertices and edges weighted by distance values, and partition it.
Hierarchical clustering, connected-components clustering, spectral clustering, etc.
• Ill-defined problem: what does “similar” mean, and how similar should points be to be in the
same cluster? Hence, many different definitions of clustering and many different algorithms.
This is typical of exploratory tasks, where we want to understand the structure of the data before
committing to specific analyses. In practice, one should try different clustering algorithms.
28
k−means: Initial After 1 iteration
20 20
10 10
0 0
2
2
x
x
−10 −10
−20 −20
−30 −30
−40 −20 0 20 40 −40 −20 0 20 40
x1 x1
10 10
0 0
2
2
x
−10
x −10
−20 −20
−30 −30
−40 −20 0 20 40 −40 −20 0 20 40
x x
1 1
• Each iteration (centroid + assignment step) reduces E or leaves it unchanged. After a finite
number of iterations (there are at most N K clusterings), no more changes occur and we stop.
• The result depends on the initialization. We usually initialize Z by a random assignment of
points to clusters, run k-means from several such random Z, and pick the best result.
• Vector quantization: compress a continuous space of dimension D, represented by a sample
X = {xn }N K
n=1 , into a finite set of reference vectors {µk }k=1 (codebook ), with minimal distortion.
– Ex: color quantization. Represent 24-bit color images using only 256 colors with minimal distortion.
– A new point x ∈ RD is quantized as µk∗ with k ∗ = arg mink=1,...,K kx − µk k.
This partitions the space into a Voronoi tessellation.
– Lossy compression: encode x ∈ RD (D floats) as an integer k ∈ {1, . . . , K} (⌈log2 K⌉ bits).
Also needs to store the codebook.
– We can learn the codebook with k-means. Better than uniform quantization, because it adapts to density
variations in the data and does not require a codebook of size exponential on D. Usually K is larger than for clustering.
Encoder Decoder
Communication
x i i x' =mi
line
Find closest
mi mi
. .
. .
. .
29
Mixture densities and the EM algorithm
K
(
X p(x|k) component densities
• Mixture density with K components: p(x; Θ) = p(x|k)p(k)
k=1
p(k) = πk mixture proportions.
Similar to k-means, where the assignment and centroid steps correspond to the E and M steps.
But in EM the assignments are soft (znk ∈ [0, 1]), while in k-means they are hard (znk ∈ {0, 1}).
• If we knew which component xn came from for each n = 1, . . . , N, we’d not need the E step:
a single M step that estimates each component’s parameters on its set of points would suffice.
This was the case in classification (where x|Ck is Gaussian): we were given (xn , yn ).
• Each EM step increases L or leaves it unchanged, but it takes an infinite number of iterations
to converge. In practice, we stop when the parameters don’t change much, or when the number
of iterations reaches a limit.
• EM converges to a local optimum that depends on the initial value of Θ. Usually from k-means? .
• User parameter: number of clusters K. Output: posterior probabilities {p(k|xn )} and {πk , µk , Σk }K
k=1 .
30
Density-based clustering: mean-shift clustering
C3 C1 C2
p(x)
• Define a function p(x) that represents the density of the dataset {xn }N D
n=1 ⊂ R , then declare
each mode (maximum) of p as a cluster representative and assign each xn to a mode via the
mean-shift algorithm. Most useful with low-dimensional data, e.g. image segmentation (where xn = features of nth pixel).
• Kernel density estimate with bandwidth σ: a mixture having one component for each data point:
N N
X 1 X x − xn
p(x) = p(x|n)p(n) = D
K x ∈ RD .
n=1
Nσ n=1
σ
Usually the kernel K is Gaussian: K x−x σ
n
= (2π)−D/2 exp − 21 k(x − xn )/σk2 .
• Mean-shift algorithm: starting from an initial value of x, it iterates the following expression:
N
X p(x|n)p(n) exp − 12 k(x − xn )/σk2
x← p(n|x)xn where p(n|x) = = PN 1 2
.
n=1
p(x) n′ =1 exp − 2 k(x − xn′ )/σk
P
“ N n=1 p(n|x)xn ” can be understood as the weighted average of the N data points using as
weights the posterior probabilities p(n|x). The mean-shift algorithm converges to a mode of
p(x). Which one it converges to depends on the initialization. By running mean-shift starting
at a data point xn , we effectively assign xn to a mode. We repeat for all points x1 , . . . , xN .
• User parameter: σ, which determines the number of clusters (σ ↓: N clusters, σ ↑: 1 cluster).
Output: modes and clusters.
• Nonparametric clustering: no assumption on the shape of the clusters or their number.
0.8
0.6
0.2
−0.4
p 1/p
−0.6
PD
– Minkowski (or ℓp ) distance d(x, y) = d=1 |xd − yd | , if x, y ∈ RD . −0.8
−1.5
0.8
−1 −0.5 0 0.5 1 1.5
0.4
−0.4
Number of words that two documents have in common; “edit” distance between strings (Exe. 6). −0.6
−0.8
−1.5 −1 −0.5 0 0.5 1 1.5
• Define a neighborhood graph with vertices = data points and edges xn ∼ xm if:
– ǫ-ball graph: d(xn , xm ) ≤ ǫ.
– k-nearest-neighbor graph: xn and xm are among the k-nearest-neighbors of each other.
• Clusters = connected-components of the graph (which can be found by depth-first search).
• User parameter: ǫ > 0 or k ∈ N. The number of clusters depends on that (ǫ ↓: N, ǫ ↑: 1).
• It can handle complex cluster shapes, but works only with non-overlapping clusters.
Other algorithms are able to partition the graph more effectively even with overlapping clusters (e.g. spectral clustering).
31
Graph-based clustering: hierarchical clustering
• Generates a nested sequence of clusterings.
• Agglomerative clustering: start with N clusters (= singleton points) and repeatedly merge pairs
of clusters until there is only one cluster left containing all points.
Divisive clustering: start with one cluster containing all points and repeatedly divide it until we have N clusters (= singleton points).
• We merge the two closest clusters Ci , Cj according to a distance δ(Ci , Cj ) between clusters:
– Single-link clustering: δ(Ci , Cj ) = minxn ∈Ci ,xm ∈Cj d(xn , xm ).
Tends to produce elongated clusters (“chaining” effect).
Equivalent to Kruskal’s algorithm to find a minimum spanning tree of a weighted graph G = (V, E, w) where V = {xn }N
n=1 ,
E = V × V and wnm = d(xn , xm ).
a 3
c 2
e f h
d 1
a b e c d f
32
9 Dimensionality reduction and feature selection
• If we want to train a classifier (or regressor) on a sample {(xn , yn )}N
n=1 where the number of
features D in x (the dimension of x) is large:
– Training will be slow.
– Learning a good classifier will require a large sample.
• It is then convenient to transform each example xn ∈ RD into a new example zn = F(xn ) ∈ RL
having lower dimension L < D (as long as we don’t lose much information). This would work
perfectly if the data points did lie on a manifold of dimension L contained in RD .
• Two basic ways to do this: x = (x1 , x2 , x3 , x4 , x5 )T ⇒ for L = 2:
– Dimensionality reduction (DR): F(x) = a l.c. or 1x1 +3x2 −5x3 +5x4 −4x5
some other function of all the x1 , . . . , xD . → F(x) = 2x1 +3x2 −1x3 +0x4 +2x5 .
It constructs L new features and discards the original D features.
• Useful when some features are unnecessary (e.g. irrelevant for classification or pure noise) or
redundant (so we don’t need them all).
Useful with e.g. microarray data. Not useful with e.g. image pixels.
• Best-subset selection: for each subset of features, train a classifier and evaluate it on a validation
set. Pick the subset having highest accuracy and up to L features.
Combinatorial optimization: 2D possible subsets of D features? . Ex: D = 3: {∅, {x1 }, {x2 }, {x3 }, {x1 , x2 }, {x1 , x3 }, {x2 , x3 }, {x1 , x2 , x3 }}.
Brute-force search only possible for small D ⇒ approximate search.
• Forward selection: starting with an empty subset F , sequentially add one new feature at a
time. We add the feature d ∈ {1, . . . , D} such that the classifier trained on F ∪ {d} has highest
classification accuracy in the validation set. Stop when the accuracy improves little, or when
we reach L features. Backward selection: same thing but start with F = {1, . . . , D} and remove one feature at a time.
It is a greedy algorithm that is not guaranteed to find an optimal subset, but gives good results.
It trains Θ(L2 ) classifiers if we try up to L features, so it is convenient when we expect the
optimal subset to contain few features.
P
• Lasso (for regression): minw N T 2
n=1 (yn − w xn ) + λkwk1 (where λ ≥ 0 is set by cross-validation).
The ℓ1 norm kwk1 = |w1 | + · · · + |wD | makes many wd be exactly zero if λ is large enough
(unlike the ℓ22 norm kwk22 = w12 + · · · + wD 2
, which makes most wd small but none exactly zero).
• These feature selection algorithms are supervised : they use the labels yn when training the
classifier. The features selected depend on the classifier we use. There are also unsupervised algorithms.
Ex: forward selection on the Iris dataset (D = 4 features, K = 3 classes). Result: features {F4,F3}.
F1
1 4
8
0
F1
0
1
F4
−1
4 4.5 5 5.5 6 6.5 7 7.5 8
2
1
3
F2
F2
0
2.5
3.5
4.5
2
4
0
−1
2 2.5 3 3.5 4 4.5
1
F4
1
2
F3
0
3
F3
−1
0
1 2 3 4 5 6 7
0
1
1
F4
F4
0
2
−1
0 0.5 1 1.5 2 2.5
3
34
Dimensionality reduction: principal component analysis (PCA)
10 (a) Scree graph for Optdigits Optdigits after PCA
200 30
2 2
20
Eigenvalues
9 2
8 3 2 2
2 2
100 3
3 3 3 2 8
3
2 32 5 3 2 00
10 3
3 8 1
3 8 8 0
8 5 5 0 6
6 0
00 1
Second eigenvector
888 0
7 9 6 6
0 7 6
x2
0 6
0 10 20 30 40 50 60 70 7 7
7 9 6
Eigenvectors 7 5 8 5 0
6
7 9
(b) Proportion of variance explained 77 9 9 6
−10 7 7 1 1
4 1 19 1 1
41
7 9 14
0.8 94 1 9
1
4
Prop. of var.
−20
PC1 0.6 49 4
4 4
2 4 4
0.4 4
−30 4
PC2
0.2
0 0 −40
0 2 4 6 8 10 0 10 20 30 40 50 60 70 −40 −30 −20 −10 0 10 20 30 40
7
3
x2
7 7
77 4 4
7 77 7
2 7
7 7 4
4 4 4
m2 91 1
1
4
4
1 4
1 9
9 4
9 9 1
91 1 1 44
3 5 1
3 1 9 9 9
0
m1 3
1
3 6
3 3 33 3 9
8
−1 3 85 66
m2 2
2
2
2
5
9
8
8 8
88
8 666
6
2 3 55
s2 2 −2 2 2 22 8
2 0
m1 −3
2 0
w 0
0
s1 2 −4
0
00 0
0
0
x1 −5
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
1 1
4
0.5 0.5
2
0 0
pca
0 −0.5 −0.5
lda
−2 −1 −1
−2 0 2 4 −4 −2 0 2 −4 −3 −2 −1
PCA projection LDA projection
• Aims at preserving most of the signal information that is useful to discriminate among the classes.
Find a low-dimensional space such that when x is projected there, classes are well separated.
• Define:
P
– Number of points in class k: Nk . Mean of class k: µk = N1k yn =k xn .
P
– Within-class scatter matrix for class k: Sk = yn =k (xn − µk )(xn − µk )T .
P
– Total within-class scatter matrix SW = K k=1 Sk .
PK 1
PK
– Between-class scatter matrix SB = k=1 Nk (µk − µ)(µk − µ)T where µ = K k=1 µk .
• In the latent space, the between-class and within-class scatter matrices are WT SB W and
WT SW W (of L × L).
WT SB W between-class scatter
• Fisher discriminant: max J(W) = = .
W |WT SW W| within-class scatter
• This is an eigenproblem whose solution is W = (u1 , . . . , uL ) = eigenvectors associated with the
largest L eigenvalues of S−1
W SB .
?
rank (SB ) ≤ K − 1 ⇒ rank (S−1W SB ) ≤ K − 1. So we can only use values of L that satisfy 1 ≤ L ≤ K − 1.
SW must be invertible (if it is not, apply PCA to the data and eliminate directions with zero variance).
36
Dimensionality reduction: multidimensional scaling (MDS)
True distances along earth surface Estimated positions on a 2D map
2000
Helsinki
1500 Moscow
1000
Dublin
London
500
Paris Berlin
0
Zurich
Lisbon Madrid
−500
−1000 Istanbul
Rome
−1500
Athens
−2000
−2500 −2000 −1500 −1000 −500 0 500 1000 1500 2000
• Unsupervised DR method: given the matrix of squared Euclidean distances d2nm = kxn − xm k2
between N data points, MDS finds points z1 , . . . , zN ∈ RL that approximate those distances:
N
X 2
min d2nm − kzn − zm k2 .
Z
n,m=1
• MDS does not use as training data the actual feature vectors xn ∈ RD , only the pairwise
distances dnm . Hence, it is applicable even when the “distances” are computed between objects
that are not represented by features. Ex: perceptual distance between two different colors according to a subject.
• If d2nm = kxn − xm k2 where xn ∈ RD and D ≥ L, then MDS is equivalent to PCA on {xn }N
n=1 . p. 137
• MDS does not produce a projection or reconstruction mapping, only the actual L-dimensional
projections z1 , . . . , zN ∈ RL for the N training points x1 , . . . , xN ∈ RD .
• How to learn a projection mapping F: x ∈ RD → z ∈ RL with parameters Θ?
Direct fit: find the projections z1 , . . . , zN by Parametric embedding: requires nonlinear
MDS and then solve a nonlinear regression optimization
N
X XN
2
min (zn − F(xn ; Θ))2 min d2nm − kF(xn ; Θ) − F(xm ; Θ)k2 .
Θ Θ
n=1 n,m=1
• Generalizations of MDS:
– Spectral methods: Isomap, Locally Linear Embedding, Laplacian eigenmaps. . .
Require solving an eigenproblem.
Isomap: define dnm = geodesic distances (approximated by shortest paths in a nearest-neighbor graph of the sample).
38
10 Decision trees
• Applicable to classification and regression.
– An input instance follows a single root-leaf path in the tree, ignoring the rest of it.
– This path (and the whole tree) may not even use all the features in the training set.
• A decision tree is a model that (as long as it is not too big) can be interpreted by people (unlike
black-box models such as neural nets):
– We can inspect the tree visually regardless of the dimensionality of the feature vector.
– We can track the root-leaf path followed by a particular input instance to understand how
the tree made its decision.
– The tree can be transformed into a set of IF-THEN rules.
• Widely used in practice, sometimes preferred over more accurate but less interpretable models.
– Internal decision nodes, each having ≥ 2 children. Decision node m selects one of its
children based on a test (a split) applied to the input x.
∗ Continuous feature xd : “go right if xd > sm ” for some sm ∈ R.
∗ Discrete feature xd : n-way split for the n possible values of xd .
– Leaves, each containing a value (class label or output value y). Instances xn falling in the
same leaf should have identical (or similar) output values yn .
∗ Classification: class label y ∈ {1, . . . , K} (or proportion of each class p1 , . . . , pK ).
∗ Regression: numeric value y ∈ R (average of the outputs for the leaf’s instances).
The predicted output for an instance x is obtained by following a path from the root to a leaf.
In the best case (balanced tree) for binary trees, the path has length log2 L if there are L leaves.
– Nonparametric: if the tree is very big, having Θ(N) nodes if each leaf represents one (or
a few) instances. Still, inference in the tree is much faster than, say, in kernel regression
or k-nearest-neighbors classification (no need to scan the whole training set).
– Parametric: if the tree is much smaller than the training set N.
39
Univariate trees
x2
x1>w10
C2
Yes
No
x2>w20
w20
C1
Yes No
C1
C2 C1
w10 x1
• The test at node m compares one feature with a threshold: “xd > sm ” for some d ∈ {1, . . . , D}
and sm ∈ R. This defines a binary split into two regions: {x ∈ RD : xd ≤ sm } and {x ∈
RD : xd > sm }. The overall tree defines box-shaped, axis-aligned regions in input space.
T x > s ”, which define oblique regions (multivariate trees).
Simplest and most often used. More complex tests exist, e.g. “wm m
• With discrete features, the number of children equals the number nd of values the feature can
take, and the test selects the child corresponding to the value of xd (n-way split).
Ex: xd ∈ {red, green, blue} ⇒ nd = 3 children.
• Tree induction is learning the tree from a training set {(xn , yn )}N
n=1 , i.e., determining its nodes
and structure:
– For each internal node, its test (feature d ∈ {1, . . . , D} and threshold sm ).
– For each leaf, its output value y.
• For a given sample, many (sufficiently big) trees exist that code it with zero error. We want to
find the smallest such tree. This is NP-hard so an approximate, greedy algorithm is used: Fig. 9.3
pseudocode
– Starting at the root with the entire training set, select a best split according to a purity
criterion: (Nleft φleft + Nright φright )/(Nleft + Nright ), where N• is the number of instances
going to child •. Associate each child with the subset of instances that fall in it.
– Continue splitting each child (with its subset of the training set) recursively until each
child is pure (hence a leaf) and no more splits are necessary.
– Prevent overfitting by either early stopping or pruning.
Ex. algorithms: CART, ID3, C4.5.
40
Classification trees 1
0.9
entropy=−p*log2(p)−(1−p)*log2(1−p)
0.8
same class. Consider a node and all the training set instances 0.7
measure
P impurity as the entropy of p = (p1 , . . . , pK ): φ(p) = 0.4
− K ?
k=1 k log2 pk (where 0 log2 0 ≡ 0 ). This is maximum if
p 0.3
Other measures satisfying those conditions are possible: Gini index φ(p) = 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
PK PK p
i6=j pi pj = i=1 pi (1 − pi ), misclassification error φ(p) = 1 − max(p1 , . . . , pK ).
• If a node is pure, i.e., all its instances are of the same class k, there is no need to split it. It
becomes a leaf with output value k.
We can also store the proportions p = (p1 , . . . , pK ) in the node (e.g. to compute risks).
• If a node m is not pure, we split it. We evaluate (Nleft φleft + Nright φright )/(Nleft + Nright ) for all
possible features d = 1, . . . , D and all possible split thresholds sm for each feature, and pick the
split with minimum impurity.
– If the number of instances that reach node m is Nm , there are are Nm −1 possible thresholds
(the midpoints between consecutive values of xd , assuming we have sorted them).
– For discrete features, there is no threshold but an n-way split.
Regression trees
P
• Purity criterion: the squared error E(g) = n∈node (yn − g)2 (where g ∈ R) is minimal when
g is the mean of the yn values? . Then φ = E(g) is the variance of the yn values at a node.
If there is much noise or outliers, it is preferable to set g to the median of the yn values.
• We consider a node to be pure if E ≤ θ for a user threshold θ > 0. In that case, we do not
split it. It becomes a leaf with output value g.
• If a node m is not pure, we split it. We evaluate all possible features d = 1, . . . , D and all
possible split thresholds sm for each feature and pick the split with minimum impurity (= sum
of the variances E of each of the children), as in the classification case.
• Rather than assigning a constant output value to a leaf, we can assign it a regression function
(e.g. linear), as in a running-mean smoother.
41
• Pruning is slower than early stopping but it usually leads to trees that generalize better.
An intuitive reason why is as follows. In the greedy algorithm used to grow the tree, once we make a decision at a given node (to
select a split) we never backtrack and try a different, maybe better, possibility.
4 x < 3.16
θr = 0.5 Yes No
Yes No
0 1.37 -1.35
−2 x < 3.16
0 1 2 3 4 5 6 7 8 Yes No
4
θr = 0.2 x < 1.36 x < 5.96
Yes No Yes No
Yes No
0
0.9 2.40
x < 3.16
−2
0 1 2 3 4 5 6 7 8 Yes No
Yes No Yes No
Yes No
−2 1.20 0.60
0 1 2 3 4 5 6 7 8
• Path root leaf = conjunction of tests. This and the leaf’s output value give a rule.
• The set of extracted rules allows us to extract knowledge from the dataset.
x1 : Age
x1 > 38.5 x2 : Years in job
R1: IF (age > 38.5) AND (years-in-job > 2.5) THEN y = 0.8 Yes No
x3 : Gender
x4 : Job type
42
11 Ensemble models: combining multiple learners
• By suitably combining multiple learners the accuracy can be improved (but it need not to).
– How to generate base learners that complement each other?
– How to combine their outputs for maximum accuracy?
• Ex: train an ensemble of L decision trees on L different subsets of the training set and define
the ensemble output for a test instance as the majority vote (for classification) or the average
(for regression) of the L trees.
• Ensembles of decision trees (random forest, boosted decision trees) are practically among the
most accurate models in machine learning.
• Disadvantages:
– An ensemble of learners is computationally more costly in time and space than a single
learner, both at training and test time.
– A decision tree is interpretable (as rules), but an ensemble of trees is hard to interpret.
d1
x x x x
• Consider L iid random variables y1 , . . . , yL with expected Pvalue E {yl } = µ and variance
var {yl } = σ 2 (expectations wrt p(yl )). Then, the average y = L1 Ll=1 yl has the following moments
(expectations wrt p(y1 , . . . , yL )):
( L
) L
✐ 1X 1X
E {y} = E yl = E {yl } = µ
L l=1 L l=1
( L
) ( L ) L
✐ 1 X 1 X 1 X 1
var {y} = var yl = 2 var yl = 2 var {yl } = σ 2 .
L l=1 L l=1
L l=1 L
So the expected value (hence the bias) doesn’t change, but the variance (hence the mean squared
error) decreases as L increases. If y1 , . . . , yL are identically but not independently distributed:
( L ) L L
! L
✐ 1 X 1 X X 1 2 2 X
var {y} = 2 var yl = 2 var {yl } + 2 cov {yi , yj } = σ + 2 σij .
L l=1
L l=1 i,j=1, i6=j
L L i,j=1, i6=j
So, if the learners are positively correlated (σij > 0), the variance (and error) is larger than
if they are independent. Hence, a well-designed ensemble combination will aim at reducing, if
not completely eliminating, positive correlation between learners.
Further decrease in variance is possible with negatively correlated learners. However, it is
impossible to have many learners that are both accurate and negatively correlated.
Voting or averaging over models with low bias and high variance produces an ensemble with low
bias but lower variance, if the models are (somewhat) uncorrelated.
Bagging
• Bootstrap: given a training set X = {x1 , . . . , xN } containing N samples, the bootstrap generates
L subsets X1 , . . . , XL of X , each of size N, as follows: subset Xl is obtained by sampling at
random with replacement N points from X .
– This means that some points xn will appear repeated multiple times in Xl and some other
points xn will not appear in Xl .
– The L subsets are partly different from each other. On average, each subset contains ≈ 63%
of the training set. Proof: the probability we don’t pick instance xn after N draws is 1 − N1 N ≈ e−1 ≈ 0.37.
45
• Bagging (bootstrap aggregating): applicable to classification and regression.
– We generate L (partly different) subsets of the training set with the bootstrap.
Also possible to have each subset be a sample (say 90%) of the training set without replacement.
– The ensemble output is defined as the vote or average (or median) of the learners’ outputs.
✐ What kind of model results from averaging learners of the following type: polynomial; RBF network; logistic regression;
tree; neural network; etc.?
• Random forest: a variation of bagging using decorrelated decision trees as base learners.
– As in bagging, each tree is trained on a bootstrap sample of the training set, and the
ensemble output is defined as the vote or average (or median) of the learners’ outputs.
– In addition, the lth tree is constructed with a randomized CART algorithm, independently
of the other trees: at each node of the tree we use only a random subset of m ≤ D of the
original D features. The tree is fully grown (no pruning) so that it has low bias.
√
Typically one sets m = ⌊ D⌋ for classification.
Random forests are among the best classifiers in practice, and simpler to train than boosting.
0 0
−5 −5
0 1 2 3 4 5 0 1 2 3 4 5
0 0
−5 −5
0 1 2 3 4 5 0 1 2 3 4 5
46
Boosting
• Bagging generates complementary learners through random sampling and unstable learning
algorithms. Boosting actively generates complementary learners by training the next learner on
the mistakes of the previous learners.
• Weak learner : a learner that has probability of error < 12 (i.e., better than random guessing on
binary classification). Ex: decision trees, decision stumps (tree grown to only 1 or 2 levels).
Strong learner : a learner that can have arbitrarily small probability of error. Ex: neural net.
• There are many versions of boosting, we focus on AdaBoost.M1, for classification (it can be
applied to regression with some modifications):
– It combines L weak learners. They have high bias, but the decrease in variance in the
ensemble compensates for that.
– Each learner l = 1, . . . , L is trained on the entire training set X = {x1 , . . . , xN }, but each
(l)
point xn has an associated probability pn indicating how “important” it is for that learner.
The training algorithm must be able to use these probabilities p(l) together with the training set X . Otherwise, this can be
simulated by sampling a training set of size N from X according to p(l) .
1
– The first learner uses p1n = N
(all points equally important).
P (l)
– After training learner l on (X , p(l) ), let its error rate be ǫl = n misclassified by learner l pn ∈
[0, 1]. We update the probabilities as follows: for each point xn , n = 1, . . . , N:
(
(l)
βl pn if learner l correctly classifies xn ǫl
pn(l+1) = (l) where βl = ∈ [0, 1)
pn otherwise 1 − ǫl
(l+1) (l+1) P (l+1)
and then renormalize the probabilities so they sum 1: pn = pn / N n=1 pn .
This decreases the probability of a point if it is correctly classified, so the next learner
focuses on the misclassified points. That is why learners are chosen to be weak. If they are strong (= very
accurate), the next learner’s training set will emphasize a few points, many of which could be very noisy or outliers.
• After training, the ensemble output for a given instance is given by a weighted vote, where the
weight of learner l is wl = log (1/βl ) (instead of 1), so the weaker learners have a lower weight.
Other variations of boosting make the overall decision by applying learners in sequence, as in cascading.
• Boosting usually achieves very good classification performance, although it does depend on the
dataset and the type of learner used (it should be weak but not too weak). It is sensitive to
noise and outliers.
A very successful application in computer vision: the Viola-Jones face detector. This is a cascade of AdaBoost ensembles of decision
stumps, each trained on a large set of Haar features. It is robust and (with clever image-based operations) very fast for test images.
Cascading is similar to boosting, but the learners are trained sequentially. For example, in boosting
the next learner could be trained on the residual error of the previous learner. Another way:
• When applied to an instance x, each learner gives an output (e.g. class label) and a confidence
(which can be defined as the largest posterior probability p(Ck |x)).
• Learner l is trained on instances for which the previous learner is not confident enough.
• When applying the ensemble to a text instance, we apply the learners in sequence until one is
confident enough. We use learner l only if the previous learners 1, . . . , l − 1 are not confident
enough on their outputs.
• The goal is to order the learners in increasing complexity, so the early learners are simple and
classify the easy instances quickly.
47
Error-correcting output codes (ECOC)
• An ensemble learning method for K-class classification.
• Idea: instead of solving the main classification task directly with one classifier, which may be difficult, create simpler classification
subtasks which can be combined to get the main classifier.
• Code matrix W of K × L with elements in {−1, +1}: if wkl = −1 (+1) then class k should be on the negative (positive) side of
learner l. Hence, each learner tries to classify one subset of classes vs the rest (it partitions the K classes into two groups). Ex.:
+1 −1 −1 −1 +1 +1 +1 0 0 0 −1 −1 −1 −1 −1 −1 −1
! ! !
−1 +1 −1 −1 −1 0 0 +1 +1 0 all possible −1 −1 −1 +1 +1 +1 +1
one-vs-all: −1 −1 +1 −1 one-vs-one: 0 −1 0 −1 0 +1 −1 +1 +1 −1 −1 +1 +1
−1 −1 −1 +1 0 0 −1 0 −1 −1
learners: +1 −1 +1 −1 +1 −1 +1
An ECOC code matrix has L between K (one-vs-all, fewest learners) and 2K−1 − 1 (all possible learners). Because negating a
column gives the same learner; an a column of all −1s (or all +1s) is useless.
• The code matrix allows us to define a K-class classification problem in terms of several binary classification problems. We can use
any binary classifier for the latter (decision trees, etc.).
• To classify a test input x, we apply the L classifiers to it and obtain a row vector y with L entries in {−1, +1}. Ideally, if the
classifiers were perfect, this would equal the row of W corresponding to x’s class, but in practice some classifiers will classify x
incorrectly. Then, we find the row 1 ≤ k ≤ K in W that is closest to y in Hamming distance, and output k as label (as in
PL
error-correcting codes). Equivalently✐ , we can compute a weighted vote l=1 wkl yl for class k (where the weights wkl are the
elements of W) and then pick the class with most votes.
• The point of ECOC is to introduce robustness against errors of the learners by making their codewords be farther from each other
in Hamming distance (number of mismatching bits). This requires sufficiently many learners, and introduces redundancy.
• Given a value of L (the number of classifiers, with L > K) selected by the user, we generate W so its rows are as different as
possible in Hamming distance (redundant codes so protection against errors), and its columns are as different as possible (so the
learners are diverse).
• An ECOC can also be seen as an ensemble of classifiers where, to obtain each classifier, we manipulate the output targets (rather
than the input features or the training set, which in ECOC are equal to the original training set for each classifier).
3. We train a model f to predict the ground-truth output for a given instance xn from the
learners’ outputs for that instance.
• If we choose f to be a linear function, this produces a weighted average or vote, where the
weights are learned on the validation set. But we can choose f to be nonlinear.
• By training f on the validation set (rather than the training set where the learners were trained),
it learns to correct for mistakes that the learners make.
• Whether it works better than a fixed, simple combination rule (such as majority vote or average)
depends on the problem.
48
Fine-tuning an ensemble
• Constructing an ensemble that does decrease the error requires some art in selecting the right
number of learners and in making them be diverse and of the right accuracy.
• Typically, the resulting ensemble contains learners that are somewhat correlated with each
other, or that are useless.
• It often helps to postprocess the ensemble by removing some learners (ensemble pruning), so
we obtain a smaller ensemble (hence faster at test time) having about the same error.
• This is similar to feature selection and we can indeed apply feature selection algorithms. The
most usual is forward selection: starting with an empty ensemble, we sequentially add one
learner at a time, the one that gives highest accuracy when added to the previous ones.
49
12 Linear discrimination
• Consider learning a classifier given a sample {(xn , yn )}N D
n=1 where xn ∈ R and yn ∈ {1, . . . , K}.
– Discriminative approach: we learn only the class boundaries, through discriminant func-
tions gk (x), k = 1, . . . , K. We don’t learn the class densities, and gk (x) need not be
modeled using probabilities. Hence, it requires assumptions only about the class bound-
aries but not about their densities. It solves a simpler problem. More effective in practice.
This and future chapters, using linear and nonlinear discriminant functions.
• Define a parametric model gkP (x; Θk ) for each class discriminant. In linear discrimination, this
model is linear: gk (x; wk ) = D T
d=1 wkd xd + wk0 = wk x (assuming an extra constant feature x0 = 1).
• Linear discriminants are less accurate than nonlinear ones, but simpler:
– Easier optimization problem (often convex, so unique solution).
– Faster to train, can scale to large datasets.
– Low space and time complexity at test time: Θ(D) per class. To store wk and multiply times it.
– Interpretable: the output is a weighted sum of the features xd (separable contributions):
∗ magnitude of wkd : importance of xd in the decision;
∗ sign of wkd : whether the effect of xd is positive or negative.
– Accurate enough in many applications.
• In practice, try linear discrimination first, before trying nonlinear discrimination.
The result is a model that is nonlinear on the features x but linear on the parameters w.
Radial basis function (RBF) networks and kernel support vector machines (SVMs) exploit this. ch. 12–13
50
Geometry of the linear discriminant
x2
x2
g(x)=w1x1+w2 x2+w0=0 H2
x2
g(x)=0 H1
g(x)>0 + H12
g(x)<0 g(x)<0 g(x)>0 +
C1 C1 C1
C2 |w0|/||w|| C2 C2
H3
H31
+
x
+
+ H23 C3
+ w C3
|g(x)|/||w||
x1 x1 x1 x1
Two classes
?
• One discriminant function is sufficient: g1 (x) − g2 (x) = wT x + w0 = g(x).
Testing: choose C1 if g(x) > 0 and C2 if g(x) < 0.
• This defines a hyperplane where w is the weight vector and w0 the threshold (or bias). It divides
the input space RD into two half-spaces, the decision regions R1 for C1 (positive side) and R2
for C2 (negative side). The hyperplane itself is the boundary or decision surface.
positive side if w0 > 0
• The origin x = 0 is on the boundary if w0 = 0
negative side if w0 < 0.
• w is orthogonal to the hyperplane. Pf. Pick x, y on the hyperplane.
• So w determines the orientation of the hyperplane and w0 its location wrt the origin.
K > 2 classes It is possible to optimize jointly all the discriminants (e.g. see later the softmax
classifier), but a simpler, commonly used strategy is to construct a K-class classifier by “ensembling”
several binary classifiers trained separately. Ex:
• One-vs-all (or one-vs-rest): we use K discriminant functions gk (x) = wkT x + wk0.
– Training: gk is trained to classify the points of class Ck vs the points of all other classes.
Training time: K binary classifiers each on the entire training set.
– Testing: choose Ck if k = arg maxi=1,...,K gi (x), i.e., pick the class having the larger dis-
criminant. Ideal case: gk (x) > 0 and gi (x) < 0 ∀i 6= k. Test time: O(DK).
T
• One-vs-one (for each pair of classes): we use K(K − 1)/2 discriminants gij (x) = wij x + wij0.
– Training: gij (i 6= j) is trained to classify the points of class Ci vs the points of class Cj
(points from other classes are not used). Training time: K(K −1)/2 binary classifiers each
on a portion of the training set ( K2 if balanced classes). ✐ Is this faster or slower than one-vs-all?
P
– Testing: choose Ck if k = arg maxi=1,...,K K j6=i gij (x), i.e., pick the class having the larger
2
summed discriminant. Test time: O(DK ).
Also possible: for each class k, count the number of times gkj (x) > 0 for j 6= k, and pick the class with most votes.
• All the above divide the input space RD into K convex decision regions R1 , . . . , RK (polytopes).
51
Minimizing functions by (stochastic) gradient descent
• When the minimization problem minw∈RD E(w) cannot be solved in closed form (solution
w∗ = some formula), we use iterative methods (w(0) → w(1) → · · · → w(∞) = w∗ ). Many
such methods exist (Newton’s method, conjugate gradients. . . ). One of the simplest ones,
reasonably effective in many cases (although sometimes very slow), is gradient descent.
– Stop iterating:
∗ When ∇E(w) ≈ 0 (since ∇E(w∗ ) = 0 at a minimizer w∗ of E). This may overfit and may
take many iterations.
∗ When the error on a validation set starts increasing (early stopping), which will
happen before the training error is minimized. The model w generalizes better to unseen data.
– It will find a local minimizer (not necessarily global ).
P
• Stochastic gradient descent (SGD): applicable when E(w) = N n=1 e(w; xn ), i.e., the total
error is the sum of the error at each data point. We update w as soon as we process xn :
repeatedly iterate w ← w + ∆w with an update ∆w = −η ∇e(w; xn ).
– Epoch = one pass over the whole dataset {x1 , . . . , xN }. It corresponds to one “noisy”
iteration of gradient descent: itP
need not always decrease E(w) because it doesn’t use
the correct gradient ∇E(w) = N n=1 ∇e(w; xn ).
– Much faster than (batch) gradient descent to get an approximate solution for w if N
is large (so there is redundancy in the dataset), or if data points come one at a time
and we don’t store them (online learning). However, very slow convergence thereafter.
– The step size has to decrease slowly over epochs for convergence to occur.
α
One typically takes η(t) = β+t
at epoch t sor suitable α, β > 0.
– Shuffling: in each epoch, process the N points in a random order. Better than a fixed order.
P
– Minibatches: computing the update ∆w = −η n∈B ∇e(w; xn ) based on subsets B of
1 < |B| < N points works better than with |B| = 1 (pure online learning) or |B| = N
(pure batch learning). In practice |B| = 10 to 1 000 points typically. May need to adapt to GPU memory size.
P
• Ex: E(w) = 12 N T 2
n=1 (yn − w xn ) (least-squares error for linear regression). The updates:
P
GD: ∆w = η N T
n=1 (yn − w xn ) xn SGD: ∆w = η(yn − wT xn ) xn
have the form (ignoring η): “Error × Input” where Error = DesiredOutput − ActualOutput.
This has the following effect:
0.9
0.8
0.6
0.5
θ
0.4
• Logit function or log odds of θ: logit(θ) = log 1−θ ∈ (−∞, ∞) for 0.3
0.2
0
−10 −8 −6 −4 −2 0 2 4 6 8 10
P PK
• Softmax function: S(t) = (et1 , . . . , etK )T / K tk K K
k=1 e ∈ (0, 1) , t ∈ R . It satisfies k=1 Sk (t) = 1;
it maps K real values (“scores”) to a probability distribution in {1, . . . , K}. “Soft” max: if
tk ≫ tj ∀j 6= k then Sk (t) ≈ 1, Sj (t) ≈ 0 ∀j 6= k. It is differentiable (unlike the max function).
If tk ≥ tj ∀j 6= k then max(t1 , . . . , tK ) = tk and arg max(t1 , . . . , tK ) = k.
• Geometry of the logistic classifier: like the geometry of the linear discriminant (w determines
the orientation of the decision boundary and w0 its location wrt the origin), and the magnitude
of w and w0 determines how steep the logistic function becomes.
53
• Testing: given x ∈ RD , compute (θ1 , . . . , θK ) = softmax(w1T x+w10, . . . , wK
T
x+wK0 ) and choose
? T T
Ck if k = arg maxi=1,...,K {θi }, or equivalently , k = arg maxi=1,...,K w1 x + w10 , . . . , wK x + wK0 .
• Training: we learn its parameters {wk , wk0}K N D
k=1 from a training set {(xn , yn )}n=1 ⊂ R ×
{1, . . . , K} by minimizing the cross-entropy.
• The last layer of a neural net for classification is typically a softmax linear layer. With many
classes (large K), it can take a lot of training and test time (ex: large language models).
Decision boundaries Linear discriminants Posterior probabilities
for K = 3 classes wkT x + wk0 (softmax of linear discriminants)
4
3.5
2.5
20 1
x2
2
0.8
10
P(C |x ,x )
i 1 2
g (x ,x )
0.6
1 2
1.5 0
0.4
i
−10
0.2
1
−20 0
0.5
0 x x
1 1
0 0.5 1 1.5 2 2.5 3 3.5 4
x1 x2 x2
n∈C1 n∈C0 2
100
10
0.5
0
N N
∂E X ∂E X −0.5
−1.5
Newton’s method (suitably modified) is much more effective than gradient descent for this problem.
• For better generalization, we can add a regularization term λkwk2 and cross-validate λ ≥ 0.
Using instead λkwk1 forces some weight values to exactly zero and achieves feature selection. Lasso
• If the classes are linearly separable, as training proceeds kwk → ∞ and θn → yn ∈ {0, 1}.
To prevent this, we can stop early (when the number of misclassifications is zero) or add a
regularization term λkwk2 or λkwk1 .
54
K > 2 classes: yn ∈ {1, . . . , K}
• As before, but we model yn |xn as a multinomial distribution with parameter θnk = p(Ck |xn ) =
exp (wkT xn +wk0 )
PK
exp (wT x +w )
. The error function is again the cross-entropy (maximum likelihood with
j=1 j n j0
a change of sign), where we represent the labels yn as a 1-of-K encoding (a binary vector
yn ∈ {0, 1}K containing a single 1 in position k if yn corresponds to class k):
N X
X K
cross-entropy: min E {wk , wk0}K
k=1 ; {(xn , y )} N
n n=1 = − ynk log θnk .
{wk ,wk0 }K
k=1 n=1 k=1
P
Say that for point n we have ynj = 1 (so ynk = 0 if k 6= j). Then − K k=1 ynk log θnk = − log θnj ,
and minimizing this pushes θnj (the probability predicted for class j for point n) to be as close
to 1 as possible. So the cross-entropy tries to match θnk with ynk for every n = 1, . . . , N and
k = 1, . . . , K. But we cannot modify θnk directly, we modify it indirectly by modifying the
parameters {wk , wk0}K k=1 , which we can do via gradient descent on E.
Multinomial distribution:
P
• A die with K faces, each with probability θk ∈ [0, 1] for k = 1, . . . , K with Kk=1 θk = 1.
θ1 , y1 = 1
• p(y; θ) = θ1y1 · · · θK
yK
= ... where y ∈ {0, 1}K has exactly one yk = 1 and the rest are 0.
θ , y = 1
K K
By least-squares regression
Two classes: yn ∈ {0, 1}
N
1X
• We define a least-squares regression problem min E(Θ) = (yn − f (xn ; Θ))2 where:
Θ 2 n=1
– The labels {yn } are considered as real values (which happen to be either 0 or 1), and can
be seen as the desired posterior probabilities for the points {xn }.
1
– The parametric function to be learned is f (x; w, w0) = σ(wT x + w0 ) = 1+exp (−(wT x+w0 ))
.
✐ Why not do least-squares regression to labels {0, 1} directly on the function wT x + w0 ?
– Hence, learning the classifier by regression means trying to estimate the desired posterior
probabilities with the function f (whose outputs are in [0, 1]).
where θn = σ(wT xn + w0 ).
• As before, if the classes are linearly separable, this would drive kwk → ∞ and θn → yn ∈ {0, 1}.
Instead, we stop iterating similarly to before or add a regularization term.
55
13 Multilayer perceptrons (artificial neural nets)
• Parametric nonlinear function approximators f(x; Θ) for classification or regression.
• Originally inspired by neuronal circuits in the brain (McCullough-Pitts neurons, Rosenblatt’s
perceptron, etc.), and by the brain’s ability to solve intelligent problems (visual and speech
perception, navigation, planning, etc.).
Synapses between neurons = MLP weight parameters (which are modified during learning). Neuron firing = MLP unit nonlinearity.
Neurons that feed into another neuron = receptive field.
• They are usually implemented in software in serial computers and more recently in parallel or
distributed computers. There also exist VLSI implementations.
Neuron Organization of neurons in the retina
The perceptron
P
• Linear perceptron: computes a linear function y = Dd=1 wd xd + w0 ∈ R of an input vector
x ∈ RD with weight vector w ∈ RD (or y = wT x with w ∈ RD+1 and we augment x with a
0th component of value 1). To have K outputs: y = Wx with W ∈ RK×(D+1) .
• For classification, we can use:
1
– Two classes: y = σ(wT x) = ∈ (0, 1) (logistic).
1 + exp (−wT x)
exp (wkT x) PK
– K > 2 classes: yk = PK T
∈ (0, 1), k = 1, . . . , K with k=1 yk = 1 (softmax).
j=1 exp (wj x)
56
Training a perceptron
P
• Apply stochastic gradient descent: to minimize the error E(w) = N
n=1 e(w; xn , yn ), repeatedly
update the weights w ← w + ∆w with ∆w = −η ∇e(w; xn , yn ) and n = 1, . . . , N (one epoch).
– Regression by least-squares error: e(w; xn , yn ) = 21 (yn −wT xn )2 ⇒ ∆w = η(yn −wT xn ) xn .
– Classification by maximum likelihood, or equivalently cross-entropy:
∗ For two classes: yn ∈ {0, 1} and e(w; xn , yn ) = −yn log θn − (1 − yn ) log (1 − θn ) where
θn = σ(wT xn ) ⇒ ∆w = η(yn − θn ) xn .
P
∗ For K > 2 classes: yn ∈ {0, 1}K coded as 1-of-K and e(w; xn , yn ) = − K k=1 ykn log θkn
exp (wkT xn )
where θkn = PK exp (wT x ) ⇒ ∆wkd = η(ykn −θkn ) xdn for d = 0, . . . , D and k = 1, . . . , K.
j=1 j n
The original perceptron algorithm was a variation of stochastic gradient descent. For linearly separable problems, it converges in a
finite (possibly large) number of iterations. For problems that are not linearly separable problems, it never converges.
AND x2
x2
y XOR
x1 x2 y 1.5
x1 x2 y
0 0 0 -1.5 (0,1) (1,1) 0 0 0
0 1 0 +1 +1
0 1 1
1 0 0 + 1 0 1
1 1 1 1 1 0
y = σ x1 + x2 − 23 x0=+1 x1 x2
(0,0) (1,0) 1.5
x1 x1
An MLP with a single nonlinear hidden layer. . . . . . solves the XOR problem
z2
y y
-0.5 +1 +1
z1
z1 z2
+
z1
+
x0=+1 x1 x2 x1
• Many models are universal approximators (over a large class of target functions): neural nets,
decision trees and forests, RBFs, polynomials, sines/cosines, piecewise constant functions, etc.
They achieve this by having many “parts” and combining them in a suitable way. Good models
should strike a good tradeoff between approximation accuracy and number of parameters with
58
high-dimensional feature vectors.
Learning MLPs: the backpropagation algorithm
The backpropagation algorithm is (stochastic) gradient descent applied to an MLP using a least-
squares or cross-entropy error function (more generally, to a neural net with any error function):
• The updates using stochastic
gradient descent with a minibatch B to minimize an error func-
P N
tionPE Θ; {(xn , yn )}N
n=1 = n=1 e(Θ; xn , yn ) are of the form Θ ← Θ + ∆Θ with ∆Θ =
−η n∈B ∇Θ e(Θ; xn , yn ), where the gradients are given below. B = {1, . . . , N } gives batch gradient descent.
1 2 K W W W
• Since the MLP consists of a nested sequence of functions (layers) x−−→ z1 −−→ z2 · · · −−→ y=
f(x; Θ) with Θ = {W1 , . . . , WK }, each being a nonlinear perceptron, the gradient ∇Θ e(Θ)
is computed with the chain rule, and can be interpreted as backpropagating the error at the
output layer “yn − f(xn ; Θ)” through the hidden layers back to the input.
′
Ex: an MLP with one hidden layer having H units, with inputs x ∈ RD and outputs y ∈ RD , where
the hidden units are sigmoidal and the output units are linear. The gradient is as follows:
∂E ∂E ∂fi ∂fi (as with a perceptron fi
• wrt second-layer weight vih : = = δi
∂vih ∂fi ∂vih ∂vih given fixed inputs zh )
with “error”
{vih }
D′ D′ ∂E
∂E X ∂E ∂fi ∂zh X ∂fi ∂zh δi = .
• wrt first-layer weight whd : = = δi
∂fi {whd}
∂whd i=1
∂f i ∂zh ∂w hd i=1
∂z h ∂w hd
With more hidden layers, the gradient wrt each layer’s weights is computed recursively.
∂en
Note that ∂vih
is the same as for linear regression (squared error), or logistic regression or softmax (cross-entropy).?
fi fi
D ′ H
′ 1X 2
vih X
e= e ({fi }D
i=1 ) = (yi − fi ) , zh fi = fi {vih , zh }H
h=0 = vih zh ,
2 | {z }
i=1 δ = error of
|{z}
h=0
i output of
output unit i output unit i
D
!
X
zh
zh = zh {whd }D
d=0 = σ whd xd
whd
|{z}
output of d=0
hidden unit h
σ ′
D′
X z }| {
vih
∂en ∂en
= − (yin − fi (xn )) zhn = − (yin − fi (xn )) vih z (1 − zhn ) xdn .
∂vih | {z } |{z} whd ∂whd | {z } |{z} | hn {z }
∂fi /∂vih i=1 ∂fi /∂zh
∂en /∂fi ∂en /∂fi ∂zh /∂whd
Classification, two classes We need a single logistic output unit f (xn ). Θ = {W, v}. Cross-entropy:
H
X
e(W, v; xn , yn ) = en (W, v) = −yn log f (xn ; W, v) − (1 − yn ) log 1 − f (xn ; W, v) , f (xn ) = σ vih zhn
∂en ∂en h=0
∂vh
= −(yn − f (xn )) zhn ∂whd
= −(yn − f (xn )) vh zhn (1 − zhn ) xdn .
Classification, K > 2 classes We need K softmax output units fi (xn ), i = 1, . . . , K (one per
class). Θ = {W, V}. Cross-entropy error :
K H
X exp (oin ) X
e(W, V; xn, yn ) = en (W, V) = − yin log fi (xn ; W, V) , fi (xn ) = PK , oin = vih zhn
i=1 k=1 exp (okn ) h=0
P
∂en ∂en ∂en K
∂oi
= −(yin − fi (xn )) ∂vih
= −(yin − fi (xn )) z59
hn ∂whd
= i=1 −(yin − fi (xn )) vih zhn (1 − zhn ) xdn .
2 1.4 4 4 4
Training
Validation
1.5 1.2 3 3 3
1 2 2 2
1
0.5 1 1 1
−0.5 −1 −1 −1
0.4
−1 −2 −2 −2
0.2
−1.5 −3 −3 −3
0
−2 0 50 100 150 200 250 300 −4 −4 −4
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 Training Epochs −0.5 0 0.5 −0.5 0 0.5 −0.5 0 0.5
MLP with H = 2 hidden units Error on training & validation sets a) hyperplanes of hidden units in L1;
after 100, 200 and 300 epochs b) hidden unit outputs;
c) hidden unit outputs × weights in L2
Training procedures
Training neural nets in practice is tricky and requires some expertise and trial-and-error.
• Gradient descent can converge very slowly in optimization problems with many parameters.
• Vanishing gradients problem: there is one product σ ′ (z) = σ(z) (1 − σ(z)) per layer. If the
value of z is large, which will happen if the weights or inputs at that layer are large, then the
sigmoid saturates and σ ′ (z) ≈ 0, so the gradient becomes tiny, and it takes many iterations to
make progress. This is particularly problematic with deep networks (having several layers of
hidden units). Other nonlinearities have less of a problem, e.g. ReLU.
• Initial weights: small random values (e.g. uniform in [−0.01, 0.01]) so as not to saturate the
sigmoids.
• Normalizing the inputs so they have zero mean and unit variance speeds up the optimization
(since we use a single step size η for all parameters).
• Rescaling the learning rate ηw for each weight w by using a diagonal approximation to the
Hessian of the objective function E(w). This helps to correct for the fact that weights in
different layers, or subject to weight sharing, may have different effects on the output.
• It is also possible to use other optimization methods (modified Newton’s method, conjugate
gradients, L-BFGS, etc.) but, for problems with large N and redundant samples, they don’t
seem to improve significantly over SGD (with properly tuned learning rate, minibatch size and
momentum term).
60
Overtraining
• Model selection for the number of hidden units (hence weights): the bias-variance tradeoff
applies as usual:
– Neural nets with many hidden units can achieve a very low training error but memorize
the noise as well as the signal, hence overfit.
– Neural nets with few hidden units can have a large bias, hence underfit.
The number of hidden units can be estimated by cross-validation.
• More generally, one needs to select the overall architecture of the neural net (number of layers
and of hidden units in each layer, connectivity between them, choice of nonlinearity, etc.). This
is done by trial and error and can be very time-consuming.
• Early stopping: during the (stochastic) gradient descent optimization (with a given number of
hidden units), we monitor the error on a validation set and stop training when the validation
error doesn’t keep decreasing. This helps to prevent overfitting as well.
• Weight decay: we discourage weights from taking large values by adding to the objective
function, or equivalently to the update, a penalty term:
′ λX 2 ∂E
E (w) = E(w) + w ⇐⇒ ∆wi = −η + λwi .
2 i i ∂wi
This has the effect of choosing networks with small weights as long as they have a low training
error (depending on λ). Such networks are smoother and generalize better to unseen data. The
value of λ is set by cross-validation.
0.12 3.5
Training Training
Validation Validation
3
0.1
2.5
0.08
Mean Square Error
0.06
1.5
0.04
1
0.02
0.5
0 0
0 5 10 15 20 25 30 0 100 200 300 400 500 600 700 800 900 1000
Number of Hidden Units Training Epochs
• Weight sharing: the values of certain weights are constrained to be equal. For example, with
convolutional neural nets, the weights in the window are the same for every hidden unit. This
corresponds to using a filter that is homogenous in space or time, and helps to detect features
regardless of their location in the input. Ex: edges at different locations and of different orientations.
61
• Convolutional neural nets with weight sharing have a much smaller number of weights, can be
trained faster, and usually give better results than fully-connected networks.
• This process may be repeated in successive layers and make it possible for the network to learn
a hierarchical representation of the data with each layer learning progressively more complex
and abstract concepts.
Ex: pixels → edges in different orientations → corners, T-junctions, contour segments → parts → objects → classes of objects.
• With temporal data, as in speech recognition or machine translation, one may use time-delay
neural nets or recurrent neural nets.
• Deep learning refers to neural networks having multiple layers that can learn hierarchical rep-
resentations of complex data such as images. Properly trained on large, labeled datasets (using
GPUs), they have achieved impressive results in recent years in difficult tasks such as object
recognition, speech recognition, machine translation or game learning. This success is due to
the ability of the network to learn nonlinear mappings in high-dimensional spaces, and to the
fact that the network is learned end-to-end, i.e., all its weights are optimized with little human
contribution (rather than e.g. having a preprocessing layer that is fixed by hand by an expert).
Ex: instead of extracting preselected features from an image (e.g. SIFT features) or from the speech waveform (e.g. MFCCs), the
first layer(s) of the net learn those features optimally.
A
A A A
Hints
• Hints are properties of the target function that are known to us independent of the training
data. They should be built into the neural net if possible, because they help to learn a good
network, particularly when the training data is limited.
• Ex: invariance. The output of a classifier is often invariant to certain known transformations
of the input:
62
Dimensionality reduction: autoencoders
• Autoencoder : an MLP that, given a training set {xn }N D
n=1 ∈ R without labels, tries to recon-
struct its input (“labels” yn = xn ), but has a bottleneck layer, whose number of hidden units
H is smaller than the dimension D of the input x. Its output layer is linear and it is trained
by least-squares error (effectively, it is a regression problem using as mapping f ◦ F):
N
1X
min E(W1 , W2 ) = kxn − f(F(xn ; W1 ); W2 )k2
W1 ,W2 2 n=1
• This forces the overall network f(F(x)) to learn an optimal low-dimensional representation
zn = F(xn ) ∈ RH of each instance xn ∈ RD in the bottleneck (or “code”) layer. The codes (=
outputs of the bottleneck hidden layer) are optimal for reconstructing the input instances.
• If the hidden layers are all linear then the network becomes equivalent to PCA? .
The hidden unit weights need not be the H leading eigenvectors of the data covariance matrix, but they span the same subspace.
The JPEG image compression standard uses a linear encoder and decoder, but the weights of these are not learned from a dataset
(as PCA would do); instead, they are fixed to the coefficients of the Discrete Cosine Transform.
If they are nonlinear (e.g. sigmoidal) then it learns a nonlinear dimensionality reduction.
y1 yd y1 yd 1
Hidden Representation
7 7
7 4 4
777 77
7 7 7 1 4 4 44
0.9 9 1 1
4
7 1 4
1 1
9 99 9 9 4
0.8 1 44
9
4
3 9 9 1
Decoder 0.7 3 1
9
0.6
33
zH zH 33
3 9
Hidden 2
3
0.5 5 5
Encoder 0.4
8 88 8
3 8
8 8
0.3 3
5 8
5 8
1 6
5 1
0.2
66
6
6
0.1 2 2 0 0 66
x0=+ x1 xd x0=+ x1 xd
2
2 2 2 22
2
0
0 00 6
2 2 0 0 0 0
1 1 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Linear Nonlinear Hidden 1
• If a neural network is trained for classification and has a bottleneck (a hidden layer with
dimension smaller than the input), then the network will learn a low-dimensional representation
in that hidden layer that is optimal for classification (similar to LDA if using linear layers).
63
14 Radial basis function networks
′
• Assume a dataset {xn , yn }N
n=1 with xn ∈ R
D
and yn ∈ RD for regression or yn ∈ {1, . . . , K}
for classification.
• They can be seen as a feedforward neural net with a single hidden layer of Gaussian units and
an output layer of linear units. They are not usually constructed with more hidden layers, unlike MLPs.
• Two types of representation of information in neural nets: assume for simplicity that units
output either 0 (not activated) or 1 (activated):
Each unit is locally tuned to a small area of the input space called its receptive field, and
the input space is paved with such units.
Neurons in the primary visual cortex respond only to highly localized stimuli in retinal position and angle of visual orientation.
yi w2 w1
x2 x2 +
m1 x a m3
xa
xb xb
wih
xc xc
+
ph m2
x1 x1
p0=+1
64
• Training for regression: we minimize the least-squares error
N H
1X 2
X
E {wh , µh }H
h=1 , σ = kyn − f(xn )k + λ kwh k2
2 n=1 h=1
where λ ≥ 0 is a regularization user parameter which controls the smoothness of f. We can use
gradient descent, where the gradient is computed using the chain rule and is similar to that of p. 330
an MLP. However, RBF nets can be trained in an approximate but much simpler and faster
way as follows:
1. Set the centroids {µh }H N
h=1 in an unsupervised way using only the input points {xn }n=1 ,
usually by running K-means with H centroids, or by selecting H points at random.
With a suitable value for σ, this effectively “covers” the training data with Gaussians.
3. Given the centroids and width, the values φh (xn ) are fixed, and the weights {wh }H h=1 are
determined by optimizing E, which reduces to a simple linear regression. We solve the
linear system✐ (recall the normal equations of chapter 5):
ΦΦT + λI W = ΦY T ΦH×N = (φh (xn ))hn , WH×D′ = (w1 , . . . , wH )T , YD′ ×N = (y1 , . . . , yN ).
• Training for classification: we minimize the cross-entropy error
N X
K H
kwh k2
X X
E {wh , µh }H
h=1 , σ = − yin log θin + λ
n=1 i=1 h=1
H
exp (fi (xn )) X
θin = PK fi (x) = wih φh (x).
k=1 exp (fk (xn )) h=1
That is, we use an RBF net with H BFs and K outputs, which we pass through a softmax. This can either be optimized by
gradient descent, or trained approximately as above (but the optimization over the weights cannot be solved by a linear system
and requires gradient descent itself).
– The width σ (if not optimized over or set by hand): the larger, the smoother the RBF net.
✐ What happens if σ → ∞? And if σ → 0?
They are all set by cross-validation. We usually search over a range of values of H, log λ and log σ.
• Because they are localized, RBF nets may need many basis functions to achieve a desired
accuracy.
• Universal approximation: any continuous function (satisfying mild assumptions) from RD to
′
RD can be approximated by a RBF net with an error as small as desired (by using sufficiently
many basis functions).
Intuitive proof: we can enclose every input instance (or region) with a basis function having as centroid that instance and a small
enough width so it has little overlap with other BFs. This way we “grid” the input space RD with sufficiently many BFs, so that
only one BF is active for a given instance xn . Finally, we set the weight of each BF to the desired function value. This gives
an approximately piecewise constant approximation to the function (essentially equal to a nonparametric kernel smoother). Its
accuracy may be increased by using more hidden units and placing a finer grid in the input.
65
Normalized basis functions
• With Gaussian BFs, it is possible that φh (x) = 0 for all h = 1, . . . , H for some x ∈ RD . This happens if x is far enough (wrt σ)
from each centroid.
Now, even if x is far (wrt σ) from each centroid, at least one BF will have a nonzero value.
• φh (x) can be seen as the posterior probability p(h|x) of BF h given input x assuming a Gaussian mixture model p(x), where
1
component h has mean µh , and all components have the same proportion H and the same, isotropic covariance matrix Σh = σ2 I.
• The nonparametric kernel smoother of chapter 7 using a Gaussian kernel of bandwidth h > 0
N
X K k(x − xn )/hk
g(x) = PN yn
n=1 n′ =1 K k(x − xn′ )/hk
66
15 Kernel machines (support vector machines, SVMs)
• Discriminant-based method: it models the classification boundary, not the classes themselves.
• It can produce linear classifiers (linear SVMs) or nonlinear classifiers (kernel SVMs).
• Very effective in practice with certain problems:
– It gives good generalization to test cases.
– The optimization problem is convex and has a unique solution (no local optima).
– It can be trained on reasonably large datasets. Large datasets require an approximate solution.
– Special kernels can be defined for many applications (where feature vectors are not naturally defined).
• It can be extended beyond classification to solve regression, dimensionality reduction, outlier
detection and other problems.
The basic approach is the same in all cases: maximize the margin and penalize deviations, resulting in a convex quadratic program.
• We have seen several linear classifiers: logistic regression, Gaussian classes with common co-
variance, now linear SVMs. . . They give different results because they have different inductive
bias, i.e., make different assumptions (the objective function, etc.).
67
• This problem has a unique minimizer (w, w0). The closest instances to each side of the hyper-
plane satisfy? yn (wT xn + w0 ) = 1 ⇔ wT xn + w0 = yn and they are at a distance? 1/kwk from
the hyperplane. If the dataset is not linearly separable, then the QP is infeasible: it has no solution for {w, w0 }.
• This is a constrained optimization problem, specifically a convex quadratic program (QP), de-
fined on D + 1 variables (w, w0) with N linear constraints and a quadratic objective function.
Its solution can be found with different QP algorithms.
• Solving the primal problem, a QP on D + 1 variables, is equivalent to solving the following
dual problem, another convex QP but on N variables α = (α1 , . . . , αN )T and N + 1 constraints
(α1 , . . . , αN are the Lagrange multipliers of the N constraints in the primal QP):
N N N
1X X X
Dual QP: min yn ym (xTn xm ) αn αm − αn s.t. yn αn = 0, αn ≥ 0, n = 1, . . . , N.
α∈RN 2 n,m n=1 n=1
3 2 2
This can be solved in Θ(N + N D) time and Θ(N ) space (because we need the dot products
xTn xm for n, m = 1, . . . , N). The optimal α ∈ RN has the property that αn > 0 only for those
instances that lie on the margin (the closest ones to the hyperplane), i.e., yn (wT xn + w0 ) = 1,
called the support vectors (SVs). For all other instances, which lie beyond the margin, αn = 0.
w
Each SV exerts a “force” yn αn kwk on the margin hyperplane. The forces are balanced, keeping the hyperplanes in place.
so the optimal weight vector can be written as a l.c. of the SVs. This allows kernelization later.
(a)
4
1.5 1.5
(b)
3 hinge loss
1 1 (c)
2 cross entropy
1 (d)
0.5 0.5 1
−1 1
0/1 loss
−1 0
0
0 0.5 1 1.5 2
0
0 0.5 1 1.5 2 −2 −1 0 1 2
t
In terms of ξn : (a) ξn = 0, (b) ξn = 0, (c) 0 < ξn < 1, (d) ξn > 1;
In terms of αn : (a) αn = 0, (b) 0 < αn < C, (c)–(d) αn = C;
68
(b)–(d) are SVs (circled instances ⊕ ⊙, on or beyond the margin)
Binary classification, not linearly separable case: soft margin hyperplane
• If the dataset is not linearly separable, we look for the hyperplane that incurs least classification
error while having the largest margin. For each data point xn , we allow a deviation ξn ≥ 0 (a
slack variable) from the margin, but penalize such deviations proportionally to C > 0:
N
1 2
X yn (wT xn + w0 ) ≥ 1 − ξn ,
Primal QP: min kwk + C ξn s.t.
w∈RD , w0 ∈R, ξ∈RN 2 ξn ≥ 0, n = 1, . . . , N.
n=1
2
C > 0 is a user parameter that controls the tradeoff
PN between maximizing the margin (2/kwk )
and minimizing the deviations or “soft error” ( n=1 ξn ):
– C ↑: tries to avoid every single error but reduces the margin so it can overfit.
If the dataset is linearly separable, {w, w0 } will separate the data for large enough C ? . If the dataset is not linearly separable,
finding the hyperplane with fewest misclassifications (0/1 loss) is NP-hard, and no value of C will find it in general.
which differs from the dual QP of the linearly separable case in the extra “αn ≤ C” constraints.
The optimal α ∈ RN has the property that αn > 0 only for those instances that lie on or within
the margin or are misclassified, i.e., yn (wT xn + w0 ) ≤ 1, called the support vectors (SVs). For
all other instances, which lie beyond the margin, αn = 0. In summary, for each x1 , . . . , xN :
• The primal problem can be written equivalently but without constraints by eliminating ξn as:
N
1 2
X
min kwk + C [1 − yn g(xn )]+ where g(x) = wT x + w0 .
w∈RD ,w0 ∈R 2
n=1
This defines the hinge loss, which penalizes errors linearly:
[1 − yn g(xn )]+ = max 0, 1 − yn g(xn ) = 0
if yn g(xn ) ≥ 1
1 − yn g(xn ) otherwise
69
and is more robust against outliers compared to the quadratic loss (1 − yn g(xn ))2 .
Kernelization: kernel SVMs for nonlinear binary classification
• We map the data points x to a higher-dimensional space. Learning a linear model in the new
space corresponds to learning a nonlinear model in the original space.
• Assume we map x ∈ RD → z = φ(x) = (φ1 , . . . , φK )T ∈ RK using K fixed basis functions
where usually K ≫ D (or even K = ∞), and z1 = φ1 (x) ≡ 1P(so we don’t have to write a bias
term explicitly). The discriminant is now g(x) = w φ(x) = K
T
i=1 wi φi (x).
• The primal and dual soft-margin QPs are as before but replacing xn with φ(xn ):
N
1 X
Primal QP: min kwk2 + C ξn s.t. yn (wT φ(xn )) ≥ 1 − ξn , ξn ≥ 0, n = 1, . . . , N
w∈RK , ξ∈RN 2 n=1
N N N
1X X X C ≥ αn ≥ 0,
Dual QP: min yn ym (φ(xn )T φ(xm )) αn αm − αn s.t. yn αn = 0,
α∈RN 2 n,m | {z } n = 1, . . . , N
n=1 n=1
K(xn ,xm )
N
X N
X
Optimal solution: w = αn yn φ(xn ) = αn yn φ(xn )
n=1 n∈SVs
N
X
Discriminant: g(x) = wT φ(x) = αn yn (φ(xn )T φ(x)).
n=1
| {z }
K(xn ,x)
T
• Kernelization: we replace the inner product φ(xn ) φ(xm ) between basis functions with a kernel
function K(xn , xm ) that operates on a pair of instances in the original input space. This saves
the work of having to construct φ(·), whose dimension can be very high or infinite, and taking
the dot product. All we need in practice is the kernel K(·, ·), not the basis functions φ(·).
• The resulting, nonlinear discriminant (the kernel SVM ) is
N
X N
X N
X
T T
g(x) = w φ(x) = αn yn φ(xn ) φ(x) = αn yn K(xn , x) = αn yn K(xn , x).
| {z } | {z }
n=1 n=1 n∈SVs
requires w, φ requires K, α
Again, we can use the kernel K(·, ·) directly and don’t need the basis functions.
• Many other algorithms that depend on dot products of input instances have been kernelized
(hence made nonlinear) in the same way: PCA, LDA, etc.
• The fundamental object in a kernel SVM, which determines the resulting discriminant, is the
kernel, and the N × N Gram matrix of dot products it defines on a training set of N points:
K = (K(xn , xm ))nm = (φ(xn )T φ(xm ))nm .
For this matrix and the dual QP to be well defined, the kernel K(·, ·) must be a symmetric positive definite function, i.e., it must
satisfy uT Ku ≥ 0 ∀u ∈ RN , u 6= 0.
• A kernel SVM can be seen as a basis function expansion (as in RBF networks), but the number
of BFs is not set by the user (possibly by cross-validation); it is determined automatically
during training, and each BF corresponds to one input instance that is a support vector.
• At test time, a kernel SVM works like a template classifier, by “comparing” the test pattern x
with the SVs (templates) by means of the kernel, and combining these comparisons into g(x)
to compute the final discriminant. Ex: for MNIST handwritten digits with SVs , ...
N
X
g(x) = αn yn K(xn , x) = α1 y1 K( , x) + α2 y2 K( , x) + . . .
n∈SVs
| {z }
how similar x is to
Also true of RBF networks. Very different from neural nets,70
which learn nested functions of the input vector x.
Polynomial kernel of degree 2 Gaussian kernel of different widths
2 2
(a) s =2
2
(b) s =0.5
2 2
−1
1
1.5 −1 1 1 1
−1
1 0 0
0 1 2 0 1 2
1
2 2
(c) s =0.25 (d) s =0.1
2 2
1
0.5 1
−1
1 1
−1
−1 −1
0 0 0
0 0.5 1 1.5 2 0 1 2 0 1 2
. . . . . . . . . . . . . . . . . . . circled instances ⊕, ⊙ are SVs; +, · are non-SVs . . . . . . . . . . . . . . . . . . .
Kernels
• Intuitively, the kernel function K(x, y) ≥ 0 measures the similarity between two input points
x, y, and determines the form of the discriminant g(x). Different kernels may be useful in
different applications.
• The most practically used kernels (and their hyperparameters) are, for x, y ∈ RD :
– Polynomial kernel of degree q: K(x, y) = (xT y + 1)q .
Ex. for D = 2 and q = 2: K(x, y) = (x1 y1 + x2 y2 + 1)2 = 1 + 2x1 y1 + 2x2 y2 + 2x1 x2 y1 y2 + x21 y12 + x22 y22 , which corresponds
√ √ √
to the inner product of the basis function φ(x) = (1, 2x1 , 2x2 , 2x1 x2 , x21 , x22 )T ∈ R6 ✐.
71
Multiclass kernel machines
With K > 2 classes, there are two common ways to define the decision for a point x ∈ RD :
– Training: SVM gk is trained to classify the training points of class Ck (label +1) vs the
training points of all other classes (label −1).
– Testing: choose Ck if k = arg maxi=1,...,K gi (x).
– Training: SVM gij is trained to classify the training points of class Ci (label +1) vs the
training points of class Cj (label −1); training points from other classes are not used.
P
– Testing: choose Ck if k = arg maxi=1,...,K K j6=i gij (x). Also possible: for each class k, count
the number of times gkj (x) > 0 for j 6= k, and pick the class with most votes.
72
Kernel machines for regression: support vector regression
• We are given a sample {(xn , yn )}N
n=1 with xn ∈ R
D and y ∈ R.
n
• Instead of the least-squares error (yn − f (xn ))2 , in support vector regression we use the ǫ-sensitive loss function:
(
0, if |yn − f (xn )| < ǫ
[|yn − f (xn )| − ǫ]+ = max 0, |yn − f (xn )| − ǫ =
|yn − f (xn )| − ǫ, otherwise
which means we tolerate errors up to ǫ > 0 and that errors beyond ǫ have a linear penalty and not a quadratic one. This error
function is more robust to noise and outliers; the estimated f will be less affected by them.
+ −
• As with the soft-margin hyperplane, we introduce slack variables ξn , ξn to account for deviations (positive and negative) out of
the ǫ-zone. We get the following, primal QP:
N − +
≤ yn − (wT xn + w0 ) ≤ ǫ + ξn
1 −ǫ − ξn
kwk2 + C
X
+ −
min (ξn + ξn ) s.t. + −
w∈RD , w0 ∈R, ξ+ ,ξ− ∈RN 2 ξn , ξn ≥ 0, n = 1, . . . , N.
n=1
• This is a convex QP on D + 1 + 2N variables and its unique minimizer (w, w0 , ξ+ , ξ− ) can be found with different QP algorithms.
It is equivalent to solving the dual QP on 2N variables:
N N N
1 X + X X
min (αn − α− + − T
n )(αm − αm )(xn xm ) + ǫ (α+ −
n + αn ) − yn (α+ −
n − αn )
α+ ,α− ∈RN 2 n,m n=1 n=1
N
X
s.t. (α+ −
n − αn ) = 0, C ≥ α+ −
n , αn ≥ 0, n = 1, . . . , N.
n=1
– α+ −
n = αn = 0: xn is fitted within the ǫ-tube.
– Either α+ −
n or αn is in (0, C): xn is fitted on the boundary of the ǫ-tube.
We use these instances to calculate w0 , since they satisfy yn = wT xn + w0 ± ǫ if α∓
n > 0.
– α+ −
n = C or αn = C: xn is fitted outside the ǫ-tube.
• Hence, the fitted line can be written as a weighted sum of the support vectors:
N
X
f (x) = wT x + w0 = (α+ − T
n − αn )(xn x) + w0 .
n=1
−4 −2 0 2 4
2
1.5
1.4 1.5
1.2 1 1
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8
(a) α+
= α−
n = 0, (b)
n < C, (c) α+α+
n =C
n least-squares,
......... .........
non-SVs ×, SVs on margin ⊗, SVs beyond margin ⊠ ǫ-sensitive
73
Joint distribution: p(X = x, Y = y).
p(X = x, Y = y)
Conditioning (product rule): p(Y = y | X = x) = .
X p(X = x)
Marginalizing (sum rule): p(X = x) = p(X = x, Y = y).
y
A graphical model represents the joint probability P (all variables) of several random variables (ob-
served or hidden) in a visual way, via a graph that indicates their assumed interaction (or dependence
structure) and has parametric models at the nodes. It is useful to represent uncertainty and to com-
pute complex queries of the form P (some vars | some vars).
Like a database query on variables X = age, Y = salary, etc. To answer p(X = x, Y = y) we count all records satisfying X = x and
Y = y and divide by the total umber of records; to answer p(X = x|Y = y) we count all records satisfying X = x among all that satisfy
Y = y and divide by the number of records satisfying Y = y; etc. But these are very simple probability estimates. With a graphical model
we introduce parametric models (Bernoulli, Gaussian, etc.) and apply probability calculus to compute the result.
• Types of independence of random variables: for all values of the variables involved,
– X and Y are independent iff p(X, Y ) = p(X) p(Y ).
Equivalently? p(X|Y ) = p(X) and p(Y |X) = p(Y ). Knowing Y tells me nothing about X and vice versa.
– X and Y are conditionally independent given Z iff p(X, Y |Z) = p(X|Z) p(Y |Z).
Equivalently? p(X|Y, Z) = p(X|Z) and p(Y |X, Z) = p(Y |Z).
Compact notation: p(R) means p(R = 1), p(W |R) means p(W = 0|R = 1), etc., but only for
Example (two variables) the examples about R, S, W and C. Otherwise, X, Y or Z mean generic random variables.
Binary variables “rain” R, “wet grass” W (and later “cloudy weather” C, “sprin-
kler” S), graphical model with completely specified conditional distributions at “Rain causes wet grass”
each node (conditional probability tables (CPT)) giving p(R) and p(W |R) for P (R) = 0.4
all combinations of values of R and W . This specifies the joint distribution R
P (R) = 0.6
over all variables: p(R, W ) = p(R) p(W |R) ∀R, W ∈ {0, 1}. From this we can
compute any specific distribution over groups of variables: P (W |R) = 0.9
P (W |R) = 0.1
W
• ✐ p(R) = 0.6, p(W |R) = 0.1, p(W |R) = 0.8 P (W |R) = 0.2
X P (W |R) = 0.8
• ✐ Marginal: p(W ) = p(R, W ) = p(W |R) p(R) + p(W |R) p(R) = 0.48
p(R), p(W |R), p(W |R) can be derived from
R ∈ {0, 1} the values above, so they are not stored ex-
Inverts the dependency
• ✐ Bayes’ rule: p(R|W ) = p(W |R) p(R)/p(W ) = 0.75 to give a diagnosis.
plicitly.
74
Example (three variables): canonical cases for conditional independence
• By repeated application of the rule of conditional probability, we can write any joint distribution
over d random variables without assumptions as follows (✐ What graphical model corresponds to this?)
?
p(X1 , X2 , . . . , Xd ) = p(X1 ) p(X2 |X1 ) p(X3 |X1 , X2 ) · · · p(Xd |X1 , X2 , . . . , Xd−1 ).
• In particular, consider 3 random variables X, Y and Z. Their joint distribution can always be
written, without assumption, as p(X, Y, Z) = p(X) p(Y |X) p(Z|X, Y ) (or any permutation of X, Y, Z).
This can be restricted with assumptions given by conditional independencies, i.e., by removing
arrows from the full graphical model X Y Z . There are 3 canonical cases.
Summary
For d variables X1 , X2 , . . . , Xd (binary, but this generalizes to discrete or continuous variables):
• A node for a variable Xi of the form p(Xi |parents(Xi )) needs a CPT with 2|parents(Xi )| parameters,
giving the probability of Xi = 1 for every combination of the values of parents(Xi ).
• The general expression for a joint distribution without assumptions requires 2d −1 parameters? :
p(X1 , X2 , . . . , Xd ) = p(X1 ) p(X2 |X1 ) p(X3 |X1 , X2 ) · · · p(Xd |X1 , X2 , . . . , Xd−1 )
which is computationally intractable unless d is very small. In a graphical model, we simplify
it by applying conditional independence assumptions (based on domain knowledge):
d
Y
p(X1 , X2 , . . . , Xd ) = p(Xi |parents(Xi ))
i=1
|parents(Xi )|
which requires 2 parameters at each node Xi , which is much smaller in total.
• In turn, this simplifies the computations required for:
– Inference (testing): answering questions of the form “p(X1 , X7 |X3 , X5 ) = ?”.
– Learning (training): learning the parameters at each node (tables of probability values).
• In graphical models, we need not specify explicitly some variables as “inputs” and others as
“outputs” as in a supervised problem (e.g. classification). Having the trained graphical model
(i.e., with known values for the conditional distribution table at each node), we can set the
values of any subset of the random variables Soutputs ⊂ {X1 , . . . , Xd } (e.g. based on observed
data) and do inference over any other subset Sinputs ⊂ {X1 , . . . , Xd } (unobserved variables), i.e.,
compute p(Soutputs |Sinputs ), where Soutputs ∩ Sinputs = ∅. The graphical model is a “probabilistic
database”, a machine that can answer queries regarding the values of random variables.
But rather than search for items in a database that satisfy the query, we use probability calculus to compute a probability.
• We can also have hidden variables, which are never observed, but which help in defining the
dependency structure.
• Training, i.e., estimating the parameters of the graphical model given a dataset, is typically
done by maximum likelihood.
• The computational complexity of training and inference depends on the type of graphical model:
– If the graph is a tree: exact inference takes polynomial time (junction-tree algorithm).
– Otherwise, exact inference is NP-hard and becomes intractable for large models. Then we
use approximate algorithms (belief propagation, variational methods, Markov chain Monte Carlo, etc.).
76
Generative models as graphical models
• Generative model : a graphical model that represents the process we believe created the data.
• Convenient to design specific graphical models for supervised and unsupervised problems in
machine learning (classification, regression, clustering, etc.).
• The joint distribution is not defined using parents (which imply a directed graph), but using:
– Cliques (clique = set of nodes such that there is a link between every pair of nodes in the clique).
– Potential functions ψC (XC ), where XC is the set of variables in clique C.
1 Y X Y
X = (X1 , . . . , Xd ): p(X) = ψC (XC ) Z= ψC (XC ).
Z all cliques C X all cliques C
P
The partition function Z is a normalization constant that ensures that X p(X) = 1.
• Ex: denoising an image Y . Consider an image X = (X1 , . . . , Xd ) with d black or white pixels
Xi ∈ {0, 1}, so X can take 2d different values. Connecting each pixel to its 4 neighboring pixels
defines two types of cliques: single pixels Xi , and pairs of neighboring pixels Xi ∼ Xj . Then:
d
! d !
1 Y Y
p(X) = ψ1 (Xi ) ψ2 (Xi , Xj ) ψ1 (Xi ) = e−|Yi −Xi | ψ2 (Xi , Xj ) = e−α|Xi −Xj |
Z i=1 i∼j
| {z } | {z }
encourages Yi = Xi encourages Xi = Xj
where Y1 , . . . , Yd are the pixel values for an observed, noisy image Y and α > 0. Then, p(X) is
high if X is close to Y (single-pixel cliques), and if X is smooth (neighboring pixels tend to have
the same value), as controlled by α. We can compute a denoised image as X ∗ = arg maxX p(X).
If α = 0 then X ∗ = Y (no denoising); if α → ∞ then X ∗ tends to a constant image.
In statistical mechanics, the Ising model for magnetic materials is similar to an MRF: pixel = atom with spin ∈ {↑, ↓}.
77
17 Discrete Markov models and hidden Markov models
• Up to now, we have regarded data points x1 , . . . , xN as identically independently distributed
(iid): each xn is an independent sample from a (possibly unknown) distribution p(x; Θ) with
parameters Θ. Hence, the log-likelihood is simply a sum of the individual log-likelihood for each
P
data point: log P (x1 , . . . , xN ) = N
n=1 log p(xn ; Θ). As a graphical model: x1 x2 · · · xN .
• Now, we relax that assumption: x1 , . . . , xN depend on each other. Ex: sampling from a Bernoulli coin is
likely to generate something like X = 000110001111001 (iid samples) but not like X = 00000011111100011111110000 (not iid).
• First-order Markov model : assumes the state at time t + 1 depends only on the state at time t
(it is conditionally independent of all other times given time t, or “the future is independent of the past given the present”):
• Homogeneous Markov model : assumes the transition probability from Si to Sj (for any pair of
states Si , Sj ) is independent of the time (so going from Si to Sj has the same probability no matter when it happens):
K
X
aij ≡ p(xt+1 = Sj | xt = Si ) aij = 1, i = 1, . . . , K, aij ∈ [0, 1], i, j = 1, . . . , K.
j=1
• Sampling: given A and π, we can generate a sequence of states of length T as follows: sample
x1 from π; sample x2 from Ax1 • ; . . . ; sample xT from AxT −1 • .
• Training: given N observed sequences of length T , we want to estimate the parameters (A, π).
This can be done by maximum likelihood:
N T
!
X Y
max log P (sequence n; A, π) where P (sequence n; A, π) = πxn1 axn,t−1 xnt .
π,A
n=1 t=2
# sequences starting with Si # transitions from Si to Sj
The solution is? : πi = aij = .
# sequences # transitions from Si
• Application: language models (e.g. sequences of English words), often as part of an HMM. Each Google
Ngram
state is a word from a vocabulary of size K (> 104 usually). Sequences of m = 1, 2, 3 . . . words Viewer
are called unigrams, bigrams, trigrams, etc.
• A discrete Markov model of order m can generate more and more realistic sequences as we
increase m. But this needs an (m + 1)-way table with K m+1 transition probabilities, which
becomes quickly impractical in terms of memory size. Another problem is that sequences of
length m + 1 which are possible but happen not to occur in the training data will receive a
transition matrix value of zero. If that sequence appears at test time, the model will thus say
it has probability zero and fail to recognize it. This is bound to occur with language models as
m & 2, however large the training set. To prevent this, in practice one sets every zero value in
the transition matrix (= unobserved sequence) to a small probability ǫ > 0 (”smoothing”).
79
• Altogether, the HMM is characterized by 3 groups of parameters: the transition matrix A, the
observation matrix B (assuming discrete observations), and the initial distribution π.
• Training: given a training set of observation sequences, learn the HMM parameters (A, B, π)
that maximize the probability of generating those sequences. If the sequences are labeled (we
do have the state xt as well as the observation yt for each t), training is easy: A, π can be
trained separately from B. The former as in a discrete Markov model; the latter by training a
classifier separately for each state on its corresponding observations ✐. Ex: in ASR, a phonetician can
(painstakingly) label the acoustic observations with their corresponding phonemes. If the sequences are not labeled
(we only have the observations {yt }), training can be solved with a form of the EM algorithm
(the Baum-Welch algorithm). E step: compute “p(X|Y ; A, B, π)” for the current parameters (A, B, π). M step:
reestimate the parameters (A, B, π), as in a discrete Markov process. ✐
In an HMM, the state variables follow a first-order Markov process: p(xt |x1 , . . . , xt−1 ) = p(xt |xt−1 ).
But the observed variables do not: p(yt |yt−1 , . . . , y1) 6= p(yt |yt−1 ). In fact, yt depends (implicitly) on
all previous observations. An HMM achieves this long-range dependency with fewer parameters than
using a Markov process of order m.
80
Continuous states and observations: tracking (dynamical models)
• Both the states and the observations are continuous. We want to predict the state at time t + 1
given the state xt and the observation yt at time t.
• Many applications in control, computer vision, robotics, etc. Ex: predicting the trajectory of
a guided missile, mobile robot, drone, pedestrian, 3D pose of a person, etc. State xt = 3D spatial
coordinates / velocity / orientation of robot, observation yt = sensor information (camera image, depth sensor, etc.).
81
18 Reinforcement learning
• How an autonomous agent that senses and acts in its environment can learn to choose optimal
actions to achieve a goal. Ex.: board games (e.g. chess, backgammon), robot navigation (e.g.
looking for the exit of a maze).
– Each time the agent performs an action, it may receive a reward (or penalty) to indicate
how good the resulting state of the environment is.
– Often the reward is delayed.
At the end of the game (positive reward if we win, negative if we lose); or when (if) the robot reaches the exit.
– The task of the agent is to learn from this reward to choose sequences of actions that
produce the greatest cumulative reward.
• As in other machine learning settings, the task is to learn a function, here a control policy
π: S → A that maps states s ∈ S to actions a ∈ A, but with the following differences:
– Delayed reward : we are not given pairs (state,action) to train the function, as in supervised
learning. Instead, we (may) have a reward when the agent executes an action. There is
no such thing as the best move at any given time, what matters is the sequence of moves.
Supervised learning is “learning with a teacher”. Reinforcement learning is “learning with a critic”. The critic doesn’t tell
us what to do but only how well we have done in the past. Its feedback is scarce and when it comes, it comes late.
– Credit assignment problem: after taking several actions and getting the reward, which
actions helped? So that we can record and recall them later on.
A reinforcement learning program learns to generate an internal value for the intermediate
states and actions in terms of how good they are in leading us to the goal. Having learned
this internal reward mechanism, the agent can just take local actions to maximize it.
– Exploration vs exploitation: the agent influences the distribution of training examples by
the actions it chooses. It should trade off exploration of unknown states and actions (to
gain information) vs exploitation of states and actions that it has already learned will
yield high reward (to maximize its cumulative reward).
– Partially observed states: practically, sensors provide only partial information about the
environment state (e.g. a camera doesn’t provide the robot location).
• Basic elements of a reinforcement learning problem:
– Agent: the decision maker (e.g. game player, robot). It has
sensors to observe the environment (e.g. robot camera).
82
• The task of the agent: perform sequences of actions, observe their consequences and learn a
control policy that, from any initial state, chooses actions that maximize the reward accumulated
over time.
Ex.: grid world with GOAL state, squares = states, arrows = actions (γ = 0.9):
0 100 0 90 100 0
0 GOAL 81 GOAL 90 100 GOAL GOAL
100
0 0 72 81
0 0 100 81 90 100
0 0 81 90
0 0 72 81 81 90 100
Q-learning
• Learning π ∗ : S → A directly is difficult because we don’t have examples (s, a). Instead, we
learn a numerical evaluation function Q(s, a) of states and actions, and derive π from it.
• Define the Q function: Q(s, a) = r(s, a) + γV ∗ (δ(s, a)).
83
• We want to find π ∗ (s) = arg maxa Q(s, a). This can be done with the following iterative
algorithm. Noting that V ∗ (s) = maxa′ Q(s, a′ ), we can write the following recursive expression
for Q: Q(s, a) = r(s, a) + γ maxa′ Q(δ(s, a), a′ ). The algorithm starts with a table Q̂(s, a) = 0
for every (s, a), and repeatedly does the following:
• Each update of Q̂(s, a) affects the old state s (not the new one s′ ).
• Assuming r(s, a) is bounded and every possible (state,action) pair is visited infinitely often,
one can prove: 1) the Q̂ values never decrease, and are between 0 and their optimal values Q;
2) Q-learning converges to the correct Q function.
• Note the agent need not know the functions r and δ ahead of time (which is often not practical,
e.g. for a mobile robot). It just needs to know their particular values as it takes actions, moves
to new states and receives the corresponding rewards (i.e., it samples those functions).
In choosing new actions to apply, the agent has an exploration-exploitation tradeoff (repeat actions that seem currently good vs
trying new ones). Commonly, one moves from pure exploration at the beginning towards exploitation as training proceeds.
If we have perfect knowledge of the functions r and δ (i.e., we can evaluate them for any (s, a)),
we can find the optimal policy more efficiently with a dynamic programming algorithm to solve
Bellman’s equation: V ∗ (s) = E {r(s, π(s)) + γ V ∗ (δ(s, π(s)))} ∀s ∈ S. This is convenient in
some applications, e.g. factory automation and job scheduling problems.
Training consists of a series of episodes, each beginning at a random state and executing actions
until reaching the goal state. The first update occurs when reaching the goal state and receiving a
nonzero reward; in subsequent episodes, updates propagate backward, eventually filling the entire
Q̂(·, ·) table.
84
– Optimal policy (as before): π ∗ = arg maxπ V π (s) ∀s ∈ S.
∗ ∗
P a) =′ E {r(s,∗ a)′ + γ V (δ(s, a))} = E {r(s, a)} + γ E {V (δ(s, a))} =
– Q function: Q(s,
E {r(s, a)} + γ s′ P (s |s, a) V (s ).
P
– Recursive expression for Q: Q(s, a) = E {r(s, a)} + γ s′ P (s′|s, a) maxa′ Q(s′ , a′ ).
– Update for Q̂ at iteration k: Q̂k (s, a) ← (1 − αk )Q̂k−1 (s, a) + αk r + γ maxa′ Q̂k−1 (s′ , a′ ) ,
where the step size αk decreases towards zero as k → ∞ for convergence to occur.
This is similar to SGD compared to GD. Convergence is slow and may require many thousands of iterations.
– In practice, there may be a large or infinite number of states and actions (e.g. turning the
steering wheel by a continuous angle).
– Even if we could store the table, the previous Q-learning algorithm performs a kind of rote
learning: it simply stores seen (state,action) pairs and doesn’t generalize to unseen ones.
Indeed, the convergence proof assumes every possible (state,action) pair is visited (infinitely often).
• Instead, we can represent Q(s, a) with a function approximator, such as a neural net Q(s, a; W)
with inputs (s, a), weights W, and a single output corresponding to the value of Q(s, a). Each
Q̂(s, a) update in the Q-learning algorithm provides one training example, which is used to
update the weights via SGD.
A classic example: the 1995 TD-gammon program for playing backgammon (which has ≈ 1020 states and randomness because of
the dice roll) used a neural net trained on 1.5 million self-generated games (playing against itself) and achieved master-level play.
• This can be modeled using a Partially Observable Markov Decision Process (POMDP), which
uses P (o|s, a) to model an observation o given the current state s (unobserved) and action a,
rather than P (s′ |s, a) (e.g. o = camera image, s = 3D location of the robot).
This is similar to the difference between a discrete (observable) Markov process and a hidden Markov model (HMM).
85