Foundations and Linear Methods
Machine Learning
Michael Wand
TA: Eric Alcaide
{michael.wand, eric.alcaide}@idsia.ch
Dalle Molle Institute for Artificial Intelligence Studies (IDSIA) USI - SUPSI
Fall Semester 2024
Contents
In this lecture we will cover the two basic tasks of supervised learning, as
well as some fundamental aspects. Our schedule is as follows:
Introduction to Regression
Introduction to Classification
Elements of the Foundations of Machine Learning
Besides introducing elementary examples of algorithms, we will also get
to know many basic concepts and definitions which we will frequently
encounter later on!
Attribution: Many figures taken from Bishop’s Machine Learning
textbook https://www.microsoft.com/en-us/research/people/
cmbishop/prml-book.
Also highly recommended as a reference!
Foundations and Linear Methods 2
Introduction to Regression
Polynomial Fitting
We are given a training set of observations
D = {dn }n=1,...,N = {(xn , tn )}n=1,...,N (represented as blue circles).
Task: Predict the value of the target variable t from the input
variable x, for an unknown input x (blue ‘?’).
For now, both input and target are one-dimensional.
Since we have observations which include the target, this is a
supervised task.
It is also a regression task (real-valued output).
Img src: Bishop, figure 1.2 (modified)
Foundations and Linear Methods 4
Polynomial Fitting
Spoiler: the training data was generated from a sine wave with
added (Gaussian) noise.
? Is there a “perfect” solution to our prediction problem?
? Can we find it, based on the available data? Can we approximate it?
? What do we expect from our solution?
Img src: Bishop, figure 1.2 (modified)
Foundations and Linear Methods 5
Polynomial Fitting
Consider the example: The
input x exactly matches one of
the training samples.
Which of the two ‘X’ is the
“correct” prediction?
Img src: Bishop, figure 1.2 (modified)
Foundations and Linear Methods 6
Polynomial Fitting
Consider the example: The
input x exactly matches one of
the training samples.
Which of the two ‘X’ is the
“correct” prediction?
Img src: Bishop, figure 1.2 (modified)
We usually aim at finding underlying structure in the task.
In practical cases, the training data is frequently noisy.
The lower ‘X’ is probably the better prediction.
The answer may also depend on our knowledge or assumptions of
the underlying data.
Foundations and Linear Methods 6
Polynomial Fitting
What about the general case in
which the input x does not
match any training sample?
Given our knowledge of the
data, the prediction should be
given by the red ‘X’.
Img src: Bishop, figure 1.2 (modified)
Foundations and Linear Methods 7
Polynomial Fitting
What about the general case in
which the input x does not
match any training sample?
Given our knowledge of the
data, the prediction should be
given by the red ‘X’.
Img src: Bishop, figure 1.2 (modified)
This is called generalization: the system should be able to make
reasonable predictions on new data, provided that this data is
“similar” to the training data.
Later on, we will formally define what “similarity” means.
For this prediction we have assumed knowledge about the structure
of the data.
It turns out that we always have to make some kind of such
assumption, otherwise generalization is impossible (No Free Lunch
Theorem).
Foundations and Linear Methods 7
Polynomial Fitting
Our goal is to approximate the underlying structure of the data.
For this purpose we make a model assumption: we describe the
relationship between input and target by a polynomial
M
X
y (x, w) = wj x j .
j=0
After fitting, we wish to use y as estimator for t.
We now need to fit the model to the input observations
{(xn , tn )}n=1,...,N by determining the coefficients w = {wj }j=0,...,M .
(We also need to choose the order M, but for now assume that M is
fixed.)
Note that for the given task, this particular model will never exactly
fit unseen data. (Why? Two reasons!)
Foundations and Linear Methods 8
Polynomial Fitting
We define the Mean Squared Error (MSE) as
1 X 1 X1
E (w) = en (w) = (y (xn , w) − tn )2
N n N n 2
where en (w) = 21 (y (xn , w) − tn )2 is the error for sample n.
The MSE is our loss function and our training criterion: We aim to
minimize it by choosing suitable parameters w.
We frequently do not write the explicit dependence of the error on w.
There are underlying reasons for choosing this error. For now, let us
just say that it is computationally easy. We include the factor 12 for
later convenience.
Foundations and Linear Methods 9
Solving the Fitting Task
To summarize:
M
1 X 1 X
E = en , en = (y (xn , w) − tn )2 , y (x, w) = wj x j = wT ϕ(x).
N n 2 j=0
T 0 1 M T
with ϕ(x) = (ϕ0 (x), . . . , ϕM (x)) = (x , x , . . . , x ) (useful later).
We compute the derivative of the error:
dE ∂E ∂E
= ,...,
dw ∂w0 ∂wM
with
∂E 1 X ∂en 1 X ∂y (xn , w) 1 X T
= = (y (xn , w)−tn )· = (w ϕ(xn )−tn )·ϕj (xn )
∂wj N n ∂wj N n ∂wj N n
and use matrix calculus to simplify:
!
dE 1 T
X T
X T
= w ϕ(xn )ϕ(xn ) − tn ϕ(xn ) .
dw N n n
Foundations and Linear Methods 10
Solving the Fitting Task
Using the design matrix
···
ϕ0 (x1 ) ϕ1 (x1 ) ϕM (x1 )
ϕ0 (x2 ) ϕ1 (x2 ) ··· ϕM (x2 )
Φ= .
.. .. ..
.. . . .
ϕ0 (xN ) ϕ1 (xN ) ··· ϕM (xN )
and the target vector T T = (t1 , . . . , tN )T , we can further simplify:
dE 1 T T
= w (Φ Φ) − T T Φ .
dw N
The fitting task is solved by minimizing the error, which we do by
setting the derivative of the error to zero:
! dE 1 T T
0= = w (Φ Φ) − T T Φ .
dw N
Foundations and Linear Methods 11
Solving the Fitting Task
1
Cancelling the factor N and transposing, the equation
! !
0 = wT ΦT Φ − T T Φ ⇔ 0 = ΦT Φw − ΦT T
is a linear system of (M + 1) equations for (M + 1) variables, so we
will assume that there is a unique solution. (Q: What is the
condition for a solution to exist?)
It can be seen easily that the solution is given by
wmin = (ΦT Φ)−1 ΦT T .
The matrix (ΦT Φ)−1 ΦT is called the Moore-Penrose Pseudo-Inverse
of the matrix Φ (Φ is in general is not a square matrix).
If Φ is square and invertible, the Moore-Penrose Pseudo-Inverse
coincides with Φ−1 (the proper inverse).
Foundations and Linear Methods 12
Polynomial Fitting
We have solved our first regression task!
Now we can use y (x, wmin ) to predict targets t from inputs x.
Key observation: we had to solve an equation which was nonlinear in
the inputs, but linear in the parameters, i.e. y depends linearly on w.
⇒ This allowed us to find a closed-form solution!
Unfortunately, finding solutions for nonlinear models is not so easy.
During this lecture we will get to know several ways to approximate
such solutions!
Also remember that this is the “best” solution in terms of the
criterion which we used, but it is not necessarily a perfect solution,
and there may be other solutions for other criteria!
Foundations and Linear Methods 13
Polynomial Fitting: Order
So how should we choose the order M of the fitting polynomial?
Remember our goal: predict the target variable for new data points.
Here are some fitted curves (in red) for different polynomial orders M.
Foundations and Linear Methods 14
Polynomial Fitting: Order
So how should we choose the order M of the fitting polynomial?
Remember our goal: predict the target variable for new data points.
Here are some fitted curves (in red) for different polynomial orders M.
Img src: Bishop, figure 1.4
Which one would you choose?
Foundations and Linear Methods 14
Polynomial Fitting: Order
So how should we choose the order M of the fitting polynomial?
Remember our goal: predict the target variable for new data points.
Here are some fitted curves (in red) for different polynomial orders M.
Img src: Bishop, figure 1.4
Which one would you choose?
Clearly, the linear approximations with orders 0 and 1 do not well
represent the data.
These orders are too low for the task: these polynomials are not
expressive enough, they underfit the data.
Foundations and Linear Methods 14
Polynomial Fitting: Order
Img src: Bishop, figure 1.4
The polynomial of order 9 exactly fits
the training samples (why?), but does
not recover the underlying structure.
It overfits the training data, yielding
bad predictions for new data points.
One can also say that it has been
tuned to the random noise present in
the data.
Foundations and Linear Methods 15
Polynomial Fitting: Order
Img src: Bishop, figure 1.4
The polynomial of order 9 exactly fits Table: Coefficients of the fitted
polynomial
the training samples (why?), but does
not recover the underlying structure. Order M=1 M=3 M=9
It overfits the training data, yielding w0 0.82 0.31 0.35
w1 -1.27 7.99 232.37
bad predictions for new data points. w2 -25.43 -5321.83
w3 17.37 48568.31
One can also say that it has been w4 -231639.30
tuned to the random noise present in w5 640042.26
w6 -1061800.52
the data. w7 1042400.18
w8 -557682.99
The overfitted polynomial has very w9 125201.43
large coefficients.
Foundations and Linear Methods 15
Polynomial Fitting: Order
Img src: Bishop, figure 1.4
Within the set of models which we considered, the order 3
polynomial is optimal.
: Even though the training error is higher than for higher-order
polynomials.
: A perfect solution with a polynomial is not possible (since the “true”
shape of the data is not a polynomial, and since the data has noise).
Foundations and Linear Methods 16
Reducing Overfitting
Img src: Bishop, figure 1.4 and 1.6
Easiest way to reduce overfitting: Get more training data
Figures: estimated regression polynomial for M = 9 with
N = 10, 15, 100 data points
In the latter case, the curve does not overfit: better generalization
Don’t have enough data? Maybe you can artificially create it
: by injecting random noise to samples
: by domain-specific transformations (e.g. in image recognition, you
could slightly deform images without changing their content)
: relevant for all machine learning tasks (classification, regression, and
others), and for a variety of methods
Foundations and Linear Methods 17
Reducing Overfitting
Limiting the number of model parameters (in this case, polynomial
coefficients) helps against overfitting.
Parameters can also be constrained by penalizing their absolute
value.
In the case of linear regression, optimize a modified error function
with a suitable regularization term, e.g.:
N N
1 X1 λ 1 X1 λ
Ẽ (w) = (y (xn , w)−tn )2 + ||w||22 = (y (xn , w)−tn )2 + wT w
N n=1 2 2 N n=1 2 2
λ determines the relative weight of the regularization term.
This regularization is called L2 regularization.
The solution has a closed form (Ridge Regression):
wmin = (λI + ΦT Φ)−1 ΦT T .
Other forms of regularization exist and are regularly used.
Foundations and Linear Methods 18
Features
Important observation: The solution depends only on the features,
i.e. the elements of the design matrix:
ϕ0 (x1 ) ϕ1 (x1 ) ··· ϕM (x1 )
ϕ0 (x2 ) ϕ1 (x2 ) ··· ϕM (x2 )
Φ= . . .
..
.. . .
. . .
ϕ0 (xN ) ϕ1 (xN ) ··· ϕM (xN )
We had ϕj (x) = x j , but we see that can choose features ϕj (x)
arbitrarily, leading to fitting with arbitrary functions using the same
method as before.
The choice of features often depends on knowledge of the task.
In general, computing features amounts to preprocessing the data in
a way to make relevant information accessible, and to remove
irrelevant content.
Foundations and Linear Methods 19
Multivariate Regression / Curse of Dimensionality
Assume we have multiple input variables x (1) , x (2) , . . . , x (L) .
For polynomial fitting, we have to take into account all possible
products between variables, up to the desired order M:
X (ℓ) X (ℓ ,ℓ )
y (x, w) = w0 + w1 x (ℓ) + w2 1 2 x (ℓ1 ) x (ℓ2 ) +
ℓ ℓ1 ,ℓ2
ℓ1 ≤ℓ2
(ℓ ,ℓ2 ,...,ℓM ) (ℓ1 ) (ℓ2 )
X
+... + wM 1 x x · · · x (ℓM ) .
ℓ1 ,ℓ2 ,...,ℓM
ℓ1 ≤ℓ2 ≤...≤ℓM
Thus we need to estimate O(M L ) coefficients, which may be
inefficient and can cause severe overfitting.
⇒ This problem in high-dimensional input spaces is called the Curse of
Dimensionality: the number of model parameters rises exponentially
with the input dimension!
Foundations and Linear Methods 20
Multivariate Regression / Curse of Dimensionality
Fortunately, dimensionality problems can be tackled:
Real data often lies in a subspace of lower dimensionality
Example: An image of 100 · 100 pixels has 10000 dimensions. . . but
how many theoretically possible images show anything sensible?
⇒ Use dimensionality reduction techniques on the data.
⇒ Define a suitable set of features: ϕk (x1 , x2 , . . . , xL ), where the
number of features can be much smaller than LM .
⇒ In realistic tasks, features can be very complex, and feature
optimization plays a crucial role.
Neural networks can intrinsically deals with high-dimensional input
and often do not require sophisticated features.
Multidimensional output: less of a problem, can estimate output variables
independently (but versatile algorithms estimate them jointly).
Foundations and Linear Methods 21
Examples of Features
Features play a central role in almost all machine learning
applications:
: they reduce the dimensionality of the data
: they make hidden information accessible
: they suppress irrelevant information.
Example: The speech spectrogram, computed by taking the Fourier
transformation of short speech units (frames)
Img src: https://www.opentextbooks.org.hk/ditatopic/9759
Foundations and Linear Methods 22
The Linear Regression Model
Step by step, we have gotten to know a technique which is called
Linear Regression.
Idea: model the relationship between input features and targets as a
linear equation:
T = Wϕ + ϵ,
where the vector ϵ is the error term (also called noise).
We aim to minimize the noise. If we use the MSE loss to measure
the error, we have already derived the closed-form solution for this
task, including a regularization term if desired.
However, other strategies for solving linear regression are possible.
The features are predefined and could be generic (like polynomial
coefficients), or application-specific. Note that the features do not
need to depend linearly on the raw input data.
Foundations and Linear Methods 23
Regression: Summary
You have learned about
fitting a regression model which is linear in its parameters: an exact
solution
examples of overfitting and underfitting, regularization
the role of features
issues of dimensionality.
Foundations and Linear Methods 24
Regression: Summary
We summarize the following important definitions:
generalization: the capability of a system to adapt to novel, unseen
data
underfitting: when the model does not reflect the structure of the
training data
overfitting: when the model has learned the structure of the training
data very well, but fails at generalizing to novel test data
regularization: changing the training criterion of a system to
improve generalization.
Foundations and Linear Methods 25
Introduction to Classification
Classification
Classification: Assign categorical
values to input data points.
1
Here we have two classes, e.g.
0= ,1= . ϕ1
Multiple classes: often use 1-of-K (or 0
one-hot) coding:
: (1, 0, 0, ...) for class 0,
: (0, 1, 0, ...) for class 1, etc. -1
As before, we assume we have some
training data and some test data. -1 0 ϕ0 1
Have a look at the training data given
for a two-class problem with two
features ϕ0 and ϕ1 .
Foundations and Linear Methods 27
Classification
How would you assign classes to the ‘?’
test data points?
1
If in some cases you find the answer ?
“obvious”, think a second why. ϕ1
0 ?
?
?
-1 ?
-1 0 ϕ0 1
Foundations and Linear Methods 28
Classification
How would you assign classes to the ‘?’
test data points?
1
Two easy cases: follow class ?
assignments of surrounding data ϕ1
point(s) 0 ?
?
?
-1 ?
-1 0 ϕ0 1
Foundations and Linear Methods 28
Classification
How would you assign classes to the ‘?’
test data points?
1
This one is closer to the class, but ?
note that the are much more ϕ1
compact than the . Could be 0 ?
ambiguous.
?
?
-1 ?
-1 0 ϕ0 1
Foundations and Linear Methods 28
Classification
How would you assign classes to the ‘?’
test data points?
1
This one is far away from any of the ?
classes. Probably should be a , but ϕ1
realistically, the training data does not 0 ?
prepare us well for classifying this
?
particular sample. ?
-1 ?
-1 0 ϕ0 1
Foundations and Linear Methods 28
Classification
How would you assign classes to the ‘?’
test data points?
1
This one is close to a , which seems ?
however an outlier - a nontypical ϕ1
example, maybe due to data noise. 0 ?
Remember that data is never perfect, ?
and that ambiguities are quite normal! ?
-1 ?
Even if it is not an outlier, we can
intuitively say that the “probability” of
seems higher than . -1 0 ϕ0 1
Probabilistic modeling will allow us to
formalize this idea.
⇒ In this (simple) case, we can solve the
problem by considering multiple
neighbors (not just the closest one).
Foundations and Linear Methods 28
K-Nearest-Neighbors classifier
Define the simple, nonparametric k-Nearest-Neighbors (kNN) classifier:
for a test sample x, consider the k nearest training samples (for
fixed k)
determine the class assignment for x by “majority vote” (i.e. it
corresponds to the class of the majority of the k nearest training
samples)
one could also weigh the influence of the k nearest neighbors (e.g.
by distance)
The algorithm is called nonparametric because works directly with the
data, as opposed to parametric methods like linear regression which are
based on learning parameters of a model.
What do you think are the problems of this approach?
Foundations and Linear Methods 31
K-Nearest-Neighbors classifier - Problems
Here are some major problems of the kNN classifier:
requires to store all training data, and to compute distances between
test sample and all training samples
does not discover structure (e.g. shape, “compactness” of the
classes)
very sensitive to outliers
problematic in high-dimensional feature spaces (Curse of
Dimensionality: samples do not cover the space well)
difficult to define suitable metric of closeness.
Foundations and Linear Methods 32
K-Nearest-Neighbors classifier - Problems
Consider a k-nearest-neighbors classifier with which we want to
distinguish a swim tyre and a donut.
Images are represented at 250x250 pixels, 3 colors, ranging from
0..255.
Thus, 250x250x3 = 187500 features in raw pixel space.
Foundations and Linear Methods 33
K-Nearest-Neighbors classifier - Problems
Absolute difference (=distance) between swim tyre and donut, in
raw pixel space: average ≈ 45.
Absolute difference between rescaled donut and shifted identical
donut, in raw pixel space: average ≈ 146.
This cannot be a good way to perform (image) classification.
Foundations and Linear Methods 34
K-Nearest-Neighbors classifier - Problems
In practice, one would define suitable features (search internet for
HOG, SIFT, etc. if you are interested), and one would normalize
(e.g. center) the images.
But what about rotation, scaling, photos from different angles . . . ?
Clearly, complex realistic cases cannot be solved this way. Need a
much more versatile approach.
We will get to know such approaches in this lecture. They are
always based on some kind of model, i.e. they are parametric.
Foundations and Linear Methods 35
The Linear Classifier
Parametric classifier: structure of data gets summarized by a set of
parameters of (usually) fixed size, independent of the number of
training samples
Two fundamental concepts:
: Discriminant function: A function (with trainable parameters) which
assigns a class to a test sample
C = f (ϕ)
: Probabilistic modeling: A function models the probabilities of a test
sample belonging to any class
p(Ck |ϕ) = f (ϕ, Ck )
It is easy to compute the discriminant function from a probabilistic
model (take the class whose probability is highest).
Foundations and Linear Methods 36
The Linear Classifier
Consider the example below (the outlier has been removed).
Most simple discriminant function: a hyperplane in feature space.
⇒ In the two-dimensional case: a straight line.
1
ϕ1
0
-1
-1 0 ϕ0 1
Foundations and Linear Methods 37
The Linear Classifier
Consider the example below (the outlier has been removed).
Most simple discriminant function: a hyperplane in feature space.
⇒ In the two-dimensional case: a straight line.
In the given example, many
solutions are possible.
: We will see later how to
1 choose a good one.
ϕ1 : We will see later what to
do if there is no perfect
0 solution.
-1
-1 0 ϕ0 1
Foundations and Linear Methods 37
The Linear Classifier
The discriminant function takes the form
(
T C0 , if y (ϕ) ≥ 0
y (ϕ) = w ϕ + w0 with C(ϕ) =
C1 , if y (ϕ) < 0
w is called a weight vector, w0 is
the bias. The hyperplane with the
equation wT ϕ + w0 = 0 is called
1
decision boundary, it separates the
ϕ1 space into decision regions.
0
-1
-1 0 ϕ0 1
Foundations and Linear Methods 38
The Linear Classifier
The discriminant function takes the form
(
T C0 , if y (ϕ) ≥ 0
y (ϕ) = w ϕ + w0 with C(ϕ) =
C1 , if y (ϕ) < 0
Note that w is orthogonal to the
decision boundary. The distance of
any point x to the decision bound-
1 y (x)
ary is ||w|| , in particular, the dis-
ϕ1 tance of the decision boundary to
|w0 |
0 the origin (0,0) is ||w|| .
-1 w
-1 0 ϕ0 1
Foundations and Linear Methods 38
The Linear Classifier - Observations
We now have a mathematical formulation for the classification
problem: the decision regions are separated by a hyperplane
parametrized by w and w0 .
: We will get to know different ways to choose w and w0 .
The decision regions of the kNN classifier have a much more
complex shape!
The parametrized model is tendentially robust against outliers.
Foundations and Linear Methods 39
The Linear Classifier - Observations
We now have a mathematical formulation for the classification
problem: the decision regions are separated by a hyperplane
parametrized by w and w0 .
: We will get to know different ways to choose w and w0 .
The decision regions of the kNN classifier have a much more
complex shape!
The parametrized model is tendentially robust against outliers.
This comes at the price of lower flexibility: This classifier is prone to
drastic underfitting.
We could get a better fit by choosing a decision boundary described
by a polynomial
⇒ just like for regression, we can instead define suitable nonlinear
features, so that the classifier remains linear in its parameters.
Foundations and Linear Methods 39
Multiple Classes
How to extend linear classification to K > 2 classes?
One-vs-rest classification (left figure): use K classifiers, each of
which distinguishes points in class k from points not in class k.
One-vs-one classification (right figure): use K (K − 1)/2 classifiers,
one for each pair of classes.
Each of these methods leads to ambiguous areas (shaded green).
Img src: Bishop, figure 4.2
Foundations and Linear Methods 40
Multiple Classes
How to extend linear classification to K > 2 classes?
Better solution: Use K linear functions of the form
yk (ϕ) = wkT ϕ + wk0
and assign a data point ϕ(x) to class Ck if
yk (ϕ(x)) > yj (ϕ(x))
for all j ̸= k.
This method avoids ambiguous regions, all decision regions are
singly connected and convex.
Foundations and Linear Methods 41
Logistic Regression
We present one specific way to define a linear classifier.
We would like to use the ideas we have developed for linear
regression to perform classification.
Assume that our training data consists of observations
{(ϕn , tn )}n=1,...,N , where tn only takes the values 0 and 1. Train a
regressor on this data.
While performing “regression for classification” in this way is
technically possible, it leads to issues:
: What does it mean if the estimate of t for an input data point is not
equal 0 or 1?
: Assume a data point belonging to class 1. The MSE loss will be the
same regardless of whether the estimate of t is 0.6 or 1.4 – do you
think this makes sense?
Clearly, we need to deal with the regression output in a different way.
Foundations and Linear Methods 42
Logistic Regression
We bridge the gap from regression to classification by imposing a
simple probabilistic model.
Idea: We assume that each data point has a certain probability of
belonging to class 0 or 1:
p(C1 |ϕ) = y (ϕ)
p(C0 |ϕ) = 1 − y (ϕ)
From the probabilistic model, we derive the discriminant function:
C(ϕ) = 1 if p(C1 |ϕ) = y (ϕ) ≥ 0.5
C(ϕ) = 0 if p(C1 |ϕ) = y (ϕ) < 0.5.
With this model we can also express degrees of uncertainty about
classification results!
Foundations and Linear Methods 43
Logistic Regression
We derive the function y from Linear Regression.
We make the output of regression probabilistic by passing it through
a “squeezing function” which normalizes the real-valued output to
the range [0.0, 1.0].
The classical choice of the squeezing function is the logistic function
or logistic sigmoid
1
σ(x) = .
1 + e −x
Image source: Wikipedia, Sigmoid Function
Foundations and Linear Methods 44
Logistic Regression
We define
ỹ (ϕ) = wT ϕ + w0
y (ϕ) = σ(ỹ (ϕ)).
Taking y = y (ϕ) = y (ϕ, w, w0 ) as the probability that a sample ϕ
belongs to class 1, we see that
: greater absolute values of y (ϕ) imply less uncertainty about the
classification of the sample ϕ,
: the decision boundary between the classes is given by the hyperplane
ỹ (ϕ) = 0.
This model is called Logistic Regression.
Foundations and Linear Methods 45
Logistic Regression
Consider the following example task.
Note that the classes overlap.
1
ϕ1
0
-1
-1 0 ϕ0 1
Foundations and Linear Methods 46
Logistic Regression
Consider the following example task.
Note that the classes overlap.
This is the decision boundary which 1
the model finds.
ϕ1
0
-1
-1 0 ϕ0 1
Foundations and Linear Methods 46
Logistic Regression
Consider the following example task.
Note that the classes overlap.
This is the decision boundary which 1
the model finds.
ϕ1
Items which are more distant from the
decision boundary are assigned to their 0
respective classes with higher
probability.
-1
-1 0 ϕ0 1
Foundations and Linear Methods 46
Logistic Regression
Consider the following example task.
Note that the classes overlap.
This is the decision boundary which 1
the model finds.
ϕ1
Items which are more distant from the
decision boundary are assigned to their 0
respective classes with higher
probability.
-1
Clearly, the model does not classify the
data perfectly.
-1 0 ϕ0 1
Foundations and Linear Methods 46
Logistic Regression
We need a training criterion for logistic regression.
This criterion will be based on probabilities, just like the model itself.
We define the likelihood of a single training sample (ϕ, t) as its
probability under the logistic regression model, as a function of the
regression parameters w and w0 :
(
y if t = 1
Lϕ (w, w0 ) = = y t (1 − y )1−t
1 − y if t = 0
where y = y (ϕ, w, w0 ).
Assuming that the training samples are statistically independent, the
likelihood of the training data {(ϕn , tn )}n=1,...,N is
N
Y
L(w, w0 ) = yntn (1 − yn )1−tn .
n=1
Foundations and Linear Methods 47
Logistic Regression
It is computationally simpler to work in logarithmic space, thus we
compute the log-likelihood
N
X
L̃(w, w0 ) = tn ln yn + (1 − tn ) ln(1 − yn ).
n=1
Note that always L̃(w, w0 ) ≤ 0 (why?).
We now define an error function as the negative log-likelihood
N
X
E (w, w0 ) = −L̃(w, w0 ) = − tn ln yn + (1 − tn ) ln(1 − yn ).
n=1
Our training criterion will be to minimize E , which is the same as
maximizing L or L̃: the Maximum Likelihood criterion.
Foundations and Linear Methods 48
Logistic Regression
Unfortunately, in contrast to “normal” linear regression, there is no
closed-form solution to determine the parameter vector w and the
bias w0 for logistic regression.
We will get to know a general way to solve such tasks during this
course.
During our lecture on probabilistic modeling, we will also reconsider
the assumptions which underly Logistic Regression, and learn that
under certain conditions it arises very naturally from imposing a
basic probabilistic model.
Foundations and Linear Methods 49
Logistic Regression
Unfortunately, in contrast to “normal” linear regression, there is no
closed-form solution to determine the parameter vector w and the
bias w0 for logistic regression.
We will get to know a general way to solve such tasks during this
course.
During our lecture on probabilistic modeling, we will also reconsider
the assumptions which underly Logistic Regression, and learn that
under certain conditions it arises very naturally from imposing a
basic probabilistic model.
If there is a decision boundary hyperplane which perfectly separates
C0 and C1 , the classes are said to be separable.
In this case, the model is prone to overfitting : It will estimate the
assignment probabilities badly.
Even data points very close to the decision boundary will be
assigned with high probability to their respective class.
Again, probabilistic modeling offers a way to resolve this problem.
Foundations and Linear Methods 49
Classification: Summary
You have learned about
nonparametric and parametric classification
discriminant functions
representation of a linear decision boundary in feature space
Logistic regression as an example of a linear classifier.
Foundations and Linear Methods 50
Elements of the Foundations of Machine
Learning
Foundations of Machine Learning
We have gotten to know some elementary examples of Machine
Learning algorithms.
Before going on, we introduce some basic concepts which will help
us to formalize what we have seen so far, and what we will get to
know in the future.
We start with the most basic question: How to formulate
mathematically what learning means?
You have already seen that it does not mean to memorize the
training samples!
Foundations and Linear Methods 52
A Framework for Machine Learning
Assume that our data is described by a probability distribution
P(x, t) over inputs and targets, where we assume both inputs and
targets to be real-valued vectors: x ∈ RM , t ∈ RN .
This key idea allows to express, in a mathematically sound way
: the variability of data (even for a fixed class)
: the ambigousness of the mapping between input data and target
: and also our own lack of knowledge about this mapping!
It also gives us a powerful set of tools to create practical algorithms.
The probability calculus is at the heart of many important machine
learning algorithms!
Bousquet: Introduction to Statistical Learning Theory. Machine Learning 2003, LNAI 3176, pp. 169–207, 2004.
Foundations and Linear Methods 53
A Framework for Machine Learning
Assume a probability distribution P(x, t) over inputs and targets.
Let y (x) be a predictor for t. Assuming any loss L(t, y (x)), define
the risk as the expected loss over the entire probability space:
Z
R(y ) = EP(x,t) L(t, y (x)) = L(t, y (x))dP(x, t).
The goal of Machine Learning is now to find the predictor of t given
x which minimizes the risk:
y ∗ = arg min R(y ) = arg min EP(x,t) L(t, y (x)).
y :RM →RN y :RM →RN
Note that by using this probabilistic framework, we avoid referring to
specific data sets (like train and test dataset)!
Vapnik: Principles of Risk Minimization for Learning Theory. Proc. NIPS 1991.
Foundations and Linear Methods 54
The Irreducible Error
Assume for a moment that we know the distribution P(x, t).
In some cases, we can directly obtain y ∗ , for example in the case of
classification:
y ∗ (x) = arg max P(x, t)
t
Q: Will the risk, i.e. the expected loss, be zero for the predictor y ∗ ?
Foundations and Linear Methods 55
The Irreducible Error
Assume for a moment that we know the distribution P(x, t).
In some cases, we can directly obtain y ∗ , for example in the case of
classification:
y ∗ (x) = arg max P(x, t)
t
Q: Will the risk, i.e. the expected loss, be zero for the predictor y ∗ ?
No, because t might not be deterministic for a given x.
: Only if x completely determines t, the risk of the optimal predictor is
zero.
: Mathematically, this means that for all x, P(t|x) = 1 for exactly one
tTrue (x) and P(t|x) = 0 for all other t.
Foundations and Linear Methods 55
The Irreducible Error
The prediction error which stems from the fact that the input does
not completely determine the target is called the irreducible error.
The irreducible error can be considered a form of noise affecting the
inputs, the targets, or both. It is independent of any concrete
method which we use for the prediction of targets.
In practice, such noise can derive from inexact measurements of
data, from unaccounted sources of variation, and many other factors.
In the case of classification, we (usually) map a continuous input to
discrete output, so there will be border regions with unclear class
assignments.
As an example, consider a subset of the famous MNIST corpus of
handwritten digits. Do you think all these examples are clearly the
digit “7”?
Foundations and Linear Methods 56
A Framework for Machine Learning
Clearly, we do not know the true distribution of the data P(x, t).
How can we estimate (and subsequently minimize) the risk?
Foundations and Linear Methods 57
A Framework for Machine Learning
Clearly, we do not know the true distribution of the data P(x, t).
How can we estimate (and subsequently minimize) the risk?
We are given a finite set of observations D = {(xn , tn )}n=1,...,N .
Idea: Approximate expectations over P by averaging over the data,
specifically:
N
1 X
R(y ) = EP(x,t) L(t, y (x)) ≈ L(tn , y (xn )) =: Remp (y ).
N n=1
Remp (y ) is called empirical risk. The observations D are usually
called training data.
Foundations and Linear Methods 57
A Framework for Machine Learning
The entire field of Machine Learning deals with minimizing the true
risk, based on estimations using the empirical risk.
We will get to know a large variety of algorithms and methods where
we give the predictor y (x) a specific mathematical form, allowing
minimization of the empirical risk.
However, does the empirical risk always reflect the true risk?
Foundations and Linear Methods 58
A Framework for Machine Learning
The entire field of Machine Learning deals with minimizing the true
risk, based on estimations using the empirical risk.
We will get to know a large variety of algorithms and methods where
we give the predictor y (x) a specific mathematical form, allowing
minimization of the empirical risk.
However, does the empirical risk always reflect the true risk?
In general, no!
: Remember that we allow any function from the input space to the
target space as a predictor.
: For example, a function which correctly maps all training data points
and is random on all other points achieves zero empirical risk, but
high true risk.
We want to avoid such pathological cases. Which functions are
“reasonable” as predictors?
Foundations and Linear Methods 58
A Framework for Machine Learning
A function like the one on the last slide has a high complexity,
whereas a structured representation of the data should have a much
simpler form.
Thus, we restrict the predictors y (x) which we take into account to
a subset of the functions from RM to RN : the hypothesis class H.
This generalizes and formalizes the concept of regularization which
we covered earlier!
We hope that this helps us to obtain a predictor whose empirical risk
(on the training data) reflects the true risk (on unseen test data).
Clearly, this is not automatically true. . .
Foundations and Linear Methods 59
A Framework for Machine Learning
We now define the optimal predictor within the hypothesis class H:
∗
yH = arg min R(y ) = arg min EP(x,t) L(t, y (x))
y ∈H y ∈H
as well as the empirically optimal predictor given a dataset
D = {(xn , tn )}n=1,...,N :
N
emp 1 X
yH = arg min Remp (y ) = arg min L(tn , y (xn )).
y ∈H y ∈H N n=1
In almost all cases, exactly computing the optimal predictor in H is
intractable, and it needs to be approximated.
All Machine Learning algorithms aim at optimally defining the
hypothesis class H, such that efficient optimization over a wide
range of functions is possible, while retaining generalization ability.
Foundations and Linear Methods 60
A Framework for Machine Learning
Based on the probabilistic framework, many bounds on the
difference between empirical and true risk have been proposed,
typically of the form
r !
cap(H)
∀y ∈ H R(y ) − Remp (y ) < O
N
where N is the number of samples, and cap(H) measures the
“capacity” or “richness” of the function class H (not covered in this
course).
Note that the true risk is typically assumed to be greater than the
empirical risk.
We see that it is necessary to balance the capacity of the hypothesis
class and the number of data samples:
: If there are not enough elements in H, even the optimal predictor
might not be very good.
: If there are too many elements in H, the empirically best predictor
might be far away from the true optimal predictor.
Foundations and Linear Methods 61
Definitions
A predictor whose true risk is close to the empirical risk on the
training data is said to generalize well.
A predictor whose empirical risk is low, but whose true risk is high,
is said to overfit to the training data.
A predictor whose empirical risk is high is said to underfit the data.
Which of the predictors in the figure overfit, and which ones
underfit?
Img src: Bishop, figure 1.4
Foundations and Linear Methods 62
Using Data Sets
We tune our Machine Learning system on the training data,
obtaining the empirical risk. But how can we now estimate the true
risk?
This requires a separate dataset which is not used for tuning the
system, but only for measuring the loss after training: the test data.
In practical scenarios, even this might not guarantee a good estimate
of the true risk: If we perform system tuning many times with
slightly different metaparameters, always reusing the same test data,
we implicitly tune the system to this test data.
Good practice requires to use three different data sets:
: The training data is used for tuning the system.
: The validation data is used to determine the quality of the trained
system and to guide further optimization steps.
: The test data is used only when the system is completely tuned to
determine the final quality (e.g. for writing a final report or similar).
Foundations and Linear Methods 63
Foundations: Summary
In this section, you have gotten to know
a mathematical formulation of the task of Machine Learning
generalization, overfitting, and underfitting revisited
splitting your data set to obtain statistically valid results.
Foundations and Linear Methods 64