[go: up one dir, main page]

0% found this document useful (0 votes)
20 views81 pages

Slides Foundations

foundations

Uploaded by

Francesco Gualdi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views81 pages

Slides Foundations

foundations

Uploaded by

Francesco Gualdi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Foundations and Linear Methods

Machine Learning
Michael Wand
TA: Eric Alcaide

{michael.wand, eric.alcaide}@idsia.ch

Dalle Molle Institute for Artificial Intelligence Studies (IDSIA) USI - SUPSI

Fall Semester 2024


Contents

In this lecture we will cover the two basic tasks of supervised learning, as
well as some fundamental aspects. Our schedule is as follows:
Introduction to Regression
Introduction to Classification
Elements of the Foundations of Machine Learning
Besides introducing elementary examples of algorithms, we will also get
to know many basic concepts and definitions which we will frequently
encounter later on!
Attribution: Many figures taken from Bishop’s Machine Learning
textbook https://www.microsoft.com/en-us/research/people/
cmbishop/prml-book.
Also highly recommended as a reference!

Foundations and Linear Methods 2


Introduction to Regression
Polynomial Fitting

We are given a training set of observations


D = {dn }n=1,...,N = {(xn , tn )}n=1,...,N (represented as blue circles).
Task: Predict the value of the target variable t from the input
variable x, for an unknown input x (blue ‘?’).
For now, both input and target are one-dimensional.
Since we have observations which include the target, this is a
supervised task.
It is also a regression task (real-valued output).

Img src: Bishop, figure 1.2 (modified)

Foundations and Linear Methods 4


Polynomial Fitting

Spoiler: the training data was generated from a sine wave with
added (Gaussian) noise.
? Is there a “perfect” solution to our prediction problem?
? Can we find it, based on the available data? Can we approximate it?
? What do we expect from our solution?

Img src: Bishop, figure 1.2 (modified)

Foundations and Linear Methods 5


Polynomial Fitting

Consider the example: The


input x exactly matches one of
the training samples.
Which of the two ‘X’ is the
“correct” prediction?
Img src: Bishop, figure 1.2 (modified)

Foundations and Linear Methods 6


Polynomial Fitting

Consider the example: The


input x exactly matches one of
the training samples.
Which of the two ‘X’ is the
“correct” prediction?
Img src: Bishop, figure 1.2 (modified)

We usually aim at finding underlying structure in the task.


In practical cases, the training data is frequently noisy.
The lower ‘X’ is probably the better prediction.
The answer may also depend on our knowledge or assumptions of
the underlying data.

Foundations and Linear Methods 6


Polynomial Fitting

What about the general case in


which the input x does not
match any training sample?
Given our knowledge of the
data, the prediction should be
given by the red ‘X’.
Img src: Bishop, figure 1.2 (modified)

Foundations and Linear Methods 7


Polynomial Fitting

What about the general case in


which the input x does not
match any training sample?
Given our knowledge of the
data, the prediction should be
given by the red ‘X’.
Img src: Bishop, figure 1.2 (modified)

This is called generalization: the system should be able to make


reasonable predictions on new data, provided that this data is
“similar” to the training data.
Later on, we will formally define what “similarity” means.
For this prediction we have assumed knowledge about the structure
of the data.
It turns out that we always have to make some kind of such
assumption, otherwise generalization is impossible (No Free Lunch
Theorem).
Foundations and Linear Methods 7
Polynomial Fitting

Our goal is to approximate the underlying structure of the data.


For this purpose we make a model assumption: we describe the
relationship between input and target by a polynomial
M
X
y (x, w) = wj x j .
j=0

After fitting, we wish to use y as estimator for t.


We now need to fit the model to the input observations
{(xn , tn )}n=1,...,N by determining the coefficients w = {wj }j=0,...,M .
(We also need to choose the order M, but for now assume that M is
fixed.)
Note that for the given task, this particular model will never exactly
fit unseen data. (Why? Two reasons!)

Foundations and Linear Methods 8


Polynomial Fitting

We define the Mean Squared Error (MSE) as

1 X 1 X1
E (w) = en (w) = (y (xn , w) − tn )2
N n N n 2

where en (w) = 21 (y (xn , w) − tn )2 is the error for sample n.


The MSE is our loss function and our training criterion: We aim to
minimize it by choosing suitable parameters w.
We frequently do not write the explicit dependence of the error on w.
There are underlying reasons for choosing this error. For now, let us
just say that it is computationally easy. We include the factor 12 for
later convenience.

Foundations and Linear Methods 9


Solving the Fitting Task

To summarize:
M
1 X 1 X
E = en , en = (y (xn , w) − tn )2 , y (x, w) = wj x j = wT ϕ(x).
N n 2 j=0

T 0 1 M T
with ϕ(x) = (ϕ0 (x), . . . , ϕM (x)) = (x , x , . . . , x ) (useful later).

We compute the derivative of the error:


 
dE ∂E ∂E
= ,...,
dw ∂w0 ∂wM
with
∂E 1 X ∂en 1 X ∂y (xn , w) 1 X T
= = (y (xn , w)−tn )· = (w ϕ(xn )−tn )·ϕj (xn )
∂wj N n ∂wj N n ∂wj N n

and use matrix calculus to simplify:


!
dE 1 T
X T
X T
= w ϕ(xn )ϕ(xn ) − tn ϕ(xn ) .
dw N n n

Foundations and Linear Methods 10


Solving the Fitting Task

Using the design matrix


···
 
ϕ0 (x1 ) ϕ1 (x1 ) ϕM (x1 )
 ϕ0 (x2 ) ϕ1 (x2 ) ··· ϕM (x2 ) 
Φ= .
 
.. .. .. 
 .. . . . 
ϕ0 (xN ) ϕ1 (xN ) ··· ϕM (xN )

and the target vector T T = (t1 , . . . , tN )T , we can further simplify:


dE 1  T T 
= w (Φ Φ) − T T Φ .
dw N
The fitting task is solved by minimizing the error, which we do by
setting the derivative of the error to zero:
! dE 1  T T 
0= = w (Φ Φ) − T T Φ .
dw N

Foundations and Linear Methods 11


Solving the Fitting Task

1
Cancelling the factor N and transposing, the equation

! !
0 = wT ΦT Φ − T T Φ ⇔ 0 = ΦT Φw − ΦT T

is a linear system of (M + 1) equations for (M + 1) variables, so we


will assume that there is a unique solution. (Q: What is the
condition for a solution to exist?)
It can be seen easily that the solution is given by

wmin = (ΦT Φ)−1 ΦT T .

The matrix (ΦT Φ)−1 ΦT is called the Moore-Penrose Pseudo-Inverse


of the matrix Φ (Φ is in general is not a square matrix).
If Φ is square and invertible, the Moore-Penrose Pseudo-Inverse
coincides with Φ−1 (the proper inverse).

Foundations and Linear Methods 12


Polynomial Fitting

We have solved our first regression task!


Now we can use y (x, wmin ) to predict targets t from inputs x.
Key observation: we had to solve an equation which was nonlinear in
the inputs, but linear in the parameters, i.e. y depends linearly on w.
⇒ This allowed us to find a closed-form solution!
Unfortunately, finding solutions for nonlinear models is not so easy.
During this lecture we will get to know several ways to approximate
such solutions!
Also remember that this is the “best” solution in terms of the
criterion which we used, but it is not necessarily a perfect solution,
and there may be other solutions for other criteria!

Foundations and Linear Methods 13


Polynomial Fitting: Order

So how should we choose the order M of the fitting polynomial?


Remember our goal: predict the target variable for new data points.
Here are some fitted curves (in red) for different polynomial orders M.

Foundations and Linear Methods 14


Polynomial Fitting: Order

So how should we choose the order M of the fitting polynomial?


Remember our goal: predict the target variable for new data points.
Here are some fitted curves (in red) for different polynomial orders M.

Img src: Bishop, figure 1.4

Which one would you choose?

Foundations and Linear Methods 14


Polynomial Fitting: Order

So how should we choose the order M of the fitting polynomial?


Remember our goal: predict the target variable for new data points.
Here are some fitted curves (in red) for different polynomial orders M.

Img src: Bishop, figure 1.4

Which one would you choose?

Clearly, the linear approximations with orders 0 and 1 do not well


represent the data.
These orders are too low for the task: these polynomials are not
expressive enough, they underfit the data.
Foundations and Linear Methods 14
Polynomial Fitting: Order

Img src: Bishop, figure 1.4

The polynomial of order 9 exactly fits


the training samples (why?), but does
not recover the underlying structure.
It overfits the training data, yielding
bad predictions for new data points.
One can also say that it has been
tuned to the random noise present in
the data.

Foundations and Linear Methods 15


Polynomial Fitting: Order

Img src: Bishop, figure 1.4

The polynomial of order 9 exactly fits Table: Coefficients of the fitted


polynomial
the training samples (why?), but does
not recover the underlying structure. Order M=1 M=3 M=9
It overfits the training data, yielding w0 0.82 0.31 0.35
w1 -1.27 7.99 232.37
bad predictions for new data points. w2 -25.43 -5321.83
w3 17.37 48568.31
One can also say that it has been w4 -231639.30
tuned to the random noise present in w5 640042.26
w6 -1061800.52
the data. w7 1042400.18
w8 -557682.99
The overfitted polynomial has very w9 125201.43
large coefficients.
Foundations and Linear Methods 15
Polynomial Fitting: Order

Img src: Bishop, figure 1.4

Within the set of models which we considered, the order 3


polynomial is optimal.
: Even though the training error is higher than for higher-order
polynomials.
: A perfect solution with a polynomial is not possible (since the “true”
shape of the data is not a polynomial, and since the data has noise).

Foundations and Linear Methods 16


Reducing Overfitting

Img src: Bishop, figure 1.4 and 1.6

Easiest way to reduce overfitting: Get more training data


Figures: estimated regression polynomial for M = 9 with
N = 10, 15, 100 data points
In the latter case, the curve does not overfit: better generalization
Don’t have enough data? Maybe you can artificially create it
: by injecting random noise to samples
: by domain-specific transformations (e.g. in image recognition, you
could slightly deform images without changing their content)
: relevant for all machine learning tasks (classification, regression, and
others), and for a variety of methods
Foundations and Linear Methods 17
Reducing Overfitting

Limiting the number of model parameters (in this case, polynomial


coefficients) helps against overfitting.
Parameters can also be constrained by penalizing their absolute
value.
In the case of linear regression, optimize a modified error function
with a suitable regularization term, e.g.:
N N
1 X1 λ 1 X1 λ
Ẽ (w) = (y (xn , w)−tn )2 + ||w||22 = (y (xn , w)−tn )2 + wT w
N n=1 2 2 N n=1 2 2

λ determines the relative weight of the regularization term.


This regularization is called L2 regularization.
The solution has a closed form (Ridge Regression):

wmin = (λI + ΦT Φ)−1 ΦT T .

Other forms of regularization exist and are regularly used.


Foundations and Linear Methods 18
Features

Important observation: The solution depends only on the features,


i.e. the elements of the design matrix:
ϕ0 (x1 ) ϕ1 (x1 ) ··· ϕM (x1 )
 
 ϕ0 (x2 ) ϕ1 (x2 ) ··· ϕM (x2 ) 
Φ= . . .
 
..
 .. . .

. . . 
ϕ0 (xN ) ϕ1 (xN ) ··· ϕM (xN )

We had ϕj (x) = x j , but we see that can choose features ϕj (x)


arbitrarily, leading to fitting with arbitrary functions using the same
method as before.
The choice of features often depends on knowledge of the task.
In general, computing features amounts to preprocessing the data in
a way to make relevant information accessible, and to remove
irrelevant content.

Foundations and Linear Methods 19


Multivariate Regression / Curse of Dimensionality

Assume we have multiple input variables x (1) , x (2) , . . . , x (L) .


For polynomial fitting, we have to take into account all possible
products between variables, up to the desired order M:
X (ℓ) X (ℓ ,ℓ )
y (x, w) = w0 + w1 x (ℓ) + w2 1 2 x (ℓ1 ) x (ℓ2 ) +
ℓ ℓ1 ,ℓ2
ℓ1 ≤ℓ2
(ℓ ,ℓ2 ,...,ℓM ) (ℓ1 ) (ℓ2 )
X
+... + wM 1 x x · · · x (ℓM ) .
ℓ1 ,ℓ2 ,...,ℓM
ℓ1 ≤ℓ2 ≤...≤ℓM

Thus we need to estimate O(M L ) coefficients, which may be


inefficient and can cause severe overfitting.
⇒ This problem in high-dimensional input spaces is called the Curse of
Dimensionality: the number of model parameters rises exponentially
with the input dimension!

Foundations and Linear Methods 20


Multivariate Regression / Curse of Dimensionality

Fortunately, dimensionality problems can be tackled:


Real data often lies in a subspace of lower dimensionality
Example: An image of 100 · 100 pixels has 10000 dimensions. . . but
how many theoretically possible images show anything sensible?
⇒ Use dimensionality reduction techniques on the data.
⇒ Define a suitable set of features: ϕk (x1 , x2 , . . . , xL ), where the
number of features can be much smaller than LM .
⇒ In realistic tasks, features can be very complex, and feature
optimization plays a crucial role.
Neural networks can intrinsically deals with high-dimensional input
and often do not require sophisticated features.
Multidimensional output: less of a problem, can estimate output variables
independently (but versatile algorithms estimate them jointly).

Foundations and Linear Methods 21


Examples of Features

Features play a central role in almost all machine learning


applications:
: they reduce the dimensionality of the data
: they make hidden information accessible
: they suppress irrelevant information.
Example: The speech spectrogram, computed by taking the Fourier
transformation of short speech units (frames)

Img src: https://www.opentextbooks.org.hk/ditatopic/9759

Foundations and Linear Methods 22


The Linear Regression Model

Step by step, we have gotten to know a technique which is called


Linear Regression.
Idea: model the relationship between input features and targets as a
linear equation:
T = Wϕ + ϵ,
where the vector ϵ is the error term (also called noise).
We aim to minimize the noise. If we use the MSE loss to measure
the error, we have already derived the closed-form solution for this
task, including a regularization term if desired.
However, other strategies for solving linear regression are possible.
The features are predefined and could be generic (like polynomial
coefficients), or application-specific. Note that the features do not
need to depend linearly on the raw input data.

Foundations and Linear Methods 23


Regression: Summary

You have learned about


fitting a regression model which is linear in its parameters: an exact
solution
examples of overfitting and underfitting, regularization
the role of features
issues of dimensionality.

Foundations and Linear Methods 24


Regression: Summary

We summarize the following important definitions:


generalization: the capability of a system to adapt to novel, unseen
data
underfitting: when the model does not reflect the structure of the
training data
overfitting: when the model has learned the structure of the training
data very well, but fails at generalizing to novel test data
regularization: changing the training criterion of a system to
improve generalization.

Foundations and Linear Methods 25


Introduction to Classification
Classification

Classification: Assign categorical


values to input data points.
1
Here we have two classes, e.g.
0= ,1= . ϕ1
Multiple classes: often use 1-of-K (or 0
one-hot) coding:
: (1, 0, 0, ...) for class 0,
: (0, 1, 0, ...) for class 1, etc. -1
As before, we assume we have some
training data and some test data. -1 0 ϕ0 1
Have a look at the training data given
for a two-class problem with two
features ϕ0 and ϕ1 .

Foundations and Linear Methods 27


Classification

How would you assign classes to the ‘?’


test data points?
1
If in some cases you find the answer ?
“obvious”, think a second why. ϕ1
0 ?

?
?
-1 ?

-1 0 ϕ0 1

Foundations and Linear Methods 28


Classification

How would you assign classes to the ‘?’


test data points?
1
Two easy cases: follow class ?
assignments of surrounding data ϕ1
point(s) 0 ?

?
?
-1 ?

-1 0 ϕ0 1

Foundations and Linear Methods 28


Classification

How would you assign classes to the ‘?’


test data points?
1
This one is closer to the class, but ?
note that the are much more ϕ1
compact than the . Could be 0 ?
ambiguous.
?
?
-1 ?

-1 0 ϕ0 1

Foundations and Linear Methods 28


Classification

How would you assign classes to the ‘?’


test data points?
1
This one is far away from any of the ?
classes. Probably should be a , but ϕ1
realistically, the training data does not 0 ?
prepare us well for classifying this
?
particular sample. ?
-1 ?

-1 0 ϕ0 1

Foundations and Linear Methods 28


Classification

How would you assign classes to the ‘?’


test data points?
1
This one is close to a , which seems ?
however an outlier - a nontypical ϕ1
example, maybe due to data noise. 0 ?

Remember that data is never perfect, ?


and that ambiguities are quite normal! ?
-1 ?
Even if it is not an outlier, we can
intuitively say that the “probability” of
seems higher than . -1 0 ϕ0 1
Probabilistic modeling will allow us to
formalize this idea.
⇒ In this (simple) case, we can solve the
problem by considering multiple
neighbors (not just the closest one).
Foundations and Linear Methods 28
K-Nearest-Neighbors classifier

Define the simple, nonparametric k-Nearest-Neighbors (kNN) classifier:


for a test sample x, consider the k nearest training samples (for
fixed k)
determine the class assignment for x by “majority vote” (i.e. it
corresponds to the class of the majority of the k nearest training
samples)
one could also weigh the influence of the k nearest neighbors (e.g.
by distance)
The algorithm is called nonparametric because works directly with the
data, as opposed to parametric methods like linear regression which are
based on learning parameters of a model.
What do you think are the problems of this approach?

Foundations and Linear Methods 31


K-Nearest-Neighbors classifier - Problems

Here are some major problems of the kNN classifier:


requires to store all training data, and to compute distances between
test sample and all training samples
does not discover structure (e.g. shape, “compactness” of the
classes)
very sensitive to outliers
problematic in high-dimensional feature spaces (Curse of
Dimensionality: samples do not cover the space well)
difficult to define suitable metric of closeness.

Foundations and Linear Methods 32


K-Nearest-Neighbors classifier - Problems

Consider a k-nearest-neighbors classifier with which we want to


distinguish a swim tyre and a donut.
Images are represented at 250x250 pixels, 3 colors, ranging from
0..255.
Thus, 250x250x3 = 187500 features in raw pixel space.

Foundations and Linear Methods 33


K-Nearest-Neighbors classifier - Problems

Absolute difference (=distance) between swim tyre and donut, in


raw pixel space: average ≈ 45.

Absolute difference between rescaled donut and shifted identical


donut, in raw pixel space: average ≈ 146.

This cannot be a good way to perform (image) classification.


Foundations and Linear Methods 34
K-Nearest-Neighbors classifier - Problems

In practice, one would define suitable features (search internet for


HOG, SIFT, etc. if you are interested), and one would normalize
(e.g. center) the images.
But what about rotation, scaling, photos from different angles . . . ?
Clearly, complex realistic cases cannot be solved this way. Need a
much more versatile approach.
We will get to know such approaches in this lecture. They are
always based on some kind of model, i.e. they are parametric.

Foundations and Linear Methods 35


The Linear Classifier

Parametric classifier: structure of data gets summarized by a set of


parameters of (usually) fixed size, independent of the number of
training samples
Two fundamental concepts:
: Discriminant function: A function (with trainable parameters) which
assigns a class to a test sample

C = f (ϕ)

: Probabilistic modeling: A function models the probabilities of a test


sample belonging to any class

p(Ck |ϕ) = f (ϕ, Ck )

It is easy to compute the discriminant function from a probabilistic


model (take the class whose probability is highest).

Foundations and Linear Methods 36


The Linear Classifier

Consider the example below (the outlier has been removed).


Most simple discriminant function: a hyperplane in feature space.
⇒ In the two-dimensional case: a straight line.

1
ϕ1
0

-1

-1 0 ϕ0 1
Foundations and Linear Methods 37
The Linear Classifier

Consider the example below (the outlier has been removed).


Most simple discriminant function: a hyperplane in feature space.
⇒ In the two-dimensional case: a straight line.

In the given example, many


solutions are possible.
: We will see later how to
1 choose a good one.
ϕ1 : We will see later what to
do if there is no perfect
0 solution.

-1

-1 0 ϕ0 1
Foundations and Linear Methods 37
The Linear Classifier

The discriminant function takes the form


(
T C0 , if y (ϕ) ≥ 0
y (ϕ) = w ϕ + w0 with C(ϕ) =
C1 , if y (ϕ) < 0

w is called a weight vector, w0 is


the bias. The hyperplane with the
equation wT ϕ + w0 = 0 is called
1
decision boundary, it separates the
ϕ1 space into decision regions.
0

-1

-1 0 ϕ0 1
Foundations and Linear Methods 38
The Linear Classifier

The discriminant function takes the form


(
T C0 , if y (ϕ) ≥ 0
y (ϕ) = w ϕ + w0 with C(ϕ) =
C1 , if y (ϕ) < 0

Note that w is orthogonal to the


decision boundary. The distance of
any point x to the decision bound-
1 y (x)
ary is ||w|| , in particular, the dis-
ϕ1 tance of the decision boundary to
|w0 |
0 the origin (0,0) is ||w|| .

-1 w

-1 0 ϕ0 1
Foundations and Linear Methods 38
The Linear Classifier - Observations

We now have a mathematical formulation for the classification


problem: the decision regions are separated by a hyperplane
parametrized by w and w0 .
: We will get to know different ways to choose w and w0 .
The decision regions of the kNN classifier have a much more
complex shape!
The parametrized model is tendentially robust against outliers.

Foundations and Linear Methods 39


The Linear Classifier - Observations

We now have a mathematical formulation for the classification


problem: the decision regions are separated by a hyperplane
parametrized by w and w0 .
: We will get to know different ways to choose w and w0 .
The decision regions of the kNN classifier have a much more
complex shape!
The parametrized model is tendentially robust against outliers.
This comes at the price of lower flexibility: This classifier is prone to
drastic underfitting.
We could get a better fit by choosing a decision boundary described
by a polynomial
⇒ just like for regression, we can instead define suitable nonlinear
features, so that the classifier remains linear in its parameters.

Foundations and Linear Methods 39


Multiple Classes

How to extend linear classification to K > 2 classes?


One-vs-rest classification (left figure): use K classifiers, each of
which distinguishes points in class k from points not in class k.
One-vs-one classification (right figure): use K (K − 1)/2 classifiers,
one for each pair of classes.
Each of these methods leads to ambiguous areas (shaded green).

Img src: Bishop, figure 4.2

Foundations and Linear Methods 40


Multiple Classes

How to extend linear classification to K > 2 classes?


Better solution: Use K linear functions of the form

yk (ϕ) = wkT ϕ + wk0

and assign a data point ϕ(x) to class Ck if

yk (ϕ(x)) > yj (ϕ(x))

for all j ̸= k.
This method avoids ambiguous regions, all decision regions are
singly connected and convex.

Foundations and Linear Methods 41


Logistic Regression

We present one specific way to define a linear classifier.


We would like to use the ideas we have developed for linear
regression to perform classification.
Assume that our training data consists of observations
{(ϕn , tn )}n=1,...,N , where tn only takes the values 0 and 1. Train a
regressor on this data.
While performing “regression for classification” in this way is
technically possible, it leads to issues:
: What does it mean if the estimate of t for an input data point is not
equal 0 or 1?
: Assume a data point belonging to class 1. The MSE loss will be the
same regardless of whether the estimate of t is 0.6 or 1.4 – do you
think this makes sense?
Clearly, we need to deal with the regression output in a different way.

Foundations and Linear Methods 42


Logistic Regression

We bridge the gap from regression to classification by imposing a


simple probabilistic model.
Idea: We assume that each data point has a certain probability of
belonging to class 0 or 1:

p(C1 |ϕ) = y (ϕ)


p(C0 |ϕ) = 1 − y (ϕ)

From the probabilistic model, we derive the discriminant function:

C(ϕ) = 1 if p(C1 |ϕ) = y (ϕ) ≥ 0.5


C(ϕ) = 0 if p(C1 |ϕ) = y (ϕ) < 0.5.

With this model we can also express degrees of uncertainty about


classification results!

Foundations and Linear Methods 43


Logistic Regression

We derive the function y from Linear Regression.


We make the output of regression probabilistic by passing it through
a “squeezing function” which normalizes the real-valued output to
the range [0.0, 1.0].
The classical choice of the squeezing function is the logistic function
or logistic sigmoid
1
σ(x) = .
1 + e −x

Image source: Wikipedia, Sigmoid Function

Foundations and Linear Methods 44


Logistic Regression

We define

ỹ (ϕ) = wT ϕ + w0
y (ϕ) = σ(ỹ (ϕ)).

Taking y = y (ϕ) = y (ϕ, w, w0 ) as the probability that a sample ϕ


belongs to class 1, we see that
: greater absolute values of y (ϕ) imply less uncertainty about the
classification of the sample ϕ,
: the decision boundary between the classes is given by the hyperplane
ỹ (ϕ) = 0.
This model is called Logistic Regression.

Foundations and Linear Methods 45


Logistic Regression

Consider the following example task.


Note that the classes overlap.
1
ϕ1
0

-1

-1 0 ϕ0 1

Foundations and Linear Methods 46


Logistic Regression

Consider the following example task.


Note that the classes overlap.
This is the decision boundary which 1
the model finds.
ϕ1
0

-1

-1 0 ϕ0 1

Foundations and Linear Methods 46


Logistic Regression

Consider the following example task.


Note that the classes overlap.
This is the decision boundary which 1
the model finds.
ϕ1
Items which are more distant from the
decision boundary are assigned to their 0
respective classes with higher
probability.
-1

-1 0 ϕ0 1

Foundations and Linear Methods 46


Logistic Regression

Consider the following example task.


Note that the classes overlap.
This is the decision boundary which 1
the model finds.
ϕ1
Items which are more distant from the
decision boundary are assigned to their 0
respective classes with higher
probability.
-1
Clearly, the model does not classify the
data perfectly.
-1 0 ϕ0 1

Foundations and Linear Methods 46


Logistic Regression

We need a training criterion for logistic regression.


This criterion will be based on probabilities, just like the model itself.
We define the likelihood of a single training sample (ϕ, t) as its
probability under the logistic regression model, as a function of the
regression parameters w and w0 :
(
y if t = 1
Lϕ (w, w0 ) = = y t (1 − y )1−t
1 − y if t = 0

where y = y (ϕ, w, w0 ).
Assuming that the training samples are statistically independent, the
likelihood of the training data {(ϕn , tn )}n=1,...,N is
N
Y
L(w, w0 ) = yntn (1 − yn )1−tn .
n=1

Foundations and Linear Methods 47


Logistic Regression

It is computationally simpler to work in logarithmic space, thus we


compute the log-likelihood
N
X
L̃(w, w0 ) = tn ln yn + (1 − tn ) ln(1 − yn ).
n=1

Note that always L̃(w, w0 ) ≤ 0 (why?).


We now define an error function as the negative log-likelihood
N
X
E (w, w0 ) = −L̃(w, w0 ) = − tn ln yn + (1 − tn ) ln(1 − yn ).
n=1

Our training criterion will be to minimize E , which is the same as


maximizing L or L̃: the Maximum Likelihood criterion.

Foundations and Linear Methods 48


Logistic Regression

Unfortunately, in contrast to “normal” linear regression, there is no


closed-form solution to determine the parameter vector w and the
bias w0 for logistic regression.
We will get to know a general way to solve such tasks during this
course.
During our lecture on probabilistic modeling, we will also reconsider
the assumptions which underly Logistic Regression, and learn that
under certain conditions it arises very naturally from imposing a
basic probabilistic model.

Foundations and Linear Methods 49


Logistic Regression

Unfortunately, in contrast to “normal” linear regression, there is no


closed-form solution to determine the parameter vector w and the
bias w0 for logistic regression.
We will get to know a general way to solve such tasks during this
course.
During our lecture on probabilistic modeling, we will also reconsider
the assumptions which underly Logistic Regression, and learn that
under certain conditions it arises very naturally from imposing a
basic probabilistic model.
If there is a decision boundary hyperplane which perfectly separates
C0 and C1 , the classes are said to be separable.
In this case, the model is prone to overfitting : It will estimate the
assignment probabilities badly.
Even data points very close to the decision boundary will be
assigned with high probability to their respective class.
Again, probabilistic modeling offers a way to resolve this problem.
Foundations and Linear Methods 49
Classification: Summary

You have learned about


nonparametric and parametric classification
discriminant functions
representation of a linear decision boundary in feature space
Logistic regression as an example of a linear classifier.

Foundations and Linear Methods 50


Elements of the Foundations of Machine
Learning
Foundations of Machine Learning

We have gotten to know some elementary examples of Machine


Learning algorithms.
Before going on, we introduce some basic concepts which will help
us to formalize what we have seen so far, and what we will get to
know in the future.
We start with the most basic question: How to formulate
mathematically what learning means?
You have already seen that it does not mean to memorize the
training samples!

Foundations and Linear Methods 52


A Framework for Machine Learning

Assume that our data is described by a probability distribution


P(x, t) over inputs and targets, where we assume both inputs and
targets to be real-valued vectors: x ∈ RM , t ∈ RN .
This key idea allows to express, in a mathematically sound way
: the variability of data (even for a fixed class)
: the ambigousness of the mapping between input data and target
: and also our own lack of knowledge about this mapping!
It also gives us a powerful set of tools to create practical algorithms.
The probability calculus is at the heart of many important machine
learning algorithms!
Bousquet: Introduction to Statistical Learning Theory. Machine Learning 2003, LNAI 3176, pp. 169–207, 2004.

Foundations and Linear Methods 53


A Framework for Machine Learning

Assume a probability distribution P(x, t) over inputs and targets.


Let y (x) be a predictor for t. Assuming any loss L(t, y (x)), define
the risk as the expected loss over the entire probability space:
Z
R(y ) = EP(x,t) L(t, y (x)) = L(t, y (x))dP(x, t).

The goal of Machine Learning is now to find the predictor of t given


x which minimizes the risk:

y ∗ = arg min R(y ) = arg min EP(x,t) L(t, y (x)).


y :RM →RN y :RM →RN

Note that by using this probabilistic framework, we avoid referring to


specific data sets (like train and test dataset)!
Vapnik: Principles of Risk Minimization for Learning Theory. Proc. NIPS 1991.

Foundations and Linear Methods 54


The Irreducible Error

Assume for a moment that we know the distribution P(x, t).


In some cases, we can directly obtain y ∗ , for example in the case of
classification:
y ∗ (x) = arg max P(x, t)
t

Q: Will the risk, i.e. the expected loss, be zero for the predictor y ∗ ?

Foundations and Linear Methods 55


The Irreducible Error

Assume for a moment that we know the distribution P(x, t).


In some cases, we can directly obtain y ∗ , for example in the case of
classification:
y ∗ (x) = arg max P(x, t)
t

Q: Will the risk, i.e. the expected loss, be zero for the predictor y ∗ ?
No, because t might not be deterministic for a given x.
: Only if x completely determines t, the risk of the optimal predictor is
zero.
: Mathematically, this means that for all x, P(t|x) = 1 for exactly one
tTrue (x) and P(t|x) = 0 for all other t.

Foundations and Linear Methods 55


The Irreducible Error

The prediction error which stems from the fact that the input does
not completely determine the target is called the irreducible error.
The irreducible error can be considered a form of noise affecting the
inputs, the targets, or both. It is independent of any concrete
method which we use for the prediction of targets.
In practice, such noise can derive from inexact measurements of
data, from unaccounted sources of variation, and many other factors.
In the case of classification, we (usually) map a continuous input to
discrete output, so there will be border regions with unclear class
assignments.
As an example, consider a subset of the famous MNIST corpus of
handwritten digits. Do you think all these examples are clearly the
digit “7”?

Foundations and Linear Methods 56


A Framework for Machine Learning

Clearly, we do not know the true distribution of the data P(x, t).
How can we estimate (and subsequently minimize) the risk?

Foundations and Linear Methods 57


A Framework for Machine Learning

Clearly, we do not know the true distribution of the data P(x, t).
How can we estimate (and subsequently minimize) the risk?
We are given a finite set of observations D = {(xn , tn )}n=1,...,N .
Idea: Approximate expectations over P by averaging over the data,
specifically:
N
1 X
R(y ) = EP(x,t) L(t, y (x)) ≈ L(tn , y (xn )) =: Remp (y ).
N n=1

Remp (y ) is called empirical risk. The observations D are usually


called training data.

Foundations and Linear Methods 57


A Framework for Machine Learning

The entire field of Machine Learning deals with minimizing the true
risk, based on estimations using the empirical risk.
We will get to know a large variety of algorithms and methods where
we give the predictor y (x) a specific mathematical form, allowing
minimization of the empirical risk.
However, does the empirical risk always reflect the true risk?

Foundations and Linear Methods 58


A Framework for Machine Learning

The entire field of Machine Learning deals with minimizing the true
risk, based on estimations using the empirical risk.
We will get to know a large variety of algorithms and methods where
we give the predictor y (x) a specific mathematical form, allowing
minimization of the empirical risk.
However, does the empirical risk always reflect the true risk?
In general, no!
: Remember that we allow any function from the input space to the
target space as a predictor.
: For example, a function which correctly maps all training data points
and is random on all other points achieves zero empirical risk, but
high true risk.
We want to avoid such pathological cases. Which functions are
“reasonable” as predictors?

Foundations and Linear Methods 58


A Framework for Machine Learning

A function like the one on the last slide has a high complexity,
whereas a structured representation of the data should have a much
simpler form.
Thus, we restrict the predictors y (x) which we take into account to
a subset of the functions from RM to RN : the hypothesis class H.
This generalizes and formalizes the concept of regularization which
we covered earlier!
We hope that this helps us to obtain a predictor whose empirical risk
(on the training data) reflects the true risk (on unseen test data).
Clearly, this is not automatically true. . .

Foundations and Linear Methods 59


A Framework for Machine Learning

We now define the optimal predictor within the hypothesis class H:



yH = arg min R(y ) = arg min EP(x,t) L(t, y (x))
y ∈H y ∈H

as well as the empirically optimal predictor given a dataset


D = {(xn , tn )}n=1,...,N :
N
emp 1 X
yH = arg min Remp (y ) = arg min L(tn , y (xn )).
y ∈H y ∈H N n=1

In almost all cases, exactly computing the optimal predictor in H is


intractable, and it needs to be approximated.
All Machine Learning algorithms aim at optimally defining the
hypothesis class H, such that efficient optimization over a wide
range of functions is possible, while retaining generalization ability.

Foundations and Linear Methods 60


A Framework for Machine Learning

Based on the probabilistic framework, many bounds on the


difference between empirical and true risk have been proposed,
typically of the form
r !
cap(H)
∀y ∈ H R(y ) − Remp (y ) < O
N
where N is the number of samples, and cap(H) measures the
“capacity” or “richness” of the function class H (not covered in this
course).
Note that the true risk is typically assumed to be greater than the
empirical risk.
We see that it is necessary to balance the capacity of the hypothesis
class and the number of data samples:
: If there are not enough elements in H, even the optimal predictor
might not be very good.
: If there are too many elements in H, the empirically best predictor
might be far away from the true optimal predictor.
Foundations and Linear Methods 61
Definitions

A predictor whose true risk is close to the empirical risk on the


training data is said to generalize well.
A predictor whose empirical risk is low, but whose true risk is high,
is said to overfit to the training data.
A predictor whose empirical risk is high is said to underfit the data.
Which of the predictors in the figure overfit, and which ones
underfit?

Img src: Bishop, figure 1.4

Foundations and Linear Methods 62


Using Data Sets

We tune our Machine Learning system on the training data,


obtaining the empirical risk. But how can we now estimate the true
risk?
This requires a separate dataset which is not used for tuning the
system, but only for measuring the loss after training: the test data.
In practical scenarios, even this might not guarantee a good estimate
of the true risk: If we perform system tuning many times with
slightly different metaparameters, always reusing the same test data,
we implicitly tune the system to this test data.
Good practice requires to use three different data sets:
: The training data is used for tuning the system.
: The validation data is used to determine the quality of the trained
system and to guide further optimization steps.
: The test data is used only when the system is completely tuned to
determine the final quality (e.g. for writing a final report or similar).

Foundations and Linear Methods 63


Foundations: Summary

In this section, you have gotten to know


a mathematical formulation of the task of Machine Learning
generalization, overfitting, and underfitting revisited
splitting your data set to obtain statistically valid results.

Foundations and Linear Methods 64

You might also like