0% found this document useful (0 votes)

20 views81 pages

Slides Foundations

foundations

Uploaded by

Francesco Gualdi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views81 pages

Slides Foundations

foundations

Uploaded by

Francesco Gualdi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 81

Foundations and Linear Methods

Machine Learning
Michael Wand
TA: Eric Alcaide

{michael.wand, eric.alcaide}@idsia.ch

Dalle Molle Institute for Artificial Intelligence Studies (IDSIA) USI - SUPSI

Fall Semester 2024

Contents

In this lecture we will cover the two basic tasks of supervised learning, as
well as some fundamental aspects. Our schedule is as follows:
Introduction to Regression
Introduction to Classification
Elements of the Foundations of Machine Learning
Besides introducing elementary examples of algorithms, we will also get
to know many basic concepts and definitions which we will frequently
encounter later on!
Attribution: Many figures taken from Bishop’s Machine Learning
textbook https://www.microsoft.com/en-us/research/people/
cmbishop/prml-book.
Also highly recommended as a reference!

Foundations and Linear Methods 2

Introduction to Regression
Polynomial Fitting

We are given a training set of observations

D = {dn }n=1,...,N = {(xn , tn )}n=1,...,N (represented as blue circles).
Task: Predict the value of the target variable t from the input
variable x, for an unknown input x (blue ‘?’).
For now, both input and target are one-dimensional.
Since we have observations which include the target, this is a
supervised task.
It is also a regression task (real-valued output).

Img src: Bishop, figure 1.2 (modified)

Foundations and Linear Methods 4

Polynomial Fitting

Spoiler: the training data was generated from a sine wave with
added (Gaussian) noise.
? Is there a “perfect” solution to our prediction problem?
? Can we find it, based on the available data? Can we approximate it?
? What do we expect from our solution?

Img src: Bishop, figure 1.2 (modified)

Foundations and Linear Methods 5

Polynomial Fitting

Consider the example: The

input x exactly matches one of
the training samples.
Which of the two ‘X’ is the
“correct” prediction?
Img src: Bishop, figure 1.2 (modified)

Foundations and Linear Methods 6

Polynomial Fitting

Consider the example: The

input x exactly matches one of
the training samples.
Which of the two ‘X’ is the
“correct” prediction?
Img src: Bishop, figure 1.2 (modified)

We usually aim at finding underlying structure in the task.

In practical cases, the training data is frequently noisy.
The lower ‘X’ is probably the better prediction.
The answer may also depend on our knowledge or assumptions of
the underlying data.

Foundations and Linear Methods 6

Polynomial Fitting

What about the general case in

which the input x does not
match any training sample?
Given our knowledge of the
data, the prediction should be
given by the red ‘X’.
Img src: Bishop, figure 1.2 (modified)

Foundations and Linear Methods 7

Polynomial Fitting

What about the general case in

which the input x does not
match any training sample?
Given our knowledge of the
data, the prediction should be
given by the red ‘X’.
Img src: Bishop, figure 1.2 (modified)

This is called generalization: the system should be able to make

reasonable predictions on new data, provided that this data is
“similar” to the training data.
Later on, we will formally define what “similarity” means.
For this prediction we have assumed knowledge about the structure
of the data.
It turns out that we always have to make some kind of such
assumption, otherwise generalization is impossible (No Free Lunch
Theorem).
Foundations and Linear Methods 7
Polynomial Fitting

Our goal is to approximate the underlying structure of the data.

For this purpose we make a model assumption: we describe the
relationship between input and target by a polynomial
M
X
y (x, w) = wj x j .
j=0

After fitting, we wish to use y as estimator for t.

We now need to fit the model to the input observations
{(xn , tn )}n=1,...,N by determining the coefficients w = {wj }j=0,...,M .
(We also need to choose the order M, but for now assume that M is
fixed.)
Note that for the given task, this particular model will never exactly
fit unseen data. (Why? Two reasons!)

Foundations and Linear Methods 8

Polynomial Fitting

We define the Mean Squared Error (MSE) as

1 X 1 X1
E (w) = en (w) = (y (xn , w) − tn )2
N n N n 2

where en (w) = 21 (y (xn , w) − tn )2 is the error for sample n.

The MSE is our loss function and our training criterion: We aim to
minimize it by choosing suitable parameters w.
We frequently do not write the explicit dependence of the error on w.
There are underlying reasons for choosing this error. For now, let us
just say that it is computationally easy. We include the factor 12 for
later convenience.

Foundations and Linear Methods 9

Solving the Fitting Task

To summarize:
M
1 X 1 X
E = en , en = (y (xn , w) − tn )2 , y (x, w) = wj x j = wT ϕ(x).
N n 2 j=0

T 0 1 M T
with ϕ(x) = (ϕ0 (x), . . . , ϕM (x)) = (x , x , . . . , x ) (useful later).

We compute the derivative of the error:

dE ∂E ∂E
= ,...,
dw ∂w0 ∂wM
with
∂E 1 X ∂en 1 X ∂y (xn , w) 1 X T
= = (y (xn , w)−tn )· = (w ϕ(xn )−tn )·ϕj (xn )
∂wj N n ∂wj N n ∂wj N n

and use matrix calculus to simplify:

!
dE 1 T
X T
X T
= w ϕ(xn )ϕ(xn ) − tn ϕ(xn ) .
dw N n n

Foundations and Linear Methods 10

Solving the Fitting Task

Using the design matrix

···
 
ϕ0 (x1 ) ϕ1 (x1 ) ϕM (x1 )
 ϕ0 (x2 ) ϕ1 (x2 ) ··· ϕM (x2 ) 
Φ= .
 
.. .. .. 
 .. . . . 
ϕ0 (xN ) ϕ1 (xN ) ··· ϕM (xN )

and the target vector T T = (t1 , . . . , tN )T , we can further simplify:

dE 1 T T
= w (Φ Φ) − T T Φ .
dw N
The fitting task is solved by minimizing the error, which we do by
setting the derivative of the error to zero:
! dE 1 T T
0= = w (Φ Φ) − T T Φ .
dw N

Foundations and Linear Methods 11

Solving the Fitting Task

1
Cancelling the factor N and transposing, the equation

! !
0 = wT ΦT Φ − T T Φ ⇔ 0 = ΦT Φw − ΦT T

is a linear system of (M + 1) equations for (M + 1) variables, so we

will assume that there is a unique solution. (Q: What is the
condition for a solution to exist?)
It can be seen easily that the solution is given by

wmin = (ΦT Φ)−1 ΦT T .

The matrix (ΦT Φ)−1 ΦT is called the Moore-Penrose Pseudo-Inverse

of the matrix Φ (Φ is in general is not a square matrix).
If Φ is square and invertible, the Moore-Penrose Pseudo-Inverse
coincides with Φ−1 (the proper inverse).

Foundations and Linear Methods 12

Polynomial Fitting

We have solved our first regression task!

Now we can use y (x, wmin ) to predict targets t from inputs x.
Key observation: we had to solve an equation which was nonlinear in
the inputs, but linear in the parameters, i.e. y depends linearly on w.
⇒ This allowed us to find a closed-form solution!
Unfortunately, finding solutions for nonlinear models is not so easy.
During this lecture we will get to know several ways to approximate
such solutions!
Also remember that this is the “best” solution in terms of the
criterion which we used, but it is not necessarily a perfect solution,
and there may be other solutions for other criteria!

Foundations and Linear Methods 13

Polynomial Fitting: Order

So how should we choose the order M of the fitting polynomial?

Remember our goal: predict the target variable for new data points.
Here are some fitted curves (in red) for different polynomial orders M.

Foundations and Linear Methods 14

Polynomial Fitting: Order

So how should we choose the order M of the fitting polynomial?

Remember our goal: predict the target variable for new data points.
Here are some fitted curves (in red) for different polynomial orders M.

Img src: Bishop, figure 1.4

Which one would you choose?

Foundations and Linear Methods 14

Polynomial Fitting: Order

So how should we choose the order M of the fitting polynomial?

Remember our goal: predict the target variable for new data points.
Here are some fitted curves (in red) for different polynomial orders M.

Img src: Bishop, figure 1.4

Which one would you choose?

Clearly, the linear approximations with orders 0 and 1 do not well

represent the data.
These orders are too low for the task: these polynomials are not
expressive enough, they underfit the data.
Foundations and Linear Methods 14
Polynomial Fitting: Order

Img src: Bishop, figure 1.4

The polynomial of order 9 exactly fits

the training samples (why?), but does
not recover the underlying structure.
It overfits the training data, yielding
bad predictions for new data points.
One can also say that it has been
tuned to the random noise present in
the data.

Foundations and Linear Methods 15

Polynomial Fitting: Order

Img src: Bishop, figure 1.4

The polynomial of order 9 exactly fits Table: Coefficients of the fitted

polynomial
the training samples (why?), but does
not recover the underlying structure. Order M=1 M=3 M=9
It overfits the training data, yielding w0 0.82 0.31 0.35
w1 -1.27 7.99 232.37
bad predictions for new data points. w2 -25.43 -5321.83
w3 17.37 48568.31
One can also say that it has been w4 -231639.30
tuned to the random noise present in w5 640042.26
w6 -1061800.52
the data. w7 1042400.18
w8 -557682.99
The overfitted polynomial has very w9 125201.43
large coefficients.
Foundations and Linear Methods 15
Polynomial Fitting: Order

Img src: Bishop, figure 1.4

Within the set of models which we considered, the order 3

polynomial is optimal.
: Even though the training error is higher than for higher-order
polynomials.
: A perfect solution with a polynomial is not possible (since the “true”
shape of the data is not a polynomial, and since the data has noise).

Foundations and Linear Methods 16

Reducing Overfitting

Img src: Bishop, figure 1.4 and 1.6

Easiest way to reduce overfitting: Get more training data

Figures: estimated regression polynomial for M = 9 with
N = 10, 15, 100 data points
In the latter case, the curve does not overfit: better generalization
Don’t have enough data? Maybe you can artificially create it
: by injecting random noise to samples
: by domain-specific transformations (e.g. in image recognition, you
could slightly deform images without changing their content)
: relevant for all machine learning tasks (classification, regression, and
others), and for a variety of methods
Foundations and Linear Methods 17
Reducing Overfitting

Limiting the number of model parameters (in this case, polynomial

coefficients) helps against overfitting.
Parameters can also be constrained by penalizing their absolute
value.
In the case of linear regression, optimize a modified error function
with a suitable regularization term, e.g.:
N N
1 X1 λ 1 X1 λ
Ẽ (w) = (y (xn , w)−tn )2 + ||w||22 = (y (xn , w)−tn )2 + wT w
N n=1 2 2 N n=1 2 2

λ determines the relative weight of the regularization term.

This regularization is called L2 regularization.
The solution has a closed form (Ridge Regression):

wmin = (λI + ΦT Φ)−1 ΦT T .

Other forms of regularization exist and are regularly used.

Foundations and Linear Methods 18
Features

Important observation: The solution depends only on the features,

i.e. the elements of the design matrix:
ϕ0 (x1 ) ϕ1 (x1 ) ··· ϕM (x1 )
 
 ϕ0 (x2 ) ϕ1 (x2 ) ··· ϕM (x2 ) 
Φ= . . .
 
..
 .. . .

. . . 
ϕ0 (xN ) ϕ1 (xN ) ··· ϕM (xN )

We had ϕj (x) = x j , but we see that can choose features ϕj (x)

arbitrarily, leading to fitting with arbitrary functions using the same
method as before.
The choice of features often depends on knowledge of the task.
In general, computing features amounts to preprocessing the data in
a way to make relevant information accessible, and to remove
irrelevant content.

Foundations and Linear Methods 19

Multivariate Regression / Curse of Dimensionality

Assume we have multiple input variables x (1) , x (2) , . . . , x (L) .

For polynomial fitting, we have to take into account all possible
products between variables, up to the desired order M:
X (ℓ) X (ℓ ,ℓ )
y (x, w) = w0 + w1 x (ℓ) + w2 1 2 x (ℓ1 ) x (ℓ2 ) +
ℓ ℓ1 ,ℓ2
ℓ1 ≤ℓ2
(ℓ ,ℓ2 ,...,ℓM ) (ℓ1 ) (ℓ2 )
X
+... + wM 1 x x · · · x (ℓM ) .
ℓ1 ,ℓ2 ,...,ℓM
ℓ1 ≤ℓ2 ≤...≤ℓM

Thus we need to estimate O(M L ) coefficients, which may be

inefficient and can cause severe overfitting.
⇒ This problem in high-dimensional input spaces is called the Curse of
Dimensionality: the number of model parameters rises exponentially
with the input dimension!

Foundations and Linear Methods 20

Multivariate Regression / Curse of Dimensionality

Fortunately, dimensionality problems can be tackled:

Real data often lies in a subspace of lower dimensionality
Example: An image of 100 · 100 pixels has 10000 dimensions. . . but
how many theoretically possible images show anything sensible?
⇒ Use dimensionality reduction techniques on the data.
⇒ Define a suitable set of features: ϕk (x1 , x2 , . . . , xL ), where the
number of features can be much smaller than LM .
⇒ In realistic tasks, features can be very complex, and feature
optimization plays a crucial role.
Neural networks can intrinsically deals with high-dimensional input
and often do not require sophisticated features.
Multidimensional output: less of a problem, can estimate output variables
independently (but versatile algorithms estimate them jointly).

Foundations and Linear Methods 21

Examples of Features

Features play a central role in almost all machine learning

applications:
: they reduce the dimensionality of the data
: they make hidden information accessible
: they suppress irrelevant information.
Example: The speech spectrogram, computed by taking the Fourier
transformation of short speech units (frames)

Img src: https://www.opentextbooks.org.hk/ditatopic/9759

Foundations and Linear Methods 22

The Linear Regression Model

Step by step, we have gotten to know a technique which is called

Linear Regression.
Idea: model the relationship between input features and targets as a
linear equation:
T = Wϕ + ϵ,
where the vector ϵ is the error term (also called noise).
We aim to minimize the noise. If we use the MSE loss to measure
the error, we have already derived the closed-form solution for this
task, including a regularization term if desired.
However, other strategies for solving linear regression are possible.
The features are predefined and could be generic (like polynomial
coefficients), or application-specific. Note that the features do not
need to depend linearly on the raw input data.

Foundations and Linear Methods 23

Regression: Summary

You have learned about

fitting a regression model which is linear in its parameters: an exact
solution
examples of overfitting and underfitting, regularization
the role of features
issues of dimensionality.

Foundations and Linear Methods 24

Regression: Summary

We summarize the following important definitions:

generalization: the capability of a system to adapt to novel, unseen
data
underfitting: when the model does not reflect the structure of the
training data
overfitting: when the model has learned the structure of the training
data very well, but fails at generalizing to novel test data
regularization: changing the training criterion of a system to
improve generalization.

Foundations and Linear Methods 25

Introduction to Classification
Classification

Classification: Assign categorical

values to input data points.
1
Here we have two classes, e.g.
0= ,1= . ϕ1
Multiple classes: often use 1-of-K (or 0
one-hot) coding:
: (1, 0, 0, ...) for class 0,
: (0, 1, 0, ...) for class 1, etc. -1
As before, we assume we have some
training data and some test data. -1 0 ϕ0 1
Have a look at the training data given
for a two-class problem with two
features ϕ0 and ϕ1 .

Foundations and Linear Methods 27

Classification

How would you assign classes to the ‘?’

test data points?
1
If in some cases you find the answer ?
“obvious”, think a second why. ϕ1
0 ?

?
?
-1 ?

-1 0 ϕ0 1

Foundations and Linear Methods 28

Classification

How would you assign classes to the ‘?’

test data points?
1
Two easy cases: follow class ?
assignments of surrounding data ϕ1
point(s) 0 ?

?
?
-1 ?

-1 0 ϕ0 1

Foundations and Linear Methods 28

Classification

How would you assign classes to the ‘?’

test data points?
1
This one is closer to the class, but ?
note that the are much more ϕ1
compact than the . Could be 0 ?
ambiguous.
?
?
-1 ?

-1 0 ϕ0 1

Foundations and Linear Methods 28

Classification

How would you assign classes to the ‘?’

test data points?
1
This one is far away from any of the ?
classes. Probably should be a , but ϕ1
realistically, the training data does not 0 ?
prepare us well for classifying this
?
particular sample. ?
-1 ?

-1 0 ϕ0 1

Foundations and Linear Methods 28

Classification

How would you assign classes to the ‘?’

test data points?
1
This one is close to a , which seems ?
however an outlier - a nontypical ϕ1
example, maybe due to data noise. 0 ?

Remember that data is never perfect, ?

and that ambiguities are quite normal! ?
-1 ?
Even if it is not an outlier, we can
intuitively say that the “probability” of
seems higher than . -1 0 ϕ0 1
Probabilistic modeling will allow us to
formalize this idea.
⇒ In this (simple) case, we can solve the
problem by considering multiple
neighbors (not just the closest one).
Foundations and Linear Methods 28
K-Nearest-Neighbors classifier

Define the simple, nonparametric k-Nearest-Neighbors (kNN) classifier:

for a test sample x, consider the k nearest training samples (for
fixed k)
determine the class assignment for x by “majority vote” (i.e. it
corresponds to the class of the majority of the k nearest training
samples)
one could also weigh the influence of the k nearest neighbors (e.g.
by distance)
The algorithm is called nonparametric because works directly with the
data, as opposed to parametric methods like linear regression which are
based on learning parameters of a model.
What do you think are the problems of this approach?

Foundations and Linear Methods 31

K-Nearest-Neighbors classifier - Problems

Here are some major problems of the kNN classifier:

requires to store all training data, and to compute distances between
test sample and all training samples
does not discover structure (e.g. shape, “compactness” of the
classes)
very sensitive to outliers
problematic in high-dimensional feature spaces (Curse of
Dimensionality: samples do not cover the space well)
difficult to define suitable metric of closeness.

Foundations and Linear Methods 32

K-Nearest-Neighbors classifier - Problems

Consider a k-nearest-neighbors classifier with which we want to

distinguish a swim tyre and a donut.
Images are represented at 250x250 pixels, 3 colors, ranging from
0..255.
Thus, 250x250x3 = 187500 features in raw pixel space.

Foundations and Linear Methods 33

K-Nearest-Neighbors classifier - Problems

Absolute difference (=distance) between swim tyre and donut, in

raw pixel space: average ≈ 45.

Absolute difference between rescaled donut and shifted identical

donut, in raw pixel space: average ≈ 146.

This cannot be a good way to perform (image) classification.

Foundations and Linear Methods 34
K-Nearest-Neighbors classifier - Problems

In practice, one would define suitable features (search internet for

HOG, SIFT, etc. if you are interested), and one would normalize
(e.g. center) the images.
But what about rotation, scaling, photos from different angles . . . ?
Clearly, complex realistic cases cannot be solved this way. Need a
much more versatile approach.
We will get to know such approaches in this lecture. They are
always based on some kind of model, i.e. they are parametric.

Foundations and Linear Methods 35

The Linear Classifier

Parametric classifier: structure of data gets summarized by a set of

parameters of (usually) fixed size, independent of the number of
training samples
Two fundamental concepts:
: Discriminant function: A function (with trainable parameters) which
assigns a class to a test sample

C = f (ϕ)

: Probabilistic modeling: A function models the probabilities of a test

sample belonging to any class

p(Ck |ϕ) = f (ϕ, Ck )

It is easy to compute the discriminant function from a probabilistic

model (take the class whose probability is highest).

Foundations and Linear Methods 36

The Linear Classifier

Consider the example below (the outlier has been removed).

Most simple discriminant function: a hyperplane in feature space.
⇒ In the two-dimensional case: a straight line.

1
ϕ1
0

-1

-1 0 ϕ0 1
Foundations and Linear Methods 37
The Linear Classifier

Consider the example below (the outlier has been removed).

Most simple discriminant function: a hyperplane in feature space.
⇒ In the two-dimensional case: a straight line.

In the given example, many

solutions are possible.
: We will see later how to
1 choose a good one.
ϕ1 : We will see later what to
do if there is no perfect
0 solution.

-1

-1 0 ϕ0 1
Foundations and Linear Methods 37
The Linear Classifier

The discriminant function takes the form

(
T C0 , if y (ϕ) ≥ 0
y (ϕ) = w ϕ + w0 with C(ϕ) =
C1 , if y (ϕ) < 0

w is called a weight vector, w0 is

the bias. The hyperplane with the
equation wT ϕ + w0 = 0 is called
1
decision boundary, it separates the
ϕ1 space into decision regions.
0

-1

-1 0 ϕ0 1
Foundations and Linear Methods 38
The Linear Classifier

The discriminant function takes the form

(
T C0 , if y (ϕ) ≥ 0
y (ϕ) = w ϕ + w0 with C(ϕ) =
C1 , if y (ϕ) < 0

Note that w is orthogonal to the

decision boundary. The distance of
any point x to the decision bound-
1 y (x)
ary is ||w|| , in particular, the dis-
ϕ1 tance of the decision boundary to
|w0 |
0 the origin (0,0) is ||w|| .

-1 w

-1 0 ϕ0 1
Foundations and Linear Methods 38
The Linear Classifier - Observations

We now have a mathematical formulation for the classification

Foundations and Linear Methods 39

The Linear Classifier - Observations

We now have a mathematical formulation for the classification

problem: the decision regions are separated by a hyperplane
parametrized by w and w0 .
: We will get to know different ways to choose w and w0 .
The decision regions of the kNN classifier have a much more
complex shape!
The parametrized model is tendentially robust against outliers.
This comes at the price of lower flexibility: This classifier is prone to
drastic underfitting.
We could get a better fit by choosing a decision boundary described
by a polynomial
⇒ just like for regression, we can instead define suitable nonlinear
features, so that the classifier remains linear in its parameters.

Foundations and Linear Methods 39

Multiple Classes

How to extend linear classification to K > 2 classes?

One-vs-rest classification (left figure): use K classifiers, each of
which distinguishes points in class k from points not in class k.
One-vs-one classification (right figure): use K (K − 1)/2 classifiers,
one for each pair of classes.
Each of these methods leads to ambiguous areas (shaded green).

Img src: Bishop, figure 4.2

Foundations and Linear Methods 40

Multiple Classes

How to extend linear classification to K > 2 classes?

Better solution: Use K linear functions of the form

yk (ϕ) = wkT ϕ + wk0

and assign a data point ϕ(x) to class Ck if

yk (ϕ(x)) > yj (ϕ(x))

for all j ̸= k.
This method avoids ambiguous regions, all decision regions are
singly connected and convex.

Foundations and Linear Methods 41

Logistic Regression

We present one specific way to define a linear classifier.

We would like to use the ideas we have developed for linear
regression to perform classification.
Assume that our training data consists of observations
{(ϕn , tn )}n=1,...,N , where tn only takes the values 0 and 1. Train a
regressor on this data.
While performing “regression for classification” in this way is
technically possible, it leads to issues:
: What does it mean if the estimate of t for an input data point is not
equal 0 or 1?
: Assume a data point belonging to class 1. The MSE loss will be the
same regardless of whether the estimate of t is 0.6 or 1.4 – do you
think this makes sense?
Clearly, we need to deal with the regression output in a different way.

Foundations and Linear Methods 42

Logistic Regression

We bridge the gap from regression to classification by imposing a

simple probabilistic model.
Idea: We assume that each data point has a certain probability of
belonging to class 0 or 1:

p(C1 |ϕ) = y (ϕ)

p(C0 |ϕ) = 1 − y (ϕ)

From the probabilistic model, we derive the discriminant function:

C(ϕ) = 1 if p(C1 |ϕ) = y (ϕ) ≥ 0.5

C(ϕ) = 0 if p(C1 |ϕ) = y (ϕ) < 0.5.

With this model we can also express degrees of uncertainty about

classification results!

Foundations and Linear Methods 43

Logistic Regression

We derive the function y from Linear Regression.

We make the output of regression probabilistic by passing it through
a “squeezing function” which normalizes the real-valued output to
the range [0.0, 1.0].
The classical choice of the squeezing function is the logistic function
or logistic sigmoid
1
σ(x) = .
1 + e −x

Image source: Wikipedia, Sigmoid Function

Foundations and Linear Methods 44

Logistic Regression

We define

ỹ (ϕ) = wT ϕ + w0
y (ϕ) = σ(ỹ (ϕ)).

Taking y = y (ϕ) = y (ϕ, w, w0 ) as the probability that a sample ϕ

belongs to class 1, we see that
: greater absolute values of y (ϕ) imply less uncertainty about the
classification of the sample ϕ,
: the decision boundary between the classes is given by the hyperplane
ỹ (ϕ) = 0.
This model is called Logistic Regression.

Foundations and Linear Methods 45

Logistic Regression

Consider the following example task.

Note that the classes overlap.
1
ϕ1
0

-1

-1 0 ϕ0 1

Foundations and Linear Methods 46

Logistic Regression

Consider the following example task.

Note that the classes overlap.
This is the decision boundary which 1
the model finds.
ϕ1
0

-1

-1 0 ϕ0 1

Foundations and Linear Methods 46

Logistic Regression

Consider the following example task.

Note that the classes overlap.
This is the decision boundary which 1
the model finds.
ϕ1
Items which are more distant from the
decision boundary are assigned to their 0
respective classes with higher
probability.
-1

-1 0 ϕ0 1

Foundations and Linear Methods 46

Logistic Regression

Consider the following example task.

Foundations and Linear Methods 46

Logistic Regression

We need a training criterion for logistic regression.

This criterion will be based on probabilities, just like the model itself.
We define the likelihood of a single training sample (ϕ, t) as its
probability under the logistic regression model, as a function of the
regression parameters w and w0 :
(
y if t = 1
Lϕ (w, w0 ) = = y t (1 − y )1−t
1 − y if t = 0

where y = y (ϕ, w, w0 ).
Assuming that the training samples are statistically independent, the
likelihood of the training data {(ϕn , tn )}n=1,...,N is
N
Y
L(w, w0 ) = yntn (1 − yn )1−tn .
n=1

Foundations and Linear Methods 47

Logistic Regression

It is computationally simpler to work in logarithmic space, thus we

compute the log-likelihood
N
X
L̃(w, w0 ) = tn ln yn + (1 − tn ) ln(1 − yn ).
n=1

Note that always L̃(w, w0 ) ≤ 0 (why?).

We now define an error function as the negative log-likelihood
N
X
E (w, w0 ) = −L̃(w, w0 ) = − tn ln yn + (1 − tn ) ln(1 − yn ).
n=1

Our training criterion will be to minimize E , which is the same as

maximizing L or L̃: the Maximum Likelihood criterion.

Foundations and Linear Methods 48

Logistic Regression

Unfortunately, in contrast to “normal” linear regression, there is no

Foundations and Linear Methods 49

Logistic Regression

Unfortunately, in contrast to “normal” linear regression, there is no

closed-form solution to determine the parameter vector w and the
bias w0 for logistic regression.
We will get to know a general way to solve such tasks during this
course.
During our lecture on probabilistic modeling, we will also reconsider
the assumptions which underly Logistic Regression, and learn that
under certain conditions it arises very naturally from imposing a
basic probabilistic model.
If there is a decision boundary hyperplane which perfectly separates
C0 and C1 , the classes are said to be separable.
In this case, the model is prone to overfitting : It will estimate the
assignment probabilities badly.
Even data points very close to the decision boundary will be
assigned with high probability to their respective class.
Again, probabilistic modeling offers a way to resolve this problem.
Foundations and Linear Methods 49
Classification: Summary

You have learned about

nonparametric and parametric classification
discriminant functions
representation of a linear decision boundary in feature space
Logistic regression as an example of a linear classifier.

Foundations and Linear Methods 50

Elements of the Foundations of Machine
Learning
Foundations of Machine Learning

We have gotten to know some elementary examples of Machine

Learning algorithms.
Before going on, we introduce some basic concepts which will help
us to formalize what we have seen so far, and what we will get to
know in the future.
We start with the most basic question: How to formulate
mathematically what learning means?
You have already seen that it does not mean to memorize the
training samples!

Foundations and Linear Methods 52

A Framework for Machine Learning

Assume that our data is described by a probability distribution

P(x, t) over inputs and targets, where we assume both inputs and
targets to be real-valued vectors: x ∈ RM , t ∈ RN .
This key idea allows to express, in a mathematically sound way
: the variability of data (even for a fixed class)
: the ambigousness of the mapping between input data and target
: and also our own lack of knowledge about this mapping!
It also gives us a powerful set of tools to create practical algorithms.
The probability calculus is at the heart of many important machine
learning algorithms!
Bousquet: Introduction to Statistical Learning Theory. Machine Learning 2003, LNAI 3176, pp. 169–207, 2004.

Foundations and Linear Methods 53

A Framework for Machine Learning

Assume a probability distribution P(x, t) over inputs and targets.

Let y (x) be a predictor for t. Assuming any loss L(t, y (x)), define
the risk as the expected loss over the entire probability space:
Z
R(y ) = EP(x,t) L(t, y (x)) = L(t, y (x))dP(x, t).

The goal of Machine Learning is now to find the predictor of t given

x which minimizes the risk:

y ∗ = arg min R(y ) = arg min EP(x,t) L(t, y (x)).

y :RM →RN y :RM →RN

Note that by using this probabilistic framework, we avoid referring to

specific data sets (like train and test dataset)!
Vapnik: Principles of Risk Minimization for Learning Theory. Proc. NIPS 1991.

Foundations and Linear Methods 54

The Irreducible Error

Assume for a moment that we know the distribution P(x, t).

In some cases, we can directly obtain y ∗ , for example in the case of
classification:
y ∗ (x) = arg max P(x, t)
t

Q: Will the risk, i.e. the expected loss, be zero for the predictor y ∗ ?

Foundations and Linear Methods 55

The Irreducible Error

Assume for a moment that we know the distribution P(x, t).

In some cases, we can directly obtain y ∗ , for example in the case of
classification:
y ∗ (x) = arg max P(x, t)
t

Q: Will the risk, i.e. the expected loss, be zero for the predictor y ∗ ?
No, because t might not be deterministic for a given x.
: Only if x completely determines t, the risk of the optimal predictor is
zero.
: Mathematically, this means that for all x, P(t|x) = 1 for exactly one
tTrue (x) and P(t|x) = 0 for all other t.

Foundations and Linear Methods 55

The Irreducible Error

The prediction error which stems from the fact that the input does
not completely determine the target is called the irreducible error.
The irreducible error can be considered a form of noise affecting the
inputs, the targets, or both. It is independent of any concrete
method which we use for the prediction of targets.
In practice, such noise can derive from inexact measurements of
data, from unaccounted sources of variation, and many other factors.
In the case of classification, we (usually) map a continuous input to
discrete output, so there will be border regions with unclear class
assignments.
As an example, consider a subset of the famous MNIST corpus of
handwritten digits. Do you think all these examples are clearly the
digit “7”?

Foundations and Linear Methods 56

A Framework for Machine Learning

Clearly, we do not know the true distribution of the data P(x, t).
How can we estimate (and subsequently minimize) the risk?

Foundations and Linear Methods 57

A Framework for Machine Learning

Clearly, we do not know the true distribution of the data P(x, t).
How can we estimate (and subsequently minimize) the risk?
We are given a finite set of observations D = {(xn , tn )}n=1,...,N .
Idea: Approximate expectations over P by averaging over the data,
specifically:
N
1 X
R(y ) = EP(x,t) L(t, y (x)) ≈ L(tn , y (xn )) =: Remp (y ).
N n=1

Remp (y ) is called empirical risk. The observations D are usually

called training data.

Foundations and Linear Methods 57

A Framework for Machine Learning

Foundations and Linear Methods 58

A Framework for Machine Learning

The entire field of Machine Learning deals with minimizing the true
risk, based on estimations using the empirical risk.
We will get to know a large variety of algorithms and methods where
we give the predictor y (x) a specific mathematical form, allowing
minimization of the empirical risk.
However, does the empirical risk always reflect the true risk?
In general, no!
: Remember that we allow any function from the input space to the
target space as a predictor.
: For example, a function which correctly maps all training data points
and is random on all other points achieves zero empirical risk, but
high true risk.
We want to avoid such pathological cases. Which functions are
“reasonable” as predictors?

Foundations and Linear Methods 58

A Framework for Machine Learning

A function like the one on the last slide has a high complexity,
whereas a structured representation of the data should have a much
simpler form.
Thus, we restrict the predictors y (x) which we take into account to
a subset of the functions from RM to RN : the hypothesis class H.
This generalizes and formalizes the concept of regularization which
we covered earlier!
We hope that this helps us to obtain a predictor whose empirical risk
(on the training data) reflects the true risk (on unseen test data).
Clearly, this is not automatically true. . .

Foundations and Linear Methods 59

A Framework for Machine Learning

We now define the optimal predictor within the hypothesis class H:

∗
yH = arg min R(y ) = arg min EP(x,t) L(t, y (x))
y ∈H y ∈H

as well as the empirically optimal predictor given a dataset

D = {(xn , tn )}n=1,...,N :
N
emp 1 X
yH = arg min Remp (y ) = arg min L(tn , y (xn )).
y ∈H y ∈H N n=1

In almost all cases, exactly computing the optimal predictor in H is

intractable, and it needs to be approximated.
All Machine Learning algorithms aim at optimally defining the
hypothesis class H, such that efficient optimization over a wide
range of functions is possible, while retaining generalization ability.

Foundations and Linear Methods 60

A Framework for Machine Learning

Based on the probabilistic framework, many bounds on the

difference between empirical and true risk have been proposed,
typically of the form
r !
cap(H)
∀y ∈ H R(y ) − Remp (y ) < O
N
where N is the number of samples, and cap(H) measures the
“capacity” or “richness” of the function class H (not covered in this
course).
Note that the true risk is typically assumed to be greater than the
empirical risk.
We see that it is necessary to balance the capacity of the hypothesis
class and the number of data samples:
: If there are not enough elements in H, even the optimal predictor
might not be very good.
: If there are too many elements in H, the empirically best predictor
might be far away from the true optimal predictor.
Foundations and Linear Methods 61
Definitions

A predictor whose true risk is close to the empirical risk on the

training data is said to generalize well.
A predictor whose empirical risk is low, but whose true risk is high,
is said to overfit to the training data.
A predictor whose empirical risk is high is said to underfit the data.
Which of the predictors in the figure overfit, and which ones
underfit?

Img src: Bishop, figure 1.4

Foundations and Linear Methods 62

Using Data Sets

We tune our Machine Learning system on the training data,

obtaining the empirical risk. But how can we now estimate the true
risk?
This requires a separate dataset which is not used for tuning the
system, but only for measuring the loss after training: the test data.
In practical scenarios, even this might not guarantee a good estimate
of the true risk: If we perform system tuning many times with
slightly different metaparameters, always reusing the same test data,
we implicitly tune the system to this test data.
Good practice requires to use three different data sets:
: The training data is used for tuning the system.
: The validation data is used to determine the quality of the trained
system and to guide further optimization steps.
: The test data is used only when the system is completely tuned to
determine the final quality (e.g. for writing a final report or similar).

Foundations and Linear Methods 63

Foundations: Summary

In this section, you have gotten to know

a mathematical formulation of the task of Machine Learning
generalization, overfitting, and underfitting revisited
splitting your data set to obtain statistically valid results.

Foundations and Linear Methods 64

3 Polyreg
No ratings yet
3 Polyreg
22 pages
Curve Fitting
No ratings yet
Curve Fitting
38 pages
Lec19 Introduction2LinearRegression
No ratings yet
Lec19 Introduction2LinearRegression
53 pages
Lecture 1-Polynomial Curve Fitting
No ratings yet
Lecture 1-Polynomial Curve Fitting
47 pages
Neural Networks for Engineers
No ratings yet
Neural Networks for Engineers
44 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
18 pages
Unit-1 PCF
No ratings yet
Unit-1 PCF
18 pages
Machinelearning
No ratings yet
Machinelearning
59 pages
Least Squares Fit To Polynomial
No ratings yet
Least Squares Fit To Polynomial
12 pages
Lecture 2 2022
No ratings yet
Lecture 2 2022
34 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Machine Learning Regression Techniques
No ratings yet
Machine Learning Regression Techniques
16 pages
21Csc305P-Machine Learning: Offline
No ratings yet
21Csc305P-Machine Learning: Offline
8 pages
GML Slides 2024 04 29
No ratings yet
GML Slides 2024 04 29
206 pages
Polynomial Curve Fitting Lecture
No ratings yet
Polynomial Curve Fitting Lecture
43 pages
Physics 114: Lecture 17 Least Squares Fit To Polynomial: Dale E. Gary
No ratings yet
Physics 114: Lecture 17 Least Squares Fit To Polynomial: Dale E. Gary
12 pages
Polynomial Fitting Techniques
No ratings yet
Polynomial Fitting Techniques
12 pages
Lecture 3 - Regression
No ratings yet
Lecture 3 - Regression
47 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
QSRI Lecture1
No ratings yet
QSRI Lecture1
45 pages
Curs 1 SSL - Introduction
No ratings yet
Curs 1 SSL - Introduction
57 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
Notes5 Regression
No ratings yet
Notes5 Regression
14 pages
Machine Learning for Math Students
No ratings yet
Machine Learning for Math Students
60 pages
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
100% (1)
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
48 pages
Week 2
No ratings yet
Week 2
43 pages
Machine Learning and Data Mining
No ratings yet
Machine Learning and Data Mining
88 pages
Regression Analysis
No ratings yet
Regression Analysis
11 pages
CSE 440 AI Volume1 (p1)
No ratings yet
CSE 440 AI Volume1 (p1)
4 pages
03 Linear Regression
No ratings yet
03 Linear Regression
54 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
38 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Lect 1
No ratings yet
Lect 1
24 pages
PRML Exercise Solutions Guide
No ratings yet
PRML Exercise Solutions Guide
87 pages
ML 3
No ratings yet
ML 3
66 pages
Lecture3 Upload
No ratings yet
Lecture3 Upload
28 pages
Vahid
No ratings yet
Vahid
18 pages
Lecture 13 - Least Squares
No ratings yet
Lecture 13 - Least Squares
28 pages
Introduction: Geometric Models: - Page 1 of 25
No ratings yet
Introduction: Geometric Models: - Page 1 of 25
25 pages
Introduction ML
No ratings yet
Introduction ML
65 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
Lecture 1 2022
No ratings yet
Lecture 1 2022
55 pages
Polynomial Regression
No ratings yet
Polynomial Regression
16 pages
Lecture 1
No ratings yet
Lecture 1
56 pages
Linear Regression
No ratings yet
Linear Regression
91 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
02 - Overfitting Underfitting and Model Selection - en
No ratings yet
02 - Overfitting Underfitting and Model Selection - en
1 page
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
2 LinearRegression2
No ratings yet
2 LinearRegression2
45 pages
Lecture 5 Curve FittingMS
No ratings yet
Lecture 5 Curve FittingMS
63 pages
Model Fitting
No ratings yet
Model Fitting
19 pages
RADL TQKhoat
No ratings yet
RADL TQKhoat
50 pages
Lecture Slides - Linear Regression (2025)
No ratings yet
Lecture Slides - Linear Regression (2025)
45 pages
Machine Learning Lecture Notes Undergrad
No ratings yet
Machine Learning Lecture Notes Undergrad
19 pages
Least-Squares Data Fitting Guide
No ratings yet
Least-Squares Data Fitting Guide
17 pages
PRML Slides 3
No ratings yet
PRML Slides 3
57 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
Lecture 02
No ratings yet
Lecture 02
33 pages
Machine Learning (CSO851) - Lecture 02
No ratings yet
Machine Learning (CSO851) - Lecture 02
74 pages
Slides NN
No ratings yet
Slides NN
59 pages
Convolutional Networks 2024
No ratings yet
Convolutional Networks 2024
44 pages
Amanote
No ratings yet
Amanote
33 pages
Analysis: Understanding Multicellular Function and Disease With Human Tissue-Specific Networks
No ratings yet
Analysis: Understanding Multicellular Function and Disease With Human Tissue-Specific Networks
11 pages
WJ o JWJWJWJ
No ratings yet
WJ o JWJWJWJ
3 pages
(Ebook) Psychophysiological Thought Reading by Banachek ISBN 9780970643810, 0970643810 Ready To Read
No ratings yet
(Ebook) Psychophysiological Thought Reading by Banachek ISBN 9780970643810, 0970643810 Ready To Read
167 pages
Creating Stems: Complete Ableton Live Guide To
No ratings yet
Creating Stems: Complete Ableton Live Guide To
8 pages
LOTTERY
50% (4)
LOTTERY
2 pages
Analyst IDC MarketScape ServiceNow
No ratings yet
Analyst IDC MarketScape ServiceNow
9 pages
Unit 1
100% (3)
Unit 1
13 pages
1 Familiarisation of Xilinx Ise
No ratings yet
1 Familiarisation of Xilinx Ise
6 pages
2.8" TFT Touchscreen: Created by Lady Ada
No ratings yet
2.8" TFT Touchscreen: Created by Lady Ada
34 pages
Chapter 8
No ratings yet
Chapter 8
19 pages
Jadual Mba Odl Sem 1 2022-2023 (Update 13102022) 3
No ratings yet
Jadual Mba Odl Sem 1 2022-2023 (Update 13102022) 3
16 pages
SAP MDG RFM To Upload
No ratings yet
SAP MDG RFM To Upload
15 pages
ASME PCC-1 Appendix-O-Calculation
50% (2)
ASME PCC-1 Appendix-O-Calculation
23 pages
Taniguchi 2001
No ratings yet
Taniguchi 2001
14 pages
Letter Structure and Examples
No ratings yet
Letter Structure and Examples
6 pages
Manual Tribeca
100% (1)
Manual Tribeca
2,453 pages
A Simplified Model To Predict Smoke Movement in Vertical Shafts During A High-Rise Structural Fire
No ratings yet
A Simplified Model To Predict Smoke Movement in Vertical Shafts During A High-Rise Structural Fire
10 pages
TQM Supplier Selection
No ratings yet
TQM Supplier Selection
13 pages
25 06 23 JR Iit Star Co Scmodel A Jee Adv 2017p I Wat 9 QP
No ratings yet
25 06 23 JR Iit Star Co Scmodel A Jee Adv 2017p I Wat 9 QP
20 pages
Ruxandra-Laura Boşilcă, Susana Ferreira, Barry J. Ryan - Routledge Handbook of Maritime Security (2022, Routledge) - Libgen - Li
No ratings yet
Ruxandra-Laura Boşilcă, Susana Ferreira, Barry J. Ryan - Routledge Handbook of Maritime Security (2022, Routledge) - Libgen - Li
409 pages
Study Guide - 1 - Math 102 DC Dianne Oliver
No ratings yet
Study Guide - 1 - Math 102 DC Dianne Oliver
8 pages
Chemistry An Atoms First Approach 2nd Edition Steven S Zumdahl Susan A Zumdahl Digital Access
100% (1)
Chemistry An Atoms First Approach 2nd Edition Steven S Zumdahl Susan A Zumdahl Digital Access
405 pages
Blue Brain Report
No ratings yet
Blue Brain Report
25 pages
lLOWARA SV 3392N - 5 - P - Feb08 - EdD
No ratings yet
lLOWARA SV 3392N - 5 - P - Feb08 - EdD
10 pages
Recent Changes in Software Patenting in India
No ratings yet
Recent Changes in Software Patenting in India
4 pages
Case Study - Bakery House App
No ratings yet
Case Study - Bakery House App
4 pages
GSM Switching Services and Protocols p1 8309
0% (1)
GSM Switching Services and Protocols p1 8309
8 pages
EM 5.7 Parts Manual 2013
No ratings yet
EM 5.7 Parts Manual 2013
37 pages
SAP HR Forms Editor Tutorial - SAP Stack
No ratings yet
SAP HR Forms Editor Tutorial - SAP Stack
2 pages
Robbins mgmt14 PPT 01 PDF
100% (1)
Robbins mgmt14 PPT 01 PDF
40 pages
Criterios para Flanges API SPEC 6A
No ratings yet
Criterios para Flanges API SPEC 6A
33 pages