0% found this document useful (0 votes)

18 views38 pages

SML Lecture4

Lecture 4 of CS-E4715 discusses model selection in supervised machine learning, emphasizing the balance between generalization error and model complexity as guided by Occam's Razor. It covers the stochastic nature of data, sources of noise, and the importance of controlling complexity through regularization and hypothesis class selection. Additionally, it introduces concepts like Bayes error, error decomposition, and techniques for model selection such as grid search and validation sets.

Uploaded by

mohamnaf.b

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views38 pages

SML Lecture4

Uploaded by

mohamnaf.b

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

CS-E4715 Supervised Machine Learning

Lecture 4: Model selection

Model selection and Occam’s Razor

• Occam’s razor principle ”Entities

should not be multiplied
unnecessarily” captures the trade-off
between generalization error and
complexity
• Model selection in machine learning
can be seen to implement Occam’s
razor
William of Ockham
(1285–1347) ”Pluralitas
non est ponenda sine
neccesitate” 1
Stochastic scenario

• The analysis so far assumed that the labels are deterministic

functions of the input
• Stochastic scenario relaxes this assumption by assuming the output
is a probabilistic function of the input
• The input and output is generated by a joint probability distribution
D or X × Y.
• This setup covers different cases when the same input x can have
different labels y
• In the stochastic scenario, there may not always exist a target
concept f that has zero generalization error R(f ) = 0

2
Sources of stochasticity

The stochastic dependency between input and output can arise from
various sources

• Imprecision in recording the input data (e.g. measurement error),

shifting our examples in the input space
• Errors in the labeling of the training data (e.g. human annotation
errors), flipping the labels some examples
• There may be additional variables that affect the labels that are not
part of our input data

All of these sources could be characterized as adding noise (or hiding

signal)

3
Noise and complexity

• The effect of noise is typically to make the decision boundary more

complex
• To obtain a consistent hypothesis on noisy data, we can use a more
complex model e.g. a spline curve instead of a hyperplane
• But this may not give a better generalization error, if we end up
merely re-classifying points corrupted by noise

4
Noise and complexity

In practice, we need to balance the complexity of the hypothesis and the

empirical error carefully
• A too simple model does not allow optimal empirical error to
obtained, this is called underfitting
• A too complex model may obtain zero empirical error, but have
worse than optimal generalization error, this is called overfitting

5
Controlling complexity

Two general approaches to control the complexity

• Selecting a hypothesis class, e.g. the maximum degree of polynomial
to fit the regression model
• Regularization: penalizing the use of too many parameters, e.g. by
bounding the norm of the weights (used in SVMs and neural
networks)

6
Measuring complexity

What is a good measure of complexity of a hypothesis class?

We have already looked at some measures:

• Number of distinct hypotheses |H|: works for finite H (e.g. models

build form binary data), but not for infinite classes (e.g. geometric
hypotheses such as polygons, hyperplanes, ellipsoids)
• Vapnik-Chervonenkis dimension (VCdim): the maximum number of
examples that can be classified in all possible ways by choosing
different hypotheses h ∈ H
• Rademacher complexity: measures the capability to classify after
randomizing the labels

Lots of other complexity measures and model selection methods exist c.f.
https://en.wikipedia.org/wiki/Model_selection (these are not
in the scope of this course)

7
Bayes error
Bayes error

• In the stochastic scenario, there is a minimal non-zero error for any

hypothesis, called the Bayes error
• Bayes error is the minimum achievable error, given a distribution D
over X × Y, by measurable functions h : X 7→ Y

R∗ = inf R(h)
{h|h measurable }

• Note that we cannot actually compute R ∗ :

• We cannot compute the generalization error R(h) exactly (c.f. PAC
learning)
• We cannot evaluate all measurable functions (intuitively: hypothesis
class that contains all functions that are mathematically
well-behaved enough to allow us to define probabilities on them)
• Bayes error serves us a a theoretical measure of best possible
performance

8
Bayes error and noise

• A hypothesis with R(h) = R ∗ is called the Bayes classifier

• The Bayes classifier can be defined in terms of conditional
probabilities as

hBayes (x) = argmaxy ∈{0,1} Pr (y |x)

• The average error made by the Bayes classifer at x ∈ X is called the

noise
noise(x) = min(Pr (1|x), Pr (0|x))
• Its expectation E (noise(x)) = R ∗ is the Bayes error
• Similarly to the Bayer error, Bayes classifier is a theoretical tool, not
something we can compute in practice

9
Bayes error example

• We have a univariate input space:

x ∈ {1, 2, 3}, and two classes
y ∈ {0, 1}.
• We assume a uniform distribution of
data: Pr (x) = 1/3
• The Bayes classifier will predict the X 1 2 3
most probable class for each possible P(Y = 1|X ) 0.6 0.2 0.7
input value P(Y = 0|X ) 0.4 0.8 0.3
hBayes (x) 1 0 1
noise(x) 0.4 0.2 0.3
The Bayes error is the expectation of the the noise over the input domain
X
R∗ = P(x)noise(x) = 1/3 (0.4 + 0.2 + 0.3) = 0.3
x

10
Decomposing the error of a hypothesis

The excess error of a hypothesis compared to the Bayes error R ∗ can be

decomposed as:

R(h) − R ∗ = estimation + approximation

• estimation = R(h) − R(h∗ ) is the excess generalization error h has

over the optimal hypothesis h∗ = argminh0 ∈H R(h0 ) in the
hypothesis class H
• approximation = R(h∗ ) − R ∗ is the approximation error due to
selecting the hypothesis class H instead of the best possible
hypothesis class (which is generally unknown to us)

Note: The approximation error is sometimes called the bias and the
estimation error the variance, and the decomposition bias-variance
decomposition

11
Decomposing the error of a hypothesis

Figure on the right depicts the concepts:

• hBayes is the Bayes classifier, with
R(hBayes ) = R ∗
• h∗ = inf h∈H R(h) is the hypothesis
with the lowest generalization error in
the hypothesis class H
• h has both non-zero estimation error
R(h) − R(h∗ ) and approximation error
R(h∗ ) − R(hBayes )

12
Example: Approximation error

• Assume the hypothesis class of univariate threshold functions:

H = {ha,θ : X 7→ {0, 1}|a ∈ {−1, +1}, θ ∈ R},

where ha,θ (x) = 1ax≥θ

• The classifier from H with the lowest generalization error separates
the data with x = 3 from data with x < 3 by, e.g. choosing
a = 1, θ = 2.5, giving h∗ (1) = h∗ (2) = 0, h∗ (3) = 1
X 1 2 3
• The generalization error is P(y = 1|x) 0.6 0.2 0.7
R(h∗ ) = P(y = 0|x) 0.4 0.8 0.3
1/3(0.6 + 0.2 + 0.3) ≈ 0.367 h∗ (x) 0 0 1

• Approximation error satisfies:

approximation = R(h∗ ) − R ∗ ≈ 0.367 − 0.3 ≈ 0.067

13
Example: estimation error
x 1 2 3
• Consider now a training set
y=1 3 1 3
on the right consisting of
y=0 1 3 2
m = 13 examples from the
same distribution as before Σ 4 4 5

• The classifier from h ∈ H with the lowest training error

R̂(h) = 0.385 separates the data with x = 1 from data with x > 1
by, e.g. choosing a = −1, θ = 1.5, giving h(1) = 1, h(2) = h(3) = 0
• However, the generalization error of h is
R(h) = 13 (0.4 + 0.2 + 0.7) ≈ 0.433
• The estimation error satisfies

estimation = R(h) − R(h∗ ) ≈ 0.433 − 0.367 ≈ 0.067

• The decomposition of the error of h is given by:

R(h) = R ∗ + approximation + estimation ≈ 0.3 + 0.067 + 0.067 ≈ 0.433

14
Error decomposition and model selection

• We can bound the estimation error

estimation = R(h) − R(h∗ ) by
generalization bounds arising from the
PAC theory
• However, we cannot do the same for
the approximation error since
approximation = R(h∗ ) − R ∗ remains
unknown to us
• In other words, we do not know how
good the hypothesis class is for
approximating the label distribution

15
Complexity and model selection

In model selection, we have a trade-off: increasing the complexity of the

hypothesis class
• decreases the approximation
error as the class is more
likely to contain a
hypothesis with error close to
the Bayes error
• increases the estimation error
as finding the good
hypothesis becomes more
hard and the generalization
bounds become looser(due to
increasing log |H| or the VC
dimension)
To minimize the generalization error over all hypothesis classes, we
should find a balance between the two terms 16
Learning with complex hypothesis classes

• One strategy for model selection to initially choose a very complex

hypothesis class with zero or very low empirical risk H
• Assume in addition the class can be decomposed into a union of
S
increasingly complex hypothesis classes H = γ∈Γ Hγ , parametrized
by γ, e.g.

• γ = number of variables in a
boolean monomial
• γ = degree of a polynomial
function
• γ = size of a neural network
• γ = regularization parameter
penalizing large weights
• We expect the approximation error to go down and the estimation
error to up when γ increases
• Model selection entails choosing γ ∗ that gives the best trade-off
17
Half-time poll: Inequalities on different errors

Let R∗ denote the Bayes error, R(h) the generalization error and R̂S (h)
the training error of hypothesis h on training set S.
Which of the inequalities always hold true?

1. R ∗ ≤ R̂S (h)
2. R̂S (h) ≤ R(h)
3. R ∗ ≤ R(h)

Answer to the poll in Mycourses by 11:15: Go to Lectures page and scroll

down to ”Lecture 4 poll”:
Answers are anonymous and do not affect grading of the course.
Regularization-based algorithms
Regularization-based algorithms

• Regularization is technique that penalizes large weights in a model

based on weighting input features:
• linear regression, support vector machines and neural networks are
fall under this category
• An important example is the class of linear functions x 7→ wT x
• The classes as parametrized by the norm kwk of the weight vector
bounded by γ: Hγ = {x 7→ wT x : kwk ≤ γ}
• The norm is typically either
• L2 norm q(Also called Euclidean norm or 2-norm):
Pn 2
kwk2 = j=1 wj : used e.g. in support vector machines and ridge
regression
• L1 norm (Also called Manhattan norm or 1-norm):
kwk1 = nj=1 |wj |: used e.g. in LASSO regression
P

18
Regularization-based algorithms

• For the L2 -norm case, we have an important computational shortcut:

the empirical Rademacher complexity of this class can be bounded
analytically!
• Let S ⊂ {x|kxk ≤ r } be a sample of size m and let
Hγ = {x 7→ wT x : kwk2 ≤ γ} .Then
r
r 2γ2 rγ
R̂S (Hγ ) ≤ =√
m m

• Thus the Rademacher complexity depends linearly on the upper

bound γ norm of the weight vector, as r and m are constant for any
fixed training set
• We can use kwk as a efficiently computable upper bound of R̂m (Hγ )

19
Regularization-based algorithms

• A regularized learning problem is to minimize

argminh∈H R̂S (h) + λΩ(h)

• R̂S (h) is the empirical error

• Ω(h) is the regularization term which increases when the complexity
of the hypothesis class increases
• λ is a regularization parameter, which is usually set by
cross-validation
• For the linear functions h : x 7→ wT x, usually Ω(h) = kwk22 or
Ω(h) = kwk1
• We will study regularization-based algorithms during the next part of
the course

20
Model selection using a
validation set
Model selection by using a validation set

We can use the given dataset for empirical model selection, if the
algorihm has input parameters (hyperparameters) that define/affect the
model complexity

• Split the data into training, validation and test sets

• For the hyperparameters, use grid search to find the parameter
combination that gives the best performance on the validation set
• Retrain a final model using the optimal parameter combination, use
both the training and validation data for training
• Evaluate the performance of the final model on the test set

21
Grid search

• Grid search is an technique frequently used to optimize

hyperparameters, including those that define the complexity of the
models
• In its basic form it goes through all combinations of parameter
values, given a set of candidate values for each parameter

• For two parameters, taking of value

combinations (v , u) ∈ V × U, where
V and U are the sets of values for the
two parameters, defines a
two-dimensional grid to be searched
• Even more parameters can be
optimized but the exhaustive search Figure by Alexander Elvers - Own work, CC
BY-SA 4.0,
becomes computationally hard due to https://commons.wikimedia.org/w/index.php?curid=842554

exponentially exploding search space

22
Model selection by using a validation set

• The need for the validation set comes from the need to avoid
overfitting
• If we only use a simple training/test split and selected the
hyperparameter values by repeated evaluation on the test set, the
performance estimate will be optimistic
• A reliable performance estimate can only be obtained form the test
set

23
How large should the training set be in comparison of the vali-
dation set?

• The larger the training set, the better the generalization error will be
(e.g. by PAC theory)
• The larger the validation set, the less variance there is in the test
error estimate.
• When the dataset is small generally the training set is taken to be as
large as possible, typically 90% or more of the total
• When the dataset is large, training set size is often taken as big as
the computational resources allow

24
Stratification

• Class distributions of the training and validation sets should be as

similar to each another as possible, otherwise there will be extra
unwanted variance
• when the data contains classes with very low number of examples,
random splitting might result in no examples in the class in the
validation set
• Stratification is a process that tries to ensure similar class
distributions across the different sets
• Simple stratification approach is to divide all classes separately into
the training and validation sets and the merge the class-specific
training sets into global training set and class-specific validation sets
into a global validation set.

25
Cross-validation
The need of multiple data splits

One split of data into training, validation and test sets may not be
enough, due to randomness:
• The training and validation sets might be small and contain noise or
outliers
• There might be some randomness in the training procedure (e.g.
initialization)
• We need to fight the randomness by averaging the evaluation
measure over multiple (training, validation) splits
• The best hyperparameter values are chosen as those that have the
best average performance over the n validation sets.

26
Generating multiple data splits

• Let us first consider generating a number of training and validation

set pairs, after first setting a side a separate test set
• Given a dataset S, we would like to generate n random splits into
training and validation set
• Two general approaches:
• Repeated random splitting
• n-fold cross-validation

27
n-Fold Cross-Validation

• The dataset S is split randomly into n equal-sized parts (or folds)

• We keep one of the n folds as the validation set (light blue in the
Figure) and combine the remaining n − 1 folds to form the training
set for the split

• n = 5 or n = 10 are typical numbers used in practice

28
Leave-one-out cross-validation (LOO)

• Extreme case of cross-validation is leave-one-out (LOO) : given a

dataset of m examples, only one example is left out as the validation
set and training uses the m − 1 examples.
• This gives an unbiased estimate of the average generalization error
over samples of size m − 1 (Mohri, et al. 2018, Theorem 5.4.)
• However, it is comptationally demanding to compute if m is large

29
Nested cross-validation

• n-fold cross-validation gives

us a well-founded way for
model selection
• However, only using a single
test set may result in
unwanted variation
• Nested cross-validation solves
this problem by using two
cross-validation loops

30
Nested cross-validation

The dataset is initially divided into n outer folds (n = 5 in the figure)

• Outer loop uses 1 fold at a

time as a test set, and the
rest of the data is used in the
inner fold
• Inner loop splits the
remaining exampls into k
folds, 1 fold for validation,
k − 1 for training (k = 2 in
the figure)
The average performance over the n test sets is computed as the final
performance estimate

31
Summary

• Model selection concerns the trade-off between model complexity

and empirical error on training data
• Regularization-based methods are based on continuous
parametrization the complexity of the hypothesis classes
• Empirical model selection can be achieved by grid search on a
validation dataset
• Various cross-validation schemes can be used to tackle the variance
of the performance estimates

MLSM Lecture2 120923
No ratings yet
MLSM Lecture2 120923
35 pages
Week 3
No ratings yet
Week 3
56 pages
PAC Bayesian Learning Introduction
No ratings yet
PAC Bayesian Learning Introduction
124 pages
SML Lecture2
No ratings yet
SML Lecture2
35 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
AI ch6
No ratings yet
AI ch6
42 pages
Machine Learning Module 2 Overview
No ratings yet
Machine Learning Module 2 Overview
20 pages
4.4 Parametric and Non-Parametric Estimator
No ratings yet
4.4 Parametric and Non-Parametric Estimator
47 pages
Shawe-Taylor-Slides Statiscal Learning Theory For Modern Machine Learning
No ratings yet
Shawe-Taylor-Slides Statiscal Learning Theory For Modern Machine Learning
195 pages
ML 01
No ratings yet
ML 01
24 pages
ML Lecture 1 Iitg
No ratings yet
ML Lecture 1 Iitg
32 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Lec10 PDF
No ratings yet
Lec10 PDF
8 pages
19 ML Intro
No ratings yet
19 ML Intro
31 pages
ML 3
No ratings yet
ML 3
66 pages
Hypothesis Space and Inductive Bias
No ratings yet
Hypothesis Space and Inductive Bias
51 pages
Lec 3
No ratings yet
Lec 3
31 pages
ML 04 Validation Regularization
No ratings yet
ML 04 Validation Regularization
57 pages
Machine Learning
No ratings yet
Machine Learning
63 pages
UNIT I-Part 2
No ratings yet
UNIT I-Part 2
35 pages
Lecture 2 Ai
No ratings yet
Lecture 2 Ai
24 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
100 pages
Lecturenotes Cse176
No ratings yet
Lecturenotes Cse176
80 pages
Lecture 3 - ATM PDF
No ratings yet
Lecture 3 - ATM PDF
29 pages
Bias-Variance Tradeoff Explained
No ratings yet
Bias-Variance Tradeoff Explained
31 pages
Eva Uation Methods 273 A Spring 09
No ratings yet
Eva Uation Methods 273 A Spring 09
17 pages
Lecturenotes PDF
No ratings yet
Lecturenotes PDF
80 pages
DL-Lec 2 - Bias-Variance-Tradeoff
No ratings yet
DL-Lec 2 - Bias-Variance-Tradeoff
33 pages
Lecturenotes
No ratings yet
Lecturenotes
56 pages
Binary Classification & Bayes Classifier
No ratings yet
Binary Classification & Bayes Classifier
4 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Lecture+Notes+Model+ Selection PDF
No ratings yet
Lecture+Notes+Model+ Selection PDF
12 pages
01-Linear Regression-Part 2
No ratings yet
01-Linear Regression-Part 2
37 pages
ML Opt
No ratings yet
ML Opt
89 pages
Bias and Variance
No ratings yet
Bias and Variance
21 pages
Slide07 Bayes
No ratings yet
Slide07 Bayes
51 pages
IML Summary
No ratings yet
IML Summary
12 pages
ML PPT Lect - 4
No ratings yet
ML PPT Lect - 4
16 pages
Precision and Recall in ML Evaluation
No ratings yet
Precision and Recall in ML Evaluation
28 pages
Gansp Awareness Quiz PDF
No ratings yet
Gansp Awareness Quiz PDF
13 pages
05-1 Supervised Learning
No ratings yet
05-1 Supervised Learning
65 pages
CS3491-AI ML-Chapter 2
No ratings yet
CS3491-AI ML-Chapter 2
16 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Lecture 31-36
No ratings yet
Lecture 31-36
44 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
71 pages
Lecture 19
No ratings yet
Lecture 19
25 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Statistical Learning: First Steps: Sasha Rakhlin
No ratings yet
Statistical Learning: First Steps: Sasha Rakhlin
26 pages
19 ML Intro
No ratings yet
19 ML Intro
33 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
Unit 1-1
No ratings yet
Unit 1-1
75 pages
Learning3 6pp
No ratings yet
Learning3 6pp
15 pages
Machine Learning and Pattern Recognition Bayesian Complexity Control
No ratings yet
Machine Learning and Pattern Recognition Bayesian Complexity Control
4 pages
Cs 171 18 IntroLearning Old
No ratings yet
Cs 171 18 IntroLearning Old
47 pages
Foundation Model Evaluation
No ratings yet
Foundation Model Evaluation
10 pages
Getting Creative With Keyframes
No ratings yet
Getting Creative With Keyframes
23 pages
TITLE: HV070WSA-100 Product Specification Rev.O: Beijing Boe Optoelectronics Technology
No ratings yet
TITLE: HV070WSA-100 Product Specification Rev.O: Beijing Boe Optoelectronics Technology
28 pages
Trigonometric Identities
No ratings yet
Trigonometric Identities
11 pages
Form 2 Science Study Guide
100% (1)
Form 2 Science Study Guide
33 pages
Mark Scheme (Results) Summer 2015: Pearson Edexcel International GCSE Further Pure Mathematics (4PM0) Paper 1
No ratings yet
Mark Scheme (Results) Summer 2015: Pearson Edexcel International GCSE Further Pure Mathematics (4PM0) Paper 1
26 pages
A Manual of Engineering Drawing by Thomas Ewing FR (Ebooksread - Com)
100% (1)
A Manual of Engineering Drawing by Thomas Ewing FR (Ebooksread - Com)
310 pages
#Include #Include #Include #Include #Define MAX 100
No ratings yet
#Include #Include #Include #Include #Define MAX 100
5 pages
Lec-07 Motion in Straight Line Physics AN421MA Amit Sir Vinod
No ratings yet
Lec-07 Motion in Straight Line Physics AN421MA Amit Sir Vinod
17 pages
An Algorithm To Obtain Moment-Curvature Diagram Fo
No ratings yet
An Algorithm To Obtain Moment-Curvature Diagram Fo
564 pages
Lesson Plan Academic Year: 2018-19 2018/Univ/Eee/Lp
No ratings yet
Lesson Plan Academic Year: 2018-19 2018/Univ/Eee/Lp
22 pages
How To Configure User Accounts To Never Expire
No ratings yet
How To Configure User Accounts To Never Expire
2 pages
Ads Block
No ratings yet
Ads Block
44 pages
Fakhreddien Exam1
No ratings yet
Fakhreddien Exam1
9 pages
Math - Aha Solutions PDF
No ratings yet
Math - Aha Solutions PDF
220 pages
Persistent Salesforce Completion Status
No ratings yet
Persistent Salesforce Completion Status
132 pages
Teacher Wise Timetable
No ratings yet
Teacher Wise Timetable
3 pages
OOP Polymorphism Essentials
No ratings yet
OOP Polymorphism Essentials
22 pages
Drabek P., Milota J.-Methods of Nonlinear Analysis - Applications To Differential Equations (2007)
No ratings yet
Drabek P., Milota J.-Methods of Nonlinear Analysis - Applications To Differential Equations (2007)
575 pages
LCD Pinouts
No ratings yet
LCD Pinouts
1 page
C# Visual Programming Revision Guide
No ratings yet
C# Visual Programming Revision Guide
10 pages
Fallacious Appeal to Authority Guide
No ratings yet
Fallacious Appeal to Authority Guide
4 pages
MCTS Data
No ratings yet
MCTS Data
316 pages
Material Stream: 6: Conditions
No ratings yet
Material Stream: 6: Conditions
9 pages
Samsung Security Systems Guide
No ratings yet
Samsung Security Systems Guide
15 pages
Antenna System in Cellular Mobile Communication
No ratings yet
Antenna System in Cellular Mobile Communication
50 pages
Cyclone Design Calculation Tool Simplified
100% (2)
Cyclone Design Calculation Tool Simplified
6 pages
Class VI Math Exam Paper 2022-23
No ratings yet
Class VI Math Exam Paper 2022-23
10 pages
TOTVS Thread Error Debugging
No ratings yet
TOTVS Thread Error Debugging
132 pages
BW1253s UG EN v1.0
No ratings yet
BW1253s UG EN v1.0
180 pages
Final Year Project
No ratings yet
Final Year Project
47 pages

SML Lecture4

Uploaded by

SML Lecture4

Uploaded by

CS-E4715 Supervised Machine Learning

Lecture 4: Model selection

• Occam’s razor principle ”Entities

• The analysis so far assumed that the labels are deterministic

• Imprecision in recording the input data (e.g. measurement error),

All of these sources could be characterized as adding noise (or hiding

• The effect of noise is typically to make the decision boundary more

In practice, we need to balance the complexity of the hypothesis and the

Two general approaches to control the complexity

What is a good measure of complexity of a hypothesis class?

• Number of distinct hypotheses |H|: works for finite H (e.g. models

• In the stochastic scenario, there is a minimal non-zero error for any

• Note that we cannot actually compute R ∗ :

• A hypothesis with R(h) = R ∗ is called the Bayes classifier

hBayes (x) = argmaxy ∈{0,1} Pr (y |x)

• The average error made by the Bayes classifer at x ∈ X is called the

• We have a univariate input space:

The excess error of a hypothesis compared to the Bayes error R ∗ can be

R(h) − R ∗ = estimation + approximation

• estimation = R(h) − R(h∗ ) is the excess generalization error h has

Figure on the right depicts the concepts:

• Assume the hypothesis class of univariate threshold functions:

H = {ha,θ : X 7→ {0, 1}|a ∈ {−1, +1}, θ ∈ R},

where ha,θ (x) = 1ax≥θ

• Approximation error satisfies:

approximation = R(h∗ ) − R ∗ ≈ 0.367 − 0.3 ≈ 0.067

• The classifier from h ∈ H with the lowest training error

estimation = R(h) − R(h∗ ) ≈ 0.433 − 0.367 ≈ 0.067

• The decomposition of the error of h is given by:

R(h) = R ∗ + approximation + estimation ≈ 0.3 + 0.067 + 0.067 ≈ 0.433

• We can bound the estimation error

In model selection, we have a trade-off: increasing the complexity of the

• One strategy for model selection to initially choose a very complex

Answer to the poll in Mycourses by 11:15: Go to Lectures page and scroll

• Regularization is technique that penalizes large weights in a model

• For the L2 -norm case, we have an important computational shortcut:

• Thus the Rademacher complexity depends linearly on the upper

• A regularized learning problem is to minimize

argminh∈H R̂S (h) + λΩ(h)

• R̂S (h) is the empirical error

• Split the data into training, validation and test sets

• Grid search is an technique frequently used to optimize

• For two parameters, taking of value

exponentially exploding search space

• Class distributions of the training and validation sets should be as

• Let us first consider generating a number of training and validation

• The dataset S is split randomly into n equal-sized parts (or folds)

• n = 5 or n = 10 are typical numbers used in practice

• Extreme case of cross-validation is leave-one-out (LOO) : given a

• n-fold cross-validation gives

The dataset is initially divided into n outer folds (n = 5 in the figure)

• Outer loop uses 1 fold at a

• Model selection concerns the trade-off between model complexity

You might also like

R(h) − R ∗ = estimation + approximation

• estimation = R(h) − R(h∗ ) is the excess generalization error h has

approximation = R(h∗ ) − R ∗ ≈ 0.367 − 0.3 ≈ 0.067

estimation = R(h) − R(h∗ ) ≈ 0.433 − 0.367 ≈ 0.067

R(h) = R ∗ + approximation + estimation ≈ 0.3 + 0.067 + 0.067 ≈ 0.433