0% found this document useful (0 votes)

67 views26 pages

Statistical Learning: First Steps: Sasha Rakhlin

The document provides an outline and summary of a lecture on statistical learning and the perceptron algorithm. The key points covered are: 1) The document introduces the concepts of generalization, supervised learning, and the goals of prediction and estimation. 2) It then describes the perceptron learning algorithm, which maintains a hypothesis and updates it based on misclassified examples from the training data. 3) The lecture discusses the concept of consistency, where a learning algorithm is able to approach the performance of the optimal Bayes predictor as the sample size increases.

Uploaded by

Irfan Fadhullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views26 pages

Statistical Learning: First Steps: Sasha Rakhlin

Uploaded by

Irfan Fadhullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Lecture 12

Statistical Learning: First Steps

Sasha Rakhlin

Oct 17, 2019

1 / 24
Outline

Setup

Perceptron

2 / 24
What is Generalization?

3 / 24
What is Generalization?

log(1 + 2 + 3) = log(1) + log(2) + log(3)

3 / 24
What is Generalization?

log(1 + 2 + 3) = log(1) + log(2) + log(3)

log(1 + 1.5 + 5) = log(1) + log(1.5) + log(5)

log(2 + 2) = log(2) + log(2)

log(1 + 1.25 + 9) = log(1) + log(1.25) + log(9)

3 / 24
Outline

Setup

Perceptron

4 / 24
Supervised Learning: data S = {(X1 , Y1 ), . . . , (Xn , Yn )} are i.i.d. from
unknown distribution P.

Learning algorithm: a mapping {(X1 , Y1 ), . . . , (Xn , Yn )} z→ f̂n .

Goals:
▸ Prediction: small expected loss

L(f̂n ) = EX,Y `(Y, f̂n (X)).

Here (X, Y) ∼ P. Interpretation: good prediction on a random example

from same population.
▸ Estimation: small ∥f̂n − f∗ ∥, or ∥̂
θ − θ∗ ∥, where f∗ or θ∗ are
parameters of P (e.g. regression function f∗ (x) = E[Y∣X = x], or
f∗ (x) = ⟨θ∗ , x⟩, etc).

In this course, we mostly focus on prediction, but will also outline

connections between prediction and estimation.

5 / 24
Why not estimate the underlying distribution P first?
This is in general a harder problem than prediction. Consider classification.
We might be attempting to learn parts/properties of the distribution that
are irrelevant, while all we care about is the “boundary” between the two
classes.

6 / 24
Key difficulty: our goals are in terms of unknown quantities related to
unknown P. Have to use empirical data instead. Purview of statistics.
For instance, we can calculate the empirical loss of f ∶ X → Y

̂ 1 n
L(f) = ∑ `(Yi , f(Xi ))
n i=1

7 / 24
The function x ↦ f̂n (x) = f̂n (x; X1 , Y1 , . . . , Xn , Yn ) is random, since it
depends on the random data S = (X1 , Y1 , . . . , Xn , Yn ). Thus, the risk

L(f̂n ) = E [`(f̂n (X), Y)∣S ]

= E [`(f̂n (X; X1 , Y1 , . . . , Xn , Yn ), Y)∣S ]

is a random variable. We might aim for EL(f̂n ) small, or L(f̂n ) small with
high probability (over the training data).

8 / 24
Quiz: what is random here?

1. ̂
L(f) for a given fixed f
2. f̂n
L(f̂n )
3. ̂
4. L(f̂n )
5. L(f) for a given fixed f

It is important that these are understood before we proceed further.

9 / 24
Theoretical analysis of performance is typically easier if f̂n has closed form
(in terms of the training data).

E.g. ordinary least squares f̂n (x) = x T (X T X)−1 X T Y.

Unfortunately, most ML and many Statistical procedures are not explicitly

defined but arise as
▸ solutions to an optimization objective (e.g. logistic regression)
▸ as an iterative procedure without an immediately obvious objective
function (e.g. AdaBoost, Random Forests, etc)

10 / 24
The Gold Standard

Within the framework we set up, the smallest expected loss is achieved by
the Bayes optimal function

f∗ = arg min L(f)

where the minimization is over all (measurable) prediction rules f ∶ X → Y.

The value of the lowest expected loss is called the Bayes error:

L(f∗ ) = inf L(f)

Of course, we cannot calculate any of these quantities since P is unknown.

11 / 24
Bayes Optimal Function

Bayes optimal function f∗ takes on the following forms in these two

particular cases:
▸ Binary classification (Y = {0, 1}) with the indicator loss:

f∗ (x) = I{η(x) ≥ 1/2}, where η(x) = E[Y∣X = x]

⌘(x)
0

▸ Regression (Y = R) with squared loss:

f∗ (x) = η(x), where η(x) = E[Y∣X = x]

12 / 24
The big question: is there a way to construct a learning algorithm with a
guarantee that
L(f̂n ) − L(f∗ )
is small for large enough sample size n?

13 / 24
Consistency

An algorithm that ensures

lim L(f̂n ) = L(f∗ ) almost surely

n→∞

is called consistent. Consistency ensures that our algorithm is approaching

the best possible prediction performance as the sample size increases.

The good news: consistency is possible to achieve.

▸ easy if X is a finite or countable set
▸ not too hard if X is infinite, and the underlying relationship between x
and y is “continuous”

14 / 24
The bad news...
In general, we cannot prove anything quantitative about L(f̂n ) − L(f∗ ),
unless we make further assumptions (incorporate prior knowledge).

“No Free Lunch” Theorems: unless we posit assumptions,

▸ For any algorithm f̂n , any n and any > 0, there exists a distribution
P such that L(f∗ ) = 0 and
1
EL(f̂n ) ≥ −
2

▸ For any algorithm f̂n , and any sequence an that converges to 0, there
exists a probability distribution P such that L(f∗ ) = 0 and for all n

EL(f̂n ) ≥ an

Reference: (Devroye, Györfi, Lugosi: A Probabilistic Theory of Pattern Recognition),

(Bousquet, Boucheron, Lugosi, 2004)

15 / 24
is this really “bad news”?

Not really. We always have some domain knowledge.

Two ways of incorporating prior knowledge:

▸ Direct way: assumptions on distribution P (e.g. margin, smoothness of
regression function, etc)
▸ Indirect way: redefine the goal to perform as well as a reference set F
of predictors:
L(f̂n ) − inf L(f)
f∈F

F encapsulates our inductive bias.

We often make both of these assumptions.

16 / 24
Outline

Setup

Perceptron

17 / 24
We start our study of Statistical Learning with the classical Perceptron
algorithm.

Reason: simplicity. We will give a three-line proof of Perceptron, followed

by two interesting consequences with one-line proofs each. These
consequences are, perhaps, the easiest nontrivial statistical guarantees I can
think of.

18 / 24
Perceptron

<latexit sha1_base64="+03BdUB6TWqLjuCfwSU3OIMjF3A=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knN7gALgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1Lbi/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wfzVo7p</latexit>

19 / 24
Perceptron

(x1 , y1 ), . . . , (xT , yT ) ∈ X × {±1} (T may or may not be same as n)

Maintain a hypothesis wt ∈ Rd (initialize w1 = 0).

On round t,
▸ Consider (xt , yt )
▸ Form prediction ̂
yt = sign(⟨wt , xt ⟩)
▸ If ̂
yt ≠ yt , update
wt+1 = wt + yt xt

else
wt+1 = wt

20 / 24
Perceptron

wt
<latexit sha1_base64="7J0RWCUqTAOGp/2VV5Bhw8gCVEA=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71KCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xuZmI3/</latexit>

xt
<latexit sha1_base64="wK72DhZCMdU9mhIbOxvRhU8Wi14=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71iCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xubHo4A</latexit>

21 / 24
For simplicity, suppose all data are in a unit ball, ∥xt ∥ ≤ 1.

Margin with respect to (x1 , y1 ), . . . , (xT , yT ):

γ = max min (yi ⟨w, xi ⟩)+ ,

∥w∥=1 i∈[T ]

where (a)+ = max{0, a}.

Theorem (Novikoff ’62).

Perceptron makes at most 1/γ2 mistakes (and corrections) on any

sequence of examples with margin γ.

22 / 24
Proof: Let m be the number of mistakes after T iterations. If a mistake
is made on round t,

∥wt+1 ∥2 = ∥wt + yt xt ∥2 ≤ ∥wt ∥2 + 2yt ⟨wt , xt ⟩ + 1 ≤ ∥wt ∥2 + 1.

Hence,
∥wT ∥2 ≤ m.
For optimal hyperplane w∗

γ ≤ ⟨w∗ , yt xt ⟩ = ⟨w∗ , wt+1 − wt ⟩ .

Hence (adding and canceling),

√
mγ ≤ ⟨w∗ , wT ⟩ ≤ ∥wT ∥ ≤ m.

23 / 24
Recap

For any T and (x1 , y1 ), . . . , (xT , yT ),

T
D2
∑ I{yt ⟨wt , xt ⟩ ≤ 0} ≤
t=1 γ2

where γ = γ(x1∶T , y1∶T ) is margin and D = D(x1∶T , y1∶T ) = maxt ∥xt ∥.

Let w∗ denote the max margin hyperplane, ∥w∗ ∥ = 1.

24 / 24

Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
Merge
No ratings yet
Merge
240 pages
Mathematics of Machine Learning MIT
No ratings yet
Mathematics of Machine Learning MIT
411 pages
All Models Are Wrong
No ratings yet
All Models Are Wrong
429 pages
Cheet Sheet
No ratings yet
Cheet Sheet
47 pages
Brief Intro To ML PDF
No ratings yet
Brief Intro To ML PDF
236 pages
ML Merge
No ratings yet
ML Merge
145 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
PAC Bayesian Learning Introduction
No ratings yet
PAC Bayesian Learning Introduction
124 pages
ML Opt
No ratings yet
ML Opt
89 pages
Machine Learning and Data Mining
No ratings yet
Machine Learning and Data Mining
88 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
100 pages
Machine Learning
No ratings yet
Machine Learning
64 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
71 pages
Shawe-Taylor-Slides Statiscal Learning Theory For Modern Machine Learning
No ratings yet
Shawe-Taylor-Slides Statiscal Learning Theory For Modern Machine Learning
195 pages
Lecturenotes
No ratings yet
Lecturenotes
56 pages
Fairness Lectures-21
No ratings yet
Fairness Lectures-21
63 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Machine Learning Handbook - Radivojac and White
No ratings yet
Machine Learning Handbook - Radivojac and White
108 pages
Cs 171 18 IntroLearning Old
No ratings yet
Cs 171 18 IntroLearning Old
47 pages
ML 3
No ratings yet
ML 3
66 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
Poly Aml
No ratings yet
Poly Aml
76 pages
Problem Set 2
No ratings yet
Problem Set 2
18 pages
BTMMeeting25Nov2020 StatisticalLearning
No ratings yet
BTMMeeting25Nov2020 StatisticalLearning
49 pages
Slide07 Bayes
No ratings yet
Slide07 Bayes
51 pages
QSRI Lecture1
No ratings yet
QSRI Lecture1
45 pages
ML 01
No ratings yet
ML 01
24 pages
Applied Statistics - Lecture 1: Mario Beraha
No ratings yet
Applied Statistics - Lecture 1: Mario Beraha
52 pages
UNIT I-Part 2
No ratings yet
UNIT I-Part 2
35 pages
Chapter 3 - Introduction Via Linear Regression
No ratings yet
Chapter 3 - Introduction Via Linear Regression
20 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
213 pages
Super Cheatsheet Machine Learning
100% (1)
Super Cheatsheet Machine Learning
15 pages
Class 01
No ratings yet
Class 01
75 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
17 pages
When Models Meet Data
No ratings yet
When Models Meet Data
25 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Machine Learning Lecture Notes Undergrad
No ratings yet
Machine Learning Lecture Notes Undergrad
19 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Lecture 2 Ai
No ratings yet
Lecture 2 Ai
24 pages
07 Intro To ML
No ratings yet
07 Intro To ML
38 pages
01 Intro
No ratings yet
01 Intro
22 pages
CS168: The Modern Algorithmic Toolbox Lecture #5: Generalization (Or, How Much Data Is Enough?)
No ratings yet
CS168: The Modern Algorithmic Toolbox Lecture #5: Generalization (Or, How Much Data Is Enough?)
16 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
Agricultural Land Use in Kerala
No ratings yet
Agricultural Land Use in Kerala
5 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
CSE 440 AI Volume1 (p1)
No ratings yet
CSE 440 AI Volume1 (p1)
4 pages
Chapter Introduction
No ratings yet
Chapter Introduction
7 pages
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
No ratings yet
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
10 pages
Unit 2 - Machine Learning - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 2 - Machine Learning - WWW - Rgpvnotes.in PDF
10 pages
Mhike - Financial Education Policy of DEPED
No ratings yet
Mhike - Financial Education Policy of DEPED
99 pages
Test Bank For Understanding Economics A Contemporary Perspective, 9th Edition Mark Lovewell
100% (1)
Test Bank For Understanding Economics A Contemporary Perspective, 9th Edition Mark Lovewell
10 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
List of Eligible Not Eligible Candidates For The Post of Dental Surgeon Under Mobile Dental Clinic Project Under NHM
No ratings yet
List of Eligible Not Eligible Candidates For The Post of Dental Surgeon Under Mobile Dental Clinic Project Under NHM
21 pages
3d Printing Lesson Plan Final Complete
No ratings yet
3d Printing Lesson Plan Final Complete
9 pages
Sanet - ST - Evol of Air Interf Tow 5G
No ratings yet
Sanet - ST - Evol of Air Interf Tow 5G
298 pages
Chakra Chart and Astrology
No ratings yet
Chakra Chart and Astrology
7 pages
Y7-10 Achieve! History Japan Under The Shoguns
No ratings yet
Y7-10 Achieve! History Japan Under The Shoguns
68 pages
PSIR 2023 Paper 1A Analysis
No ratings yet
PSIR 2023 Paper 1A Analysis
32 pages
4 - Attack Phase - 2018
No ratings yet
4 - Attack Phase - 2018
52 pages
văn hóa đọc eng
No ratings yet
văn hóa đọc eng
3 pages
INTERSHIP REPORT 3 (AutoRecovered)
No ratings yet
INTERSHIP REPORT 3 (AutoRecovered)
41 pages
How To Write A Good (Maths) Ph.D. Thesis
No ratings yet
How To Write A Good (Maths) Ph.D. Thesis
76 pages
Anti Ragging Undertaking Procedure
No ratings yet
Anti Ragging Undertaking Procedure
4 pages
ESL Language Scope and Sequence
No ratings yet
ESL Language Scope and Sequence
35 pages
Beano Comic Book Comprehension 2
No ratings yet
Beano Comic Book Comprehension 2
27 pages
Class14 PDF
No ratings yet
Class14 PDF
29 pages
San Vicente National High School: Division of Camarines Sur San Vicente Buhi Camarines Sur S/Y 2022-2023
No ratings yet
San Vicente National High School: Division of Camarines Sur San Vicente Buhi Camarines Sur S/Y 2022-2023
6 pages
Deep-Learning Based Linear Precoding For MIMO Channels With Finite-Alphabet Signaling
No ratings yet
Deep-Learning Based Linear Precoding For MIMO Channels With Finite-Alphabet Signaling
4 pages
Personal Data Sheet
No ratings yet
Personal Data Sheet
13 pages
Academic Calendar Year - 2021-2022 (EVEN)
No ratings yet
Academic Calendar Year - 2021-2022 (EVEN)
1 page
Final Resume
No ratings yet
Final Resume
2 pages
Sta. Lucia High School: Republic of The Philippines Department of Education Region III Schools Division of Bulacan
100% (1)
Sta. Lucia High School: Republic of The Philippines Department of Education Region III Schools Division of Bulacan
2 pages
Bocoran Soal UN Matematika TKP SMK 2015 PDF
No ratings yet
Bocoran Soal UN Matematika TKP SMK 2015 PDF
11 pages
A Reflection Paper On Western Art History and Philippine Art History by Reymart Umali
No ratings yet
A Reflection Paper On Western Art History and Philippine Art History by Reymart Umali
3 pages
Kaisary Philip LAWS 2502A
No ratings yet
Kaisary Philip LAWS 2502A
5 pages
Resume 2022 July Agrim Mathur
No ratings yet
Resume 2022 July Agrim Mathur
2 pages
Mood Differences Between The Four Galen Personality Types: Choleric, Sanguine, Phlegmatic, Melancholic
No ratings yet
Mood Differences Between The Four Galen Personality Types: Choleric, Sanguine, Phlegmatic, Melancholic
3 pages
Biochem Lab Carbo Module 7
No ratings yet
Biochem Lab Carbo Module 7
2 pages
Conversation: Ririn Ariyani Saputri NIM 2201407091
No ratings yet
Conversation: Ririn Ariyani Saputri NIM 2201407091
3 pages
GVCND 2021 Program
No ratings yet
GVCND 2021 Program
2 pages
Abhinav Saxena: Profile Info Work Experience
No ratings yet
Abhinav Saxena: Profile Info Work Experience
1 page
Because of Winn Dixie Lesson Plan
No ratings yet
Because of Winn Dixie Lesson Plan
4 pages
Casestudy Osu
No ratings yet
Casestudy Osu
3 pages
Lesson Plan: Universidad Católica de La Santísima Concepción
No ratings yet
Lesson Plan: Universidad Católica de La Santísima Concepción
4 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Statistical Learning: First Steps: Sasha Rakhlin

Uploaded by

Statistical Learning: First Steps: Sasha Rakhlin

Uploaded by

Lecture 12

Statistical Learning: First Steps

Oct 17, 2019

log(1 + 2 + 3) = log(1) + log(2) + log(3)

log(1 + 2 + 3) = log(1) + log(2) + log(3)

log(1 + 1.5 + 5) = log(1) + log(1.5) + log(5)

log(2 + 2) = log(2) + log(2)

log(1 + 1.25 + 9) = log(1) + log(1.25) + log(9)

Learning algorithm: a mapping {(X1 , Y1 ), . . . , (Xn , Yn )} z→ f̂n .

L(f̂n ) = EX,Y `(Y, f̂n (X)).

Here (X, Y) ∼ P. Interpretation: good prediction on a random example

In this course, we mostly focus on prediction, but will also outline

L(f̂n ) = E [`(f̂n (X), Y)∣S ]

It is important that these are understood before we proceed further.

E.g. ordinary least squares f̂n (x) = x T (X T X)−1 X T Y.

Unfortunately, most ML and many Statistical procedures are not explicitly

f∗ = arg min L(f)

where the minimization is over all (measurable) prediction rules f ∶ X → Y.

L(f∗ ) = inf L(f)

Of course, we cannot calculate any of these quantities since P is unknown.

Bayes optimal function f∗ takes on the following forms in these two

f∗ (x) = I{η(x) ≥ 1/2}, where η(x) = E[Y∣X = x]

▸ Regression (Y = R) with squared loss:

f∗ (x) = η(x), where η(x) = E[Y∣X = x]

An algorithm that ensures

lim L(f̂n ) = L(f∗ ) almost surely

is called consistent. Consistency ensures that our algorithm is approaching

The good news: consistency is possible to achieve.

“No Free Lunch” Theorems: unless we posit assumptions,

Reference: (Devroye, Györfi, Lugosi: A Probabilistic Theory of Pattern Recognition),

Not really. We always have some domain knowledge.

Two ways of incorporating prior knowledge:

F encapsulates our inductive bias.

We often make both of these assumptions.

Reason: simplicity. We will give a three-line proof of Perceptron, followed

(x1 , y1 ), . . . , (xT , yT ) ∈ X × {±1} (T may or may not be same as n)

Maintain a hypothesis wt ∈ Rd (initialize w1 = 0).

Margin with respect to (x1 , y1 ), . . . , (xT , yT ):

γ = max min (yi ⟨w, xi ⟩)+ ,

where (a)+ = max{0, a}.

Theorem (Novikoff ’62).

Perceptron makes at most 1/γ2 mistakes (and corrections) on any

∥wt+1 ∥2 = ∥wt + yt xt ∥2 ≤ ∥wt ∥2 + 2yt ⟨wt , xt ⟩ + 1 ≤ ∥wt ∥2 + 1.

γ ≤ ⟨w∗ , yt xt ⟩ = ⟨w∗ , wt+1 − wt ⟩ .

Hence (adding and canceling),

For any T and (x1 , y1 ), . . . , (xT , yT ),

where γ = γ(x1∶T , y1∶T ) is margin and D = D(x1∶T , y1∶T ) = maxt ∥xt ∥.

Let w∗ denote the max margin hyperplane, ∥w∗ ∥ = 1.

You might also like