0% found this document useful (0 votes)

4 views6 pages

Lecture 2

This lecture discusses PAC learning, focusing on the concepts of true risk and empirical risk in supervised learning. It introduces the PAC framework, defining agnostic and realizable PAC-learnability, and explains the Empirical Risk Minimization (ERM) algorithm. Additionally, it covers Hoeffding's inequality and its application to provide PAC guarantees for ERM in the agnostic setting.

Uploaded by

Soumyodeep Dey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views6 pages

Lecture 2

Uploaded by

Soumyodeep Dey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Instructor: Adarsh Barik Lecture 2, COV878

Scribed by: Soumyodeep Dey August 21, 2025

L ECTURE 2: PAC L EARNING , H OEFFDING ’ S I NEQUALITY AND

A PPLICATIONS

In supervised learning, our goal is to learn a mapping from an input space X to a label space Y . We are
given a training set S = {( x1 , y1 ), . . . , ( xn , yn )}, where each pair ( xi , yi ) ∈ X × Y is assumed to be drawn
independently and identically distributed (i.i.d.) from an unknown joint probability distribution Pxy .

Our objective is to find a function, called a hypothesis h : X → Y , that performs well on new, unseen
data from the same distribution. We typically search for this hypothesis within a predefined set of possible
functions, known as the hypothesis class H.

To measure the performance of a hypothesis, we use a loss function L : Y × Y → R+ . Following is a

common example for classification tasks:
(
1, if h( x ) ̸= y
L ( h ( x ), y ) = I[ h ( x ) ̸ = y ] =
0, if h( x ) = y

This loss function simply indicates whether a prediction is correct or not.

True Risk and Empirical Risk

There are two fundamental ways to quantify the error of a hypothesis.

Definition 1 (True Risk). The true risk (or generalization error) of a hypothesis h ∈ H is its expected loss
over the true data distribution Pxy . It is defined as:

R(h) = E(x,y)∼ Pxy [ L(h( x ), y)]

For binary classification with the 0-1 loss, this simplifies to the probability of misclassification:
h i
R(h) = P (x,y)∼ Pxy [h( x ) ̸= y]

The true risk is what we ultimately care about, but we cannot compute it directly because the distribution Pxy
is unknown. Instead, we use the training data to calculate an estimate of the true risk.

Definition 2 (Empirical Risk). The empirical risk (or training error) of a hypothesis h ∈ H is its average
loss on the training set S. It is defined as:

1 n
n i∑
R̂(h) = L ( h ( x i ), y i )
=1

For binary classification with the 0-1 loss, this is the fraction of misclassified training examples:

1 n
n i∑
R̂(h) = I[ h ( x i ) ̸ = y i ]
=1

1
Instructor: Adarsh Barik Lecture 2, COV878
Scribed by: Soumyodeep Dey August 21, 2025

1 Probably Approximately Correct (PAC) Learning

The PAC framework provides a formal definition of what it means for a hypothesis class to be ”learnable”.
Definition 3 (Agnostic PAC-Learnable). A hypothesis class H is agnostic PAC-learnable if ∃ a function
N : (0, 1)2 → N and a learning algorithm such that for every ϵ, δ ∈ (0, 1) and for every probability
distribution Pxy over X × Y : if the algorithm is given n ≥ N (ϵ, δ) i.i.d. samples from Pxy , it returns a
hypothesis ĥ ∈ H such that, with probability at least 1 − δ,

R(ĥ) ≤ inf
∗
R( h∗ ) + ϵ
h ∈H

The term ”agnostic” means we make no assumptions about the data, not even that the best hypothesis in our
class can achieve zero error. A simpler, special case is the ”realizable” setting.
Definition 4 (Realizable PAC-Learnable). A hypothesis class H is realizable PAC-learnable if it is PAC-
learnable under the assumption of realizability. Realizability means there exists at least one hypothesis
h∗ ∈ H with zero true risk, i.e., R(h∗ ) = 0. In this case, the PAC guarantee simplifies to:

R(ĥ) ≤ ϵ

2 Empirical Risk Minimization (ERM)

A natural and powerful learning strategy is to find the hypothesis that best fits the training data.
Definition 5 (Empirical Risk Minimization (ERM)). The Empirical Risk Minimization (ERM) algorithm is a
learning rule that, given a training set S, outputs a hypothesis ĥ that minimizes the empirical risk:

ĥ ∈ arg min[ R̂(h)]

h∈H

under the assumptions that we have a finite hypothesis class, that is | H | < ∞ and our problem is realizable.
Theorem 1. Let H be a finite hypothesis class. Assume the learning problem is realizable, i.e., there exists
an h∗ ∈ H such that R(h∗ ) = 0. Let ĥ be the hypothesis returned by the ERM algorithm on a set of n i.i.d.
samples. Then, for all ϵ > 0, h i
P R(ĥ) ≤ ϵ ≥ 1 − |H| exp(−nϵ)

Proof. Let H B ⊆ H be the set of ”bad” hypotheses, which are those with a true risk greater than ϵ:

H B = {h ∈ H | R(h) > ϵ}

Our goal is to bound the probability of the ”bad event” where the ERM algorithm returns a hypothesis ĥ from
this set. Let this event be Ebad = { ĥ ∈ H B }.

From the realizability assumption, we know there exists an h∗ ∈ H with R(h∗ ) = 0. This means h∗ makes
no errors on the true distribution, so its empirical risk on any training set must also be zero, i.e., R̂(h∗ ) = 0.

The ERM algorithm finds a hypothesis ĥ that minimizes the empirical risk. Since h∗ is a candidate hypothesis
in H with R̂(h∗ ) = 0, the hypothesis ĥ returned by ERM must also satisfy R̂(ĥ) = 0.

2
Instructor: Adarsh Barik Lecture 2, COV878
Scribed by: Soumyodeep Dey August 21, 2025

The bad event Ebad can only happen if ERM selects a hypothesis from H B . However, for any hypothesis to
be selected by ERM in this setting, it must have an empirical risk of 0. Therefore, the bad event Ebad implies
that at least one hypothesis in H B must have had an empirical risk of 0 on the training data. This leads to the
following crucial step, which is an application of the union bound:
h i h i
P R(ĥ) > ϵ = P ĥ ∈ H B
≤ P ∃h ∈ H B : R̂(h) = 0

≤ ∑ P R̂(h) = 0

(1)
h∈H B

Now we just need to bound the probability P R̂(h) = 0 for any single bad hypothesis h ∈ H B . By

definition of H B , we have R(h) = P [h( x ) ̸= y] > ϵ. This means the probability of h correctly classifying
a single, randomly drawn example is P [h( x ) = y] = 1 − R(h) < 1 − ϵ.

The event R̂(h) = 0 means that h correctly classifies all n i.i.d. samples in the training set. The probability
of this is:

P R̂(h) = 0 = P [h( x1 ) = y1 , . . . , h( xn ) = yn ]

n
= ∏ P [ h ( xi ) = yi ] (due to i.i.d. samples)
i =1
≤ (1 − ϵ ) n

Substituting this result back into our sum from (1):

h i
P R(ĥ) > ϵ ≤ ∑ (1 − ϵ)n
h∈H B

= |H B |(1 − ϵ)n
≤ |H|(1 − ϵ)n (since H B ⊆ H)

Finally, we use the well-known inequality 1 − x ≤ e− x for any x ∈ R, which gives (1 − ϵ)n ≤ exp(−nϵ).
This gives us the final bound: h i
P R(ĥ) > ϵ ≤ |H| exp(−nϵ)

This completes the proof.

This theorem shows that the probability of ERM selecting a ”bad” hypothesis decreases exponentially with
the number of training samples n.

3
Instructor: Adarsh Barik Lecture 2, COV878
Scribed by: Soumyodeep Dey August 21, 2025

2.1 Sample Complexity

From the theorem, we can derive the sample complexity: the number of samples n required to guarantee a
certain level of performance. We want the failure probability to be at most δ:

|H| exp(−nϵ) ≤ δ
δ
exp(−nϵ) ≤
|H|

δ
−nϵ ≤ ln
|H|
|H|

δ
nϵ ≥ − ln = ln
|H| δ
|H|

1
n ≥ ln
ϵ δ

This gives us a concrete number of samples N (ϵ, δ) needed to ensure that with probability 1 − δ, our ERM-
learned hypothesis has true risk at most ϵ. This confirms that for finite hypothesis classes under realizability,
ERM is a valid PAC learning algorithm. The analysis for the agnostic case is more involved and requires
stronger concentration inequalities like Hoeffding’s inequality.

2.2 Hoeffding’s Lemma and Inequality

Lemma 1 (Hoeffding’s Lemma). Let X be a random variable with support on [0, 1] and mean E [ X ] = µ.
Then for any t ∈ R, we have: 2
t
E [exp (t( X − µ))] ≤ exp
8

Proof Sketch. The proof relies on the convexity of the exponential function.

1. Define f ( x ) = exp(t( x − µ)). This function is convex.

2. For any x ∈ [0, 1], we can write x = (1 − x ) · 0 + x · 1. By Jensen’s inequality (or the definition of
convexity), f ( x ) ≤ (1 − x ) f (0) + x f (1).

3. Taking the expectation over X, we get E [ f ( X )] ≤ (1 − µ) f (0) + µ f (1).

4. Let ϕ(t) = ln(E [exp(t( X − µ))]). After some algebra, one can show using a Taylor expansion of
2
ϕ(t) around 0 that ϕ(t) ≤ t8 , which proves the lemma.

This lemma can be extended to a random variable bounded in any interval [ a, b].

Corollary 1. Let X be a random variable with support on [ a, b] and mean E [ X ] = µ. Then for any t ∈ R:
2
t ( b − a )2

E [exp (t( X − µ))] ≤ exp
8

4
Instructor: Adarsh Barik Lecture 2, COV878
Scribed by: Soumyodeep Dey August 21, 2025

Using Hoeffding’s Lemma and the Chernoff bounding technique (which involves applying Markov’s inequal-
ity to the exponential), we arrive at Hoeffding’s inequality.

Theorem 2 (Hoeffding’s Inequality). Let X1 , . . . , Xn be independent random variables with support on [0, 1]
and common mean E [ Xi ] = µ. For any ϵ > 0, we have:
" #
1 n
n i∑
P Xi − µ ≥ ϵ ≤ 2 exp(−2nϵ2 )
=1

A more general version allows for different bounds and means for each random variable.

Theorem 3 (Hoeffding’s Inequality, General Version). Let X1 , . . . , Xn be independent random variables

with Xi having support on [ ai , bi ]. For any ϵ > 0, we have:
" " # #
1 n 1 n −2n2 ϵ2

n i∑ n i∑
P Xi − E Xi ≥ ϵ ≤ 2 exp
=1 =1 ∑in=1 (bi − ai )2

3 ERM in the Agnostic Setting

We can now use Hoeffding’s inequality to provide a PAC guarantee for ERM in the agnostic setting.

Theorem 4. Let H be a finite hypothesis class. Let ĥ be the hypothesis returned by the ERM algorithm on a
set of n i.i.d. samples. Then, for any ϵ > 0,

P R(ĥ) ≤ inf R(h) + 2ϵ ≥ 1 − 2|H| exp(−2nϵ2 )
h∈H

Proof. The proof consists of two main steps. First, we show that with high probability, the empirical risk is
close to the true risk for all hypotheses simultaneously. Second, we use this fact to bound the excess risk of
the ERM hypothesis.

Step 1: Uniform Convergence

Fix an arbitrary hypothesis h ∈ H. Let’s define a set of random variables Z1 , . . . , Zn where Zi =

L(h( xi ), yi ) = I[h( xi ) ̸= yi ]. Since the loss is 0 or 1, each Zi is a random variable with support on
[0, 1]. The true risk is the mean of this random variable: R(h) = E [ Zi ]. The empirical risk is the sample
mean: R̂(h) = n1 ∑in=1 Zi . Since the Zi ’s are i.i.d. (because the data samples are i.i.d.), we can apply
Hoeffding’s Inequality (Theorem 2) to this specific, fixed h:

P R̂(h) − R(h) ≥ ϵ ≤ 2 exp(−2nϵ2 )

This bound holds for one hypothesis. We need it to hold for all hypotheses in H at the same time. We use the
union bound to achieve this:

P ∃h ∈ H : R̂(h) − R(h) ≥ ϵ ≤ ∑ P R̂(h) − R(h) ≥ ϵ

h∈H
≤ |H| · 2 exp(−2nϵ2 )

5
Instructor: Adarsh Barik Lecture 2, COV878
Scribed by: Soumyodeep Dey August 21, 2025

This is the probability of a ”bad event” where at least one hypothesis has its empirical risk far from its true
risk. The complementary ”good event” is that for all hypotheses, the risks are close.

P ∀h ∈ H : R̂(h) − R(h) ≤ ϵ ≥ 1 − 2|H| exp(−2nϵ2 )

This property is known as uniform convergence. For the rest of the proof, we assume we are in this
high-probability ”good event”.

Step 2: Bounding the Excess Risk

Let h∗ = arg min R(h) be the best possible hypothesis in our class. We want to bound the excess risk,
h∈H
R(ĥ) − R(h∗ ). We can decompose this term as follows:

R(ĥ) − R(h∗ ) = R(ĥ) − R̂(ĥ) + R̂(ĥ) − R̂(h∗ ) + R̂(h∗ ) − R(h∗ )

Now let’s bound each of the three terms, assuming the uniform convergence property holds:

1. R(ĥ) − R̂(ĥ) : By uniform convergence, we know R(h) − R̂(h) ≤ ϵ for all h, including ĥ. Thus,
R(ĥ) − R̂(ĥ) ≤ ϵ.

2. R̂(ĥ) − R̂(h∗ ) : By definition, ĥ is the ERM solution, meaning it minimizes the empirical risk.
Therefore, R̂(ĥ) ≤ R̂(h) for all h ∈ H, including h∗ . This implies R̂(ĥ) − R̂(h∗ ) ≤ 0.

3. R̂(h∗ ) − R(h∗ ) : Again, by uniform convergence, R̂(h∗ ) − R(h∗ ) ≤ ϵ. This implies R̂(h∗ ) −

R(h∗ ) ≤ ϵ.

Combining these bounds, we get:

R(ĥ) − R(h∗ ) ≤ ϵ + 0 + ϵ
R(ĥ) − R(h∗ ) ≤ 2ϵ

This inequality holds whenever the uniform convergence event occurs. The probability of this event is at least
1 − 2|H| exp(−2nϵ2 ). Therefore, we have shown that with high probability,

R(ĥ) ≤ R(h∗ ) + 2ϵ

which completes the proof.

Disclaimer: These notes have not been scrutinized with the level of rigor usually applied to formal publica-
tions. Readers should verify the results before use.

Supervised Learning
No ratings yet
Supervised Learning
61 pages
02 First Model of Learning
No ratings yet
02 First Model of Learning
37 pages
Empirical Risk Minimization
No ratings yet
Empirical Risk Minimization
7 pages
Lecture 1
No ratings yet
Lecture 1
5 pages
UNIT1 ERM and PAC Learning
No ratings yet
UNIT1 ERM and PAC Learning
20 pages
MATH 499 Homework 2
100% (3)
MATH 499 Homework 2
2 pages
Understanding Machine Learning Solution Manual: 2 Gentle Start
No ratings yet
Understanding Machine Learning Solution Manual: 2 Gentle Start
67 pages
Lecture Notes For ECE 695-09/08/03
No ratings yet
Lecture Notes For ECE 695-09/08/03
3 pages
ML Lecture23
No ratings yet
ML Lecture23
57 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
ML Lecture 1 Iitg
No ratings yet
ML Lecture 1 Iitg
32 pages
Emiprical Risk Minimization
No ratings yet
Emiprical Risk Minimization
12 pages
RIP Routing Protocol
No ratings yet
RIP Routing Protocol
27 pages
Chapter 3 Solutions Understanding Machine Learning
No ratings yet
Chapter 3 Solutions Understanding Machine Learning
6 pages
10-601 Machine Learning
No ratings yet
10-601 Machine Learning
7 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
100 pages
Tuo Zhao Notes
No ratings yet
Tuo Zhao Notes
47 pages
Lec11 Handout
No ratings yet
Lec11 Handout
86 pages
Empirical Risk Minimization
100% (1)
Empirical Risk Minimization
3 pages
Machine Learning Theory Lecture
No ratings yet
Machine Learning Theory Lecture
6 pages
n14 PDF
No ratings yet
n14 PDF
4 pages
Lecture 2.4
No ratings yet
Lecture 2.4
28 pages
Learning From Uniform Convergence
No ratings yet
Learning From Uniform Convergence
12 pages
Binary Classification & Bayes Classifier
No ratings yet
Binary Classification & Bayes Classifier
4 pages
Unit Online 1.2
No ratings yet
Unit Online 1.2
20 pages
Introduction To Neural Networks
No ratings yet
Introduction To Neural Networks
3 pages
Class14 PDF
No ratings yet
Class14 PDF
29 pages
Unit-2 IML
No ratings yet
Unit-2 IML
54 pages
Michael Importance Weighting
No ratings yet
Michael Importance Weighting
30 pages
Risk Minimization
No ratings yet
Risk Minimization
12 pages
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
No ratings yet
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
9 pages
Csup AL
No ratings yet
Csup AL
5 pages
Lecturenotes
No ratings yet
Lecturenotes
56 pages
Generalization Error of The Tilted Empirical Risk
No ratings yet
Generalization Error of The Tilted Empirical Risk
54 pages
Sol Advriskmin 2
No ratings yet
Sol Advriskmin 2
3 pages
Lecture 02
No ratings yet
Lecture 02
4 pages
Ex 1
No ratings yet
Ex 1
2 pages
Machine Learning PDF
No ratings yet
Machine Learning PDF
77 pages
Matrix Properties
No ratings yet
Matrix Properties
53 pages
Ai512 Book
No ratings yet
Ai512 Book
127 pages
Lec 6
No ratings yet
Lec 6
29 pages
Lec10 PDF
No ratings yet
Lec10 PDF
8 pages
Learning Theory
No ratings yet
Learning Theory
19 pages
MLT Pac
No ratings yet
MLT Pac
3 pages
Convexity: 18.657: Mathematics of Machine Learning
No ratings yet
Convexity: 18.657: Mathematics of Machine Learning
6 pages
03 Bayes Nearest Neighbors
No ratings yet
03 Bayes Nearest Neighbors
34 pages
Hw1 Theory Solution PuHK4fmHvB
No ratings yet
Hw1 Theory Solution PuHK4fmHvB
4 pages
Aleatoric and Epistemic Uncertainty
No ratings yet
Aleatoric and Epistemic Uncertainty
59 pages
Unit-2 MLT
No ratings yet
Unit-2 MLT
84 pages
Online Prediction Algorithms
No ratings yet
Online Prediction Algorithms
8 pages
Lecture 5
No ratings yet
Lecture 5
47 pages
ML B HA4 Answers Final
No ratings yet
ML B HA4 Answers Final
8 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
57 pages
University of Dar Es Salaam Coict: Department of Computer Science & Eng
No ratings yet
University of Dar Es Salaam Coict: Department of Computer Science & Eng
42 pages
Class 02
No ratings yet
Class 02
42 pages
Reweighting Improves Conditional Risk Bounds
No ratings yet
Reweighting Improves Conditional Risk Bounds
33 pages
Week12 Summary Detail
No ratings yet
Week12 Summary Detail
10 pages
Foundations of Machine Learning, Second Edition (2nd Ed) (Instructor Res. N. 1 of 3, Solution Manual, Solutions) (201
100% (1)
Foundations of Machine Learning, Second Edition (2nd Ed) (Instructor Res. N. 1 of 3, Solution Manual, Solutions) (201
61 pages
MMW Reviewer 2
No ratings yet
MMW Reviewer 2
16 pages
Boudary Layet
No ratings yet
Boudary Layet
2 pages
Haberdashers Aske S Boys School 13 Plus Maths Entrance Exam Paper 1 2012 PDF
No ratings yet
Haberdashers Aske S Boys School 13 Plus Maths Entrance Exam Paper 1 2012 PDF
8 pages
Airtel Inventory Dispatch Report
No ratings yet
Airtel Inventory Dispatch Report
450 pages
Algae, Bryophytes and Ferns
No ratings yet
Algae, Bryophytes and Ferns
44 pages
NXC AC Contactor Catalog
No ratings yet
NXC AC Contactor Catalog
6 pages
Valve Axon
No ratings yet
Valve Axon
2 pages
Autism Sensory Processing Study
No ratings yet
Autism Sensory Processing Study
16 pages
Trelleborg AVS IND Product Catalogue 2024-ENG-v3
No ratings yet
Trelleborg AVS IND Product Catalogue 2024-ENG-v3
116 pages
DC Motor Speed Control with PWM
No ratings yet
DC Motor Speed Control with PWM
6 pages
Jyoti Itr 23-24
No ratings yet
Jyoti Itr 23-24
1 page
Making Information Accessible - Article by Jorge Frascara
No ratings yet
Making Information Accessible - Article by Jorge Frascara
3 pages
5-Day Inner Cheerleader Challenge
No ratings yet
5-Day Inner Cheerleader Challenge
3 pages
Emily Dickinson's Life & Poetry
No ratings yet
Emily Dickinson's Life & Poetry
3 pages
Filter Circuits: 2. The Low-Pass Filter
No ratings yet
Filter Circuits: 2. The Low-Pass Filter
5 pages
Driver Booster Export List
No ratings yet
Driver Booster Export List
32 pages
(START HERE) 7 Day Feminine Transformation Workbook
100% (3)
(START HERE) 7 Day Feminine Transformation Workbook
37 pages
Vietnam RangDong
No ratings yet
Vietnam RangDong
1 page
Soal Agreement and Congratulation
No ratings yet
Soal Agreement and Congratulation
8 pages
Heating Expansion Vessel Tank Cylinder Sizing Guide
No ratings yet
Heating Expansion Vessel Tank Cylinder Sizing Guide
1 page
Introduction: Data Analytic Thinking
No ratings yet
Introduction: Data Analytic Thinking
38 pages
Atomic Theory and Periodic Table
No ratings yet
Atomic Theory and Periodic Table
46 pages
Assignment 4 - ECE410F Linear Control Systems: Controllability and Stabilization of LTI Systems Solutions
No ratings yet
Assignment 4 - ECE410F Linear Control Systems: Controllability and Stabilization of LTI Systems Solutions
8 pages
Simple Present for Tourism Students
No ratings yet
Simple Present for Tourism Students
4 pages
FE 1002 Model Paper Answers
No ratings yet
FE 1002 Model Paper Answers
9 pages
Psychic First Aid
0% (1)
Psychic First Aid
58 pages
Year 2 Maths Progression
No ratings yet
Year 2 Maths Progression
8 pages
Fundamentals
No ratings yet
Fundamentals
24 pages
Understanding Religion in Malaysia
No ratings yet
Understanding Religion in Malaysia
31 pages
A Comprehensive Approach For Road Safety - The Example of Germany
No ratings yet
A Comprehensive Approach For Road Safety - The Example of Germany
1 page

Lecture 2

Uploaded by

Lecture 2

Uploaded by

Instructor: Adarsh Barik Lecture 2, COV878

Scribed by: Soumyodeep Dey August 21, 2025

L ECTURE 2: PAC L EARNING , H OEFFDING ’ S I NEQUALITY AND

To measure the performance of a hypothesis, we use a **loss function** L : Y × Y → R+ . Following is a

This loss function simply indicates whether a prediction is correct or not.

True Risk and Empirical Risk

There are two fundamental ways to quantify the error of a hypothesis.

R(h) = E(x,y)∼ Pxy [ L(h( x ), y)]

1 Probably Approximately Correct (PAC) Learning

2 Empirical Risk Minimization (ERM)

ĥ ∈ arg min[ R̂(h)]

Substituting this result back into our sum from (1):

This completes the proof.

2.1 Sample Complexity

2.2 Hoeffding’s Lemma and Inequality

1. Define f ( x ) = exp(t( x − µ)). This function is convex.

3. Taking the expectation over X, we get E [ f ( X )] ≤ (1 − µ) f (0) + µ f (1).

Theorem 3 (Hoeffding’s Inequality, General Version). Let X1 , . . . , Xn be independent random variables

3 ERM in the Agnostic Setting

Step 1: Uniform Convergence

Fix an arbitrary hypothesis h ∈ H. Let’s define a set of random variables Z1 , . . . , Zn where Zi =

P R̂(h) − R(h) ≥ ϵ ≤ 2 exp(−2nϵ2 )

P ∃h ∈ H : R̂(h) − R(h) ≥ ϵ ≤ ∑ P R̂(h) − R(h) ≥ ϵ

P ∀h ∈ H : R̂(h) − R(h) ≤ ϵ ≥ 1 − 2|H| exp(−2nϵ2 )

Step 2: Bounding the Excess Risk

Combining these bounds, we get:

which completes the proof.

You might also like

To measure the performance of a hypothesis, we use a loss function L : Y × Y → R+ . Following is a