[go: up one dir, main page]

0% found this document useful (0 votes)
4 views6 pages

Lecture 2

This lecture discusses PAC learning, focusing on the concepts of true risk and empirical risk in supervised learning. It introduces the PAC framework, defining agnostic and realizable PAC-learnability, and explains the Empirical Risk Minimization (ERM) algorithm. Additionally, it covers Hoeffding's inequality and its application to provide PAC guarantees for ERM in the agnostic setting.

Uploaded by

Soumyodeep Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views6 pages

Lecture 2

This lecture discusses PAC learning, focusing on the concepts of true risk and empirical risk in supervised learning. It introduces the PAC framework, defining agnostic and realizable PAC-learnability, and explains the Empirical Risk Minimization (ERM) algorithm. Additionally, it covers Hoeffding's inequality and its application to provide PAC guarantees for ERM in the agnostic setting.

Uploaded by

Soumyodeep Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Instructor: Adarsh Barik Lecture 2, COV878

Scribed by: Soumyodeep Dey August 21, 2025

L ECTURE 2: PAC L EARNING , H OEFFDING ’ S I NEQUALITY AND


A PPLICATIONS

In supervised learning, our goal is to learn a mapping from an input space X to a label space Y . We are
given a training set S = {( x1 , y1 ), . . . , ( xn , yn )}, where each pair ( xi , yi ) ∈ X × Y is assumed to be drawn
independently and identically distributed (i.i.d.) from an unknown joint probability distribution Pxy .

Our objective is to find a function, called a hypothesis h : X → Y , that performs well on new, unseen
data from the same distribution. We typically search for this hypothesis within a predefined set of possible
functions, known as the hypothesis class H.

To measure the performance of a hypothesis, we use a **loss function** L : Y × Y → R+ . Following is a


common example for classification tasks:
(
1, if h( x ) ̸= y
L ( h ( x ), y ) = I[ h ( x ) ̸ = y ] =
0, if h( x ) = y

This loss function simply indicates whether a prediction is correct or not.

True Risk and Empirical Risk

There are two fundamental ways to quantify the error of a hypothesis.

Definition 1 (True Risk). The true risk (or generalization error) of a hypothesis h ∈ H is its expected loss
over the true data distribution Pxy . It is defined as:

R(h) = E(x,y)∼ Pxy [ L(h( x ), y)]

For binary classification with the 0-1 loss, this simplifies to the probability of misclassification:
h i
R(h) = P (x,y)∼ Pxy [h( x ) ̸= y]

The true risk is what we ultimately care about, but we cannot compute it directly because the distribution Pxy
is unknown. Instead, we use the training data to calculate an estimate of the true risk.

Definition 2 (Empirical Risk). The empirical risk (or training error) of a hypothesis h ∈ H is its average
loss on the training set S. It is defined as:

1 n
n i∑
R̂(h) = L ( h ( x i ), y i )
=1

For binary classification with the 0-1 loss, this is the fraction of misclassified training examples:

1 n
n i∑
R̂(h) = I[ h ( x i ) ̸ = y i ]
=1

1
Instructor: Adarsh Barik Lecture 2, COV878
Scribed by: Soumyodeep Dey August 21, 2025

1 Probably Approximately Correct (PAC) Learning

The PAC framework provides a formal definition of what it means for a hypothesis class to be ”learnable”.
Definition 3 (Agnostic PAC-Learnable). A hypothesis class H is agnostic PAC-learnable if ∃ a function
N : (0, 1)2 → N and a learning algorithm such that for every ϵ, δ ∈ (0, 1) and for every probability
distribution Pxy over X × Y : if the algorithm is given n ≥ N (ϵ, δ) i.i.d. samples from Pxy , it returns a
hypothesis ĥ ∈ H such that, with probability at least 1 − δ,

R(ĥ) ≤ inf

R( h∗ ) + ϵ
h ∈H

The term ”agnostic” means we make no assumptions about the data, not even that the best hypothesis in our
class can achieve zero error. A simpler, special case is the ”realizable” setting.
Definition 4 (Realizable PAC-Learnable). A hypothesis class H is realizable PAC-learnable if it is PAC-
learnable under the assumption of realizability. Realizability means there exists at least one hypothesis
h∗ ∈ H with zero true risk, i.e., R(h∗ ) = 0. In this case, the PAC guarantee simplifies to:

R(ĥ) ≤ ϵ

2 Empirical Risk Minimization (ERM)

A natural and powerful learning strategy is to find the hypothesis that best fits the training data.
Definition 5 (Empirical Risk Minimization (ERM)). The Empirical Risk Minimization (ERM) algorithm is a
learning rule that, given a training set S, outputs a hypothesis ĥ that minimizes the empirical risk:

ĥ ∈ arg min[ R̂(h)]


h∈H

under the assumptions that we have a finite hypothesis class, that is | H | < ∞ and our problem is realizable.
Theorem 1. Let H be a finite hypothesis class. Assume the learning problem is realizable, i.e., there exists
an h∗ ∈ H such that R(h∗ ) = 0. Let ĥ be the hypothesis returned by the ERM algorithm on a set of n i.i.d.
samples. Then, for all ϵ > 0, h i
P R(ĥ) ≤ ϵ ≥ 1 − |H| exp(−nϵ)

Proof. Let H B ⊆ H be the set of ”bad” hypotheses, which are those with a true risk greater than ϵ:

H B = {h ∈ H | R(h) > ϵ}

Our goal is to bound the probability of the ”bad event” where the ERM algorithm returns a hypothesis ĥ from
this set. Let this event be Ebad = { ĥ ∈ H B }.

From the realizability assumption, we know there exists an h∗ ∈ H with R(h∗ ) = 0. This means h∗ makes
no errors on the true distribution, so its empirical risk on any training set must also be zero, i.e., R̂(h∗ ) = 0.

The ERM algorithm finds a hypothesis ĥ that minimizes the empirical risk. Since h∗ is a candidate hypothesis
in H with R̂(h∗ ) = 0, the hypothesis ĥ returned by ERM must also satisfy R̂(ĥ) = 0.

2
Instructor: Adarsh Barik Lecture 2, COV878
Scribed by: Soumyodeep Dey August 21, 2025

The bad event Ebad can only happen if ERM selects a hypothesis from H B . However, for any hypothesis to
be selected by ERM in this setting, it must have an empirical risk of 0. Therefore, the bad event Ebad implies
that at least one hypothesis in H B must have had an empirical risk of 0 on the training data. This leads to the
following crucial step, which is an application of the union bound:
h i h i
P R(ĥ) > ϵ = P ĥ ∈ H B
≤ P ∃h ∈ H B : R̂(h) = 0
 

≤ ∑ P R̂(h) = 0
 
(1)
h∈H B

Now we just need to bound the probability P R̂(h) = 0 for any single bad hypothesis h ∈ H B . By
 

definition of H B , we have R(h) = P [h( x ) ̸= y] > ϵ. This means the probability of h correctly classifying
a single, randomly drawn example is P [h( x ) = y] = 1 − R(h) < 1 − ϵ.

The event R̂(h) = 0 means that h correctly classifies all n i.i.d. samples in the training set. The probability
of this is:

P R̂(h) = 0 = P [h( x1 ) = y1 , . . . , h( xn ) = yn ]
 

n
= ∏ P [ h ( xi ) = yi ] (due to i.i.d. samples)
i =1
≤ (1 − ϵ ) n

Substituting this result back into our sum from (1):


h i
P R(ĥ) > ϵ ≤ ∑ (1 − ϵ)n
h∈H B

= |H B |(1 − ϵ)n
≤ |H|(1 − ϵ)n (since H B ⊆ H)

Finally, we use the well-known inequality 1 − x ≤ e− x for any x ∈ R, which gives (1 − ϵ)n ≤ exp(−nϵ).
This gives us the final bound: h i
P R(ĥ) > ϵ ≤ |H| exp(−nϵ)

This completes the proof.

This theorem shows that the probability of ERM selecting a ”bad” hypothesis decreases exponentially with
the number of training samples n.

3
Instructor: Adarsh Barik Lecture 2, COV878
Scribed by: Soumyodeep Dey August 21, 2025

2.1 Sample Complexity

From the theorem, we can derive the sample complexity: the number of samples n required to guarantee a
certain level of performance. We want the failure probability to be at most δ:

|H| exp(−nϵ) ≤ δ
δ
exp(−nϵ) ≤
|H|
 
δ
−nϵ ≤ ln
|H|
|H|
   
δ
nϵ ≥ − ln = ln
|H| δ
|H|
 
1
n ≥ ln
ϵ δ

This gives us a concrete number of samples N (ϵ, δ) needed to ensure that with probability 1 − δ, our ERM-
learned hypothesis has true risk at most ϵ. This confirms that for finite hypothesis classes under realizability,
ERM is a valid PAC learning algorithm. The analysis for the agnostic case is more involved and requires
stronger concentration inequalities like Hoeffding’s inequality.

2.2 Hoeffding’s Lemma and Inequality

Lemma 1 (Hoeffding’s Lemma). Let X be a random variable with support on [0, 1] and mean E [ X ] = µ.
Then for any t ∈ R, we have:  2
t
E [exp (t( X − µ))] ≤ exp
8

Proof Sketch. The proof relies on the convexity of the exponential function.

1. Define f ( x ) = exp(t( x − µ)). This function is convex.

2. For any x ∈ [0, 1], we can write x = (1 − x ) · 0 + x · 1. By Jensen’s inequality (or the definition of
convexity), f ( x ) ≤ (1 − x ) f (0) + x f (1).

3. Taking the expectation over X, we get E [ f ( X )] ≤ (1 − µ) f (0) + µ f (1).

4. Let ϕ(t) = ln(E [exp(t( X − µ))]). After some algebra, one can show using a Taylor expansion of
2
ϕ(t) around 0 that ϕ(t) ≤ t8 , which proves the lemma.

This lemma can be extended to a random variable bounded in any interval [ a, b].

Corollary 1. Let X be a random variable with support on [ a, b] and mean E [ X ] = µ. Then for any t ∈ R:
 2
t ( b − a )2

E [exp (t( X − µ))] ≤ exp
8

4
Instructor: Adarsh Barik Lecture 2, COV878
Scribed by: Soumyodeep Dey August 21, 2025

Using Hoeffding’s Lemma and the Chernoff bounding technique (which involves applying Markov’s inequal-
ity to the exponential), we arrive at Hoeffding’s inequality.

Theorem 2 (Hoeffding’s Inequality). Let X1 , . . . , Xn be independent random variables with support on [0, 1]
and common mean E [ Xi ] = µ. For any ϵ > 0, we have:
" #
1 n
n i∑
P Xi − µ ≥ ϵ ≤ 2 exp(−2nϵ2 )
=1

A more general version allows for different bounds and means for each random variable.

Theorem 3 (Hoeffding’s Inequality, General Version). Let X1 , . . . , Xn be independent random variables


with Xi having support on [ ai , bi ]. For any ϵ > 0, we have:
" " # #
1 n 1 n −2n2 ϵ2
 

n i∑ n i∑
P Xi − E Xi ≥ ϵ ≤ 2 exp
=1 =1 ∑in=1 (bi − ai )2

3 ERM in the Agnostic Setting

We can now use Hoeffding’s inequality to provide a PAC guarantee for ERM in the agnostic setting.

Theorem 4. Let H be a finite hypothesis class. Let ĥ be the hypothesis returned by the ERM algorithm on a
set of n i.i.d. samples. Then, for any ϵ > 0,
 
P R(ĥ) ≤ inf R(h) + 2ϵ ≥ 1 − 2|H| exp(−2nϵ2 )
h∈H

Proof. The proof consists of two main steps. First, we show that with high probability, the empirical risk is
close to the true risk for all hypotheses simultaneously. Second, we use this fact to bound the excess risk of
the ERM hypothesis.

Step 1: Uniform Convergence

Fix an arbitrary hypothesis h ∈ H. Let’s define a set of random variables Z1 , . . . , Zn where Zi =


L(h( xi ), yi ) = I[h( xi ) ̸= yi ]. Since the loss is 0 or 1, each Zi is a random variable with support on
[0, 1]. The true risk is the mean of this random variable: R(h) = E [ Zi ]. The empirical risk is the sample
mean: R̂(h) = n1 ∑in=1 Zi . Since the Zi ’s are i.i.d. (because the data samples are i.i.d.), we can apply
Hoeffding’s Inequality (Theorem 2) to this specific, fixed h:

P R̂(h) − R(h) ≥ ϵ ≤ 2 exp(−2nϵ2 )


 

This bound holds for one hypothesis. We need it to hold for all hypotheses in H at the same time. We use the
union bound to achieve this:

P ∃h ∈ H : R̂(h) − R(h) ≥ ϵ ≤ ∑ P R̂(h) − R(h) ≥ ϵ


   
h∈H
≤ |H| · 2 exp(−2nϵ2 )

5
Instructor: Adarsh Barik Lecture 2, COV878
Scribed by: Soumyodeep Dey August 21, 2025

This is the probability of a ”bad event” where at least one hypothesis has its empirical risk far from its true
risk. The complementary ”good event” is that for all hypotheses, the risks are close.

P ∀h ∈ H : R̂(h) − R(h) ≤ ϵ ≥ 1 − 2|H| exp(−2nϵ2 )


 

This property is known as uniform convergence. For the rest of the proof, we assume we are in this
high-probability ”good event”.

Step 2: Bounding the Excess Risk

Let h∗ = arg min R(h) be the best possible hypothesis in our class. We want to bound the excess risk,
h∈H
R(ĥ) − R(h∗ ). We can decompose this term as follows:
   
R(ĥ) − R(h∗ ) = R(ĥ) − R̂(ĥ) + R̂(ĥ) − R̂(h∗ ) + R̂(h∗ ) − R(h∗ )


Now let’s bound each of the three terms, assuming the uniform convergence property holds:
 
1. R(ĥ) − R̂(ĥ) : By uniform convergence, we know R(h) − R̂(h) ≤ ϵ for all h, including ĥ. Thus,
R(ĥ) − R̂(ĥ) ≤ ϵ.
 
2. R̂(ĥ) − R̂(h∗ ) : By definition, ĥ is the ERM solution, meaning it minimizes the empirical risk.
Therefore, R̂(ĥ) ≤ R̂(h) for all h ∈ H, including h∗ . This implies R̂(ĥ) − R̂(h∗ ) ≤ 0.

3. R̂(h∗ ) − R(h∗ ) : Again, by uniform convergence, R̂(h∗ ) − R(h∗ ) ≤ ϵ. This implies R̂(h∗ ) −


R(h∗ ) ≤ ϵ.

Combining these bounds, we get:

R(ĥ) − R(h∗ ) ≤ ϵ + 0 + ϵ
R(ĥ) − R(h∗ ) ≤ 2ϵ

This inequality holds whenever the uniform convergence event occurs. The probability of this event is at least
1 − 2|H| exp(−2nϵ2 ). Therefore, we have shown that with high probability,

R(ĥ) ≤ R(h∗ ) + 2ϵ

which completes the proof.

Disclaimer: These notes have not been scrutinized with the level of rigor usually applied to formal publica-
tions. Readers should verify the results before use.

You might also like