[go: up one dir, main page]

0% found this document useful (0 votes)
67 views26 pages

Statistical Learning: First Steps: Sasha Rakhlin

The document provides an outline and summary of a lecture on statistical learning and the perceptron algorithm. The key points covered are: 1) The document introduces the concepts of generalization, supervised learning, and the goals of prediction and estimation. 2) It then describes the perceptron learning algorithm, which maintains a hypothesis and updates it based on misclassified examples from the training data. 3) The lecture discusses the concept of consistency, where a learning algorithm is able to approach the performance of the optimal Bayes predictor as the sample size increases.

Uploaded by

Irfan Fadhullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views26 pages

Statistical Learning: First Steps: Sasha Rakhlin

The document provides an outline and summary of a lecture on statistical learning and the perceptron algorithm. The key points covered are: 1) The document introduces the concepts of generalization, supervised learning, and the goals of prediction and estimation. 2) It then describes the perceptron learning algorithm, which maintains a hypothesis and updates it based on misclassified examples from the training data. 3) The lecture discusses the concept of consistency, where a learning algorithm is able to approach the performance of the optimal Bayes predictor as the sample size increases.

Uploaded by

Irfan Fadhullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Lecture 12

Statistical Learning: First Steps

Sasha Rakhlin

Oct 17, 2019

1 / 24
Outline

Setup

Perceptron

2 / 24
What is Generalization?

3 / 24
What is Generalization?

log(1 + 2 + 3) = log(1) + log(2) + log(3)

3 / 24
What is Generalization?

log(1 + 2 + 3) = log(1) + log(2) + log(3)

log(1 + 1.5 + 5) = log(1) + log(1.5) + log(5)

log(2 + 2) = log(2) + log(2)

log(1 + 1.25 + 9) = log(1) + log(1.25) + log(9)

3 / 24
Outline

Setup

Perceptron

4 / 24
Supervised Learning: data S = {(X1 , Y1 ), . . . , (Xn , Yn )} are i.i.d. from
unknown distribution P.

Learning algorithm: a mapping {(X1 , Y1 ), . . . , (Xn , Yn )} z→ f̂n .

Goals:
▸ Prediction: small expected loss

L(f̂n ) = EX,Y `(Y, f̂n (X)).

Here (X, Y) ∼ P. Interpretation: good prediction on a random example


from same population.
▸ Estimation: small ∥f̂n − f∗ ∥, or ∥̂
θ − θ∗ ∥, where f∗ or θ∗ are
parameters of P (e.g. regression function f∗ (x) = E[Y∣X = x], or
f∗ (x) = ⟨θ∗ , x⟩, etc).

In this course, we mostly focus on prediction, but will also outline


connections between prediction and estimation.

5 / 24
Why not estimate the underlying distribution P first?
This is in general a harder problem than prediction. Consider classification.
We might be attempting to learn parts/properties of the distribution that
are irrelevant, while all we care about is the “boundary” between the two
classes.

6 / 24
Key difficulty: our goals are in terms of unknown quantities related to
unknown P. Have to use empirical data instead. Purview of statistics.
For instance, we can calculate the empirical loss of f ∶ X → Y

̂ 1 n
L(f) = ∑ `(Yi , f(Xi ))
n i=1

7 / 24
The function x ↦ f̂n (x) = f̂n (x; X1 , Y1 , . . . , Xn , Yn ) is random, since it
depends on the random data S = (X1 , Y1 , . . . , Xn , Yn ). Thus, the risk

L(f̂n ) = E [`(f̂n (X), Y)∣S ]


= E [`(f̂n (X; X1 , Y1 , . . . , Xn , Yn ), Y)∣S ]

is a random variable. We might aim for EL(f̂n ) small, or L(f̂n ) small with
high probability (over the training data).

8 / 24
Quiz: what is random here?

1. ̂
L(f) for a given fixed f
2. f̂n
L(f̂n )
3. ̂
4. L(f̂n )
5. L(f) for a given fixed f

It is important that these are understood before we proceed further.

9 / 24
Theoretical analysis of performance is typically easier if f̂n has closed form
(in terms of the training data).

E.g. ordinary least squares f̂n (x) = x T (X T X)−1 X T Y.

Unfortunately, most ML and many Statistical procedures are not explicitly


defined but arise as
▸ solutions to an optimization objective (e.g. logistic regression)
▸ as an iterative procedure without an immediately obvious objective
function (e.g. AdaBoost, Random Forests, etc)

10 / 24
The Gold Standard

Within the framework we set up, the smallest expected loss is achieved by
the Bayes optimal function

f∗ = arg min L(f)


f

where the minimization is over all (measurable) prediction rules f ∶ X → Y.

The value of the lowest expected loss is called the Bayes error:

L(f∗ ) = inf L(f)


f

Of course, we cannot calculate any of these quantities since P is unknown.

11 / 24
Bayes Optimal Function

Bayes optimal function f∗ takes on the following forms in these two


particular cases:
▸ Binary classification (Y = {0, 1}) with the indicator loss:

f∗ (x) = I{η(x) ≥ 1/2}, where η(x) = E[Y∣X = x]

⌘(x)
0

▸ Regression (Y = R) with squared loss:

f∗ (x) = η(x), where η(x) = E[Y∣X = x]

12 / 24
The big question: is there a way to construct a learning algorithm with a
guarantee that
L(f̂n ) − L(f∗ )
is small for large enough sample size n?

13 / 24
Consistency

An algorithm that ensures

lim L(f̂n ) = L(f∗ ) almost surely


n→∞

is called consistent. Consistency ensures that our algorithm is approaching


the best possible prediction performance as the sample size increases.

The good news: consistency is possible to achieve.


▸ easy if X is a finite or countable set
▸ not too hard if X is infinite, and the underlying relationship between x
and y is “continuous”

14 / 24
The bad news...
In general, we cannot prove anything quantitative about L(f̂n ) − L(f∗ ),
unless we make further assumptions (incorporate prior knowledge).

“No Free Lunch” Theorems: unless we posit assumptions,


▸ For any algorithm f̂n , any n and any  > 0, there exists a distribution
P such that L(f∗ ) = 0 and
1
EL(f̂n ) ≥ −
2

▸ For any algorithm f̂n , and any sequence an that converges to 0, there
exists a probability distribution P such that L(f∗ ) = 0 and for all n

EL(f̂n ) ≥ an

Reference: (Devroye, Györfi, Lugosi: A Probabilistic Theory of Pattern Recognition),


(Bousquet, Boucheron, Lugosi, 2004)

15 / 24
is this really “bad news”?

Not really. We always have some domain knowledge.

Two ways of incorporating prior knowledge:


▸ Direct way: assumptions on distribution P (e.g. margin, smoothness of
regression function, etc)
▸ Indirect way: redefine the goal to perform as well as a reference set F
of predictors:
L(f̂n ) − inf L(f)
f∈F

F encapsulates our inductive bias.

We often make both of these assumptions.

16 / 24
Outline

Setup

Perceptron

17 / 24
We start our study of Statistical Learning with the classical Perceptron
algorithm.

Reason: simplicity. We will give a three-line proof of Perceptron, followed


by two interesting consequences with one-line proofs each. These
consequences are, perhaps, the easiest nontrivial statistical guarantees I can
think of.

18 / 24
Perceptron

<latexit sha1_base64="+03BdUB6TWqLjuCfwSU3OIMjF3A=">AAAB7HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R1FvQi8cIbhJIljA7mU3GzGOZmRXCkn/w4kHFqx/kzb9xkuxBEwsaiqpuurvilDNjff/bW1ldW9/YLG2Vt3d29/YrB4dNozJNaEgUV7odY0M5kzS0zHLaTjXFIua0FY9up37riWrDlHyw45RGAg8kSxjB1knN7gALgXuVql/zZ0DLJChIFQo0epWvbl+RTFBpCcfGdAI/tVGOtWWE00m5mxmaYjLCA9pxVGJBTZTPrp2gU6f0UaK0K2nRTP09kWNhzFjErlNgOzSL3lT8z+tkNrmKcibTzFJJ5ouSjCOr0PR11GeaEsvHjmCimbsVkSHWmFgXUNmFECy+vEzC89p1Lbi/qNZvijRKcAwncAYBXEId7qABIRB4hGd4hTdPeS/eu/cxb13xipkj+APv8wfzVo7p</latexit>

19 / 24
Perceptron

(x1 , y1 ), . . . , (xT , yT ) ∈ X × {±1} (T may or may not be same as n)

Maintain a hypothesis wt ∈ Rd (initialize w1 = 0).

On round t,
▸ Consider (xt , yt )
▸ Form prediction ̂
yt = sign(⟨wt , xt ⟩)
▸ If ̂
yt ≠ yt , update
wt+1 = wt + yt xt

else
wt+1 = wt

20 / 24
Perceptron

wt
<latexit sha1_base64="7J0RWCUqTAOGp/2VV5Bhw8gCVEA=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71KCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xuZmI3/</latexit>

xt
<latexit sha1_base64="wK72DhZCMdU9mhIbOxvRhU8Wi14=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKMR6DXjxGNA9IljA7mU2GzM4uM71iCPkELx4U8eoXefNvnLxAjQUNRVU33V1BIoVB1/1yMiura+sb2c3c1vbO7l5+/6Bu4lQzXmOxjHUzoIZLoXgNBUreTDSnUSB5IxhcT/zGA9dGxOoehwn3I9pTIhSMopXuHjvYyRfcojsFcYveeal0USbeQlmQAsxR7eQ/292YpRFXyCQ1puW5CfojqlEwyce5dmp4QtmA9njLUkUjbvzR9NQxObFKl4SxtqWQTNWfEyMaGTOMAtsZUeybv95E/M9rpRhe+iOhkhS5YrNFYSoJxmTyN+kKzRnKoSWUaWFvJaxPNWVo08nZEJZeXib1s6JnI7o9L1Su5nFk4QiO4RQ8KEMFbqAKNWDQgyd4gVdHOs/Om/M+a80485lD+AXn4xubHo4A</latexit>

21 / 24
For simplicity, suppose all data are in a unit ball, ∥xt ∥ ≤ 1.

Margin with respect to (x1 , y1 ), . . . , (xT , yT ):

γ = max min (yi ⟨w, xi ⟩)+ ,


∥w∥=1 i∈[T ]

where (a)+ = max{0, a}.

Theorem (Novikoff ’62).

Perceptron makes at most 1/γ2 mistakes (and corrections) on any


sequence of examples with margin γ.

22 / 24
Proof: Let m be the number of mistakes after T iterations. If a mistake
is made on round t,

∥wt+1 ∥2 = ∥wt + yt xt ∥2 ≤ ∥wt ∥2 + 2yt ⟨wt , xt ⟩ + 1 ≤ ∥wt ∥2 + 1.

Hence,
∥wT ∥2 ≤ m.
For optimal hyperplane w∗

γ ≤ ⟨w∗ , yt xt ⟩ = ⟨w∗ , wt+1 − wt ⟩ .

Hence (adding and canceling),



mγ ≤ ⟨w∗ , wT ⟩ ≤ ∥wT ∥ ≤ m.

23 / 24
Recap

For any T and (x1 , y1 ), . . . , (xT , yT ),


T
D2
∑ I{yt ⟨wt , xt ⟩ ≤ 0} ≤
t=1 γ2

where γ = γ(x1∶T , y1∶T ) is margin and D = D(x1∶T , y1∶T ) = maxt ∥xt ∥.

Let w∗ denote the max margin hyperplane, ∥w∗ ∥ = 1.

24 / 24

You might also like