0% found this document useful (0 votes)

11 views93 pages

Machine Learning

Uploaded by

egkafati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views93 pages

Machine Learning

Uploaded by

egkafati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 93

Information Theory meets Machine Learning

Kedar Tatwawadi

EE376a Course
Table of contents

1. Introduction

2. Unsupervised Learning

3. Learning Data Distribution

4. ML for Lossy Compression

1
Introduction
ML and IT

ML/Statistics & Information theory are two sides of the same coin!

Figure 1: ML and IT

2
A short (very) intro to ML

Figure 2: ML zoo

3
A short (very) intro to ML

Figure 3: ML Zoo

4
A short (very) intro to ML

Figure 4: ML Zoo

5
Supervised Learning

Given data tuples (X1 , y1 ), (X2 , y2 ), . . . , (XN , yN ), find a function F such

that:

6
Supervised Learning

Given data tuples (X1 , y1 ), (X2 , y2 ), . . . , (XN , yN ), find a function F such

that:
F (X ) = y

6
Supervised Learning

Given data tuples (X1 , y1 ), (X2 , y2 ), . . . , (XN , yN ), find a function F such

that:
F (X ) = y

6
Supervised Learning

1. What is the function F ?

2. SVM, ConvNet, Recurrent Neural Network, Decision Tree ...

7
Supervised Learning

1. What is the function F ?

2. SVM, ConvNet, Recurrent Neural Network, Decision Tree ...

Take CS229, CS231n courses!

7
A short (very) intro to ML

Figure 5: ML Zoo

8
A short (very) intro to ML

Figure 6: ML Zoo

9
Unsupervised Learning
Unsupervised Learning

Given data: X1 , X2 , X3 , . . . , XN
”Learn” something useful about X

1. Clustering
2. Data Representation
3. Distribution of the data

10
Clustering

Figure 7: Clustering
11
Data Representation

Figure 8: Word2Vec Representation

12
Data Representation

13
Learning Data Distribution
Learning the distribution

”Learn” the underlying Distribution of the data

Given data: X (1) , X (2) , . . . , X (N) with distribution pX (X ), How do we
learn pX (X )?

14
Learning the distribution

”Learn” the underlying Distribution of the data

Given data: X (1) , X (2) , . . . , X (N) with distribution pX (X ), How do we
learn pX (X )?
Use cases

1. Sampling
2. Prediction
3. De-noising
4. Compression

14
Sampling

15
Prediction

16
Denoising

17
Learning the distribution

”Learn” the underlying Distribution of the data

1. Sampling
2. Prediction
3. De-noising
4. Compression

18
Learning the distribution

Data: X (1) , X (2) , . . . , X (N) i.i.d (independent and identically distributed)

with distribution pX (X )

• Xi ∈ X
• Potentially |X | can be high

19
Learning the distribution

Data: X (1) , X (2) , . . . , X (N) i.i.d (independent and identically distributed)

with distribution pX (X )

• Xi ∈ X
• Potentially |X | can be high

How do we learn pX (X )?

19
Learning the distribution

Data: X (1) , X (2) , . . . , X (N) i.i.d (independent and identically distributed)

with distribution pX (X )

• Xi ∈ X
• Potentially |X | can be high

How do we learn pX (X )?

• We can use the Log-loss (Cross-entropy loss) to learn pX (X )

1
pX = argmin EpX log (1)
q(X ) q(X )

19
Learning the distribution

Data: X (1) , X (2) , . . . , X (N) with distribution pX (X )

1 X 1
EpX log = p(x) log
q(X ) q(x)
x∈X
X 1 p(x)
= p(x) log
p(x) q(x)
x∈X
X 1 X p(x)
= p(x) log + p(x) log
p(x) q(x)
x∈X x∈X

= Hp (X ) + DKL (pX ||q)

20
Learning the distribution

Data: X (1) , X (2) , . . . , X (N) with distribution pX (X )

1 X 1
EpX log = p(x) log
q(X ) q(x)
x∈X
X 1 p(x)
= p(x) log
p(x) q(x)
x∈X
X 1 X p(x)
= p(x) log + p(x) log
p(x) q(x)
x∈X x∈X

= Hp (X ) + DKL (pX ||q)

1
pX = argmin EpX log
q(X ) q(X )

20
Learning the distribution

1
pX = argmin EpX log
q(X ) q(X )

• In practice we consider empirical expectation instead:

N
1 1 X 1
argmin EpX log ≈ argmin log
q(X ) q(X ) q(X ) N i=1 q(X (i) )

21
Learning the distribution

• In practice we consider empirical expectation instead:

N
1 X 1 1 1
argmin log (i) )
= argmin log
q(X ) N q(X q(X ) N q(X 1 )q(X2 ) . . . q(XN )
i=1
X nx 1
= argmin log
q(X ) N q(x)
x∈X
1
= argmin Ep̂X log
q(X ) q(x)

22
Learning the distribution

N
1 X 1 1
argmin log (i) )
= argmin Ep̂X log
q(X ) N q(X q(X ) q(x)
i=1
nx
= p̂X (x) =
N

23
Learning the distribution

N
1 X 1 1
argmin log (i) )
= argmin Ep̂X log
q(X ) N q(X q(X ) q(x)
i=1
nx
= p̂X (x) =
N

• When X = (Y1 , Y2 , . . . , Yd ), |X | = k d
• For high |X |, p̂X is not useful!

23
Learning the distribution

N
1 X 1 1
argmin log (i) )
= argmin Ep̂X log
q(X ) N q(X q(X ) q(x)
i=1
nx
= p̂X (x) =
N

• When X = (Y1 , Y2 , . . . , Yd ), |X | = k d
• For high |X |, p̂X is not useful!
• We need more data, or ... some regularization.

23
Data Example

• X = (Y1 , Y2 , . . . , Yd ), |X | = k d , N ≈ number of dimensions.

24
Regularization

1 1
argmin EpX log = argmin Ep̂X log
q(X ) q(x) q(X ) q(x)
1
≈ argmin Ep̂X log
q(X )∈Q q(x)

• q(X ) = q(Y1 , Y2 , . . . , Yd ) =
q1 (Y1 )q2 (Y2 |Y1 )q3 (Y3 |Y2 , Y1 ) . . . qd (Yd |Y1 , . . . , Yd−1 )
• Q restricts some distributions
e.g.: q(Y1 , Y2 , . . . , Yd ) = q1 (Y1 )q2 (Y2 )q3 (Y3 ) . . . qd (Yd )

25
QI independent distributions

• QI restricts the distribution over the d dimensions to be independent

e.g.: q(Y1 , Y2 , . . . , Yd ) = q1 (Y1 )q2 (Y2 )q3 (Y3 ) . . . qd (Yd )

1
argmin Ep̂X log = (q̂1 (y1 ), . . . , q̂d (yd ))
q(X )∈QI q(x)

• q(Y1 , Y2 , . . . , Yd ) = q̂1 (y1 )q̂2 (y2 ) . . . q̂d (yd )

is not very useful for the tabular dataset

26
Tabular Example

27
Tree-based distributions

• We restrict distributions to T :
e.g.: T = {q|q(Y1 , Y2 , . . . , Yd ) =
q1 (Y1 )q2 (Y2 |Yi2 )q3 (Y3 |Yi3 ) . . . qd (Yd |Yid )}

28
Tree-based distributions

• We restrict distributions to T :
e.g.: T = {q|q(Y1 , Y2 , . . . , Yd ) =
q1 (Y1 )q2 (Y2 |Yi2 )q3 (Y3 |Yi3 ) . . . qd (Yd |Yid )}
• For every Yi , we allow dependence on one of the other variables
Yij , ij < i

28
Tree-based distributions

28
Tree-based distributions — Examples

• Example tree distribution:

q(Y1 , Y2 , . . . , Y5 ) = q1 (Y1 )q2 (Y2 |Y1 )q3 (Y3 |Y1 )q4 (Y4 |Y2 )q5 (Y5 |Y2 )

29
Tree-based distributions — Examples

• Example tree distribution:

q(Y1 , Y2 , Y3 ) = q1 (Y1 )q2 (Y2 |Y1 )q3 (Y3 |Y2 )

Figure 11: Graph example

30
Tree-based distributions

Figure 12: Graph example

• Tree distributions are practical! No of parameters = dk 2

31
Tree-based distributions

Figure 12: Graph example

• Tree distributions are practical! No of parameters = dk 2

• Sampling is easy (in a breadth-first search order):
Y1 → Y2 → Y3 → Y4 → Y5 31
Tree-based distributions

• Can be used for compression, using Arithmetic coding:

• q(Y1 , Y2 , . . . , Y5 ) = q1 (Y1 )q2 (Y2 |Y1 )q3 (Y3 |Y1 )q4 (Y4 |Y2 )q5 (Y5 |Y2 )

Figure 13: HW3 Q3(f)

32
Chow-Liu Tree Algorithm

• Let Iˆ(Yi ; Yj ) be the mutual information computed using the

”empirical” distribution: p̂X (X ) = p̂X (Y1 , Y2 , . . . , Yd )
The best tree graph representing the data can be found by:
X
G = argmax Iˆ(Yi ; Yj ) (2)
edges(i,j)

33
Chow-Liu Tree Algorithm

• Let Iˆ(Yi ; Yj ) be the mutual information computed using the

”empirical” distribution: p̂X (X ) = p̂X (Y1 , Y2 , . . . , Yd )
The best tree graph representing the data can be found by:
X
G = argmax Iˆ(Yi ; Yj ) (2)
edges(i,j)

• Intuition:: Add edges which have ”high’ correlation.

33
Data Example

• X = (Y1 , Y2 , . . . , Yd ), |X | = k d , N ≈ number of dimensions.

34
Data Example

35
Chow-Liu Tree Algorithm

• Let Iˆ(Yi ; Yj ) be the mutual information computed using the

”empirical” distribution: p̂X (X ) = p̂X (Y1 , Y2 , . . . , Yd )
The best tree graph representing the data can be found by:
X
G = argmax Iˆ(Yi ; Yj ) (3)
edges(i,j)

• Intuition:: Add edges which have ”high’ correlation.

• We will proove this in the class!

36
Practical Considerations

• Exaustive search over all trees is not possible, use O(d log d)
algorithm such as Kruskal’s or Prim’s algorithm

37
Practical Considerations

• Exaustive search over all trees is not possible, use O(d log d)
algorithm such as Kruskal’s or Prim’s algorithm
• Need to compute O(d 2 ) mutual informations, which is the more
costly part

37
Practical Considerations

• X
G = argmax Iˆ(Yi ; Yj ) (4)
edges(i,j)

G is a solution to the problem:

1 1
argmin Ep̂X log ≈ argmin EpX log
q(X ) q(x) q(X ) q(x)

and thus ”approximates” p̂X (X ).

38
Practical Considerations

• X
G = argmax Iˆ(Yi ; Yj ) (4)
edges(i,j)

G is a solution to the problem:

1 1
argmin Ep̂X log ≈ argmin EpX log
q(X ) q(x) q(X ) q(x)

and thus ”approximates” p̂X (X ).

• We really want to solve the problem:
X
argmax I (Yi ; Yj )
edges(i,j)

38
Practical Considerations

• X
G = argmax Iˆ(Yi ; Yj ) (4)
edges(i,j)

G is a solution to the problem:

1 1
argmin Ep̂X log ≈ argmin EpX log
q(X ) q(x) q(X ) q(x)