Information Theory meets Machine Learning
Kedar Tatwawadi
EE376a Course
Table of contents
1. Introduction
2. Unsupervised Learning
3. Learning Data Distribution
4. ML for Lossy Compression
1
Introduction
ML and IT
ML/Statistics & Information theory are two sides of the same coin!
Figure 1: ML and IT
2
A short (very) intro to ML
Figure 2: ML zoo
3
A short (very) intro to ML
Figure 3: ML Zoo
4
A short (very) intro to ML
Figure 4: ML Zoo
5
Supervised Learning
Given data tuples (X1 , y1 ), (X2 , y2 ), . . . , (XN , yN ), find a function F such
that:
6
Supervised Learning
Given data tuples (X1 , y1 ), (X2 , y2 ), . . . , (XN , yN ), find a function F such
that:
F (X ) = y
6
Supervised Learning
Given data tuples (X1 , y1 ), (X2 , y2 ), . . . , (XN , yN ), find a function F such
that:
F (X ) = y
6
Supervised Learning
1. What is the function F ?
2. SVM, ConvNet, Recurrent Neural Network, Decision Tree ...
7
Supervised Learning
1. What is the function F ?
2. SVM, ConvNet, Recurrent Neural Network, Decision Tree ...
Take CS229, CS231n courses!
7
A short (very) intro to ML
Figure 5: ML Zoo
8
A short (very) intro to ML
Figure 6: ML Zoo
9
Unsupervised Learning
Unsupervised Learning
Given data: X1 , X2 , X3 , . . . , XN
”Learn” something useful about X
1. Clustering
2. Data Representation
3. Distribution of the data
10
Clustering
Figure 7: Clustering
11
Data Representation
Figure 8: Word2Vec Representation
12
Data Representation
13
Learning Data Distribution
Learning the distribution
”Learn” the underlying Distribution of the data
Given data: X (1) , X (2) , . . . , X (N) with distribution pX (X ), How do we
learn pX (X )?
14
Learning the distribution
”Learn” the underlying Distribution of the data
Given data: X (1) , X (2) , . . . , X (N) with distribution pX (X ), How do we
learn pX (X )?
Use cases
1. Sampling
2. Prediction
3. De-noising
4. Compression
14
Sampling
15
Prediction
16
Denoising
17
Learning the distribution
”Learn” the underlying Distribution of the data
1. Sampling
2. Prediction
3. De-noising
4. Compression
18
Learning the distribution
Data: X (1) , X (2) , . . . , X (N) i.i.d (independent and identically distributed)
with distribution pX (X )
• Xi ∈ X
• Potentially |X | can be high
19
Learning the distribution
Data: X (1) , X (2) , . . . , X (N) i.i.d (independent and identically distributed)
with distribution pX (X )
• Xi ∈ X
• Potentially |X | can be high
How do we learn pX (X )?
19
Learning the distribution
Data: X (1) , X (2) , . . . , X (N) i.i.d (independent and identically distributed)
with distribution pX (X )
• Xi ∈ X
• Potentially |X | can be high
How do we learn pX (X )?
• We can use the Log-loss (Cross-entropy loss) to learn pX (X )
1
pX = argmin EpX log (1)
q(X ) q(X )
19
Learning the distribution
Data: X (1) , X (2) , . . . , X (N) with distribution pX (X )
1 X 1
EpX log = p(x) log
q(X ) q(x)
x∈X
X 1 p(x)
= p(x) log
p(x) q(x)
x∈X
X 1 X p(x)
= p(x) log + p(x) log
p(x) q(x)
x∈X x∈X
= Hp (X ) + DKL (pX ||q)
20
Learning the distribution
Data: X (1) , X (2) , . . . , X (N) with distribution pX (X )
1 X 1
EpX log = p(x) log
q(X ) q(x)
x∈X
X 1 p(x)
= p(x) log
p(x) q(x)
x∈X
X 1 X p(x)
= p(x) log + p(x) log
p(x) q(x)
x∈X x∈X
= Hp (X ) + DKL (pX ||q)
1
pX = argmin EpX log
q(X ) q(X )
20
Learning the distribution
1
pX = argmin EpX log
q(X ) q(X )
• In practice we consider empirical expectation instead:
N
1 1 X 1
argmin EpX log ≈ argmin log
q(X ) q(X ) q(X ) N i=1 q(X (i) )
21
Learning the distribution
• In practice we consider empirical expectation instead:
N
1 X 1 1 1
argmin log (i) )
= argmin log
q(X ) N q(X q(X ) N q(X 1 )q(X2 ) . . . q(XN )
i=1
X nx 1
= argmin log
q(X ) N q(x)
x∈X
1
= argmin Ep̂X log
q(X ) q(x)
22
Learning the distribution
N
1 X 1 1
argmin log (i) )
= argmin Ep̂X log
q(X ) N q(X q(X ) q(x)
i=1
nx
= p̂X (x) =
N
23
Learning the distribution
N
1 X 1 1
argmin log (i) )
= argmin Ep̂X log
q(X ) N q(X q(X ) q(x)
i=1
nx
= p̂X (x) =
N
• When X = (Y1 , Y2 , . . . , Yd ), |X | = k d
• For high |X |, p̂X is not useful!
23
Learning the distribution
N
1 X 1 1
argmin log (i) )
= argmin Ep̂X log
q(X ) N q(X q(X ) q(x)
i=1
nx
= p̂X (x) =
N
• When X = (Y1 , Y2 , . . . , Yd ), |X | = k d
• For high |X |, p̂X is not useful!
• We need more data, or ... some regularization.
23
Data Example
• X = (Y1 , Y2 , . . . , Yd ), |X | = k d , N ≈ number of dimensions.
24
Regularization
1 1
argmin EpX log = argmin Ep̂X log
q(X ) q(x) q(X ) q(x)
1
≈ argmin Ep̂X log
q(X )∈Q q(x)
• q(X ) = q(Y1 , Y2 , . . . , Yd ) =
q1 (Y1 )q2 (Y2 |Y1 )q3 (Y3 |Y2 , Y1 ) . . . qd (Yd |Y1 , . . . , Yd−1 )
• Q restricts some distributions
e.g.: q(Y1 , Y2 , . . . , Yd ) = q1 (Y1 )q2 (Y2 )q3 (Y3 ) . . . qd (Yd )
25
QI independent distributions
• QI restricts the distribution over the d dimensions to be independent
e.g.: q(Y1 , Y2 , . . . , Yd ) = q1 (Y1 )q2 (Y2 )q3 (Y3 ) . . . qd (Yd )
1
argmin Ep̂X log = (q̂1 (y1 ), . . . , q̂d (yd ))
q(X )∈QI q(x)
• q(Y1 , Y2 , . . . , Yd ) = q̂1 (y1 )q̂2 (y2 ) . . . q̂d (yd )
is not very useful for the tabular dataset
26
Tabular Example
27
Tree-based distributions
• We restrict distributions to T :
e.g.: T = {q|q(Y1 , Y2 , . . . , Yd ) =
q1 (Y1 )q2 (Y2 |Yi2 )q3 (Y3 |Yi3 ) . . . qd (Yd |Yid )}
28
Tree-based distributions
• We restrict distributions to T :
e.g.: T = {q|q(Y1 , Y2 , . . . , Yd ) =
q1 (Y1 )q2 (Y2 |Yi2 )q3 (Y3 |Yi3 ) . . . qd (Yd |Yid )}
• For every Yi , we allow dependence on one of the other variables
Yij , ij < i
28
Tree-based distributions
• We restrict distributions to T :
e.g.: T = {q|q(Y1 , Y2 , . . . , Yd ) =
q1 (Y1 )q2 (Y2 |Yi2 )q3 (Y3 |Yi3 ) . . . qd (Yd |Yid )}
• For every Yi , we allow dependence on one of the other variables
Yij , ij < i
• This exactly corresponds to a ”tree distribution”
28
Tree-based distributions — Examples
• Example tree distribution:
q(Y1 , Y2 , . . . , Y5 ) = q1 (Y1 )q2 (Y2 |Y1 )q3 (Y3 |Y1 )q4 (Y4 |Y2 )q5 (Y5 |Y2 )
29
Tree-based distributions — Examples
• Example tree distribution:
q(Y1 , Y2 , Y3 ) = q1 (Y1 )q2 (Y2 |Y1 )q3 (Y3 |Y2 )
Figure 11: Graph example
30
Tree-based distributions
Figure 12: Graph example
• Tree distributions are practical! No of parameters = dk 2
31
Tree-based distributions
Figure 12: Graph example
• Tree distributions are practical! No of parameters = dk 2
• Sampling is easy (in a breadth-first search order):
Y1 → Y2 → Y3 → Y4 → Y5 31
Tree-based distributions
• Can be used for compression, using Arithmetic coding:
• q(Y1 , Y2 , . . . , Y5 ) = q1 (Y1 )q2 (Y2 |Y1 )q3 (Y3 |Y1 )q4 (Y4 |Y2 )q5 (Y5 |Y2 )
Figure 13: HW3 Q3(f)
32
Chow-Liu Tree Algorithm
• Let Iˆ(Yi ; Yj ) be the mutual information computed using the
”empirical” distribution: p̂X (X ) = p̂X (Y1 , Y2 , . . . , Yd )
The best tree graph representing the data can be found by:
X
G = argmax Iˆ(Yi ; Yj ) (2)
edges(i,j)
33
Chow-Liu Tree Algorithm
• Let Iˆ(Yi ; Yj ) be the mutual information computed using the
”empirical” distribution: p̂X (X ) = p̂X (Y1 , Y2 , . . . , Yd )
The best tree graph representing the data can be found by:
X
G = argmax Iˆ(Yi ; Yj ) (2)
edges(i,j)
• Intuition:: Add edges which have ”high’ correlation.
33
Data Example
• X = (Y1 , Y2 , . . . , Yd ), |X | = k d , N ≈ number of dimensions.
34
Data Example
35
Chow-Liu Tree Algorithm
• Let Iˆ(Yi ; Yj ) be the mutual information computed using the
”empirical” distribution: p̂X (X ) = p̂X (Y1 , Y2 , . . . , Yd )
The best tree graph representing the data can be found by:
X
G = argmax Iˆ(Yi ; Yj ) (3)
edges(i,j)
• Intuition:: Add edges which have ”high’ correlation.
• We will proove this in the class!
36
Practical Considerations
• Exaustive search over all trees is not possible, use O(d log d)
algorithm such as Kruskal’s or Prim’s algorithm
37
Practical Considerations
• Exaustive search over all trees is not possible, use O(d log d)
algorithm such as Kruskal’s or Prim’s algorithm
• Need to compute O(d 2 ) mutual informations, which is the more
costly part
37
Practical Considerations
• X
G = argmax Iˆ(Yi ; Yj ) (4)
edges(i,j)
G is a solution to the problem:
1 1
argmin Ep̂X log ≈ argmin EpX log
q(X ) q(x) q(X ) q(x)
and thus ”approximates” p̂X (X ).
38
Practical Considerations
• X
G = argmax Iˆ(Yi ; Yj ) (4)
edges(i,j)
G is a solution to the problem:
1 1
argmin Ep̂X log ≈ argmin EpX log
q(X ) q(x) q(X ) q(x)
and thus ”approximates” p̂X (X ).
• We really want to solve the problem:
X
argmax I (Yi ; Yj )
edges(i,j)
38
Practical Considerations
• X
G = argmax Iˆ(Yi ; Yj ) (4)
edges(i,j)
G is a solution to the problem:
1 1
argmin Ep̂X log ≈ argmin EpX log
q(X ) q(x) q(X ) q(x)
and thus ”approximates” p̂X (X ).
• We really want to solve the problem:
X
argmax I (Yi ; Yj )
edges(i,j)
• Using N samples we can have better estimators for I (Yi , Yj ) than
the empirical plug-in estimator Iˆ(Yi , Yj )
• Information theory helps us get better estimators!
38
Practical Considerations
• Using N samples we can have better estimators for I (Yi , Yj ) than
the empirical plug-in estimator Iˆ(Yi , Yj )
• Information theory helps us get better estimators!
39
Practical Considerations
• Using N samples we can have better estimators for I (Yi , Yj ) than
the empirical plug-in estimator Iˆ(Yi , Yj )
• Information theory helps us get better estimators!
40
Practical Considerations
• Using N samples we can have better estimators for I (Yi , Yj ) than
the empirical plug-in estimator Iˆ(Yi , Yj )
• Information theory helps us get better estimators!
41
Practical Considerations
• Using N samples we can have better estimators for I (Yi , Yj ) than
the empirical plug-in estimator Iˆ(Yi , Yj )
• Information theory helps us get better estimators!
42
Practical Considerations
• When we solve the optimization problem, we are not penalizing
model complexity:
X
G = argmax Iˆ(Yi ; Yj ) (5)
edges(i,j)
G is a solution to the problem:
• Practically this is important. For example in compression, we also
need space to store the distributions p̂(Yi |Jj ) themselves! (along
with arithmetic coding).
43
Practical Considerations
• When we solve the optimization problem, we are not penalizing
model complexity:
X
G = argmax Iˆ(Yi ; Yj ) (5)
edges(i,j)
G is a solution to the problem:
• Practically this is important. For example in compression, we also
need space to store the distributions p̂(Yi |Jj ) themselves! (along
with arithmetic coding).
• The BIC Criteria (Bayesian Information Criteria), alters the
optimization by adding a penalty function for model complexity
X 1
argmax Iˆ(Yi ; Yj ) + log N|Yi ||Yj |
2
edges(i,j)
43
General Bayesian Networks
44
Chow-Liu Algorithm for Bayesian Networks
• Let Iˆ(Yi ; Yj ) be the mutual information computed using the
”empirical” distributions.
For general bayesian networks:
X
G = argmax Iˆ(Yi ; Yj ) (6)
edges(i,j)
• Chow-liu algorithm for Bayesian networks is an approximation based
on the intuition for tree-based algorithms
• Exact solutions no more possible. Apply heuristic greedy schemes
45
Learning Distributions — Language model
Figure 14: Slides borrowed from CS224n lecture, Jan 22
46
Learning Distributions — Language model
Figure 15: Slides borrowed from CS224n lecture, Jan 22
47
Learning Distributions — Language model
Figure 16: Slides borrowed from CS224n lecture, Jan 22
48
Learning Distributions — Language model
Figure 17: Slides borrowed from CS224n lecture, Jan 22
49
Learning Distributions — Language model
Figure 18: Slides borrowed from CS224n lecture, Jan 22
50
Learning Distributions — Language model
Figure 19: Slides borrowed from CS224n lecture, Jan 22 51
Learning Distributions — Language model
Figure 20: Slides borrowed from CS224n lecture, Jan 22
52
Learning Distributions — Language model
Figure 21: Slides borrowed from CS224n lecture, Jan 22
53
Learning Distributions — Language model
Figure 22: Slides borrowed from CS224n lecture, Jan 22
54
Learning Distributions — Language model
Figure 23: Slides borrowed from CS224n lecture, Jan 22
55
Learning Distributions — Language model
Figure 24: DeepZip Language compressor based on Language models
56
ML for Lossy Compression
Representation learning for Compression
• More recently, ML is being used for lossy compression of data
• Why use ML?
1. Unclear, high-dimensional data models. Eg: natural images.
57
Representation learning for Compression
• More recently, ML is being used for lossy compression of data
• Why use ML?
1. Unclear, high-dimensional data models. Eg: natural images.
2. Unclear Loss functions for lossy compression
3. EG: Image Compression → ”Human perception loss”
57
ML for Compressionl
Figure 25: Waveone image compression using neural networks 58
ML for Compressionl
Figure 26: Waveone image compression using neural networks 59
ICLR Compression Contest
Figure 27: More details: https://www.compression.cc
60
ML for Compression
• Lots of very different types of techniques used. The core idea is:
1. Learn a ”smooth” representation for the Image
2. Quantize the representation
3. Entropy coding of the representation
61
ML for Compression
• Lots of very different types of techniques used. The core idea is:
1. Learn a ”smooth” representation for the Image
2. Quantize the representation
3. Entropy coding of the representation
• The ”smoothness” of the representation is the key to good lossy
compression.
• Different techniques used for compression: Autoencoders, VAE,
GANs
61
ML for Compression
Some notable papers:
• Toderici, George, et al. ”Full resolution image compression with
recurrent neural networks.” Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2017.
Figure 28: Compression over multiple iterations
62
ML for Compression
Some notable papers:
• Rippel et.al, ”Real-time adaptive Image compression” ICML’17
Proceedings
Figure 29: Using a GAN based loss
63
ML for Channel Coding
Figure 30: Using a RNN for channel coding
64
ML for Channel Coding
Figure 31: Using a RNN for channel coding
65
ML for Channel Coding
Figure 32: Using a RNN for channel coding
66
ML for Joint Source-Channel Coding
Figure 33: NECST: Neural Joint Source-Channel Code, Choi et.al. arxiv
67
Conclusion?
ML/Statistics & Information theory are two sides of the same coin!
Figure 34: ML and IT
68
Thank You!
68