ML Merge
ML Merge
ML Merge
Machine learning
Lecture 1
Lecturer: Haim Permuter Scribe: Gal Rattner
In this lecture we introduce machine learning, define the relevant notations, and
examine the popular machine learning algorithm K-Nearest Neighbors. Most of the
material for this lecture was taken from the work of Cover and Hart on K-NN [1].
The goal of machine learning is to program computers to use training data or gained
experience to solve given problems. We can broadly say that a machine learns whenever
it changes its structure, program or data such that its future expected performance is
improved. Many successful applications of machine learning already exist today including
systems that predict stock prices, face recognition technology incorporated in digital
cameras or facebook, speech recognition in Google Assiatnce, Siri or Alexa, safety
systems in car (e.g., Mobiley), advertising on the web, auto-completion text, and ”spam”
mail detection programs. In recent years, machine learning is entering every engineering
field that you may think about and is expected to have a great impact in the next decades
since the technology is progressing significantly.
The type of problems that we try to solve by machine learning can be divided into
two categories:
• Classification problems: Given stochastic observation x, classification is the prob-
lem of associating x with one class from among a set of classes.
For example:
- Associating a recorded speech with the speaker (speaker recognition).
- Associating an image with a letter (digit recognition).
- Associating an image with an object, e.g., recognizing a car (object recognition).
1-2
B. Training/Validation/Test Data
Classification and regression are both based on a training session comprising a set
of training samples xi and each sample has it’s own fixed label li . The success rate
of the classification is then evaluated with a set of validation samples {x1 , x2 , . . . , xn }
whose properties are similar to those of the training set. The evaluation of the classifi-
cation/regression is done by comparing the original labels (using a loss function that we
will explain later in the lecture) with the labels associated by the trained model. Usually,
the learning model might still be changed (especially hyper-parameters, like the size or
depth of the machine learning model) during the validation and therefore there is a need
for final testing of the machine learning system on an additional data called the test data,
where the model can not be changed.
C. Types of learning
Machine learning methods can be categorized into three main types of the data that it
needs to learn from.
1-3
Supervised - In the training stage, the system is presented with labels for each sample.
Sample-label couples {(x1 , l1 ), (x2 , l2 ), . . . , (xN , lN )} are given to the system along with
the unlabeled test set {y1 , y2 , . . . , yM }. The system then associates labels with the test
samples according to the statistic model based on the training set. The supervised learning
system checks the correctness of its associating process by comparing the association
results with the original test labels. Supervised learning is often used in classification
and regression problems, for instance, in digit recognition, house price estimations, etc.
Unsupervised - In the training stage, the system is not given any fixed labels as
inputs, and as such, the system must define relevant labels. This usually requires that the
system learn the distribution of the sample set. Unsupervised learning is often used for
segmentation, for instance, edge detection in image processing tasks, etc.
Reinforcement learning - An intermediate stage to supervised and unsupervised learn-
ing, reinforcement learning entails the system learning “on the fly” through its experience.
For each attempt, the system receives some reward and then determines whether the
attempt was a failure or a success. The system thus gains experience and learns which of
its attempts were “good” by comparing between the rewards it received for the attempts.
Reinforcement learning is used to train computers to be experts in defined tasks, for
example, playing a game (e.g. https://www.youtube.com/watch?v=V1eYniJ0Rnk).
In this course, we will focus on supervised learning, though unsupervised learning
often constitutes an integral step in training the supervised model. We have a whole
course only on Reinforcement learning (http://www.ee.bgu.ac.il/∼haimp/RL/index.html).
D. Types of models
Generative model: The learning model is probabilistic and it models the joint
distribution of the samples and the labels, i.e., P (x, l). It is called generative since often
one can use it in order to generate data similar to the one it learns from. An example of
generative model is the Gaussian Mixture Model that we learn later in the course.
Discriminative model: The learning model learns to discriminate the feature x into
labels l, namely, it learns a mapping φ : x 7→ l, or the conditional probability P (l|x) but
not the joint as in the generative model. Discriminative model are using for supervised
1-4
learning and very rarely for unsupervised learning. Examples of discriminative models
that we will learn in the course are Logistic regression model and Neural network model.
II. N OTATION
Similarly E[g(X)] is
X
E[g(X)] , g(x)P (x) (2)
x∈X
PROBABILITY IS KNOWN
P
Let {η1 , η2 , . . . , ηM } , ηi > 0 ∀i , ηi = 1 be the prior probability of the M classes,
and fi (x) be the probability density of each class at x. The distribution of X is thus:
Class P (x|class)
X∼ η1 = P (Class = 1) f1 (x) = f (x|Class = 1)
η2 = P (Class = 2) f2 (x) = f (x|Class = 2)
.. ..
. .
ηM = P (Class = M) fM (x) = f (x|Class = M).
1-5
Definition 1 (Loss Function) Let X be a random variable with the classes set
{1, 2, . . . , n}. The loss function L(i, j) is the loss incurred by associating the observation
to class j when in fact it belongs to class i ,∀i, j ∈ {1, 2, . . . , n}.
Example 1 (Right/wrong loss function) Consider M=2, and the loss function is the
right/wrong function, such that a correct association yields no loss and an incorrect
association, or an error, yields a loss of 1. The loss matrix in this case is:
0 1
L= .
1 0
For the case of right/wrong loss function, we notice that in the case of an error, the
loss always counts as 1, while a correct association counts as 0 loss. The loss matrix
is therefore the 0-1 matrix of M × M size. For example, consider the right/wrong case
were M = 3, then
0 1 1
L = 1 0 1 .
1 1 0
In general (not right/wrong case), a loss matrix can also be asymmetric, for instance,
L(1, 2) > L(2, 1). In this case loss matrix is warranted when the system is tasked with
deciding whether a given person is a terrorist based on a recorded phone call. Failure to
identify a terrorist may result in much greater loss than falsely deciding that an innocent
person is a terrorist. Therefore, we can expect the representative loss matrix to be in
many cases asymmetric.
Let us recall the joint probability equation
We can describe P (x) as the sum of the joint probability over all the alphabet of X, and
using eq. (3) we get
M
X M
X M
X
P (x) = P (j, x) = P (j)P (x|j) = ηj fj (x). (4)
j=1 j=1 j=1
1-6
The probability that X belongs to class i, ∀i ∈ {1, 2, . . . , M} given the samples x, is the
posterior probability η̂i (x):
P (class = i, x) P (i)P (x|i) ηi fi (x)
η̂i (x) = P (class = i|x) = = = PM , (5)
P (x) P (x) j=1 η j fj (x)
We can finally sort the class probabilities in vectors. These include the prior probability
vector:
P (class) = [η1 , η2 , . . . , ηM ] , (6)
Definition 2 (Conditional loss) The conditional loss denoted by rj (x) is the loss in-
curred by associating observation x with class j, then:
X M
X
rj (x) = E [L(I, j) | X = x] = P (class = i|x)L(i, j) = η̂i (x)L(i, j). (8)
i i=1
Because our goal is to minimize the conditional loss, we therefore define r ∗ (x) and R∗
that is the one that corresponds for the minimum choice of class j ∈ {1, 2, . . . , N}.
Definition 3 (Conditional Bayes risk) The conditional Bayes risk denoted by r ∗ (x), is
the loss incurred by associating x with class j that has the lowest cost out of all classes,
i.e. (M )
X
r ∗ (x) , min{rj (x)} = min η̂i (x)L(i, j) . (9)
j j
i=1
Definition 4 (Bayes risk) The Bayes risk denoted by R∗ is the resulting overall mini-
mum expected risk, i.e,
R∗ , E [r ∗ (X)] , (10)
Example 2 (Decision due to minimum loss) Consider the next loss matrix L, where
all failures have the same loss value:
0 1 1 ... 1
1 0 1 ... 1
L=
1 1 0 ... 1 .
.. . . ..
. . .
1 1 ... 1 0
and therefore, by choosing the class that producing the minimum loss, we will minimize
the Bayes risk, i.e.,
Minimizing the Bayes risk for the loss function of right/wrong yields the decision rule
given in (13), which is also known as the Maximum a Posteriori (MAP) rule and is
formally defined in the next definition.
Given that the divisors are equal for all the elements, the maximum vector element is
equal to argmaxj ηj fj (x).
1-8
Now, for the case in which η1 = η2 = · · · = ηM , meaning that all classes have the
same prior probability, using the MAP method is similar to choosing the maximum of
only fj (x), i.e.,
and we can refer to another decision method, Maximum Likelihood Estimation (MLE).
Example 3 (Two Gaussian distributed classes) Given two classes distributed normally
over two dimensions. Using MLE, we choose the class that has the higher probability
density function value at point x. Using this method entails an obligated error probability,
in this case presented by the area trapped under the two Gaussian graphs:
Class
| 1 {z
is chosen} Class
| 2 {z
is chosen}
2
j∗ = 1 j∗ = 2
1.5 η1 f1 (x)
1
η2 f2 (x)
ηi fi (x)
0.5
0
↑ ↑
Risk caused by choosing j ∗ = 1 Risk caused by choosing j ∗ = 2
while j = 2 while j = 1
-0.5
-10 -8 -6 -4 -2 0 2 4 6 8 10
x
Figure 1. The total area trapped under the overlap between the two Gaussians ovelap (marked in color) is the total
obligated error probability.
1-9
Now we can calculate the overall risk by integrating over min{ηˆ1 (x)f1 (x), ηˆ2 (x)f2 (x)}
to obtain the trapped area:
R∗ = E [r ∗ (X)]
Z
= r ∗ (x)f (x) dx
Z Z
= ηˆ2 (x)f (x) dx + ηˆ1 (x)f (x) dx (18)
η1 f1 (x) > η2 f2 (x) η1 f1 (x) < η2 f2 (x)
Z Z
= η2 f2 (x) dx + η1 f1 (x) dx.
η1 f1 (x) > η2 f2 (x) η1 f1 (x) < η2 f2 (x)
In the previous section we defined the conditional Bayes risk r ∗ (x) in Eq. (9) and
the Bayes risk R = E[r ∗ (X)] in (10), which are optimal but one need to know the
probabilities of the classes i.e., ηi and the conditional probability fi (x) for all possible i
and x. In many cases, we have several classifiers (or a set of classifiers) which can also
be called hypothesis. The hypothesis set may not include the optimal one, namely, the
Bayes classifier that we saw in the previous subsection. The ERM idea is a very simple
idea that tells us which hypothesis to choose from the set of hypotheses.
For each classifier/hypothesis h let’s define a Risk
where L(·, ·) is the loss function (as defined in previous subsection in Def. 1) and y is
the label associated with x. In general, we would like to choose the classifier/hypothesis
h from the possible set H that minimize the risk, i.e.,
However, in order to compute the risk for specific hypotheses h, i.e., R(h), one need
to know the joint probability p(x, y) for all possible samples, x and labels, y. In practice,
the joint probability p(x, y), is unknown (this situation is called as agnostic learning).
However, one can assume that we have samples and labels {xi , yi }ni=1 drawn from the
1-10
joint p(x, y). The set of samples that are available is called the training set. The empirical
risk of a specific classifier h is defined as
n
(n) 1X
Remp (h) = L(h(xi ), Yi ) (21)
n i=1
and the minimum empirical risk idea is
(n)
ĥ = min Remp (h). (22)
h∈H
Assuming {xi , yi }ni=1 are i.i.d., (or at least stationary and ergodic) then by the law of
(n)
large numbers limn→∞ Remp (h) = R(h) with probability 1, hence the minimum empirical
risk convergence to the minimum risk if the number of samples is large enough.
Summary of the ERM idea: The ERM idea is very simple and extremely useful. We
have a set of classifiers/hypothesis H and a set of samples (a.k.a tanning set). The ERM
principle tells us to choose the classifier that minimize the emprical risk, i.e., Eq. (22).
d(x, x′ ) ≤ d(x, xi ) ∀ i = 1, . . . , n
η1 = P (θ = 1)
η2 = P (θ = 2)
1-11
..
.
ηn = P (θ = n)
P (x, θ, x1 , θ1 , . . . , xn , θn ) = P (θ)P (x|θ)P (θ1)P (x1 |θ2 )P (θ2 )P (x2 |θ2 ) · · · (23)
We define the n-sample Nearest Neighbor procedure risk to be (n holdss for the training
number of samples) :
R∗ ≤ R ≤ 2R∗ · (1 − R∗ ). (26)
In the next part of this lecture, we will use a lemma and the definitions given above
to show that these bounds hold. Before we get to the proof, note that the Bayes risk R∗
can only get values in the section 0, 21 , so that 0 ≤ R∗ ≤ R ≤ 2R∗ (1 − R∗ ) ≤ 12 . In
1
particular, for the edge cases, we get that R∗ = 0 if and only if R = 0, and that R∗ = 2
if and only if R = 21 .
1-12
Exercise 1 Prove the equation used above: P (θ, θn′ |x, x′n ) = P (θ|x) · P (θn′ |x′n )
proof: Let Sx (r), r > 0 be the sphere of radius r centered at x, and d(·) is the metric
defined on X. Considering the case where Sx (r), r > 0 has a non-zero probability
measure, then for any δ > 0
The distance of the nearest neighbor x′n from x decreases monotonically with the increase
in k.
We can now use the fact that limn→∞ x′n = x with probability one to show that the
conditional NN risk r(x, x′n ) converges to the limit 2r ∗ (x)(1 − r ∗ (x)). For large numbers
of training samples n, we get
(a)
lim r(x, x′n ) = lim (ηˆ1 (x)ηˆ2 (x′n ) + ηˆ2 (x)ηˆ1 (x′n )) (29a)
n→∞ n→∞
(b)
= 2 · ηˆ1 (x) · ηˆ2 (x) (29b)
(c)
= 2ηˆ1 (x)(1 − ηˆ1 (x)) (29c)
(d)
= 2r ∗ (x)(1 − r ∗ (x)), (29d)
where:
(a) holds using equation (27).
1-13
Now R is the limit of the expectation of r(x, x′n ), and we can use the fact that r(x, x′n )
is bounded to switch the order of expectation and the limit to get
r(x,x′n )<1
h i
′ ′
R = lim E [r(x, xn )] = E lim r(x, xn ) . (31)
n→∞ n→∞
R = E[r(x)] (33a)
where:
(a) holds for the linearity of the expectation.
(b) holds since r ∗ ∈ [0, 21 ] and the expression inside the expectation is non negative over
this section, and equality is achieved only if r ∗ (x)(1 − 2r ∗ (x)) = 0.
Using the first part of the equation above and the fact that R∗ is the expectation of r ∗ ,
and that V ar(r ∗ (x)) ≥ 0 we can then write
≤ 2R∗ (1 − R∗ ),
1-14
and
R∗ = E [r ∗ (x)] (35a)
(a)
≤ E [2r ∗(x)(1 − r ∗ (x))] (35b)
(b)
≤ 2R∗ (1 − R∗ ), (35c)
where:
(a) holds from equation (33e).
(b) holds since r ∗ (x) ≤ 12 .
To conclude, we collect equations (33),(34) to obtain the bounds of the overall NN
risk R(n):
R∗ ≤ R ≤ 2R∗ (1 − R∗ ). (36)
R EFERENCES
[1] T.M. Cover and P.E. Hart. Nearest Neighbor Pattern Classification. IEEE, 1967.
2: GMM and EM-1
Machine Learning
I. I NTRODUCTION
This lecture comprises introduction to the Gaussian Mixture Model (GMM) and the
Expectation-Maximization (EM) algorithm. Parts of this lecture are based on lecture
notes of Stanford’s CS229 machine learning course by Andrew NG[1]. This lecture
assumes you are familiar with basic probability theory. The notation here is similar to
that in Lecture 1.
In supervised classification, our target is to analyze the labelled training data we get,
and to use it to generate a model to map and classify new examples. Below is a general
model of supervised learning.
observation label
r1 c1
r2 c2
.. ..
. .
rN cN
r x y ĉ
=⇒ feature extraction =⇒ statistical model =⇒ decision =⇒
Here r represents the raw observation vectors with their respective label c, and ĉ is the
estimation of the model.
2: GMM and EM-2
1) Feature extraction: The goal of the feature extraction block is to extract the features
that contribute most to the classification and to eliminate the rest. For instance, in
speech recognition a well-known feature extraction technique is called Cepstrum,
and in texture classification, it is based on the discrete cosine transform (DCT) or
wavelet. All the above-mentioned features are based on the frequency domain. In
some cases, one might also use dimension reduction in addition to these features,
when, for example, the feature vector is too large or there is high redundancy. Two
good feature reduction methods are PCA and Autoencoders.
2) Statistical model: The goal of the statistical model is to represent the statistics of
each class, which allows the classes to be separated from each other. The statistical
model usually has some probability justification (such as the GMM) but sometimes
it might just be a separations technique of a different class. Usually a good statistical
model can also be used for additional tasks such as data compression or denoising.
3) Decision: The decision component is responsible for using the statistical model
output to classify the input. In some cases, we may generate more than one
statistical model, and the decision component uses all of the outputs.
The GMM is a statistical model which assumes that every probability distribution can
be approximated with a set of gaussians.
In supervised learning, GMM models the distribution of each class as a set of weighted
gaussians, P (xi , zi |c) = P (xi |zi , c)P (zi |c), where zi is a hidden random variable that is
not observed. One assumption that is made is that any distribution can be well modelled
by a set of gaussians. In the figure below, you can see three weighted gaussians in R1
and their sum which models the distribution of a random variable. Another assumption
of the GMM is that any sample, x ∈ Rl , was generated from a single gaussian.
1 1
P (xi |zi = j, c) = p exp(− (x − µj )T Σ−1
j (x − µj )) (1)
(2π) |Σj |
l 2
2: GMM and EM-3
0.6
gaussian #1
gaussian #2
gaussian #3
mixture of gaussians
0.5
0.4
PDF
0.3
0.2
0.1
0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
feature - x
Fig. 1. Gaussian mixture model built from three weighted gaussians. The figure shows three weighted gaussians with
different µ and σ. The sum represents the mixture model.
From now on we only refer to a specific class c, since each class has its own model.
Our goal is to estimate the model’s parameters φj , µj and Σj , ∀j ∈ {1, ..., k}, and
we use maximum likelihood[2] criteria to maximize the expression P (X n ; θ), where θ
represents the parameters.
Y
= argmax log( P (xi |θ)) (3b)
θ
i
X
= argmax log(P (xi |θ)) (3c)
θ
i
First, let’s try to model our data using a single Gaussian. This example is in one
dimensional space.
m
X k
X
ℓ(φ, µ, Σ) = log P (xi |z = j; µj , σj )P (z = j; µj , σj ) (5a)
i=1 j=1
m
(a)
X
= log P (xi |µ, σ) (5b)
i=1
m
X 1 (xi −µ)2
= log √ e− 2σ 2 (5c)
i=1 2πσ 2
Where (a) follows from k = 1. Now lets take derivative in order to find the parameters
which maximizes the log-likelihood. First we take derivative with respect to µ
m
∂ 1 X
ℓ(φ, µ, Σ) = − 2 (xi − µ) (6a)
∂µ 2σ i=1
=0 (6b)
Therefore
m
1 X
µ̂ = xi (7)
m i=1
2: GMM and EM-5
=0 (8c)
We get
m
2 1 X
σ̂ = (xi − µ)2 (9)
m i=1
Two Gaussians example: Now lets try to find an analytic solution to maximize the
log-likelihood of two Gaussians model.
m
X k
X
ℓ(φ, µ, Σ) = log P (xi |z = j; µ, σ)P (z = j; θ) (10a)
i=1 j=1
m
X
= log(φ1 N (xi ; µ1 , σ12 ) + φ2 N (xi ; µ2, σ22 )) (10b)
i=1
The EM algorithm was first explained in a 1977 paper, Maximum Likelihood from
Incomplete Data via the EM Algorithm [3]. It is an iterative algorithm for using maximum
likelihood to estimate the parameters of a statistical model with unobserved (hidden)
2: GMM and EM-6
variables (also called latent variables where lateo is “lie hidden” in Latin). It has two
main steps. First is the E-step, which stands for expectation. We compute some probability
distribution of the latent variables so we can use it for expectations. Second comes the
M-step, which stands for maximization. The EM algorithm find a local maximum and
that depends on the initial model that we start with.
Problem Definition: Let xn be a random vector distributed i.i.d. that we observe. Let
Z n be a hidden vector that X n depends on. s.t. P (xn |z n ) = ni=1 p(xi |zi ) also distributed
Q
i.i.d.. i.e., P (z n ) = ni=1 P (zi ). The distribution of X n and Z n have some parameters θ
Q
that we are interested to find. In GMM, for instance, Zi stands for the Gaussian number
of sample i, i.e., zi ∈ {1, 2, ..., k} where k is the total number of Gaussians. Our goal is
to maximize the log-likelihood
!
X
log PX n (xn ; θ) = log PX n ,Z n (xn , z n ) (12)
zn
The stopping criteria can be changed. Can be either the one that is written,
log P (xn |θ(t) )−log P (xn |θ(t−1) ) < ǫ, or a fixed number of iterations or ||θ(t) )−θ(t−1) )|| <
ǫ for some norm || · ||.
Derivation of the EM algorithm: Now, we are going to show that in each iteration
the algorithm increases the log likelihood, i.e., log P (xn |θ(t) ) increases as t increases.
Hence, the EM algorithm convergence to a local minimum that depends on the initial
model θ(0) .
For any Q(z n ) we have
!
n n (t)
X P (x , z ; θ )
log P (xn ; θ(t) ) = log Q(z n ) (14)
zn
Q(z n )
P (xn , Z n ; θ(t) )
= log EQZ n (15)
Q(Z n )
P (xn , Z n ; θ(t) )
(a)
≥ EQZ n log (16)
Q(Z n )
Where (a) follows Jensen’s inequality. Now, using algebra and information measure we
obtain
P (xn , Z n ; θ(t) ) P (Z n |xn ; θ(t) )
n (t)
EQZ n log = EQZ log(P (x ; θ )) + EQZ n log
Q(Z n ) Q(Z n )
P (Z n |xn ; θ(t) )
n (t)
= log(P (x ; θ )) + EQZ n log (17)
Q(Z n )
= log(P (xn ; θ(t) )) − D(QZ n ||PZ n|xn ,θ(t) ) (18)
Hence, we can choose Q(z n ) = P (z n |xn , θ(t) ), i.e., Q(zi ) = P (zi |xi , θ(t) ) and we obtain
that (a) in (16) is with equality. So now when Q(z n ) is fixed, we can maximize over all
θ the following expression
P (xn , Z n ; θ)
(t+1)
θ = arg max EQZ n log n
= arg max EQZ n [log P (xn , Z n ; θ)] ,
θ Q(Z ) θ
Hence we have
P (xn , Z n ; θ(t+1 )
n (t+1) (a)
log(P (x ; θ )) − D(QZ n ||PZ n|xn ,θ(t+1) ) = EQZ n log
Q(Z n )
P (xn , Z n ; θ(t )
(b)
≥ EQZ n log
Q(Z n )
2: GMM and EM-8
(c)
= log P (xn ; θ(t) )
where (a) and (c) follows from the E step derivation, (b) from the M step. Finaly the last
sequence of equations implies that log P (xn ; θ(t+1) ) ≥ log P (xn ; θ(t) ) because divergence
is non negative.
Now, we wish find the new parameters θ which maximize the log-likelihood with
△
respect to the expectations (the M-step). First let’s define φ(j) = P (j; θ). The model
parameters are θ = {φ(j), µj , σj2 }.
Finding the optimal φ(j) is a bit more difficult since it is a probability distribution
function and therefore it has some constrains. The constrains are
X
φ(j) = 1
j
φ(j) ≥ 0 ∀j
The solution for this comes from the field of convex optimization. Further discussion
about convex optimization problems and Lagrange multipliers can be found in the
appendix and in Boyd’s book[4]. Let’s define the problem as a standard convex
optimization problem (defintion 3)
m X
X k
minimize − w(j, i) log φ(j)
i=1 j=1
subject to − φ(j) ≤ 0, 1 ≤ j ≤ k
X
φ(j) = 1
j
1) ∇φ L(φ∗ , λ∗ , ν ∗ ) = 0
2) −φ∗ (j) ≤ 0 , ∀j
P
3) j φ(j) = 1
4) λ∗j φ∗ (j) = 0 , ∀j
5) λ∗j ≥ 0 , ∀i
After deriving the Lagrangian with respect to a specific φ(j) we get
Pm
w(j, i)
φ(j) = i=1 (25)
ν
2: GMM and EM-10
P
We do not know ν, so we use the fact that j φ(j) = 1
P Pm
X j i=1 w(j, i)
φ(j) = =1 (26)
ν
Therefore
Pn
w(j, i)
φ(j) = Pm i=1 Pk (27)
i=1 j=1 w(j, i)
Pn
w(j, i)
= i=1 (28)
n
We can see that the KKT conditions holds.
Algorithm:
m
1 X
φ(j) := w(j, i) (30)
m i=1
Pm
w(j, i)xi
µj := Pi=1
m (31)
i=1 w(j, i)
Pm
w(j, i)(x(i) − µj )(xi − µj )T
Σj := i=1 Pm (32)
i=1 w(j, i)
Now we must consider when to stop the algorithm. One way is to repeat the two steps
until convergence θ(t+1) ≈ θ(t) . Another option is to repeat them for a fixed number
of times. Repeating the steps a fixed number of times can prevent over-fitting of the
training data set.
2: GMM and EM-11
F. K-means algorithm
K-means is very similar to EM except that it gives a ’hard’ decision for each sample
to the centroid to which it belongs. Below we discuss K-means briefly. We are given
a training set {x(1) , ..., x(m) } and we wish to group it into K different components.
Algorithm:
1) Initialize centroids µ1 ...µk ∈ Rn
2) Repeat until convergence: {
a) for every i, set
}
To initiate the centroid, we can choose K random training samples. There are other
initialization methods. The inner loop assigns each sample to the ’closest’ centroid, and
moves the centroid to the mean of the points assigned to it.
The K-means algorithm guarantees convergence to a local optima (except in very
rare cases, where the K-means oscillate between a few different values), but it does not
guarantee convergence to a global optima. A common way to deal with this problem is
to run K-means many times with different initiations, and then to pick the one with the
lowest distortion. K-means is often used to initiate GMM modelling.
A common use of the K-means algorithm is for initial estimation for GMM parameters.
We use the centroids to initiate the Gaussian’s µ and set an arbitrary Σ and φ.
1) Choosing the number of GMM components: [6] Choosing the number of compo-
nents for modelling the distribution is an important issue. There is a trade-off between
choosing too many components which may cause over-fitting (for example, what happens
if we use the number of samples?), and choosing too few, which can render a model that
2: GMM and EM-12
is not flexible enough. Selecting the number of components is crucial for unsupervised
segmentation tasks, in which each component represents a separate class. However, when
each GMM models a different class and the segmentation is supervised, the selection of
the number of components becomes less critical.
2) Full or diagonal covariance?: [5] In the model that is the focus of this lecture,
we used unrestricted covariance for the GMMs. The GMM can also be restricted to have
diagonal covariance which has two practical advantages:
• There are fewer parameters to be estimated.
• Calculation of the inverse matrix is trivial.
In some applications (e.g., speaker verification), the diagonal model is sufficient. The
obvious disadvantage of assuming diagonality, however, is that the different features
may in practice be strongly correlated. In this case, the model might be incapable of
representing the feature distribution accurately. The figure below illustrates this point.
Fig. 2. Modelling the distributions of two different features(first column) using a full covariance(second column) and
a diagonal covariance(third column)
2: GMM and EM-13
On the first row, we see that the features are highly correlated, and therefore the
GMM must use a full covariance matrix. On the second row, the difference is far less
pronounced, and it seems that a diagonal covariance matrix is sufficient.
3) GMM for unsupervised learning: So far we only talked about using GMM for
supervised learning, where each sample has its own label, and a model is built for each
class. In many cases, GMM is used for unsupervised clustering. It is useful to gather data
from a group into sub-groups that can be modelled well by a gaussian distribution. Using
GMM, we can cluster the data into sub-groups that do not have labels or identity. For
example, we have a class with n students and we want to split it into three different study
groups according to student averages. An addition application of GMM in unsupervised
learning is the Universal Background Model (UBM). In the UBM, we first generate a
general class-independent statistical model by using all samples. We often use the UBM
to generate the class-dependent models. This is particularly useful for cases in which we
lack a sufficient number of samples of the class we want to identify, but we have many
samples of other classes. A good example of this application is in the field of speaker
verification, when we may have only a few voice samples of the person we are trying to
identify.
2: GMM and EM-14
A PPENDIX
R EVIEW: CONVEX FUNCTION AND J ENSEN ’ S INEQUALITY [4 ]
θx + (1 − θ)y ∈ C. (35)
Fig. 3. The hexagon on the left is convex. on the right is a non-convex set.
minimize f0 (x)
2: GMM and EM-15
hj (x) = 0, 1≤j≤k
describes the problem of finding the x that minimizes f0 (x) among all x that satisfy the
constrains. It is called a convex optimization problem if f0 , ..., fm are convex functions
and h1 , ..., hk are affine.
Definition 4 (Lagrangian) :
For a constrained optimization problem as seen in definition 3 we define the Lagrangian
m
X k
X
L(x, λ, ν) = f0 (x) + λi fi (x) + νj hj (x) (38)
i=1 j=1
The idea of the Lagrangian duality is to take the constrains into account by augmenting
the objective function with a weighted sum of the constraint functions.
∇x L(x∗ , λ∗ , ν ∗ ) = 0
fi (x∗ ) ≤ 0 , ∀i
hj (x∗ ) = 0 , ∀i
λi fi (x∗ ) = 0 , ∀i
2: GMM and EM-16
λ∗i ≥ 0 , ∀i
where x∗ and (λ∗ , ν ∗ ) are a primal and dual optimal points with zero duality gap. This
are called Karush-Kuhn-Tucker conditions.
You can find a more rigorous discussion of convex optimization in Boyd’s book on
convex optimization in chapters 1-5[4].
R EFERENCES
[1] Andrew NG’s machine learning course. Lectures on Unsupervised Learning, k-means clustering, Mixture of
Gaussians and The EM Algorithm http://cs229.stanford.edu/materials.html.
[2] Haim Permuter’s Machine leaning course, lecture 1 http://www.ee.bgu.ac.il/∼haimp/ml/lectures.html.
[3] Arthur Dempster, Nan Laird, and Donald Rubin (1977) Maximum Likelihood from Incomplete Data via the EM
Algorithm https://www.jstor.org/stable/2984875.
[4] Stephan Boyd’s Convex Optimization https://web.stanford.edu/∼boyd/cvxbook/bv cvxbook.pdf.
[5] Haim Permuter, Joseph Francos, Ian Jermyn, A study of Gaussian mixture models of color and texture features
for image classification and segmentation 3.4 http://www.ee.bgu.ac.il/∼francos/gmm seg f.pdf.
[6] Haim Permuter, Joseph Francos, Ian Jermyn, A study of Gaussian mixture models of color and texture features
for image classification and segmentation 3.5 http://www.ee.bgu.ac.il/∼francos/gmm seg f.pdf.
[7] Ramesh Sridharan’s Gaussian mixture models and the EM algorithm
https://people.csail.mit.edu/rameshvs/content/gmm-em.pdf.
3 - Linear and Logistic Regression-1
Throughout this lecture we talk about how to use regression analysis in Machine
Learning problems. First, we introduce regression analysis in general. Then, we talk about
Linear regression, and we use this model to review some optimization techniques, that
will serve us in the remainder of the course. Finally, we will discuss classification using
logistic regression and softmax regression. Parts of this lecture are based on lecture notes
from Stanfords CS229 machine learning course by Andrew NG[1]. This lecture assumes
you are familiar with basic probability theory. The notation here is similar to Lecture 1.
to be the amount of precipitation, we will choose a hypothesis that will map X1 , X2 to the
actual precipitation amount. In this case, we refer the problem as a regression problem.
When we define the dependent variable to be the ”weather type” (clear, cloudy, rainy,
etc), our hypothesis will use the independent variables to classify X1 , X2 to the correct
group of weather types. In this case we refer the problem as classification problem.
Now, after we discussed about the terminology, let’s delve into a commonly used
regression model, Linear Regression.
In linear Regression, as implied from its name, we try to estimate the value of y with
a linear combination of the new feature vector x ∈ Rm with respect to our training
set (x(1) , y (1) ), (x(2) , y (2) ), . . . , (x(n) , y (n)) . Let’s define hθ (x) as the hypothesis for
hθ (x) = θ0 + x1 θ1 + · · · + xm θm , (1)
where θ0 is the bias term. For convenience let’s define x0 = 1 so we can rewrite (1) as
m
X
hθ (x) = xk θk = θT x. (2)
k=0
Our Goal is to find the linear combination coefficients, θ, that will yield the hypothesis
which maps x(i) to y (i) most accurately. In order to do that, we will have to assume that
the training set are representative of the whole problem. If it is not the case, we should
enrich our training set appropriately (if it is possible). Next, we need to choose θ that will
make hθ (x(i) ) ≈ y (i) , for all i = 1, . . . , n (the ≈ sign stands for approximately equal).
By achieving a hypothesis which maps accurately x(i) to y (i) for the entire training set,
3 - Linear and Logistic Regression-3
combined with the assumption that our training set is representative of the whole problem,
we can say that our hypothesis is optimal, with the limits of a linear model.
Now, given the training set, how should we choose θ? In order to do that, we need to
define a function which measures the error between our hypothesis based on the x(i) ’s and
the corresponding y (i) ’s for all i = 1, . . . , n. Let’s define a Cost function that measures
the error of our hypothesis.
We can see that when hθ (x(i) ) ≈ y (i) for all i = 1, . . . , n , then our cost function
satisfies C(θ) ≈ 0. Moreover, every term in our cost function is positive, therefore C(θ) ≈
0 ⇐⇒ hθ (x(i) ) ≈ y (i) for all i = 1, . . . , n. These are two important properties, that lead
us to the conclusion that finding θ∗ that minimizes the cost function, eventually forces
hθ (x(i) ) ≈ y (i) for all i = 1, . . . , n. By achieving that, we are essentially accomplishing
our goal.
Let’s formulate our goal,
θ∗ = argmin C(θ). (4)
θ∈Rm+1
Note that if we find θ∗ , our linear regression model can be used on new sample x by the
following equation:
hθ (x) = (θ∗ )T x. (5)
Now, after formulating our problem, let’s survey some minimization methods to serve us
in minimizing our cost function.
3 - Linear and Logistic Regression-4
In this section we discuss about some various ways for minimizing C(θ) with respect
to θ. The methods that we focus on are:
1) Gradient Descent:
• Batch.
• Stochastic (Incremental).
2) Newton- Raphson Method.
3) Analytically.
where:
• t = 0, 1, . . . is the iteration number.
• α is the ”Learning Rate” and it determines how large step the algorithm takes
every iteration.
• ∇θ is the gradient of C(θ) with respect to the parameters θ and is defined by:
h iT
∇θ C(θ) = ∂C(θ)
∂θ
∂C(θ)
∂θ
. . . ∂C(θ)
∂θ
0 1 m
Intuitively, we can think of the algorithm as guessing an initial place in the domain.
Then, it calculates the direction and magnitude of the steepest descent of the function
for the current θ (∇θ C(θ)) and finally, in the domain, it takes a step, proportional to the
descent magnitude and to a fixed predefined step size α to the next θ. In case we want to
3 - Linear and Logistic Regression-5
use the algorithm for maximization, we will exchange the ’−’ sign with a ’+’ sign and
then we will take step in the direction of the steepest ascent of the function. In this case
the algorithm is called gradient ascent. Note that, generally, the algorithm searches ’a’
minimum and not ’the’ minimum of the function. If we want to claim that the algorithm
finds the global minimum of the function we will need to prove that the function has
only one minimum, which means that the function is convex.
Here are some examples of gradient descent as it runs to minimize convex or non
convex functions.
The leftmost figure shows us the value of f (θ0 ). The middle figure shows the algorithm
progress after three iterations and the rightmost figure shows the algorithm after achieving
convergence. We can visualize the fact that gradient descent applied on convex function
will converge to the global minimum of the function, assuming that α is relatively small.
Therefore, we can conclude that by choosing a convex cost function to our problems,
we can assure finding its global minimum by an iterative process such as gradient descent.
Luckily, the MSE cost function is a convex function with respect to θ so we can assure
that gradient descent will find its global minimum.
Now, let’s implement the gradient descent method on our linear regression problem.
For doing that, we have to calculate the term ∇θ C(θ). Let’s find an expression for the
j th coordinate of ∇θ C(θ) in our problem.
n
∂C(θ) ∂ 1 X T (i)
= (θ x − y (i) )2
∂θj ∂θj 2n i=1
n
(a) 1 X ∂ T (i)
= (θ x − y (i) )2
2n i=1 ∂θj
n
(b) 1 X (i)
= 2(θT x(i) − y (i) )xj
2n i=1
3 - Linear and Logistic Regression-7
n
1 X T (i)
(c) (i)
= (θ x − y (i) )xj (7)
n i=1
Where in (a) we used the linearity of the derivative, In (b) and (c) we differentiated and
simplified the expression. By inserting Equation (7) in Equation (6) we get the following
update rule which is given by
n
(t+1) (t) α X T (i) (i)
θj := θj − (θ x − y (i) )xj , ∀j (8)
n i=1
Now lets understand the intuition of Eq. (8). First, for simplicity, lets examine it for
(t+1) (t) (i)
n = 1. We get θj := θj −α(ŷ (i) −y (i) )xj , where ŷ (i) is the estimator of y (i) using the
linear regression model of the current iteration. Now, if denote by error (i) = (ŷ (i) − y (i) ),
(t+1) (t) (i) (t)
then the update is θj := θj − α · error (i) · xj . Namely, update θj by a linear offset
that is proportional to the estimation error multiply by the input. And if now we have
n > 1, then we average over all the updates, so the noise would be less effective at each
iteration.
In can be easily verified that at every update, the algorithm uses the entire training set
to compute the gradient. For that reason we call this way of using gradient descent as
batch gradient descent. Alternatively, we can use only one training example to update θ
every iteration. The update rule at this case is given by,
for i=1 to n, {
}
In this case, we call the algorithm Stochastic Gradient Descent (or Incremental Gradient
Descent). Next, let talk about our second method for optimizing C(θ). Our next method
uses the fact that finding an extremum of an a function is equal to finding its derivative’s
zeros.
f (θ(t) )
θ(t+1) := θ(t) − (10)
f ′ (θ(t) )
where:
• t = 0, 1, . . . is the iteration number.
• f ′ (θ) is the first derivative of f (θ).
First, to get some intuition about the algorithm’s process, let’s examine the private case
where f (x) = 31 x3 . In our case the update rule is given by
1
x(t+1) := x(t) − x(t) . (11)
3
The following figures presents the algorithm process in the first iterations.
Fig. 10: initial guess of Fig. 11: 1 iteration of Fig. 12: 5 iterations of
Newton-Raphson method Newton-Raphson method Newton-Raphson method
In the leftmost figure, we can see the initial guess for x, denoted by x0 . In the middle
figure, we can see how the algorithm, calculate its next guess of x. First, it calculates the
tangent of f (x) at x0 , and then sets the next guess at the point where the tangent cross
the x axis. We can think of that as estimating the zeros of f (x) as if f (x) was a linear
function. From this point of view let’s find the derive the update rule.
The tangent’s slope is the derivative of f (x) at xt so we can conclude that
f (x(t) ) − 0
f ′ (x(t) ) = . (12)
x(t) − x(t+1)
3 - Linear and Logistic Regression-9
x,θ
f (·) , C ′ (·)
f ′ (·) , C ′′ (·).
Therefore, the update rule for minimizing the cost function will be
C ′ (θ(t) )
θ(t+1) := θ(t) − . (14)
C ′′ (θ(t) )
In case that θ is a vector we can generalize the update rule to be
where
• Hθ is the Hessian matrix of C with respect to θ, whose entries are given by
∂ 2 C(θ)
Hi,j =
∂θi ∂θj
• ∇θ C(θ(t) ) is the gradient of C with respect to θ.
3 - Linear and Logistic Regression-10
Note that for a quadratic cost function, its gradient will be a linear function of θ and
therefore the algorithm will converge after one iteration (verify that you understand why
it is true). More generally, N-R method converges faster that gradient descent (this will
not be proven here), but it is more computationally expensive. That is because every
update rule the algorithm needs to find and invert (∼ m3 operations) the Hessian matrix.
Analytic Minimization
In some cases, there could be found an analytic, closed-formed solution for θ∗ which
minimizes the cost function. This is not always the case, but for linear regression there
is an analytic solution. In order to find a solution we take the derivatives of the cost
function and set them to zero. In terms of convenience, let’s make some notations that
will ease our process of finding θ∗ .
Let X be the design matrix, whose rows represent the training example’s index, and
its columns represent the feature’s index. E.g, Xij represents the j th feature of the ith
training example. X ∈ Rm+1×n is given by
X = (x(1) ) (x(2) ) · · · (x(n) ) .
Now, let ~y be the vector of the target values, such as ( ~y )i = y (i) . This means that ~y ∈ Rn
is given by
y (1)
y (2)
~y = . .
..
(n)
y
Next, given that hθ (x(i) ) = (x(i) )T θ, we can express the term hθ (x(i) ) − y (i) by
hθ (x(i) ) − y (i) = X T θ − ~y
i
.
3 - Linear and Logistic Regression-11
Combined with the fact that for some vector ~a, the property ~aT ~a = a2i holds, we can
P
i
conclude that
n
1 T 1 X T (i)
X T θ − ~y X T θ − ~y = (θ x − y (i) )2 , C(θ).
(16)
2n 2n i=1
XX T θ∗ = X~y
⇓
−1
θ∗ = XX T X~y (17)
−1
Where the term XX T X is an m+1-by-n matrix and is called the pseudo-inverse of
X. So, by taking derivatives and setting them to zero, we can try finding explicitly θ∗ for
any model that maps x ∈ X to y ∈ Y, or any type of cost function. So, why should we
3 - Linear and Logistic Regression-12
use an iterative process that finds θ∗ ? The first reason is that inverting matrix could be
very inaccurate calculation in the computer due to rounding errors. The second reason
is that inverting an m-by-m matrix takes ∼ m3 operations, and for large m it could be
more time-efficient to calculate θ∗ with an iterative process. The Third reason is that it
is sometimes not possible to find a closed-form solution for θ∗ due to the complexity of
the cost function nor the hypothesis’.
In the previous sections, when using the linear regression on a regression problem,
we used some heuristic explanations for choosing the cost function to be the quadratic
cost function, and explained why we minimizes it. Now, let’s make some probabilistic
assumptions on our problem, then let’s try to back up our heuristic explanations with
probabilistic justifications, and specifically we will use the Maximum Likelihood principle
that we introduced in the first lecture.
Let’s assume that our target variables, the y (i) ’s, are given by the equation
Where ǫ(i) is the error, which can be interpreted as random noise, inaccuracy in
measurements, limitation of the linear model etc. Moreover, let’s assume that the ǫ(i) ’s
are i.i.d (independently and identically distributed) and its distribution is given by
Now, lets examine the probability of P (y (i)|x(i) ; θ) (the semi-colon indicates that y (i)
is parameterized by θ, since θ is not a random variable). Given x(i) , θ, the term θT x(i)
is deterministic and hence P (y (i)|x(i) ; θ) is a Normal distribution with expectation θT x(i)
and variance σ 2 , i.e., N (θT x(i) , σ 2 ), ∀i . Let L(θ) be the likelihood function, which is
given by
L(θ) = p(~y |X; θ). (20)
The likelihood function represents the probability for all y (i) ’s given all x(i) ’s. Intuitively it
measures how likely is that y (i) is the correct value of x(i) for all i, under the probabilistic
3 - Linear and Logistic Regression-13
assumptions we made. Due to that the entire training set is given, the likelihood function
depends exclusively on θ. Likewise, note that due to the fact ǫ(i) ’s are i.i.d,we have,
Note that the exact value of the error’s variance, σ, doesn’t affect the values of θ∗ . In
fact, it tells us that adjusting θ can’t minimize the errors which are caused by the ǫ(i) ’s.
Finally, we can conclude that under the probabilistic assumptions we made earlier, we
derived the same goal, minimizing the term (23) which is exactly C(θ) as we defined
earlier.
V. L OGISTIC R EGRESSION
Logistic regression, as opposed to its misleading name, is used for binary classification
problems, which means that y (i) can take values from the set {0, 1}. As we clarified in
the introduction section, the term ”regression” is rooted in regression analysis and not
in machine learning’s conventions. Note that for classification problems we cannot use
the linear regression model because in that case, y could get any value in the interval
(−∞, ∞). Therefore, we should try other attitude to solving this problem.
Let’s adopt the probabilistic interpretation from last section. First, we will make some
probabilistic assumptions at the problem and then, we will use the maximum likelihood
criteria to adjust θ.
Assume that P (y (i)|x(i) ; θ) is Bernoulli(φ(i) ), where φ(i) is the Bernoulli parameter
for the ith training example. Therefore,
(i) (i)
p y (i) |x(i) ; θ = (φ(i) )y (1 − φ(i) )1−y .
Now we use our hypothesis to estimate the probability that x(i) belong to a certain class.
Let’s choose our hypothesis to be
φˆ(i) = hθ (x(i) )
1
= σ θT x(i) =
. (24)
1 + e−θT x(i)
1
Where σ(z) = 1+e−z
is called the sigmoid function or the logistic function. Note that
σ(z) can output values in the interval [0, 1], such as
σ(z) → 1, z → ∞
1
σ(z) = , z = 0
2
3 - Linear and Logistic Regression-15
σ(z) → 0, z → −∞
Moreover, σ(z) is differentiable over R and its derivative satisfies the following property
1
σ ′ (z) = − −z
2 (−e )
−z
(1 + e )
1 e−z
=
1 + e−z 1 + e−z
1 1
= 1−
1 + e−z 1 + e−z
= σ(z) (1 − σ(z)) (25)
Our last assumption is that the training set was generated independently. Backed up
with these assumptions, let’s adjust θ to maximize the classification accuracy over the
training set, hopefully to yield accurate results for unseen x. Let’s use the maximum
likelihood criteria to adjust θ. The likelihood function in this case is given by
L(θ) = p (y n |xn ; θ)
n
Y
p y (i) |x(i) ; θ
=
i=1
n
Y (i) (i)
(σ θT x(i) )y (1 − σ θT x(i) )1−y
= (26)
i=1
As we stated in the previous section, we can maximize θ over any strictly increasing
function of L(θ). So, to simplify our calculations, let’s maximize the log likelihood
function
n
Y (i) (i)
l(θ) = log (σ θT x(i) )y (1 − σ θT x(i) )1−y
i=1
n
X
y (i) log(σ θT x(i) ) + (1 − y (i) ) log(1 − σ θT x(i) )
= (27)
i=1
where last equality follows from the fact that for a fixed x,
P
− y pY |X (y|x) log pY |X,θ (y|x, θ) is the cross entropy H(pY |X=x , pY |X=x,θ ). Hence
our objective is for any x to minimize H(pY |X=x , pY |X=x,θ ). Recall that cross entropy
has the property
H(p, q) = H(p) + D(p||q), (30)
hence we minimize in the objective, for any x the divergence D(pY |X=x ||pY |X=x,θ ) where
pY |X (y|x) is the true conditional pmf and pY |X,θ (y|x, θ) is the one given by the model.
After finding the expression for the log likelihood function, we can use any of our
optimization methods we discussed earlier. Let’s implement the batch gradient ascent
method to maximize the log likelihood function of the Binary case Eq. (27). To do so,
we will need to find ∇θ l(θ).
n
∂l(θ) X ∂ (i)
y log(σ θT x(i) ) + (1 − y (i) ) log(1 − σ θT x(i) )
=
∂θj i=1
∂θj
n ′ T (i)
′ T (i)
(a) X (i) σ θ x (i) (i) −σ θ x (i)
= y x + (1 − y )
T x(i) ) j
x
T x(i) ) j
i=1
σ (θ 1 − σ (θ
n
(b) X (i) (i)
y (i) 1 − σ θT x(i) xj − (1 − y (i) )σ θT x(i) xj
=
i=1
n
(c) X (i)
= y (i) − σ θT x(i) xj (31)
i=1
3 - Linear and Logistic Regression-17
Where in (a) we took derivatives, in (b) we used property (25), and in (c) we simplified
the expression. After finding the gradient, we can write the logistic regression update
rule n
(t+1) (t) (i)
X
y (i) − σ θT x(i)
θj := θj + α xj . (32)
i=1
Next we will talk about how to generalize the logistic regression algorithm to a non-binary
classifier.
In the first part of this lecture we talked about linear regression and we introduced the
MSE cost function. The MSE cost function is widely used in applied mathematics and
machine learning. Using different cost functions will end up giving us different results
when using our optimization methods such as Gradient Descent for our regression and
classification problems.
In the second part of this lecture we introduced Logistic Regression and showed that
maximizing the log likelihood function resulted in achieving the Cross Entropy cost
function. The Cross Entropy cost function has an advantage over the MSE cost function
in some cases. In order to understand the advantage we shall first address a problem.
Assume we’re solving a Logistic regression problem, while using the logistic function
σ(θT x(i) ) and choosing our cost function to be the MSE cost function:
n
1 X
C(θ) = (hθ (x(i) ) − y (i) )2
2n i=1
n
1 X
= (σ(θT x(i) ) − y (i) )2 (33)
2n i=1
Now derive C(θ) in order to use Gradient Descent:
n
∂C(θ) 1X (i)
= (σ(θT x(i) ) − y (i) )xj σ ′ (θT x(i) ) (34)
∂θj n i=1
3 - Linear and Logistic Regression-18
Now let’s try a different approach. What if we could choose a cost function so that
′
the term σ (θT x(i) ) disappeared? In that case, the cost for a single training example X
would be:
∂C(θ)
= (σ(θT x) − y)xj . (36)
∂θj
For simplicity, define z = θT x. Now derive our cost function:
Hence,
∂C(θ) (σ(z) − y) −y 1−y
= = + (38)
∂σ(z) σ(z)(1 − σ(z)) σ(z) 1 − σ(z)
Integrating this expression with respect to σ(z) will give us:
This is the contribution to the cost from a single example. To get the full cost function
we must average over all training examples, obtaining:
1X
C=− (y log(σ(z)) + (1 − y) log(1 − σ(z)). (40)
n x
We achieved the Cross-Entropy cost function.
To summarize, we showed in this section that the Cross-Entropy cost function has an
advantage. When using cross-entropy cost function the larger the error (σ(z) − y), the
larger the change in θ. This means that the learning process will potentially accelerate.
Softmax regression is used when our target value y can take values from the discrete
set {1, 2, . . . , k}. In this case we assume that
p (y = 1|x; θ) = φ1
p (y = 2|x; θ) = φ2
..
.
p (y = k − 1|x; θ) = φk−1
k−1
X
p (y = k|x; θ) = 1 − φi = φk . (41)
i=1
where 1{·} is the indicator function that returns 1 if the argument statement is true and
0 otherwise. Now we will estimate φ1 , . . . , φk−1 with the following functions
exp (θ(1) )T x
φˆ1 = hθ(1) (x) = Pk
(i) T
i=1 exp (θ ) x
exp (θ(2) )T x
φˆ2 = hθ(2) (x) = Pk
(i) T
i=1 exp (θ ) x
..
.
(k−1) T
ˆ = hθ(k−1) (x) = Pexp (θ
φk−1
) x
k (i) T
i=1 exp (θ ) x
1
φˆk = hθ(k) (x) = Pk
(i) T
i=1 exp (θ ) x
Where θ(i) for all i = 1, . . . , k−1, and θ(k) = ~0 are the parameter that are used to generate
the hypothesis. Let Θ be the parameters matrix that contains θ(i) for all i = 1, . . . , k such
as
(1) (2) (k−1)
Θ = (θ ) (θ ) · · · (θ ) 0 . (43)
Pk
Moreover, note that i=1 φ̂i = 1, 0 ≤ φ̂i ≤ 1 so the φ(i) ’s form a probability distribution.
Next, let’s use the maximum likelihood criteria on the log likelihood function. The log
likelihood is given by
n
(a) Y
p y (i) |x(i) ; Θ
= log
i=1
n
hθ(1) (x(i) )1{y hθ(2) (x(i) )1{y · · · hθ(k) (x(i) )1{y
(b Y (i) =1} (i) =2} (i) =k}
= log
i=1
n
1{y (i) = 1} log hθ(1) (x(i) ) + · · · + 1{y (i) = k} log hθ(k) (x(i) )
(c) X
=
i=1
n X
k
1{y (i) = q} log hθ(q) (x(i) )
(d) X
= (44)
i=1 q=1
3 - Linear and Logistic Regression-21
Where in (a) we used the assumption that the training set was generated independently,
in (b), (c) we used the logarithm properties and in (d) we simplified the expression. Let
substitute the hypothesis with its explicit function to get
n X
k
1{y (i) = q} log hθ(q) (x(i) )
X
l(Θ) =
i=1 q=1
n X
k
exp (θ(q) )T x(i)
1{y (i) = q} log Pk
X
=
i=1 q=1 p=1 exp (θ(p) )T x(i)
n X
k
" k
#
1{y (i) = q} log exp (θ(q) )T x(i) − log
(a) X X
= exp (θ(p) )T x(i)
i=1 q=1 p=1
n
" k k k
#
1{y (i) = q}(θ(q) )T x(i) − 1{y (i) = q} log exp (θ(p))T x(i)
(b) X X X X
=
i=1 q=1 q=1 p=1
n
" k k
#
1{y = q}(θ ) x − log exp (θ ) x
(c) X X (i) (q) T (i)
X
(p) T (i)
= (45)
i=1 q=1 p=1
Where (a), (b) we used logarithm properties and in (c) we used the fact that the indicator
function return 1 only once for every training example. Now, let’s find the derivatives of
(r)
l(Θ) with respect to θj to form our gradient ascent update rule that will maximize our
log likelihood function.
n
" k k
#
∂l(Θ) ∂
1{y (i) = q}(θ(q))T x(i) − log
X X X
(r)
= (r)
exp (θ(p) )T x(i)
∂θj ∂θj i=1 q=1 p=1
n
" k k
#
X ∂ ∂
1{y (i) = q}(θ(q))T x(i) −
(a) X X
= (r) (r)
log exp (θ(p) )T x(i)
i=1 q=1 ∂θj ∂θj p=1
n
" #
exp (θ(r) )T x(i)
1{y =
(b) X
(i) (i) (i)
= r}xj − Pk xj
i=1 p=1 exp (θ(p) )T x(i)
n
" #
exp (θ(r) )T x(i)
1{y (i) = r} − Pk
(c) X (i)
= xj
i=1 p=1 exp (θ(p) )T x(i)
n
1{y (i) = r} − p y (i) = r|x(i) ; Θ xj(i)
(d) X
= (46)
i=1
Where in (a) we used the linearity of the derivative, in (b) we took derivatives, and in
(c), (d) we simplified the expression. Finally we can write our batch gradient ascent
3 - Linear and Logistic Regression-22
update rule to be
n
(r) (r)
1{y (i) = r} − p y (i) = r|x(i) ; Θ x(i)
X
θj := θj +α j (47)
i=1
so the gradient stops adjusting Θ when p y (i) = r|x(i) ; Θ is close to 1 for the correct
class and close to 0 for the incorrect classes, which approves that our results makes sense.
R EFERENCES
[1] Andrew NG’s machine learning course. Lecture on Supervised Learning http://cs229.stanford.edu/materials.html.
4 - Neural Networks-1
Machine learning
Throughout this lecture we introduce Neural Netwoks, starting from a single neuron,
and ending with the Backpropagation method. Most of the material for this lecture is
based on the online book of Michael Nielsen [1].
I. N EURAL N ETWORKS
Neural networks are limited imitations of how our own brains work. They’ve had a big
recent resurgence because of advances in computer hardware. There is evidence that the
brain uses only one ”learning algorithm” for all its different functions. At a very simple
level, neurons are basically computational units that take input (dendrites) as electrical
input (called ”spikes”) that are channeled to outputs (axons).
Neural neworks are typically organized in layers. Layers are made up of a number of
interconnected ’nodes’ which contain an ’activation function’. Patterns are presented to
the network via the ’input layer’, which communicates to one or more ’hidden layers’
where the actual processing is done via a system of weighted ’connections’. The hidden
layers then link to an ’output layer’, which is the output of the network.
Neural Networks can be applied to many problems, such as: function approximation,
classification, regression, data processing, etc. We will start by looking at a single neuron,
define it’s model, and combine neurons to a complete network.
A neuron is depicted in Fig. 1. The Neuron has k inputs, x1 , x2 , ..., xk , a set of weights
w1 , w2 , ..., wk corresponding to the inputs, a bias b and an activation function σ(z) that
produces a single output y. The output of the neuron y is determined by
y = σ(z),
4 - Neural Networks-2
k
X
z= wi xi + b. (1)
i=1
x1
w1
x2 σ(z) y
w2
⠇
wk
xk
There are many different options for the choice of the activation function σ(z). A few
of them are given below and a depicted in Fig. 2.
1
• σ(z) = 1+e−z
- sigmoid function
• σ(z) = sign(z) - sign function
ez −e−z
• σ(z) = tanh(z) = ez +e−z
- hyperbolic tangent
• σ(z) = max(0, z) - rectified linear unit (RLU)
1.5
0.5
σ(z)
-0.5
Logistic function
-1
Sign function
tanh
RLU
-1.5
-5 -4 -3 -2 -1 0 1 2 3 4 5
z
As we mentioned before, a Neural Network is organized in layers, where the first layer
contains the inputs of the network, the last layer is the output of the network, and the
layers in between are called hidden layers. Each layer gets it’s inputs from the layer
before, and passes it’s outputs to the next. We call this step forward propagation. Fig. 3
depicts a fully connected neural network with input layer, hidden layer and output layer.
The reason its called fully connected since the neural in each layer are connected to all
the neural in the next layer.
x1 2
w11
a21 a1L−1 L
w11
aL1
2 L
2
w21 w12 L
w21 w12
x2 2
w22
a22 a2L−1 L
w22
aL2
⠇ ⠇ ⠇ ⠇
2 L
w2K 1
w2K L−1
xK1 L−1
2
a2K2 aK L−1 L
aLKL
wK 2 K1
wK L KL−1
The parameters and variables that define a neural network are the following:
l
• wjk - the weight for the connection from the k th neuron in the (l − 1)th layer to the
j th neuron in the lth layer.
• blj - the coefficient we add to the j th neuron in the lth layer.
• Kl - the number of neurons in the lth layer.
PKl−1 l l−1
• zjl = k=1 wjk ak + blj
• alj = σ(zjl )
4 - Neural Networks-4
By organizing our parameters in matrices and using matrix-vector operations, we can take
advantage of fast linear algebra routines to quickly perform calculations in our network.
T
• z l = z1l , z2l , . . . , zK l
l
T
• bl = bl1 , bl2 , . . . , zK l
l
l l l
w11 w12 . . . w1K l−1
wl w l
. . . w l
l 21 22 2K l−1
• w = .
.. .
.. . .. .
..
l l l
wKl 1 wKl 1 . . . wKl Kl−1
T
• al = al1 , al2 , . . . , alKl
• z l = w l al−1 + bl
• al = σ(z l )
• σ([x1 , . . . , xn ]T ) = [σ(x1 ), . . . , σ(xn )]T
x2
1 0
1
0 1
0
x1
0 1
Example 1 (XOR function) One neuron is able to separate two sets only by a linear
separation. Now consider the XOR function that we are all familiar with and it is depicted
in Fig. 4. It is easy to see, that the XOR function cannot be approximated using a linear
4 - Neural Networks-5
function (there is no line that could separate the two groups of answers). But now we
will show, that it is possible to approximate the XOR function using a Neural Network.
We build a Network as depicted in Fig. 5 with three inputs, x1 , x2 , and the third set to
’1’, a hidden layer of two neurons, and a single neuron output layer. All biases are set
to zero. And the set the weights that we choose is given in Fig. 5.
x1 a21 a31
1 1
1
−2
1
x2 a22
1
0
−1
We choose the activation function to be the RLU function. Let’s take the input
[x1 , x2 ] = [0, 0] and insert it to the network:
a21 = max(0, 1 ∗ x1 + 1 ∗ x2 + 0 ∗ 1) = 0
a22 = max(0, 1 ∗ x1 + 1 ∗ x2 + −1 ∗ 1) = 0
The same goes for the other options for the inputs: [0, 1], [1, 0], [1, 1]:
We can see that the output a31 fits the XOR function perfectly. Note that the output of
each layer is not dependant pof previous layers, only it’s own inputs and weights.
C. Cost Function
A very important part of defining the Neural Network is the goal that it is designed
to achieve. Hence, one need to define a cost function that quantify how close we’re to
achieving the goal. The cost function is a measure of how close is the output of the neural
network to the desired label. For instance a mean square error function (or a quadratic
cost function) is define as a cost function as follows:
N
1 X
C(w, b) = ka(xi ) − yi k2 , (2)
2N i=1
where x are the input vectors and y are the corresponding labels. The input and the label
are determined by the problem setting, hence are fixed. The parameters w and biases b
are determined by the neural network, hence we can consider that the cost function is a
function of the weights w and biases b.
Our main goal is to minimize the cost function, so that the the output from the network
will be close to the desired output as possible. To minimize the cost function, we will use
a method called Gradient Decent. Gradient descent is a iterative optimization algorithm.
It says that to find a local minimum of a function, one should takes steps proportional
to the negative of the gradient (or of the approximate gradient) of the function at the
current point.
D. Backpropagation
where the sum is over all neurons k in the output layer. Of course, the output activation
aLk of the k t h neuron depends only on the input weight zjL for the j t h neuron when k = j.
∂aL
And so k
∂zjL
vanishes when k 6= j. As a result we can simplify the previous equation to
∂C ∂aLj
δjL = L L. (8)
∂aj ∂zj
4 - Neural Networks-8
Recalling that aLj = σ(zjL ), the second term on the right can be written as σ ′ (zjL ), and
the equation becomes
∂C ′ L
δjL = σ (zj ). (9)
∂aLj
We can rewrite the equation in a matrix-based form, as
δ L = ∇a C ⊙ σ ′ (z L ). (10)
∂C
Here, ∇a C is defined to be a vector whose components are the partial derivatives ∂aL
.
j
Differentiating, we obtain
∂zkl+1 l+1 ′ l
= wkj σ (zj ). (15)
∂zjl
Substituting back into (13) we obtain
X
δjl = l+1 l+1 ′ l
wkj δk σ (zj ). (16)
k
In a matrix-based form,
T
δl = w l+1 δ l+1 ⊙ σ ′ (z l ), (17)
T
where w l+1 is the transpose of the weight matrix w l+1 for the (l + 1)th layer.
4 - Neural Networks-9
By combining (17) with (10) we can copmute the error δ l for any layer in the network.
We start by using (10) to compute δ L , then apply Equation (17) to compute δ L−1 , then
Equation (17) again to compute δ L−1 , and so on, all the way back through the network.
Now that we have the errors δjl of all the layers of the network, we can compute the
∂C ∂C
partial derivatives l
∂wjk
and ∂blj
as a function of δjl :
∂C ∂C ∂zjl
l
= = δjl akl−1 , (18)
∂wjk ∂zjl ∂wjk
l
∂C ∂C ∂zjl
l
= l l
= δjl . (19)
∂bj ∂zj ∂bj
For each iteration we use (3) and (4) and compute the new values of the parameters.
To summerize the backpropagation algorithm:
1) Perform a feedforward pass, computing the activations for layers 2,3, and so on up
to the output layer L.
2) For each output unit j in layer L (the output layer), set
∂C ′ L
δjL = σ (zj ). (20)
∂aLj
3) For layers l = L − 1, L − 1, ..., 2, for each node j in layer l, set
X
l+1 l+1 ′ l
δjl = wkj δk σ (zj ). (21)
k
R EFERENCES
[1] Michael A. Nielsen. Neural Networks and Deep Learning. Determination Press, 2015.
5-1
Machine learning
Lecture 5
Lecturer: Haim Permuter Scribe: Gal Rattner
I. I NTRODUCTION
6
y
0 1 2 3 4 5 6 7 8 9 10
x
C = C0 + λ kwkL , (1)
Among the most common L-norm regularization methods are the L2 and L1 regularization
terms.
L2 regularization: Using the L2 regularization term, the total cost function is given
by
λ
C = C0 + kwk22
2n
5-3
λ X 2
= C0 + w , (2)
2n i i
where C0 can be the quadratic, cross-entropy or other cost function, wi are the weights in
the net, and the hyper-parameter λ scale the regularization term to have gentle influence
on the total cost C, and holds λ > 0. The influence of the L2 regularization term is clear
when observing the partial derivatives of the cost function in equation (2), i.e.
∂C ∂C0 λ
= + w, (3)
∂w ∂w n
and
∂C ∂C0
= . (4)
∂b ∂b
The partial derivatives can be computed using backpropagation, as described in lecture
4, and the update rules are given by
∂C0
b→b−η , (5)
∂b
and
∂C0 ηλ
w →w−η − w
∂w n
ηλ ∂C0
= 1− w−η . (6)
n ∂w
This update rule is the same as the usual gradient descent update rule, except we rescale
ηλ
the weights first by factor n
. This rescaling factor is sometimes referred to as the weight
decay factor.
L1 regularization: Using the L1 regularization term the total cost argument is given
by
λX
C = C0 + |wi |, (7)
n i
where λ > 0.
The partial derivatives of the cost function with respect to w is given by
∂C ∂C0 λ
= + sign(w). (8)
∂w ∂w n
5-4
∂C0 ηλ
w →w−η − sign(w). (9)
∂w n
We notice that both L2 and L1 regularization are penalizing large weights, and causing the
weights to decrease toward zero. The difference is in the rate of decrement, where with
L1 regularization the decrement rate is constant, with L2 regularization the decrement
rate is proportional to the size of the weight w.
III. D ROPOUT
A very common regularization method of deep neural network which have proven its
improvement skills is the dropout, as presented in [4]. Unlike the regularization methods
described above, dropout modify the network’s connections rather than the cost function.
The way dropout works is by setting each activation node to zero with some prefixed
probability p ∈ [0, 1], for each training iteration. The hyper-parameter p is generally set
before the training session starts. On that way, for an input or hidden layer with Nl
nodes, at each training iteration we get an average of only p × Nl active nodes in the
net. Notice that we do not drop any node at the net’s output layer. In order to keep the
total sum of activations coming out of each layer constant, we multiply any un-dropped
activation by factor 1p . Repeating this method over and over during the training operation,
the deep neural network will learn a set of weights and biases which generalize better
than the regular training.
The motivation of using this method can be given in three main points:
Figure 3. Fully connected neural network after applying dropout on the hidden layer.
R EFERENCES
Machine Learning
This lecture discusses decision trees and the minimum description length principle. the
goal of desicion trees is to have a series of questions(tree), that ends in decisions(leaves).
Discussed as a regulating mechanism, the MDL shows that the best hypothesis for a
given set of data is that which leads to the best compression of the data. The last part of
the lecture discuses two methods for avoiding overfitting: Random Forest and Pruning.
I. D ECISION T REE
Here we have a method in which we create a k’-nary tree. Each node represents a
question about the features, Θn . Each branch represents an answer to its origin node,
one of k options. Each leaf represents a decision. A decision tree is easily implemented
in a variety of applications and it is useful for both classification and regression. This
lecture introduces the notation of the binary decision tree, but extending it to k’-nary is
straightforward. Let {xi } be the i’th sample, i = 1, 2, .., m, each of which has n features,
and Θ is the question asked in each node. Note that a large part of this section is taken
from the book and lecture by Prof. Shai Shalev-Shwartz[1], [2].
Firm but
not hard
Tasty
Pale green to
light yellow
Softness
Not tasty
Other
Fig. 1. Classic Decision tree for papaya classification,using two features: color and softness.
A. Cost functions
Let {xi,j } be the j’th feature of the i’th sample, and {yi }i=1..m be the classifications
(labels). Define gain, where C(x) is the cost function, as
X
Gain(Y ; Θ) = C(Y ) − C(Y |Θ) = C(Y ) − P (Θ = θ)C(Y |Θ = θ) (1)
θ∈Θ
Cost functions
1
H(a)
0.9 G(a)
0.8
E(a)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
a
Gain with entropy as the cost is called information gain, and it is widely used in
applications. Note that information gain is the mutual information we learned about
in the first lecture of the course.
Red
θ=1 Tasty 48
Not tasty 12
Color
Tasty 2
Otherwise Not tasty 38
θ=0
2 2 3 12
Inf ormation Gain = H(Y ) − H(Y |Θ) = 1 − Hbinary ( ) − Hbinary ( ) (3)
5 40 5 60
Set cost to be the Gini index and we have
2 2 3 12
Gain(Y ; Θ) = 1 − G( ) − G( ) (4)
5 40 5 60
B. Algorithm
ID3(S,A)
create a new node;
if all samples are ones then
label new node as leaf labeled 1;
else
if all samples are zeros then
label new node as leaf labeled 0 ;
else
if A ∈ ∅ then
label new node as leaf labeled as the majority of labels in S;
else
p = argmax G(S, 1xa =1 ) ;
a∈A
C. Regression
Like with classification problems, decision trees can also be simply applied in
regression problems. Generally speaking, we substitute the label decision with the average
or mean. So instead of asking questions with k answers, we ask about k ranges. And
instead of taking the majority of the labels we have the average or mean of the leaf’s data.
For example when putting a price on a new house, its estimated value will be the mean
of the values of all houses with the same features (i.e. identical answers to questions on
the tree).
6-Decision Tree & MDL-5
II. MDL
s
− log(w(h)) + log( 2δ )
∀h ∈ H, LD (h) ≤ LS (h) +
2m
Or alternately
s
− log(w(h)) + log( 2δ )
∀h ∈ H, Pr Ld (h) − LS (h) ≤ ≥1−δ
2m
6-Decision Tree & MDL-6
The proof of the MDL bound is based mainly on Hoeffding’s inequality, wherein setting
some free parameters we bring the inequality to seen like the bound we need. From here
on the proof is straightforward.
Or similarly:
m
!
1 X 2mǫ2
Pr | Xi − µ| > ǫ ≤ 2 exp − Pn 2
(7)
m i=1 i=1 (bi − ai )
s
m 2 log( δ2 )
1 X log( δn ) 2m 2m h
Denote that
m
" # m
1 X 1 X
E[LS ] = E 1{h(xi )6=yi } = E[1{h(xi )6=yi } ] = Pr(Y 6= h(X)) = LD . (11)
m i=1 m i=1
So for a fixed h
s
− log(w(h)) + log( 2δ )
∀h ∈ H Pr |LS (h) − LD (h)| > ≤ δh (12)
2m
6-Decision Tree & MDL-7
Hence, we have
s
− log(w(h)) + log( 2δ )
Pr |LD − LS | ≤ ≥1−δ (14)
2m
When exploiting the MDL bound to optimize a decision tree, the most common method
employed is pruning. The simple algorithm suggests testing each node of the tree with
the following statement, which we define as a cost for the algorithm:
s
− log(w(h)) + log( 2δ )
Cost(h) = . (15)
2m
Now all that’s left is to calculate the cost with and without each node. If the cost difference
is significant, prune (i.e. delete the node). Pruning will be farther discussed in the next
section. An alternative approach is to use the MDL bound to set the tree’s optimal depth,
and then to use a modified algorithm to construct the tree by using combinations of
features.
A. Random forest
The general method of random decision forests was first proposed in 1995 by Ho[7]
who established that if forests of trees split with oblique hyperplanes, are randomly
restricted to be sensitive to only selected feature dimensions, they can gain accuracy
as they grow without suffering from over-training. Random forests are an ensemble
learning method for classification, regression and other tasks that operate by constructing
a multitude of decision trees at training time. The output of the trees comprises the
class, i.e. the mode (the value that appears most often), of the classes for classification
problems, or the mean prediction of the individual trees for regression. Random decision
forests correct the habit of decision trees to overfit to their training set.[5]
6-Decision Tree & MDL-8
1) Algorithm: The idea behind random decision forests is to train many small trees
with subsets of the samples. For each tree, we randomly select n samples of all the
samples, for those n samples we choose (again randomly) k features that will be used to
train the specific tree. When we have the new subsets, we should use the ID3 algorithm
and Feedforward for each tree. For classification, we take the majority of all trees to
make a decision, whereas for regression problems, we calculate the average of all trees
to reach a decision. The reason for the randomness is to decrease the correlation between
trees, and in so doing, to contribute to gains in accuracy. For example, if one or a few
features are very strong predictors for the response variable (the label), these features
will be selected in many of the B trees, causing them to become correlated.
√
Typically, for a classification problem with p features, p (rounded down) features are
p
used in each split. For regression problems the inventors recommend 3
(rounded down)
with a minimum node size of 5 as the default.
B. Pruning
The optimal final tree size in a decision tree algorithm is not a trivial matter. A tree that
is too large risks overfitting the training data, thus rendering it poorly generalizable to
new samples. Conversely, a tree that is too small might not capture important structural
information about the sample space. Determining when a tree algorithm should stop,
however, is difficult because one cannot know whether the addition of a single extra
node will dramatically decrease error. This problem, which is known as the horizon
6-Decision Tree & MDL-9
effect,, is commonly addressed by growing the tree until each node contains a small
number of instances and then by using pruning to remove nodes that do not provide
additional information. Pruning should reduce the size of a learning tree without reducing
its predictive accuracy as measured by a cross-validation set. There are many techniques
for tree pruning that differ in terms of the measurement that is used to optimize
performance.[6]
1) Techniques: Generally speaking, tree pruning techniques can be classified as either
top down - traverse nodes and trim subtrees starting at the root, or up - start at the
leaf nodes. Each pruning technique has advantages and weaknesses that, if known and
understood, can aid us in selecting the most appropriate method in every instance.Here we
will learn two popular algorithms, one each from the top-down and bottom-up approaches.
• {Ti }i=0,...,m - series of trees where T0 is the original tree and Tm is the root
alone.
• Error Rate of a tree - err(T, S) where T is the current tree and S is the
data set.
• Function - prune(T, t) - defines the resultant tree after removal of subtrees t
from T .
• Function - leaves(T ) - defines the quantity of leaves at tree T .
At each step,i, the tree is created by removing a subtree t from tree i − 1 and
replacing it with a leaf node whose value is chosen as in the tree building algorithm.
First we find the error rate err(Ti , S) for the current tree Ti and then we search
err(prune(Ti ,t),S)−err(Ti ,S)
for the subtree that minimizes the expression: |leaves(Ti )|−|leaves(prune(Ti ,t))|
Once the series of trees has been created, the best tree is chosen base on generalized
accuracy as measured by a training set or by cross-validation.
2) Reduced error pruning (bottom-up) -
Starting at the leaves, each node is replaced with its most popular class. If the
prediction accuracy is not affected then the change is kept. Very simple to apply,
6-Decision Tree & MDL-10
IV. A PPENDIX
Sn = X1 + X2 + ... + Xn . (17)
We have:
2t2
P (|Sn − E[Sn ]| ≥ t) ≤ 2 exp − Pn 2
(18)
i=1 (bi − ai )
Lemma 2 (Hoeffding Lemma) Suppose X is a real random variable with mean equal
to zero such that P (X ∈ [a, b]) = 1. Then
1
E[esX ≤ exp( s2 (b − a)2 ).
8
Another inequality we need is
Lemma 3 (Markov’s inequality) Let X be a non negative r.v. and a > 0, then:
E[X]
P (X ≥ a) ≤ .
a
Proof 2 Suppose X1 , X2 , ..., Xn are n independent r.v. such that
P (Xi ∈ [ai , bi ]) = 1, 1 ≤ i ≤ n.
Let
Sn = X1 + X2 + ... + Xn .
−st s(Sn −E[Sn ])
≤e E e (20)
n
Y
−st s(Xi −E[Xi ])
=e E e (21)
i=1
n
Y s2 (bi −ai )2
≤ e−st e 8 (22)
i=1
n
!
1 2X
= exp −st + s (bi − ai )2 . (23)
8 i=1
Now, for us to obtain the best possible upper bound, we need to find the minima of
g
right-hand side of the last inequality as a function of s. Define g : R+ −
→ R as
n
s2 X
g(s) = −st + (bi − ai )2
8 i=1
Note that g(s) is a quadric function, and hence, its minimum is at
4t
s = Pn 2
.
i=1 (bi − ai )
Thus we get
2t2
P (Sn − E[Sn ] ≥ t) ≤ exp − Pn 2
(24)
i=1 (bi − ai )
2t2
P (|Sn − E[Sn ]| ≥ t) ≤ 2 exp − Pn 2
(25)
i=1 (bi − ai )
Refer to[4] for further reading and better understanding about this subject.
R EFERENCES
[1] Shai Shalev-Shwartz. Understanding Machine Learning course, Lecture No.4,
Hebrew University of Jerusalem, 2014.
[2] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms
May,2015.
[3] Quinlan, J. R. 1986. Induction of Decision Trees
http://hunch.net/∼coms-4771/quinlan.pdf Netherlands, 1986
[4] Hoeffding, Wassily. ”Probability inequalities for sums of bounded random variables”. Journal of the American
Statistical Association ,Vol. 58, No. 301 (Mar., 1963), pp. 13-30
[5] Wikipedia https://en.wikipedia.org/wiki/Random forest
[6] Wikipedia https://en.wikipedia.org/wiki/Pruning (decision trees)
[7] Ho, Tin Kam (1995). Proceedings of the 3rd International Conference on Document Analysis and Recognition,
Montreal, QC, 1416 August 1995. pp. 278282. http://ect.bell-labs.com/who/tkh/publications/papers/odt.pdf
7: Tree distribution-1
This lecture discusses tree distribution and the method of types. We will introduce a
method developed by Chow and Liu [1] to fit the best tree to the data, wherein the ”Best
tree” refers to the tree that have the minimum divergence. We will use some principles that
originated in the field of the method of types, in order to show that minimum divergence
is equivalent to maximum likelihood. The method of types introduced here was fully
developed by Csiszar and Korner [2], who derived the main theorems of information
theory from this perspective.
Z Y
U W V
Fig. 1. A tree with distributions of the structure Pt (x, y, z, u, v, w) = P (x)P (y|x)P (z|x)P (u|z)P (w|z)P (v|y).
Chow and Liu used the tree distribution which have only one father for every node
(excluding the root), they assumed that the criteria of the optimal tree is minimum
divergence, and they proved that the minimum divergence achieved by maximum mutual
information over the air of variables. Note that a tree structure has a much smaller number
of parameters (linear in the number of nodes) when compared to the exponentially many
parameters needed for a general distribution. We get less complex sparse approximator
when we use the tree structure but we also get less accuracy compared to the general
distribution.
Assuming we have information represented as follows:
x1 y1 z 1 w 1
x2 y2 z 2 w 2
.. .. .. ..
. . . .
xn yn z n w n
by Csiszar and Korner [2], who derived the main theorems of information theory from
a method of types perspective.
Let xn = (x1 , x2 , ..., xn ) be a sequence from the alphabet X = (a1 , a2 , a3 , ...a|X | ). Let
N (a|xn ) be the number of times that a appears in sequence xn .
Definition 2 (Type) The type Pxn (or empirical probability distribution) of a sequence
N (a|xn )
xn is the relative proportion of occurrences of each symbol of X , i.e., Pxn (a) = n
for all a ∈ X .
Example 1 Let X = {0, 1, 2}, let n = 5 and x5 = (1, 1, 2, 2, 0). Then N (0|x5 ) = 1,
N (1|x5 ) = 2 and N (2|x5 ) = 2. Hence, Pxn = 51 , 25 , 52 .
Definition 3 (All possible types) Let Pn be the collection of all possible types of
sequences of length n.
For example, if X = {0, 1}, the set of possible types with denominator n is
0 n 1 n−1 n 0
Pn = (P (0), P (1)) : , , , , ..., , . (3)
n n n n n n
Lemma 1 An upper bound for |Pn |:
Proof:
There are |X | components in the vector that specifies Pxn . The numerator in each
component can take on only n + 1 values. So there are at most (n + 1)|X | choices
for the type vector.
Definition 4 (Type class) Let P ∈ Pn . The set of sequences of length n with type P is
called type class of P, denoted T (P ):
Proof:
Since {Xi }i≥1 are i.i.d,
n
Y
Qn (xn ) = Q(xi ). (7)
i=1
Now consider
n
X
log Qn (xn ) = log Q(xi ) (8)
i=1
(a) X
= N (a|xn ) log Q(a) (9)
a∈X
(b) X
=n Pxn (a) log Q(a) (10)
a∈X
X Q(a)
=n Pxn (a) log · Pxn (a) (11)
a∈X
Pxn (a)
where
(a) follows because each a ∈ X contributes exactly log Q(a) times it’s number of
occurrences in xn to the sum in (8).
(b) follows from the definition of Pxn (a).
Hence, we obtained
where xn1 , xn2 ,. . . , xnm are the random processes. And QX1 ,X2 ,...,Xm is the joint distribution
function.
For more information on method of types, see a whole lecture on the subject:
http://www.ee.bgu.ac.il/˜haimp/multi2/lec1/lec1.pdf
7: Tree distribution-5
We want to find the tree distribution that have the maximum probability (maximum
likelihood principle). We saw in the last section about method of types, that is achieved
by minimum divergence. From equation (13):
Pt (xn1 , xn2 , ..., xnm ) = 2−n(H(Pemp )+D(Pemp ||Pt (x1 ,x2 ,...,xm ))) , (15)
where Pemp = Pxn1 ,xn2 ,...,xnm (x1 , x2 , ..., xm ). Since the empirical entropy H(Pemp ) does not
depend on the selected tree but only on samples, then to obtain the maximum probability,
we should look for the minimum of the divergence D(Pemp ||Pt (x1 , x2 , ..., xm )).
X X Pemp
D Pemp ||Pt (x1 , x2 , ..., xm ) = ... Pemp log
x ∈X x ∈X
Pt (x1 , x2 , ..., xm )
1 1 m m
X X
= H(Pemp ) − ... Pemp log Pt (x1 , ..., xm ). (16)
x1 ∈X1 xm ∈Xm
In a tree distribution, it can be said that each node in the tree depends only on its parent
and not on any other node in the tree.
m
Y
Pt (x1 , x2 , ..., xm ) = Pt (xi |xj(i) ), (17)
i=1
where xj(i) is the parent of xi , see Fig 1.
The Chow and Liu algorithm of creating a tree based on the assumption that minimum
divergence is the criteria of the optimal tree. We will define the next lemma of minimum
cross entropy for the following proof which show that the minimum divergence is
equivalent to maximum sum of the mutual information.
Lemma 2 (Minimum cross entropy) The minimum by q of the cross entropy between
(p, q) is equal to the entropy of p
and it is achieved by q = p.
Proof:
X
min H(p, q) = min − p(x) log q(x)
q q
x
7: Tree distribution-6
X q(x)p(x)
= min − p(x) log
q
x
p(x)
X p(x) X
= min p(x) log − p(x) log p(x)
q
x
q(x) x
where (a) follows from the fact that KL divergence D(p||q) is always non-negative and
it get zero if and only if q = p.
Theorem 2 (Chow and Liu results [1] ) The minimum divergence achieved by maxi-
mum sum of mutual information, and it given by,
m
X
min D Pemp ||Pt (x1 , x2 , ..., xm ) = const − I(Xi , Xj(i) ),
Pt
i=1
Proof:
D Pemp ||Pt (x1 , x2 , ..., xm )
(a) X
= H(Pemp ) − Pemp (xm ) log Pt (x1 , ..., xm )
x1 ∈X1 ..xm ∈Xm
m
(b) X
m
Y
= H(Pemp ) − Pemp (x ) log Pt (xi |xj(i) )
xm i=1
m
X Y Pt (xi |xj(i) )Pemp (xi )
= H(Pemp ) − Pemp (xm ) log
xm i=1
Pemp (xi )
m
X
m
X Pt (xi |xj(i) )
= H(Pemp ) − Pemp (x ) log + log Pemp (xi )
xm i=1
Pemp (xi )
m
XX Pt (xi |xj(i) )
= H(Pemp ) − Pemp (xm ) log + log Pemp (xi )
xm i=1
Pemp (xi )
7: Tree distribution-7
m X m
X
m Pt (xi |xj(i) ) X X
= H(Pemp ) − Pemp (x ) log − Pemp (xm ) log Pemp (xi ),
i=1 xm
Pemp (xi ) i=1 xm
(19)
m X m
X Pt (xi |xj(i) ) X X
= H(Pemp ) − Pemp (xi , xj(i) ) log − Pemp (xi ) log Pemp (xi ),
i=1 xi ,xj(i)
Pemp (xi ) i=1 x i
(20)
m X m
X X Pt (xi |xj(i) ) X X
= H(Pemp ) − Pemp (xj ) Pemp (xi |xj(i) ) log − Pemp (xi ) log Pemp (xi ),
i=1 xj(i) xi
Pemp (xi ) i=1 xi
(21)
m X m
(d) X X Pemp (xi |xj(i) ) X X
= H(Pemp ) − Pemp (xj ) Pemp (xi |xj ) log − Pemp (xi ) log Pemp (xi ),
i=1 xj(i) xi
Pemp (xi ) i=1 x i
(22)
m
(c) X
= Const − I(Xi ; Xj(i) ), (23)
i=1
where
(a) - follows from equation (16)
(b) - follows from equation (17)
P Pt (xi |xj(i) )
(c) - Note that xi Pemp (xi |xj ) log Pemp (xi ) is a divergence and the minimum is
achieved when Pt (xi |xj(i) ) = Pemp (xi |xj(i) )
(d) we define Const = H(Pemp ) − m
P P
i=1 xi Pemp (xi ) log Pemp (xi ) which do not
The problem is to create a tree that best describe the data, i.e. the tree that will take
the divergence to minimum. In the previous section, we showed that minimizing the
divergence is equivalent to maximizing the sum of all the mutual information between
7: Tree distribution-8
each node and its parent in the tree. Our goal is to find an algorithm to create the tree.
Kruskal [4] proposed the following algorithm called ”Minimum spanning tree algorithm”
to assemble the desired tree. This is a greedy algorithm which doesn’t promise that an
optimal tree will be found. The Greedy Choice is to pick the smallest weight edge that
does not cause a cycle in the MST constructed so far. This is a generic pseudo-code for
”Minimum spanning tree algorithm”, when V is a set of all the vertices, E is a set of all
the possible edges, A is a sub set of E which contain all the edges that included in the tree.
KRUSKAL(V):
A=∅
for each vertex v,u in V: do
E = calculate weight of edge(u,v);
end
E = SORT-INC(E);
for each edge in E: do
if cycle is not formed: then
A = A ∪ (u, v)
end
if length(A)=length(V)-1: then
break;
end
end
In our problem we need to use ”Maximum spanning tree algorithm” which means
instead of order the wights(mutual information) by increasing order, we need to order
the wights by decreasing order. Before starting the routine we must compute all the
mutual information and arrange the resulting process pairs in a list from the largest
wight to the smallest wight.
Stage 1: Find the pair of nodes with the greatest mutual information in the list.
P (xi |xj(i) )P (xj(i) )
I(i,j) = P (xi , xj(i) ) log , i, j ∈ [1, 2 . . . , N ] , i 6= j. (24)
P (xj(i) )P (xi )
Stage 2: Connect the pair of nodes found in stage 1, update the list of mutual
7: Tree distribution-9
information and return to Stage 1 if the list still contains pair of processes. The update
includes the following:
1. Delete the mutual information of the selected pair.
2. Delete all items in the list that represent prohibited connections, i.e., connections that
create loops in the graph.
Stage 3: Decide which of the two processes will be at the head of the tree and determine
the direction of the arrows. In fact, for either choice, we obtain the same sum of mutual
information, and therefore, either choice is possible.
Example 2 Let us assume that we have the random processes X Y Z W . The mutual
information between each possible pair was calculated. The results are shown in the table
below and in Figure 2.
X
I=0.5 I=0.6
I=0.1
W I=0.3 Y
I=0.4 I=0.4
Fig. 2. Calculate the mutual information between each pair of random processes.
We perform step 1 of the routine and see that the greatest mutual information is
7: Tree distribution-10
between X and Y . Thus, we connect them with a line. We then Proceed to step 2,
delete X, Y from the table and return to step 1. See Figure 3.
X X
I=0.5 I=0.6
I=0.1
W I=0.3 Y Y
I=0.4 I=0.4
Fig. 3. Connect the two random processes with the largest mutual information, and remove that mutual information
from the list.
We re-apply step 1 and find that we have two lines with the same mutual information,
so we can arbitrarily choose between the two options.
After step 2, it appears that the table is empty, so we proceed to step 3 and select the
7: Tree distribution-11
X X
I=0.5 I=0.6
I=0.1
W I=0.3 Y W Y
I=0.4 I=0.4
Fig. 4. Connect the next two random processes with the largest mutual information in the list, and remove it and all
connections that might create loops on the graph.
tree head and the directions of the arrows. See Figure 5 and Figure 6.
X X
I=0.5 I=0.6
I=0.1
W I=0.3 Y W Y
I=0.4 I=0.4
Z Z
Fig. 5. Connect the next two random processes with the largest mutual information in the list, and remove it and all
connections that might create loops on the graph.
7: Tree distribution-12
W Y
R EFERENCES
[1] Chow, C. K., Liu, C.N. (1968), ”Approximating discrete probability distributions with dependence trees”, IEEE
Transactions on Information Theory, IT-14 (3): 462-467
[2] I. Csiszar and J. Korner. Information Theory: ”Coding Theorems for Discrete Memoryless Systems”. Academic
Press, New York, 1981.
[3] I Csiszar. ”Sanov property, generalized I-projection and a conditional limit theorem”. Ann. Prob., 12:768793,
1984.
[4] Kruskal, J. B. (1956). ”On the shortest spanning subtree of a graph and the traveling salesman problem”.
Proceedings of the American Mathematical Society. 7: 4850. doi:10.1090/S0002-9939-1956-0078686-7. JSTOR
2033241.
[5] I. N. Sanov. ”On the probability of large deviations of random variables”. Mat. Sbornik, 42:1144, 1957. English
translation in Sel. Transl. Math. Stat. Prob., Vol. 1, pp. 213-244, 1961.
[6] J. Wolfowitz. ”Coding Theorems of Information Theory”. Springer-Verlag, Berlin, and Prentice-Hall, Englewood
Cliffs, NJ, 1978.
[7] T. M. Cover and J. A. ”Thomas, Elements of Information Theory”, 2nd ed. New-York: Wiley, 2006.
[8] T Weissman. Information Theory: ”Conditional Differential Entropy, Info. Theory in ML”, Lecture 20,
EE376A/STATS376A, stanford university, 2018.
https://web.stanford.edu/class/ee376a/files/lecture_20.pdf
Sequential Machine Learning-1
Throughout this lecture we discuss the task of modeling sequential data. This task
comprises on overcoming the fact that training examples are no longer i.i.d. Therefore, we
have to build models that are capable of modeling the time dependencies between samples
in order to model the data optimally in the sense of maximum likelihood. First, we define
the problem mathematically and derive the necessity of Recurrent Neural Network (RNN)
models. Then, we will elaborate on two types of Recurrent Neural Networks (RNNs):
the Elman Network and the Long-Short-Term-Memory (LSTM) cell.
I. I NTRODUCTION
Given a training set of n ∈ N examples {(xi , yi )}ni=1 drawn from the joint probability
PX n ,Y n , we want to model the generating distribution of the data. In order to do so, we
seek to estimate PX n ,Y n with a parametric model QθX n ,Y n , whose parameters are denoted
by θ.
We distinct the sample elements into two groups: (1) the features X, those elements
that we sample freely from the environment, and (2) the labels Y , those elements that
we can not sample from the environment and seek to predict. We now decompose the
joint probability, namely PX n ,Y n , using the chain rule of probabilities as follows,
n
Y
n n
PXi ,Yi |X i−1 ,Y i−1 xi , yi |xi−1 , y i−1
P X n ,Y n (x , y ) = (1)
i=1
Yn
PXi |X i−1 ,Y i−1 xi |xi−1 , y i−1 PYi |X i ,Y i−1 yi |xi , y i−1 .
= (2)
i=1
Generally, want to model only PYi |X i ,Y i−1 since we sample X n freely from the real
distribution, i.e PXi |X i−1 ,Y i−1 . Hence we can model QθX n ,Y n (xn , y n ) by
n
Y
QθX n ,Y n (xn , y n ) PXi |X i−1 ,Y i−1 xi |xi−1 , y i−1 QθYi |X i ,Y i−1 yi |xi , y i−1 .
= (3)
i=1
Sequential Machine Learning-2
In the rest of the lecture we use this setting, but the derivation could be extended to
estimating the term PXi |X i−1 ,Y i−1 (xi |xi−1 , y i−1 ) as well.
Now, we find the model parameters θ by the maximum likelihood estimator which is
given by
If the samples were sampled i.i.d the term QθYi |X i ,Y i−1 (yi |xi , y i−1 ) would collapse to
QθYi |Xi (yi |xi ), which is feasible to approximate with a parametric model. Unfortunately,
when modeling QθYi |X i ,Y i−1 (yi |xi , y i−1 ), there are two major difficulties with the modeling
task: (1) the model needs more parameters to combine the features from past times, and
(2) the function QθYi |X i ,Y i−1 varies (even in the number of arguments) as i changes.
For example, if we use a logistic model, in the i.i.d case, if x ∈ Rd , y ∈ R, we
can parameterize QθYi |Xi (yi |xi ) the model with d + 1 parameters. However, in the time
dependent case, if we want to model QθYi |X i ,Y i−1 we would need di + d(i − 1) + 1
parameters.
These problems encourage us to make assumptions on the distribution that generated
the data, that eventually lead us to build feasible models which are capable on
encapsulating time dependencies. In the next section we develop a model with shared
weights by approximating the underlying distribution of the data as Markov process.
In order to deal with the parameterization problem, we approximate the process that
generated X n , Y n as a stationary Markov process.
Sequential Machine Learning-3
Let us assume that there exists a R.V Si , f (X i−1 , Y i−1 ), for some deterministic func-
tion f (·), that summarizes the history of inputs and labels, such that, PXi ,Yi |X i−1 ,Y i−1 =
PXi ,Yi |Si . That is, we assume that the following Markov chain holds,
where (x, y, s) ∈ X × Y × S, the alphabet of inputs, targets and states respectively. The
last assumption is that this stationary Markov chain is Ergodic, which basically means that
we could recover the stationary distribution of the Markov chain by sampling observation
of the data.
Under the three preceding assumptions, and by the law of large numbers, the
optimization criteria in (6) becomes
n
X
log QθYi |X i ,Y i−1 yi |xi , y i−1 =
(9)
i=1
n
n→∞
X
log QθYi |Xi ,Si−1 (yi |xi , si−1 ) −−−→ (10)
i=1
Using this approximation, we can build a feasible model that would be able to encapsulate
time dependencies in the data.
In this section we show the equations of the Elman (vanilla) RNN. Then, we delve into
the error propagation analysis of the Elman RNN and explain the vanishing/exploding
gradient phenomena, which will motivate us finally derive the architecture of the Long-
Short-Memory-Term (LSTM) cell.
Sequential Machine Learning-4
A. Elman Network
1) Description: The Elman network uses the Markov settings from the preceding
section. That is, the Elman networks generates a state S that summarizes the history and
is used to generate along with the input X the prediction for Y and the next state S 0 .
Fig. 1. Depiction of single-layered Elman network and its time unrolling representation
zt = Wx xt + Wh ht−1 + bx (13)
ht = σ (zt ) (14)
zty = Wy ht + by (15)
yt = σ (zty ) (16)
t = dt − yt , (17)
Sequential Machine Learning-5
where dt denotes the true label and the term yt denotes the model prediction. For
computational simplicity we assume that the target is a scalar. The goal is to minimize
the MSE loss, namely J(θ) which is given by
T
X
J(θ) = Jt (θ) (18)
t=1
T
X 1
= 2t (19)
t=1
2
We would like to adjust the network weights by propagating the error back in time.
Let us calculate the error signal as it propagates backwards in the network. The error
propagated to the network output is denoted by δty ∈ R1×ny and is given by
∂
δty , Jt (θ) (20)
∂zty
∂ 1
= y (dt − yt )2 (21)
∂zt 2
∂ 1 ∂yt
= (dt − yt )2 y (22)
∂yt 2 ∂zt
= (dt − yt ) σ 0 (zty )T (23)
where
h iT
σ 0 (zty ) = σ 0 (zty 1 ), σ 0 (zty 2 ), . . . , σ 0 (zty ny ) . (24)
The error that is propagated to the RNN state is denoted by δt ∈ R1×m and is given by
∂
δt , Jt (θ)
∂zt
∂ ∂z y
= y Jt (θ) t
∂zt ∂zt
∂
= δty (Wy ht + by )
∂zt
= δty Wy diag (σ 0 (zt )) (25)
Next, we can calculate the error propagated backwards in time inside the network by
δt−1 ∈ R1×nL and δt ∈ R1×n respectively. The time propagated error is given by
∂
δt−1 , Jt (θ) (26)
∂zt−1
Sequential Machine Learning-6
∂zt
= δt (27)
∂zt−1
∂
= δt (Wx xt + Wh ht−1 + bx ) (28)
∂zt−1
= δt Wh diag (σ 0 (zt−1 )) , (29)
Now, we calculate the error that is propagated k time steps backwards in time. We will
get that
Let us examine the norm of the error as it propagates backwards in time. Since that σ 0
is bounded, say by M > 0, and assume that Wh has maximal eigen value of λ we can
claim that
We can see that if |λM | > 1.0 the error explodes as we keep propagating the error
back in time, and if |λM | < 1.0 the error vanishes in time. The naive approach is to
set λM = 1 but this does not work in practice since it causes saturated units that drives
the activation to zero. This property of the Elman network is the main motivation for
the LSTM cell, which addresses the vanishing/exploding gradient problem by enforcing
constant error propagation in time.
1) Motivation - Constant Error Flow: For simplicity, let us examine the naive elman
RNN with a single unit. In that case, the propagated error from time t to time t − 1 is
given by
∂zt
δt−1 = δt (33)
∂zt−1
∂
= δt (wx xt + wh ht−1 + b) (34)
∂zt−1
= δt wh f 0 (zt−1 ) . (35)
Sequential Machine Learning-7
In order to achieve constant error flow we demand that the activation function will satisfy
wh f 0 (zt−1 ) = 1. (36)
z
By solving the differential equation we get that f (z) = whl , i.e the activation function
must be linear. So by setting f to be the identity mapping and wh = 1 we can ensure a
constant error flow, namely Constant Error Carousel (CEC). However a unit is connected
to other units except itself that introduce two conflicts.
1) Input weight conflict: Excitation of input units enter the state by a linear projection
and can contaminate the state if the input is not correlated with the targets.
2) Output weight conflict: State units are used to predict the target by a linear
transformation. Nevertheless, at different time steps different units are correlated
with the target. Hence, a linear transformation of the states essentially uses
uncorrelated units to predict the target.
These conflicts, and the conclusion that linear activation can enforce constant error flow
motivates us for the architecture of the LSTM cell.
2) The Original Model: The architecture of the LSTM is depicted in Figure 2, an its
equations are given by
c t = at it + ct−1 (40)
These equations were modified to the last common LSTM architecture by adding a
”forget” gate that can forget certain cell units when propagating the error through time.
3) The Current Model: The LSTM equations are given by
c t = at i t + ft ct−1 (46)
This structure enables constant gradient flow through time by removing the non-linear
activation from the memory update as depicted in figure 3, namely ct tre of the LSTM
cell enab
Machine Learning
I. I NTRODUCTION
In this lecture we introduce an estimation method for the Mutual Information between
two random variables using the power of neural networks. First, we recall the required
definitions from information theory, and expand on their properties. Then, we introduce
a new and a very useful way of representing information measures, which is called
the variational formulation. Using the variational formulation we will be able to apply
maximization methods that we have previously applied in Machine Learning algorithms
and, hence, to develop an iterative algorithm that will output an estimation of Mutual
Information. Lastly, we will consolidate our understanding of this methodology and view
it from the the standpoint of Hypothesis Testing.
Remark 1 (Non-negativity of the Kullback Liebler Divergence) For any two distri-
butions P, Q the Kullback Liebler Divergence is non-negative. i.e.
Definition 2 (Mutual Information) Let X and Y be two random variables with a joint
distribution P (x, y). The Mutual Information I(X; Y ) is defined as
P (X, Y )
I(X; Y ) , EPXY log . (3)
P (X)P (Y )
Remark 2 (Defining Mutual Information using Kullback Liebler Divergence)
Mutual Information can be easily defined using the Kullback Liebler Divergence as
follows:
I(X; Y ) = DKL (PXY ||PX PY ). (4)
Remark 3 From the previous remark it is easy to show that I(X; Y ) = 0 if and only if
X, Y are statistically independent.
In this section we introduce a new and a very useful way of representing the
Kullback Liebler Divergence ,which is called the variational formulation. The variation
formulation is a way of representing some measures as a supremum or infimum over a
set of functions. A general form of variational formulation is as follows:
This representation has some distinct advantages. First, it provides upper / lower
bounds for the represented measure which could not be obtained beforehand. Second, in
many cases the variational formulation might be easier to compute analytically. Third,
the use of optimization methods in order to achieve certain approximations might be
a handy solution. The following theorem introduces a variational formulation for the
Kullback Liebler Divergence, and perform as a key ingredient of neural estimation of
Mutual Information.
functions. The Kullback Liebler Divergence admits the following dual representation:
Proof:
The proof consists of two parts which we will formulate and prove in the following
two lemmas:
Proof:
P (x)
Let us choose T ∗ (x) = log Q(x) . Note that the following series of equalities hold:
h i
(a) P (X) P (X)
∗
EP [T (X)] − log(EQ [e T ∗ (X)
]) = EP log − log EQ elog Q(X) (8)
Q(X)
(b) P (X)
= DKL (P ||Q) − log EQ (9)
Q(X)
Z
(c) P (x)
= DKL (P ||Q) − log( Q(x) dx) (10)
X Q(x)
Z
= DKL (P ||Q) − log( P (x)dx) (11)
X
(d)
= DKL (P ||Q) − log(1) (12)
where
P (x)
(a) Follows from the specific choice of T ∗ (x) = log Q(x) .
(b) Follows from the definition of the Kullback Liebler Divergence.
(c) Follows from the definition of the expectation of continuous random variable.
(d) Integration of any probability density is always 1.
1-4
Lemma 2 (Lower bound for the Kullback Liebler Divergence) For any function T :
X → R the following inequality holds:
Proof:
Let us define a new probability density function by:
Q(x)eT (x)
G(x) , . (15)
EQ [eT (X) ]
Note that G(x) ≥ 0 and forms a probability density function since
Z Z
Q(x)eT (x) EQ [eT (X) ]
G(x)dx = dx = = 1. (16)
X X EQ [eT (X) ] EQ [eT (X) ]
where
(a) Follows from the definition of Divergence and linearity of expectation.
(b) Follows from the fact that log(EQ [eT (X) ]) is a deterministic constant.
Q(x)eT (x)
(c) Follows from the definition of G(x) = EQ [eT (X) ]
.
(d) Follows from the non-negativity of Kullback Liebler Divergence (see Definition 1 in
section II).
1-5
Hence,
DKL (P ||Q) = sup EP [T (X)] − log(EQ [eT (X) ]). (19)
T :X →R
Problem definition:
Let X ∼ PX and Y ∼ PY be two random variables with alphabets X , Y, respectively,
and let (Xi , Yi )ni=1 ∼ PXY be i.i.d samples. Our goal is to estimate the Mutual
Information I(X; Y ) from the n given samples.
In order to overcome the first challenge we assumed a set of i.i.d samples which are
drawn according to PXY is given. Under the assumption that the number of samples
n is large enough, we can use the Law of Large Numbers to obtain the following
approximation:
n
1X
EPXY [T (X, Y )] ≈ T (Xi , Yi ). (21)
n i=1
Note that an evaluation of log(EPX PY [eT (X,Y ) ]) is still required. Now we are facing
a new challenge since the given samples (Xi , Yi ) are drawn according to PXY and not
according to PX PY . Therefore, direct use of the Law of Large Number is not correct.
This can still be overcome by artificially constructing tuples of the form (Xi , Yei ), where
Yei is taken from the randomly shuffled set of all samples (Yi )ni=1 . From here we obtain
that Xi and Yei are statistically independent, which means that (Xi , Yei )n are i.i.d samples
i=1
" n
#
T (X,Y )
1 X T (Xi ,Yei )
log EPX PY [e ] ≈ log e . (22)
n i=1
Finally, using equations (21) and (22) we shall define our Mutual Information estimator
as:
n
" n #
1 X 1 X e
ˆ
I(X; Y) , sup T (Xi , Yi ) − log eT (Xi ,Yi ) . (23)
T :X ×Y→R n n
i=1 i=1
function Tθ (X, Y ) as the output of the neural network The network’s cost function is
defined as follows:
n n
!
1X 1 X Tθ (Xi ,Yei )
Iˆθ (X; Y ) = Tθ (Xi , Yi ) − log e . (24)
n i=1 n i=1
It is our goal to maximize this function. The back propagation algorithm makes it
possible for us to compute partial derivatives with respect to the network’s parameters θ,
and then use the gradient ascent algorithm in order to move step by step towards local
maxima of Iˆθ (X; Y ). Since, practically speaking, the number of given samples is large,
we may use mini-batch gradient ascent in order to train the network to reach a local
maximum of Iˆθ (X; Y ) by adjusting the network’s parameters θ. The algorithm steps are
as follows:
2: repeat
3: Draw mini-batch of samples: (X1 , Y1 ), (X2 , Y2 ), ..., (Xm , Ym ) ∼ PXY .
4: Draw m samples from the marginal distribution: Ye1 , Ye2 , ..., Yf m ∼ PY .
P P Tθ (Xi ,Yei )
5: Evaluate: Iˆθ (X; Y ) ← m1 m 1
i=1 Tθ (Xi , Yi ) − log( m
m
i=1 e ).
6: Update network parameters: θ ← θ + ∇θ Iˆθ (X; Y ).
7: until convergence
1-8
V. H YPOTHESIS T ESTING
A. Introduction
and since logarithms are monotonically increasing functions, we can take the logarithm
of both sides of the inequality written above and receive:
n
!
n
P (x ) (a) Y P (xi )
log n
= log
Q(x ) i=1
Q(xi )
n
(b) X P (xi )
= log
i=1
Q(xi )
> log(T ),
where (a) follows from the fact that the samples are i.i.d and (b) follows from the fact
that the logarithm of a product is the sum of the individual logarithms.
P (xi )
But log Q(x i)
is the output of the neural network that we found through the
optimization problem presented in the previous sections. Hence, when using the neural
network to approximate the Kullback Liebler Divergence, we can simultaneously sum
over the network’s outputs, as a byproduct, find from which distribution our samples
were generated!
1-9
B. Hypothesis Testing
We will now discuss how we can identify from which distribution our samples were
generated and why the inequality given in the introduction provides us with the optimal
decision region.
A set of samples (x1 , x2 , ..., xn ) are given and our goal is to decide from which of two
given distributions, P1 (x) or P2 (x), the samples were generated. In order to do this, we
first define two hypotheses:
H1 : X ∼ P1 ,
H2 : X ∼ P2 .
α = P (H2 |X is from P1 ),
β = P (H1 |X is from P2 ).
α is the resultant error when we decided that X was generated from P2 (X) when it was
actually generated from P1 (X), and β is the resultant error when we decided that X was
generated from P1 (X) when it was actually generated from P2 (X).
We will now also define a function of the samples, G : xn → 1, 2, to be equal to
1 when our hypothesis for the samples is H1 , and to be equal to 2 when when our
hypothesis for the samples is H2 . We can now re-write α and β with the help of G(xn ):
Remark 4 From this representation of α and β, we can see that there is a trade-off. As
α decreases in value, β increases and vice-versa. For example, if α were to be small,
G(xn ) would have to be equal to 1 most of the time, thus causing β to increase.
Now that we have defined our errors, we must determine the decision region in order to
1-10
Proof:
Let us first define two indicator function, φA (X) and φB (X), which will get the value 0
or 1 according to which of the two decision regions X belongs to. The explicit definition
of these functions is given by:
1,
1,
X∈A X∈B
φA (X) = , φB (X) =
0,
0,
X∈
/A X∈
/B
Let us consider
(φA (x) − φB (x)) ∗ (P1 (x) − T ∗ P2 (x)) ≥ 0.
where
(a) Is obtained by opening the parentheses of the product.
(b) Is obtained by splitting the sum into two by summing over A and B and by
remembering that φA (x) and φB (x) are indicators for each of these two groups.
(c) Follows from the linearity of the sums.
(d) Is obtained as follows:
Let us first consider
P
x∈A P1 (x).
Summing over the probability P1 (X) when X ∈ A is the same as taking one minus the
/ A. But this sum is the exact definition of α∗
sum of the probability P1 (X) when X ∈
that was given earlier since we are taking the probability of X with regards to P1 (X)
when it was really generated by P2 (X). Therefore, we can write the sum as follows:
P P
x∈A P1 (x) = 1 − x∈A
/ P1 (x) = 1 − α∗ .
This time, we are summing over the probability P2 (X) when X was really generated
from P1 (X). But this is the exact definition of β ∗ which was given previously. Therefore,
P
x∈A T ∗ P2 (x) = T ∗ β ∗ .
As for the sum over B, in the exact manner explained above, we can show that
P
x∈B (P1 (x) − T ∗ P2 (x)) = 1 − α − T ∗ β.
Our equality is now obtained by substituting the sums for the equations derived above.
(e) Is obtained by tidying up the equation.
(f ) Follows from the fact that the equation that we were originally summing,
(α − α∗ ) + T ∗ (β − β ∗ ) ≥ 0.
If we look at this phrase, it is easy to see that if α < α∗ , then β must be greater than
β ∗ in order for the inequality to hold. This concludes our proof.
Remark 5 As a result of this lemma, we can conclude that the decision region defined
by An is the optimal decision region for deciding from which probability distribution
our samples were generated. Let us plot an example of the error defined by the optimal
decision region An on a two-dimensional grid with axes α and β representing our two
forms of error.
1-13
β
Error of Decision Region An
β∗
Forbidden
α
∗
α
Fig. 1. The error given by An with an illegal region drawn beneath it.
If we were to pick any point on the plot of the error given by An and draw two
perpendicular lines to each of the axes, as is shown in the plot, then any other decision
region’s error would have to fall outside this region, which can be seen in the plot as
the area enclosed by the square. This is due to the fact that the Neyman-Pearson lemma
tells us that if the error in one axis is smaller than the error of the optimal region in that
same axis (e.g. α < α∗ ), then the error in the other axis must be larger (in this case,
β > β ∗ ). Since the point on the line is chosen arbitrarily, we can conclude that at no
point beneath the plot of the error defined by our optimal decision region can there be
an error from another decision region. Therefore, if we were to plot the error given by
any other decision region, the plot must be above the plot of the error defined by An ,
thus showing us that this is indeed the optimal decision region.
1-14
R EFERENCES
[1] M. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R.D Hjelm. Mine: mutual information
neural estimation. arXiv preprint arXiv:1801.04062, 2018.
[2] T. M. Cover and J. A. Thomas. Elements of information theory, chapter 11, pages 375–379. John Wiley & Sons,
second edition, 2012.
10-1
Machine Learning
Lecture 10
Lecturer:Haim Permuter Scribe: Omer Luxembourg
I. I NTRODUCTION
In this lecture we introduce the f-Divergence definition which generalizes the Kullback-
Leibler Divergence, and the data processing inequality theorem. Parts of this lecture are
guided by the work of T. Cover’s book [1], Y. Polyanskiy’s lecture notes [3] and Z.
Goldfeld’s lecture 6 about f-Divergences [2]. This lecture assumes the student is familiar
with basic probability theory. The notations here are similar to those of the previous
lectures.
II. f-Divergence
There are two main properties for Divergence, which were proved in previous lectures.
a. DKL (PX ||QX ) ≥ 0, and equality hold if and only if P = Q.
b. DKL (PX ||QX ) is convex in (PX , QX ).
10-2
where (a) follows from the definition of f. Note that f (1) = 0 and f is convex for
all t ≥ 0. (f 00 (t) = 1t ).
b. Negative Log: f (x) = − log(x),
P (x)
Df (PX ||QX ) , EQ f (6)
Q(x)
(a) X P (x)
= −Q(x) log
x∈X
Q(x)
, D(QX ||PX ),
P (x)
= EQ fT V
Q(x)
X 1 P (x)
= Q(x) · −1
x∈X
2 Q(x)
1X
= |P (x) − Q(x)| .
2 x
Note that f (1) = 0 and f is convex for all t ≥ 0. In addition DT V (P, Q) =
DT V (Q, P ) means that the total variation is a metric on the space of probability
distributions. That is because it is a divergence function and a symmetric function
of P and Q .
2x 2
d. Jensen-Shannon divergence (symmetrized KL): f (x) = x log x+1 + log x+1 ,
(c)
= 0,
where (a) is from Jensen’s inequality for a convex function f, (b) is due to the fact
P (x)
that Q(x)
is fixed ∀x because P = Q, (c) is from the definition of f. Note that if f
is not strictly convex around 1, the equality can hold from Jensen’s inequality and
not from P = Q.
• Joint convexity: (P, Q) 7−→ Df (P ||Q) is a jointly convex function. Consequently,
P 7−→ Df (P ||Q) for fixed Q and Q 7−→ Df (P ||Q) are also convex functions.
Proof: From the Perspective Transform Preserve Convexity lemma we learned that
if f (x) is convex ⇒ t · f xt is convex in (x, t).
X P (x)
Df (P ||Q) = Q(x)f , (10)
x
Q(x)
Let PY be the output of the system PY |X for input PX , and QY be the output of the
system QY |X for input PX , see figure 1.
PY |X PY
PX
QY |X QY
Then
Df (PY ||QY ) ≤ Df PY |X ||QY |X |PX . (12)
One can view PY and QY as the output distributions after passing PX through the channel
transition matrices PY |X and QY |X , respectively. The above relation tells us that the
average f-Divergence between the corresponding channel transition rows is at least the
f-Divergence between the output distributions.
Proof:
X X P (Y |X)
Df (PY |X ||QY |X |PX ) , PX Q(Y |X)f (13)
x y
Q(Y |X)
(a) X
= PX Df (P (Y |X = x)||Q(Y |X = x))
x
! !!
(b) X X
≥ Df PX P (Y |X = x) || PX Q(Y |X = x)
x x
(c)
= Df (EPX [P (Y |X)] ||EPX [Q(Y |X)])
(d)
= Df (P (Y )||Q(Y )) ,
where (a) follows from the definition of f-Divergence, (b) follows from Jensen’s
inequality, because Df is convex in P, Q, (c) is the definition of expectation, and (d)
follows from the Law of Total Expectation.
Remark 1 (equality for Df (PY |X ||QY |X |PX )): We can notice the following equality
holds:
" #
PY,X
Df (PY,X ||Q̃Y,X ) , EQ̃Y,X f (14)
Q̃Y,X
X P (y, x)
= Q̃(y, x)f
y,x
Q̃(y, x)
X X P (y, x)
= P (x) Q(y|x)f
x y
Q(y, x)
(a) X X P (y|x)P (x)
= P (x) Q(y|x)f
x y
Q(y|x)P (x)
10-6
(b) X X P (y|x)
= P (x) Q(y|x)f
x y
Q(y|x)
= Df (PY |X ||QY |X |PX ),
where (a) follows from the definition of conditional probability, and Q̃( y, x) ,
P (x)Q(y|x), and (b) is from the definition of divergence.
P (x) P (y)
W (y|x)
Q(x) Q(y)
The intuition behind the following inequality is that processing the observation x by a
channel WY |X makes it more difficult to determine whether it came from PX or QX . In
neural networks, for instance, the divergence of the system output will decrease as we
move to the next layer.
(a) X X P (x, y)
= Q(y) Q(x|y)f
y x
Q(x, y)
!
(b) X X P (x, y)
≥ Q(y)f Q(x|y)
y x
Q(x, y)
!
X X P (x, y)
= Q(y)f Q(x|y)
y x
Q(y)Q(x|y)
(c) X P (y)
= Q(y)f
y
Q(y)
= Df (PY ||QY ),
where (a) follows from conditioning, (b) is Jensen’s inequality for convex f in P, Q, and
(c) is from Law of Total Probability. Note that PX,Y = PX WY |X and QX,Y = QX WY |X .
10-8
R EFERENCES
[1] T. M. Cover and J. A. Thomas. Elements of Information Theory, Chap. 1. ISBN, 1991.
[2] Z. Goldfeld. Lecture 6: f-divergences.
Available at http://people.ece.cornell.edu/zivg/ECE 5630 Lectures6.pdf, 2020.
[3] Y. Polyanskiy. Lecture notes on information theory, chap. 6.
Available at http://people.lids.mit.edu/yp/homepage/data/itlectures v5.pdf, 2017.
1-1
Machine Learning
I. I NTRODUCTION [2]
The Bayes theorem tells us that the computation of the posterior requires three terms:
a prior, a likelihood and an evidence. The first two can be expressed easily as they are
part of the assumed model. However, the third term, requires to be computed such that
Z
P (x) = P (x|θ)P (θ). (1)
θ
Although in low dimension this integral can be computed without too much difficulties,
it can become intractable in higher dimensions.
II. N OTATION
A Bayesian mixture of Gaussians is a model that assumes that the data is distributed
as a mixture of k Gaussians with the following parameters [3]:
• the expectation is also a random variable, normally distributed - µi ∼ N (0, σ 2 ), i =
1, 2...k, for some known σ.
• given the expectation µi , the standard deviation of the Gaussian is 1 - Gi |µi ∼
N (µi , 1).
1-3
• P (ci ) - the probability that xi belongs to some Gaussian, i.e, the assignment of each
i’th observation, has a uniform distribution, ci ∼ U nif orm(k).
We encode ci into ’one hot’ k sized vector. We then draw that xi |ci , µk ∼ N (cTi µ, 1).
define z m = (µk , cn ), m = k + n, an m sized vector of hidden variables.
For a sample of size n, the joint density of latent and observed variables is,
P (xn , z m ) =P (xn , cn , µk )
n
Y (3)
=P (µk ) P (ci )P (xi |ci , µk ).
i=1
D(q(z m )||P (z m |xn )) = Eq(zm ) [log q(Z m )] − Eq(zm ) [log P (Z m |xn )]. (6)
D(q(z m )||P (z m |xn )) = Eq [log q(Z m )] − Eq [log P (Z m , xn )] + Eq [log P (xn )]. (7)
Expand the conditional knowing that p(xn ) is not a function of the random variable Z m
and therefor Eq [log P (xn )] = log P (xn ),
log P (xn ) = D(q(z m )||P (z m |xn )) + Eq [log P (Z m , xn )] − Eq [log q(Z m )]. (8)
1-4
Noting the term Eq [log P (Z m , xn )]−Eq [log q(Z m )] as ELBO and using the non negativity
of D we get,
And thereby its name - The Evidence (P (xn )) Lower Bound. Using eq. (9) and the fact
that w.r.t q(z m ), log P (xn ) = const, we can write D as,
So, by maximizing ELBO we actually minimize D(q(z m )||P (z m |xn )), therefore we
may solve the optimization problem
Looking at eq. (13) we can see that there is a trade-off between minimizing D and
maximizing the sum. As for minimizing D, we would like q(z m ) to be as close as
possible to P (z m ), while as for maximizing the sum, we understand that q(z m ) which
gives more weight to z m that make the term log P (xn |z m ) bigger i.e, z m that contains
more information about xn , will get better results. As we can see, the more samples there
q(z m ) log P (xn |z m ) will be over the divergence.
P
are, the more significant the term
1-5
We will use a method called coordinate ascent. This method is a maximization method
of a multi-variable functions. In this method we fix all the variables except one, maximize
the function as an normal one variable function, then again fix all the variables except
the next one and repeat. If the function is concave in all of its variables, the method will
get the global maximum, otherwise, a local one [4], [5].
while the second transaction is because P (xn ) is not random in q(z m ) and the third one
is by using eq. (14). Applying coordinate ascent and fixing q(z −i ) (all q(zm ) except of
the i’th coordinate) we get,
argmax ELBO = argmax q(zi )Eq(z−i ) [log P (Zi , z −i |xn )] − Eq(zi ) [log q(Zi )] + const,
q(zi ) q(zi )
(16)
∂ELBO
= Eq(z−i ) [log P (Zi |z −i xn )] − log q(zi ) + 1 = 0. (17)
∂q(zi )
Which yields,
log q ∗ (zi ) ∝ Eq(z−i ) [log P (zi |Z −i xn )], (18)
Therefore, coordinate ascent variational inference (CAVI) - algorithm may be written as:
Algorithm 2: CAVI
Input: model P (xn , z m ), data xn .
Output: variation density q(z m ) = m
Q
i=1 q(zi ) and ELBO (evidance lower bound
of P (xn )).
Initialization - initiate q(zi ) for some i.
while the ELBO has not converged do
for i = 1,2...m do
Set q(zi ) ∝ exp(Eq(z−i ) [log P (zi |Z −i xn )])
end
Compute ELBO = Eq [log P (Z m , xn )] − Eq [log q(Z m )]
end
Return m
Q
i=1 q(zi ), ELBO
1-7
As we saw before, we need to find P (z m |xn ) and the term is hard to compute so
we will approximate it with q(z m ). In order to do so we will use the CAVI algorithm
modified to our example. We assume now that the mixture of Gaussians is defined by
the parameters ϕ, mk , sk as follows:
• the expectation is normally distributed - µi ∼ N (mi , s2i ), i = 1, 2...k.
• P (ci ) - has a categorical distribution, ci ∼ ϕki (k sized vector of non-negative number
that sums to 1). Therefore, ϕ is an n ∗ k matrix - the row i is a k sized vector noted
ϕi .
That said we define the initialization of the algorithm like the model presented in section
IV: µi ∼ N (0, σ 2 ), i = 1, 2...k, the expectation of each Gaussian (σ 2 is a known hyper
parameter) and, ϕi ∼ U nif orm(k), i = 1, 2...n. In each iteration we will update our
distributions parameters ϕ, mk , sk .
Let us evaluate the ELBO of the mixture assuming mean field family,
k
X
k k
ELBO(ϕ, m , s ) = E[log P (µi ); mi , s2i ]
i=1
n
X
+ E[log P (cj ); ϕj ] + E[log P (xj |cj , µk ); ϕj , mk , (s2 )k ] (21)
j=1
n
X k
X
− E[log q(cj ; ϕj )] − E[log q(µi ; mi , s2i )].
j=1 i=1
Expanding equation (21) using equation (19) we derive that that the following holds:
Continue developing those eqations we eventually get that the update for q(µi ) is,
Pn
j=1 ϕji xj 1
mi = Pn , s2i = Pn . (24)
1/σ2 + j=1 ϕji 1/σ2 + j=1 ϕji
Therefore, we can write the algorithm as follows:
Algorithm 3: CAVI for mixture of Gussians model
Input: Data xn , number of components K, prior variance of component means
σ2.
Output: Variational densities q(µi ; mi , s2i ) (Gaussian) and q(ci ; ϕi )
(K-categorical).
Initialization as discribed in the beginning of this section.
while the ELBO has not converged do
for j = 1,2...n do
Set ϕji ∝ exp(E[µi ; mj , s2j ]xj − E[µ2i ; mj , s2j ]/2)
end
for i = 1,2...k do
Pn
j=1 ϕji xj
Set mi = 1/σ2 + n
P
j=1 ϕji
1
Set s2i = 1/σ2 +
P n
ϕji
j=1
end
Compute ELBO(ϕ, mk , (s2 )k )
end
Return q(ϕ, mk , (s2 )k )
R EFERENCES
[1] D. M. Blei, A. Kucukelbir, J. D. McAuliffe. Variational Inference: A Review for Statisticians. Journal of the
American statistical Association , 112:859-877, 2017.
[2] https://towardsdatascience.com/bayesian-inference-problem-mcmc-and-variational-inference-25a8aa9bce29
[3] https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf
[4] I. Naiss and H. H. Permuter, ”Alternating maximization procedure for finding the global maximum of directed
information,” 2010 IEEE 26-th Convention of Electrical and Electronics Engineers in Israel, 2010, pp. 000545-
000549, doi: 10.1109/EEEI.2010.5662161.
[5] I. Naiss and H. H. Permuter, ”Extension of the Blahut–Arimoto Algorithm for Maximizing Directed In-
formation,” in IEEE Transactions on Information Theory, vol. 59, no. 1, pp. 204-222, Jan. 2013, doi:
10.1109/TIT.2012.2214202.
12-1
Lecture 12
Lecturer: Haim Permuter Scribe: Tom Galili
I. VARIATION I NFERENCE
P (z m , xn ) P (z m )P (xn |z m )
P (z m |xn ) = = R (1)
P (xn ) P (z m )P (xn |z m ) dz m
• z m - latent (hidden)
• xn - observation evidence
Notice that the estimation of P (z m |xn ) is not trivial, therefore we simplify the term
P (z m , xn )
m n (a) m
arg min D((qzm )||P (z |x )) = arg min Eq(zm ) [log q(z )] − Eq log
q(z m )∈Q P (xn )
(b)
= arg min Eq [log q(z m )] − Eq [log P (z m , xn )] + log P (xn )
(c)
= arg min(−ELBO + log P (xn )) (2)
where
(a) follows from the definition of divergence.
(b) follows from the logarithm rules.
12-2
(c) follows from the definition of evidence Lower Bound (ELBO) as defined in the
previous lectures.
(d) follows from conditional probability.
Note that P (z m ) is the prior probability and q(z m ) ≈ P (z m |xn ) is the posterior probability
(we want to estimate) of the latent space given the evidence. First interpretation: find
maximum, we want to get as close as possible to the prior and on the other hand the
probability of q(z m ) will be greater as z m gives more information about xn :
AutoEncoders are unsupervised learning models. The general idea of Auto Encoders
consists of setting an encoder and a decoder as neural networks and learning the best
encoding-decoding scheme using an iterative optimization process.[1]
In this way, the architecture creates an information bottleneck for the data that ensures
only the main structured part of the information, with which it can be restored exactly
well, can go through and be reconstructed. Therefore, we would like to use dimension
12-3
reduction (feature reduction). In many cases, the data you want to analyze has a high
dimension, which means that each sample has a large number of features. For the most
part, not all characteristics are equally significant. Because it is difficult to analyze data
from a high dimension and build models for such data, in many cases we will try to
reduce the dimension of the data with as little information loss as possible. As illustrated
in Fig. 1, after the encoder part of the neural network we get z = e(x) which is the
latent vector of the input, characterized by a lower dimension than the data, represented
by the important features to be reconstructed in the encoder.
The AE model objective is a minimization of the recovery error between the input data
and the reconstructed output data to be as small as possible,
If the equality x = d(e(x)) holds then no information was lost in the encoder-decoder
process. On the other hand, if x 6= d(e(x)) then some information is lost due to the
dimension reduction and the complete reconstruction of the encoded information is not
possible in the decoder.
12-4
Unlike AE which takes data and performs dimension reduction, VAE [3] determines a
prior distribution to the latent space z, for example, Gaussian distribution z ∼ N (0, I) ,
when I - Identity covariance matrix. The encoder network is trained to receive data x and
output µ(x), σ(x) parameters of z (z ∼ N (µx , σx ) ), in order to minimizing as much as
possible the distance between P (z) and P (z|x). Then sample vectors from z|x (given by
the parameters calculated in the encoder) and pass them through the decoder to produce
parameters of the P (x|z) [1] [2].
It is important to mention that in comparison to the AE decoder part which uses for
the training process only, the VAE decoder is important as the encoder since it uses to
generate new data at inference time and to make the whole Variational Auto Encoder
model to a generative model.
Lets derived the loss function of VAE, first define the link between the encoder and the
decoder as
P (z) ∼ N (0, I)
= arg min [Ez∼qx (z) (log qx (z)) − Eq (log P (z)) − Eq (log P (x|z))]
g,h
Where
(a) follows from the definition of divergence and Bayes theorem.
(b) follows from the fact that P (x) and q are independents.Thus, P (x) Considered a
constant and doesn’t affect.
We know that
12-6
1 −(x−f (z))2
P (x|z) = √ e 2c , P (z) ∼ N (0, I) (10)
2πc
Thus,
(x − f (z))2
∗ ∗
(g , h ) = arg min Ez∼qx (z) + D(N (g(x), h(x)) , N (0, I) ) (11)
g,h 2c
maxf (Ez∼qx (z) (log(P (x|z))) = maxf (Ez∼qx (z) [log(N (f (z, cI)) ])
(x − f (z))2
= maxf (Ez∼qx (z) − ) (12)
2c
Gathering all the pieces together, we are looking for optimal f ∗ , g ∗ and h∗ such that
(x − f (z))2
∗ ∗ ∗
(g , h , f ) = arg min Ez∼qx (z) + D(N (g(x), h(x)) , N (0, I) ) (13)
g,h,f 2c
In Eq. (13), we get two terms: The first one for the reconstruction of x using the decoder
part, and the second term use the KL divergence to approximating the posterior P (z|x)
to be close to the prior probability P (z). The overall architecture is then obtained
by concatenating the encoder and the decoder parts and we can use gradient descent
optimization to find the optimal parameters of the VAE encoder and decoder and the loss
function is well defined as
(x − f (z))2
Loss = Ez∼qx (z) + D(N (g(x), h(x)) , N (0, I) ) (14)
2c
The second term can also be treated as a regularisation term given by the KL divergence
between two Gaussian distributions which helps the VAE model’s encoder approximation
12-7
A. Reparametrization Trick
We note that we still need to be very careful about the way we sample from the
distribution returned by the encoder during the training. The sampling process has to be
expressed in a way that allows the error to be backpropagated through the network to
compute the gradients for the Gradient Descent process as part of the training. Thus, the
reparametrization trick [1] is used as illustrated in Fig. 4 to make the gradient descent
possible despite the random sampling that occurs halfway through the architecture (after
the encoder). Using the fact that z is a random variable following a Gaussian distribution
with g(x) (mean) and h(x) (covariance) then it can be expressed as
In this approach the whole process becomes deterministic - sample ζ in advance and then
only remains to schematically calculate the spread of the value in the network.
R EFERENCES
Lecture 12
Lecturer: Haim Permuter Scribe: Moshe Bunker
S UMMARY
In just three years, Variational Autoencoders (VAEs),section IV,[3], [1] have emerged
as one of the most popular approaches to unsupervised learning of complicated distribu-
tions. VAEs are appealing because they are built on top of standard function approximators
(neural networks), and can be trained with stochastic gradient descent. VAEs have already
shown promise in generating many kinds of complicated data, including handwritten
digits, faces, house numbers CIFAR images, physical models of scenes, segmentation,
and predicting the future from static images. This lecture introduces the intuitions behind
VAEs, explains the mathematics behind them, and describes some empirical behavior.
No prior knowledge of variational Bayesian methods is assumed.
I NTRODUCTION
Variational Autoencoders belong to the family of generative models [1]. The generator
of VAEs is able to produce meaningful outputs while navigating its continuous latent
space. The possible attributes of the decoder outputs are explored through the latent
vector. VAEs attempt to model the input distribution from a decodable continuous latent
space. Within VAEs, the focus is on the variational inference of latent codes.
Therefore, VAEs provide a suitable framework for both learning and efficient Bayesian
inference with latent variables. For example, VAEs with disentangled representations
enable latent code reuse for transfer learning. In terms of structure, VAE bears a
resemblance to an autoencoder. It is also made up of an encoder (also known as a
recognition or inference model) and a decoder (also known as a generative model). Both
VAEs and autoencoders attempt to reconstruct the input data while learning the latent
12-2
Input Output
z=h(x)
Latent
Variable
Fig. 1. Illustration of an autoencoder, dimensionality reduction principle can be seen in the diagram
vector. However, unlike autoencoders, the latent space of VAE is continuous, and the
decoder itself is used as a generative model.
I. AUTO ENCODER
In its simplest form, an autoencoder [3] will learn the representation or code by trying
to copy the input to output. However, using an autoencoder is not as simple as copying the
input to output. Otherwise, the neural network would not be able to uncover the hidden
structure in the input distribution.An autoencoder will encode the input distribution into a
low-dimensional tensor, which usually takes the form of a vector. This will approximate
the hidden structure that is commonly referred to as the latent representation, code, or
vector. This process constitutes the encoding part. The latent vector will then be decoded
by the decoder part to recover the original input.
As a result of the latent vector being a low-dimensional compressed representation
of the input distribution, it should be expected that the output recovered by the decoder
can only approximate the input. The dissimilarity between the input and the output can
be measured by a loss function. But why would we use autoencoders? Simply put,
autoencoders have practical applications both in their original form or as part of more
complex neural networks. They’re a key tool in understanding the advanced topics of
deep learning as they give you a low-dimensional latent vector, therefore are used to
dimension reduction [3].
the number of features that describe some data. This reduction is done either by selection
(only some existing features are conserved) or by extraction (a reducednumber of new
features are created based on the old features) and can be useful in many situations
that require low dimensional data (data visualisation, data storage,heavy computation. . . ).
Although there exists many different methods ofdimensionality reduction, we can set a
global framework that is matched by most of these methods.
Here, we should however keep two things in mind. First, an important dimensionality
reduction with no reconstruction loss often comes with a price: the lack ofinterpretable
and exploitable structures in the latent space (lack of regularity) [2]. Second, most of
the time the final purpose of dimensionality reduction is not to onlyreduce the number
of dimensions of the data but to reduce this number of dimensions while keeping the
major part of the data structure information in the reducedrepresentations. For these two
reasons, the dimension of the latent space and the“depth” of autoencoders (that define
degree and quality of compression) have to becarefully controlled and adjusted depending
on the final purpose of the dimensionalityreduction.
z = f (x) (1)
Z
pθ (x) = pθ (x, z)dz (2)
12-4
In other words, considering all of the possible attributes, we end up with the distribution
that describes the inputs. The problem is that Equation 2 is intractable. The equation does
not have an analytic form or an efficient estimator. It cannot be differentiated with respect
to its parameters. Therefore, optimization by a neural networkis not feasible. Using Bayes’
theorem, we can find an alternative expression for Equation 2:
Z
pθ (x) = pθ (x | z)p(z)dz (3)
Therefore, we will return to Equation 6 and place the development we made in Equation
8 We get that the minimum condition becomes the maximum on the expression:
" #
max Eq [log p(xn | z m )] − D q(z m )kp(z m ) (10)
Let’s now make the assumption that p(z) is a standard Gaussian distribution and that
p(x|z) is a Gaussian distribution whose mean is defined by a deterministic function f of
the variable of z and whose covariance matrix has the form of a positive constant c that
multiplies the identity matrix I. The function f is left unspecified for the moment and
that will be chosen later. Thus, we have
Here we are going to approximate p(z|x) by a Gaussian distribution qx (z) whose mean
and covariance are defined by two functions, g and h, of the parameter x. These two
12-6
µ¹
µ² Diagonal
Multivariate
Gaussian
σ¹
σ²
functions are supposed to belong, respectively, to the families of functions G and H that
will be specified later but that are supposed to be parametrised. Thus we can denote
So, we have defined this way a family of candidates for variational inference and need
now to find the best approximation among this family by optimising the functions g and
h (in fact, their parameters) to minimise the divergence between the approximation and
the target p(z|x). In other words, we are looking for the optimal g∗ and h∗ such that:
12-7
h p(x|z)p(z) i
= arg ming,h Eq log q(z) − Eq log
p(x)
h i
= arg ming,h Eq log q(z) − Eq log p(x, z)
h i h i
= arg ming,h Eq log q(z) − Eq log p(z) − Eq log p(x|z) (13)
h i
= arg ming,h D qkp − Eq log p(x|z)
V. R EPARAMETERIZATION TRICK
The left-hand side of Figure 3 below shows the VAE network. The encoder takes
the input, and estimates the mean, g(x), and the standard deviation, h(x), of the
multivariate Gaussian distribution of the latent vector, z, to reconstruct the input as x̃.
This seems straightforward until the gradient updates happen during backpropagation.
Backpropagation gradients will not pass through the stochastic Sampling block. While
it’s fine to have stochastic inputs for neural networks, it’s not possible for the gradients
to go through a stochastic layer.
The solution to this problem is to push out the Sampling process as the input, as shown
on the right side of Figure 3. Then, compute the sample as:
If and h(x) are expressed in vector format, then ∗ h(x) is element-wise multiplication.
Using Equation 14, it appears as if sampling is directly coming from the latent space
12-8
x̃ x̃
Decoder Decoder
ɛ
Sampling + *
Encoder Encoder
N(0,1)
x x
Fig. 3. on the left side A VAE network [2] without the reparameterization trick, on the right is a diagram with the
reparameterization trick
R EFERENCES
[1] Carl Doersch, Carnegie Mellon/ UC Berkeley. Tutorial on Variational Autoencoders. Addison-Wesley, Reading,
Massachusetts, 1993.
[2] Rowel Atienzaörper. [Advanced Deep Learning with Keras]. Birmingham - Mumbai, Packt Publishing
Ltd,Birmingham B3 2PB, UK, 2018.
[3] Joseph Rocca - Understanding Variational Autoencoders (VAEs), sep 24, 2019.
https://towardsdatascience.com/˜understanding-variational-autoencoders-vaes-f70510919f73