ML Merge

1-1
Machine learning
Lecture 1
Lecturer: Haim Permuter Scribe: Gal Rattner
In this lecture we introduce machine learning, define the relevant notations, and
examine the popular machine learning algorithm K-Nearest Neighbors. Most of the
material for this lecture was taken from the work of Cover and Hart on K-NN [1].
I. I NTRODUCTION TO M ACHINE L EARNING
The goal of machine learning is to program computers to use training data or gained
experience to solve given problems. We can broadly say that a machine learns whenever
it changes its structure, program or data such that its future expected performance is
improved. Many successful applications of machine learning already exist today including
systems that predict stock prices, face recognition technology incorporated in digital
cameras or facebook, speech recognition in Google Assiatnce, Siri or Alexa, safety
systems in car (e.g., Mobiley), advertising on the web, auto-completion text, and ”spam”
mail detection programs. In recent years, machine learning is entering every engineering
field that you may think about and is expected to have a great impact in the next decades
since the technology is progressing significantly.
A. Type of learning problems
The type of problems that we try to solve by machine learning can be divided into
two categories:
• Classification problems: Given stochastic observation x, classification is the prob-
lem of associating x with one class from among a set of classes.
For example:
- Associating a recorded speech with the speaker (speaker recognition).
- Associating an image with a letter (digit recognition).
- Associating an image with an object, e.g., recognizing a car (object recognition).
1-2
• Regression problems: Given stochastic observation x, regression associates a

continuous value with x.
For example:
- Associating a price with a house for sale, using its location, size and other
parameters.
- Associating a price prediction with a stock, using its price over last few days,
the most recent financial reports of the company, etc.
- Forecasting temperatures based on the time of the year, measurements over the
most recent days, etc.
In both regression and classification problems, we associate a label with a given
test sample. In a classification problem, the label is chosen from a finite set, e.g.,
{1, 2, . . . , M} where |M| < ∞. On the other hand, the label in a regression problem
is a continuous value that has a defined sorting order, and one can define the distance
between two different elements.
B. Training/Validation/Test Data
Classification and regression are both based on a training session comprising a set
of training samples xi and each sample has it’s own fixed label li . The success rate
of the classification is then evaluated with a set of validation samples {x1 , x2 , . . . , xn }
whose properties are similar to those of the training set. The evaluation of the classifi-
cation/regression is done by comparing the original labels (using a loss function that we
will explain later in the lecture) with the labels associated by the trained model. Usually,
the learning model might still be changed (especially hyper-parameters, like the size or
depth of the machine learning model) during the validation and therefore there is a need
for final testing of the machine learning system on an additional data called the test data,
where the model can not be changed.
C. Types of learning
Machine learning methods can be categorized into three main types of the data that it
needs to learn from.
1-3
Supervised - In the training stage, the system is presented with labels for each sample.
Sample-label couples {(x1 , l1 ), (x2 , l2 ), . . . , (xN , lN )} are given to the system along with
the unlabeled test set {y1 , y2 , . . . , yM }. The system then associates labels with the test
samples according to the statistic model based on the training set. The supervised learning
system checks the correctness of its associating process by comparing the association
results with the original test labels. Supervised learning is often used in classification
and regression problems, for instance, in digit recognition, house price estimations, etc.
Unsupervised - In the training stage, the system is not given any fixed labels as
inputs, and as such, the system must define relevant labels. This usually requires that the
system learn the distribution of the sample set. Unsupervised learning is often used for
segmentation, for instance, edge detection in image processing tasks, etc.
Reinforcement learning - An intermediate stage to supervised and unsupervised learn-
ing, reinforcement learning entails the system learning “on the fly” through its experience.
For each attempt, the system receives some reward and then determines whether the
attempt was a failure or a success. The system thus gains experience and learns which of
its attempts were “good” by comparing between the rewards it received for the attempts.
Reinforcement learning is used to train computers to be experts in defined tasks, for
example, playing a game (e.g. https://www.youtube.com/watch?v=V1eYniJ0Rnk).
In this course, we will focus on supervised learning, though unsupervised learning
often constitutes an integral step in training the supervised model. We have a whole
course only on Reinforcement learning (http://www.ee.bgu.ac.il/∼haimp/RL/index.html).
D. Types of models
Generative model: The learning model is probabilistic and it models the joint
distribution of the samples and the labels, i.e., P (x, l). It is called generative since often
one can use it in order to generate data similar to the one it learns from. An example of
generative model is the Gaussian Mixture Model that we learn later in the course.
Discriminative model: The learning model learns to discriminate the feature x into
labels l, namely, it learns a mapping φ : x 7→ l, or the conditional probability P (l|x) but
not the joint as in the generative model. Discriminative model are using for supervised
1-4
learning and very rarely for unsupervised learning. Examples of discriminative models
that we will learn in the course are Logistic regression model and Neural network model.
II. N OTATION
Throughout the course we will use the following notation:

• X - random variable
• X - alphabet of X. The alphabet of X is the set of all possible outcomes. For
instance, if X is a binary random variable then X = {0, 1}. We denote sets by
calligraphic letters, such as A, B, .....
• x - an observation or a specific value. Clearly, x ∈ X .
• PX (x) - the probability that the random variable X gets the value x, i.e., PX (x) =
Pr{X = x}.
• PX or PX (·)- denotes the whole vector of probabilities, also known as probability
mass function (pmf).
• P (x) - this is a short notation for PX (x).
• E[X] - expectation, i.e.,
X
E[X] , xP (x) (1)
x∈X
Similarly E[g(X)] is
X
E[g(X)] , g(x)P (x) (2)
x∈X
III. P ROBLEM SETTINGS AND CRITERIA DECISION (MAP, MLE) WHEN
PROBABILITY IS KNOWN
P
Let {η1 , η2 , . . . , ηM } , ηi > 0 ∀i , ηi = 1 be the prior probability of the M classes,
and fi (x) be the probability density of each class at x. The distribution of X is thus:
Class P (x|class)
X∼ η1 = P (Class = 1) f1 (x) = f (x|Class = 1)
η2 = P (Class = 2) f2 (x) = f (x|Class = 2)
.. ..
. .
ηM = P (Class = M) fM (x) = f (x|Class = M).
1-5
Definition 1 (Loss Function) Let X be a random variable with the classes set
{1, 2, . . . , n}. The loss function L(i, j) is the loss incurred by associating the observation
to class j when in fact it belongs to class i ,∀i, j ∈ {1, 2, . . . , n}.
Example 1 (Right/wrong loss function) Consider M=2, and the loss function is the
right/wrong function, such that a correct association yields no loss and an incorrect
association, or an error, yields a loss of 1. The loss matrix in this case is:
 
0 1
L= .
1 0
For the case of right/wrong loss function, we notice that in the case of an error, the
loss always counts as 1, while a correct association counts as 0 loss. The loss matrix
is therefore the 0-1 matrix of M × M size. For example, consider the right/wrong case
were M = 3, then
 
 0 1 1 
 
L =  1 0 1 .
 
1 1 0
In general (not right/wrong case), a loss matrix can also be asymmetric, for instance,
L(1, 2) > L(2, 1). In this case loss matrix is warranted when the system is tasked with
deciding whether a given person is a terrorist based on a recorded phone call. Failure to
identify a terrorist may result in much greater loss than falsely deciding that an innocent
person is a terrorist. Therefore, we can expect the representative loss matrix to be in
many cases asymmetric.
Let us recall the joint probability equation
P (a, b) = P (a) · P (b|a). (3)
We can describe P (x) as the sum of the joint probability over all the alphabet of X, and
using eq. (3) we get
M
X M
X M
X
P (x) = P (j, x) = P (j)P (x|j) = ηj fj (x). (4)
j=1 j=1 j=1
1-6
The probability that X belongs to class i, ∀i ∈ {1, 2, . . . , M} given the samples x, is the
posterior probability η̂i (x):
P (class = i, x) P (i)P (x|i) ηi fi (x)
η̂i (x) = P (class = i|x) = = = PM , (5)
P (x) P (x) j=1 η j fj (x)
We can finally sort the class probabilities in vectors. These include the prior probability
vector:
P (class) = [η1 , η2 , . . . , ηM ] , (6)
and the Posterior probability vector:

" #
η1 f1 (x) η2 f2 (x) ηM fM (x)
P (class|x) = PM , PM , . . . , PM . (7)
j=1 η j fj (x) j=1 η j fj (x) j=1 ηj fj (x)
Definition 2 (Conditional loss) The conditional loss denoted by rj (x) is the loss in-
curred by associating observation x with class j, then:
X M
X
rj (x) = E [L(I, j) | X = x] = P (class = i|x)L(i, j) = η̂i (x)L(i, j). (8)
i i=1
Because our goal is to minimize the conditional loss, we therefore define r ∗ (x) and R∗
that is the one that corresponds for the minimum choice of class j ∈ {1, 2, . . . , N}.
Definition 3 (Conditional Bayes risk) The conditional Bayes risk denoted by r ∗ (x), is
the loss incurred by associating x with class j that has the lowest cost out of all classes,
i.e. (M )
X
r ∗ (x) , min{rj (x)} = min η̂i (x)L(i, j) . (9)
j j
i=1
Definition 4 (Bayes risk) The Bayes risk denoted by R∗ is the resulting overall mini-
mum expected risk, i.e,
R∗ , E [r ∗ (X)] , (10)
where the expectation is described in terms of the compound density function

X
f (x) = ηi fi (x). (11)
i
1-7
Example 2 (Decision due to minimum loss) Consider the next loss matrix L, where
all failures have the same loss value:
 
0 1 1 ... 1
 

 1 0 1 ... 1 
 
L=
 1 1 0 ... 1  .
 .. . . .. 

 . . . 

1 1 ... 1 0
Then for each class j ∈ {1, 2, . . . , N} the conditional loss is

X
rj (x) = η̂i (x) · (1 − δ(i, j)) = 1 − ηˆj (x), (12)
i
and therefore, by choosing the class that producing the minimum loss, we will minimize
the Bayes risk, i.e.,
j = argmax{ηˆj (x)} = argmax{P (j|x)}. (13)
Minimizing the Bayes risk for the loss function of right/wrong yields the decision rule
given in (13), which is also known as the Maximum a Posteriori (MAP) rule and is
formally defined in the next definition.
Definition 5 (MAP) Maximum a Posteriori is the estimation method for a random

variable J given observations X = x. The MAP estimator denoted by j ∗ is chosen
according to the maximum value of the posterior probability function PJ|X (j|x) i.e.,
j ∗ = argmax P (j|x) = argmax ηj fj (x). (14)

j j
In our case, the posterior probability function is the vector

" #
η1 f1 (x) η2 f2 (x) ηM fM (x)
PJ|X = PM , PM , . . . , PM . (15)
j=1 ηj fj (x) j=1 ηj fj (x) j=1 ηj fj (x)
Given that the divisors are equal for all the elements, the maximum vector element is
equal to argmaxj ηj fj (x).
1-8
Now, for the case in which η1 = η2 = · · · = ηM , meaning that all classes have the
same prior probability, using the MAP method is similar to choosing the maximum of
only fj (x), i.e.,
argmax P (j|x) = argmax fj (x) = argmax P (x|j), (16)

j j
and we can refer to another decision method, Maximum Likelihood Estimation (MLE).
Definition 6 (MLE) Maximum Likelihood Estimation is the method of estimating a

random variable J due to observations X = x, by choosing estimator j ∗ to be the
element with maximum conditional probability density function fj (x) at point x, i.e.
j ∗ = argmax P (x|j) = argmax fj (x). (17)

j j
Example 3 (Two Gaussian distributed classes) Given two classes distributed normally
over two dimensions. Using MLE, we choose the class that has the higher probability
density function value at point x. Using this method entails an obligated error probability,
in this case presented by the area trapped under the two Gaussian graphs:
Class
| 1 {z
is chosen} Class
| 2 {z
is chosen}
2
j∗ = 1 j∗ = 2
1.5 η1 f1 (x)
1
η2 f2 (x)
ηi fi (x)
0.5
0
↑ ↑
Risk caused by choosing j ∗ = 1 Risk caused by choosing j ∗ = 2
while j = 2 while j = 1
-0.5
-10 -8 -6 -4 -2 0 2 4 6 8 10
x
Figure 1. The total area trapped under the overlap between the two Gaussians ovelap (marked in color) is the total
obligated error probability.
1-9
Now we can calculate the overall risk by integrating over min{ηˆ1 (x)f1 (x), ηˆ2 (x)f2 (x)}
to obtain the trapped area:
R∗ = E [r ∗ (X)]
Z
= r ∗ (x)f (x) dx
Z Z
= ηˆ2 (x)f (x) dx + ηˆ1 (x)f (x) dx (18)
η1 f1 (x) > η2 f2 (x) η1 f1 (x) < η2 f2 (x)
Z Z
= η2 f2 (x) dx + η1 f1 (x) dx.
η1 f1 (x) > η2 f2 (x) η1 f1 (x) < η2 f2 (x)
C HOOSING BEST CLASSIFIER WHEN PROBABILITY IS UNKNOWN BUT WE HAVE
SAMPLES VIA E MPIRICAL R ISK M INIMIZATION (ERM)
In the previous section we defined the conditional Bayes risk r ∗ (x) in Eq. (9) and
the Bayes risk R = E[r ∗ (X)] in (10), which are optimal but one need to know the
probabilities of the classes i.e., ηi and the conditional probability fi (x) for all possible i
and x. In many cases, we have several classifiers (or a set of classifiers) which can also
be called hypothesis. The hypothesis set may not include the optimal one, namely, the
Bayes classifier that we saw in the previous subsection. The ERM idea is a very simple
idea that tells us which hypothesis to choose from the set of hypotheses.
For each classifier/hypothesis h let’s define a Risk
R(h) = E[L(h(X), Y )], (19)
where L(·, ·) is the loss function (as defined in previous subsection in Def. 1) and y is
the label associated with x. In general, we would like to choose the classifier/hypothesis
h from the possible set H that minimize the risk, i.e.,
h∗ = min R(h). (20)

h∈H
However, in order to compute the risk for specific hypotheses h, i.e., R(h), one need
to know the joint probability p(x, y) for all possible samples, x and labels, y. In practice,
the joint probability p(x, y), is unknown (this situation is called as agnostic learning).
However, one can assume that we have samples and labels {xi , yi }ni=1 drawn from the
1-10
joint p(x, y). The set of samples that are available is called the training set. The empirical
risk of a specific classifier h is defined as
n
(n) 1X
Remp (h) = L(h(xi ), Yi ) (21)
n i=1
and the minimum empirical risk idea is
(n)
ĥ = min Remp (h). (22)
h∈H
Assuming {xi , yi }ni=1 are i.i.d., (or at least stationary and ergodic) then by the law of
(n)
large numbers limn→∞ Remp (h) = R(h) with probability 1, hence the minimum empirical
risk convergence to the minimum risk if the number of samples is large enough.
Summary of the ERM idea: The ERM idea is very simple and extremely useful. We
have a set of classifiers/hypothesis H and a set of samples (a.k.a tanning set). The ERM
principle tells us to choose the classifier that minimize the emprical risk, i.e., Eq. (22).
IV. K-N EAREST N EIGHBOR M ODEL
Definition 7 (KNN) K-Nearest Neighbor is a basic classification algorithm, that uses a

predefined, classified finite training set and a defined distance function to estimate the
class of a test element.
Consider the next procedure:
Training set: (x1 , θ1 ) (x2 , θ2 ) . . . (xn , θn )

Test: x
Algorithm: for 1-NN (for multiple neighbors, the algorithm is similar): Find x′ so that
d(x, x′ ) ≤ d(x, xi ) ∀ i = 1, . . . , n
where θ′ is the label of x′ .

The prior probability to receive the element from class i is again ηi :
η1 = P (θ = 1)
η2 = P (θ = 2)
1-11
..
.
ηn = P (θ = n)
And L(θ, θn′ ) is the loss function.

Considering that x ∼ i.i.d and P (x|θ = i) = fi (x), we get the joint probability to
be:
P (x, θ, x1 , θ1 , . . . , xn , θn ) = P (θ)P (x|θ)P (θ1)P (x1 |θ2 )P (θ2 )P (x2 |θ2 ) · · · (23)
We define the n-sample Nearest Neighbor procedure risk to be (n holdss for the training
number of samples) :
R(n) = E [L(θ, θn′ )] (24)
And for a large number of training samples n, we define R to be the NN risk:
R = lim R(n) (25)

n→∞
Theorem 1 (Nearest neighbor risk bounds) Under the assumption on f1 , f2 , that x

with probability 1 is either a continuous point of f1 , f2 or a non-zero probability measure.
The overall NN risk R then has the bounds
R∗ ≤ R ≤ 2R∗ · (1 − R∗ ). (26)
In the next part of this lecture, we will use a lemma and the definitions given above
to show that these bounds hold. Before we get to the proof, note that the Bayes risk R∗

can only get values in the section 0, 21 , so that 0 ≤ R∗ ≤ R ≤ 2R∗ (1 − R∗ ) ≤ 12 . In
1
particular, for the edge cases, we get that R∗ = 0 if and only if R = 0, and that R∗ = 2
if and only if R = 21 .
1-12
proof: Under the same assumptions on f1 , f2 and x as in Theorem 1, the conditional

NN risk r(x, x′n ) is then given by:
r(x, x′n ) = E [L(θ, θn′ )|xn , x′n ]
= P (θ = 1, θn′ = 2|x, x′n ) + P (θ = 2, θn′ = 1|x, x′n ) (27)
= P (θ = 1|x) · P (θn′ = 2|x′n ) + P (θ = 2|x) · P (θn′ = 1|x′n ).,

where we use the conditional independence of θ and θn′ to open the phrase.
Exercise 1 Prove the equation used above: P (θ, θn′ |x, x′n ) = P (θ|x) · P (θn′ |x′n )
Lemma 1 (Convergence of the nearest neighbor) : Under the assumption on f1 , f2

that x with probability 1 is either a continuous point of f1 , f2 or a non-zero probability
measure. x′n denotes the nearest neighbor to x within the set {x1 , x2 , . . . , xn }. Then we
get limn→∞ x′n = x, convergence of the nearest neighbor to x with high probability.
proof: Let Sx (r), r > 0 be the sphere of radius r centered at x, and d(·) is the metric
defined on X. Considering the case where Sx (r), r > 0 has a non-zero probability
measure, then for any δ > 0
P { min d(x, xk ) ≥ δ} = (1 − P (Sx (δ)))n → 0. (28)

k=1,2,...,n
The distance of the nearest neighbor x′n from x decreases monotonically with the increase
in k.
We can now use the fact that limn→∞ x′n = x with probability one to show that the
conditional NN risk r(x, x′n ) converges to the limit 2r ∗ (x)(1 − r ∗ (x)). For large numbers
of training samples n, we get
(a)
lim r(x, x′n ) = lim (ηˆ1 (x)ηˆ2 (x′n ) + ηˆ2 (x)ηˆ1 (x′n )) (29a)
n→∞ n→∞
(b)
= 2 · ηˆ1 (x) · ηˆ2 (x) (29b)
(c)
= 2ηˆ1 (x)(1 − ηˆ1 (x)) (29c)
(d)
= 2r ∗ (x)(1 − r ∗ (x)), (29d)
where:
(a) holds using equation (27).
1-13
(b) holds since x′n converge to x.

(c) holds since ηˆ1 (x), ηˆ2 (x) are symmetric and ηˆ2 (x) = 1 − ηˆ1 (x).
(d) holds since r ∗ (x) = min(ηˆ1 (x), ηˆ2 (x)) = min(ηˆ1 (x), 1 − ηˆ1 (x)).
Recalling equation (8) and by the total expectation law, we get
E [r(x, x′n )] = E [E [L(θ, θn′ )|x, x′n ]] = E [L(θ, θn′ )] . (30)
Now R is the limit of the expectation of r(x, x′n ), and we can use the fact that r(x, x′n )
is bounded to switch the order of expectation and the limit to get
r(x,x′n )<1
h i
′ ′
R = lim E [r(x, xn )] = E lim r(x, xn ) . (31)
n→∞ n→∞
From the dominant convergence theorem we get that
R = E [2η1 (x)η2 (x)] = E [2r ∗ (x)(1 − r ∗ (x))] , (32)
and now we can write equation (31) as follows:
R = E[r(x)] (33a)
= E[2r ∗ (x)(1 − r ∗ (x))] (33b)
= E[r ∗ (x) + r ∗ (x)(1 − 2r ∗ (x))] (33c)

(a)
= R∗ + E[r ∗ (x)(1 − 2r ∗(x))] (33d)
(b)
≥ R∗ , (33e)
where:
(a) holds for the linearity of the expectation.
(b) holds since r ∗ ∈ [0, 21 ] and the expression inside the expectation is non negative over
this section, and equality is achieved only if r ∗ (x)(1 − 2r ∗ (x)) = 0.
Using the first part of the equation above and the fact that R∗ is the expectation of r ∗ ,
and that V ar(r ∗ (x)) ≥ 0 we can then write
R = E[2r ∗ (x)(1 − r ∗ (x))]
= 2R∗ (1 − R∗ ) − 2V ar(r ∗ (x)) (34)
≤ 2R∗ (1 − R∗ ),
1-14
and
R∗ = E [r ∗ (x)] (35a)
(a)
≤ E [2r ∗(x)(1 − r ∗ (x))] (35b)
(b)
≤ 2R∗ (1 − R∗ ), (35c)
where:
(a) holds from equation (33e).
(b) holds since r ∗ (x) ≤ 12 .
To conclude, we collect equations (33),(34) to obtain the bounds of the overall NN
risk R(n):
R∗ ≤ R ≤ 2R∗ (1 − R∗ ). (36)
R EFERENCES
[1] T.M. Cover and P.E. Hart. Nearest Neighbor Pattern Classification. IEEE, 1967.
2: GMM and EM-1
Machine Learning
Lecture 2: GMM and EM

Lecturer: Haim Permuter Scribe: Ron Shoham
I. I NTRODUCTION
This lecture comprises introduction to the Gaussian Mixture Model (GMM) and the
Expectation-Maximization (EM) algorithm. Parts of this lecture are based on lecture
notes of Stanford’s CS229 machine learning course by Andrew NG[1]. This lecture
assumes you are familiar with basic probability theory. The notation here is similar to
that in Lecture 1.
II. R EVIEW: S UPERVISED CLASSIFICATION
In supervised classification, our target is to analyze the labelled training data we get,
and to use it to generate a model to map and classify new examples. Below is a general
model of supervised learning.
observation label
r1 c1
r2 c2
.. ..
. .
rN cN
r x y ĉ
=⇒ feature extraction =⇒ statistical model =⇒ decision =⇒
Here r represents the raw observation vectors with their respective label c, and ĉ is the
estimation of the model.
2: GMM and EM-2
1) Feature extraction: The goal of the feature extraction block is to extract the features
that contribute most to the classification and to eliminate the rest. For instance, in
speech recognition a well-known feature extraction technique is called Cepstrum,
and in texture classification, it is based on the discrete cosine transform (DCT) or
wavelet. All the above-mentioned features are based on the frequency domain. In
some cases, one might also use dimension reduction in addition to these features,
when, for example, the feature vector is too large or there is high redundancy. Two
good feature reduction methods are PCA and Autoencoders.
2) Statistical model: The goal of the statistical model is to represent the statistics of
each class, which allows the classes to be separated from each other. The statistical
model usually has some probability justification (such as the GMM) but sometimes
it might just be a separations technique of a different class. Usually a good statistical
model can also be used for additional tasks such as data compression or denoising.
3) Decision: The decision component is responsible for using the statistical model
output to classify the input. In some cases, we may generate more than one
statistical model, and the decision component uses all of the outputs.
The GMM is a statistical model which assumes that every probability distribution can
be approximated with a set of gaussians.
III. M IXTURES OF G AUSSIANS AND THE EM ALGORITHM
A. Gaussian Mixture Model (GMM) - Introduction
In supervised learning, GMM models the distribution of each class as a set of weighted
gaussians, P (xi , zi |c) = P (xi |zi , c)P (zi |c), where zi is a hidden random variable that is
not observed. One assumption that is made is that any distribution can be well modelled
by a set of gaussians. In the figure below, you can see three weighted gaussians in R1
and their sum which models the distribution of a random variable. Another assumption
of the GMM is that any sample, x ∈ Rl , was generated from a single gaussian.
1 1
P (xi |zi = j, c) = p exp(− (x − µj )T Σ−1
j (x − µj )) (1)
(2π) |Σj |
l 2
2: GMM and EM-3
The probability of getting a specific gaussian given a class is
P (zi = j|c) = φj (2)

Pk
Note that φj ≥ 0, ∀j and that j=1 φj = 1.
0.6
gaussian #1
gaussian #2
gaussian #3
mixture of gaussians
0.5
0.4
PDF
0.3
0.2
0.1
0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
feature - x
Fig. 1. Gaussian mixture model built from three weighted gaussians. The figure shows three weighted gaussians with
different µ and σ. The sum represents the mixture model.
From now on we only refer to a specific class c, since each class has its own model.
Our goal is to estimate the model’s parameters φj , µj and Σj , ∀j ∈ {1, ..., k}, and
we use maximum likelihood[2] criteria to maximize the expression P (X n ; θ), where θ
represents the parameters.
argmax P (xn ; θ) = argmax log(P (xn |θ)) (3a)

θ θ
2: GMM and EM-4
Y
= argmax log( P (xi |θ)) (3b)
θ
i
X
= argmax log(P (xi |θ)) (3c)
θ
i
We define ℓ as the log-likelihood function. To estimate our parameters , we can write

the log-likelihood of our data:
m
X
ℓ(φ, µ, Σ) = log P (xi ; φ, µ, Σ) (4a)
i=1
m
X k
X
= log P (xi |z = j; µj , Σj )P (z = j; φj , µj , Σj ) (4b)
i=1 j=1
B. One Gaussian introduction and complete solution
First, let’s try to model our data using a single Gaussian. This example is in one
dimensional space.
m
X k
X
ℓ(φ, µ, Σ) = log P (xi |z = j; µj , σj )P (z = j; µj , σj ) (5a)
i=1 j=1
m
(a)
X
= log P (xi |µ, σ) (5b)
i=1
m
X 1 (xi −µ)2
= log √ e− 2σ 2 (5c)
i=1 2πσ 2
Where (a) follows from k = 1. Now lets take derivative in order to find the parameters
which maximizes the log-likelihood. First we take derivative with respect to µ
m
∂ 1 X
ℓ(φ, µ, Σ) = − 2 (xi − µ) (6a)
∂µ 2σ i=1
=0 (6b)
Therefore
m
1 X
µ̂ = xi (7)
m i=1
2: GMM and EM-5
We perform the same thing with respect to σ 2

m
∂ ∂ X 1 1
2
ℓ(φ, µ, Σ) = 2
(− log(σ 2 ) − 2 (xi − µ)2 ) (8a)
∂σ ∂σ i=1 2 2σ
m
X (xi − µ)2
m
=− 2 + (8b)
2σ i=1
2(σ 2 )2
=0 (8c)
We get
m
2 1 X
σ̂ = (xi − µ)2 (9)
m i=1
C. GMM parameters estimation and EM motivation
Two Gaussians example: Now lets try to find an analytic solution to maximize the
log-likelihood of two Gaussians model.
m
X k
X
ℓ(φ, µ, Σ) = log P (xi |z = j; µ, σ)P (z = j; θ) (10a)
i=1 j=1
m
X
= log(φ1 N (xi ; µ1 , σ12 ) + φ2 N (xi ; µ2, σ22 )) (10b)
i=1
φj stands for P (z = j; θ). By differentiating (10b) with respect to µ1 , we obtain

m
X 1 xi − µ1
φ1 N (xi ; µ1 , σ12 ) =0 (11)
i=1
φ1 N (xi ; µ1 , σ12 ) 2
+ φ2 N (xi ; µ2 , σ2 ) σ2
This we cannot solve in a closed form to get a clean maximum likelihood expression.
Therefore, we use the Expectation-Maximization (EM) algorithm.
D. Formal look at the EM algorithm[7]
The EM algorithm was first explained in a 1977 paper, Maximum Likelihood from
Incomplete Data via the EM Algorithm [3]. It is an iterative algorithm for using maximum
likelihood to estimate the parameters of a statistical model with unobserved (hidden)
2: GMM and EM-6
variables (also called latent variables where lateo is “lie hidden” in Latin). It has two
main steps. First is the E-step, which stands for expectation. We compute some probability
distribution of the latent variables so we can use it for expectations. Second comes the
M-step, which stands for maximization. The EM algorithm find a local maximum and
that depends on the initial model that we start with.
Problem Definition: Let xn be a random vector distributed i.i.d. that we observe. Let
Z n be a hidden vector that X n depends on. s.t. P (xn |z n ) = ni=1 p(xi |zi ) also distributed
Q
i.i.d.. i.e., P (z n ) = ni=1 P (zi ). The distribution of X n and Z n have some parameters θ
Q
that we are interested to find. In GMM, for instance, Zi stands for the Gaussian number
of sample i, i.e., zi ∈ {1, 2, ..., k} where k is the total number of Gaussians. Our goal is
to maximize the log-likelihood
!
X
log PX n (xn ; θ) = log PX n ,Z n (xn , z n ) (12)
zn
In a shorter notation simply write

!
X
log(xn ; θ) = log P (xn , z n ) . (13)
zn
EM Algorithm: The following algorithm is the EM algorithm that is an iterative

algorithm and is based on two steps: Expectation and Maximization steps.
EM Algorithm:
1) function EM(xn , θ(0) )

2) for iteration t ∈ 1, 2, ... do
3) Q(t) (zi ) = P (zi |xi ; θ(t−1) ) ∀i = 1, 2, ..., n ∀zi ∈ Z E-step
4) θ(t) = argmaxθ EQ(t)n [log(P (xn , Z n ; θ))] M-step
Z
5) if log P (xn |θ(t) ) − log P (xn |θ(t−1) ) < ǫ then

6) return θ(t)
2: GMM and EM-7
The stopping criteria can be changed. Can be either the one that is written,
log P (xn |θ(t) )−log P (xn |θ(t−1) ) < ǫ, or a fixed number of iterations or ||θ(t) )−θ(t−1) )|| <
ǫ for some norm || · ||.
Derivation of the EM algorithm: Now, we are going to show that in each iteration
the algorithm increases the log likelihood, i.e., log P (xn |θ(t) ) increases as t increases.
Hence, the EM algorithm convergence to a local minimum that depends on the initial
model θ(0) .
For any Q(z n ) we have
!
n n (t)
X P (x , z ; θ )
log P (xn ; θ(t) ) = log Q(z n ) (14)
zn
Q(z n )
P (xn , Z n ; θ(t) )

= log EQZ n (15)
Q(Z n )
P (xn , Z n ; θ(t) )

(a)
≥ EQZ n log (16)
Q(Z n )
Where (a) follows Jensen’s inequality. Now, using algebra and information measure we
obtain
P (xn , Z n ; θ(t) ) P (Z n |xn ; θ(t) )

n (t)

EQZ n log = EQZ log(P (x ; θ )) + EQZ n log
Q(Z n ) Q(Z n )
P (Z n |xn ; θ(t) )

n (t)
= log(P (x ; θ )) + EQZ n log (17)
Q(Z n )
= log(P (xn ; θ(t) )) − D(QZ n ||PZ n|xn ,θ(t) ) (18)
Hence, we can choose Q(z n ) = P (z n |xn , θ(t) ), i.e., Q(zi ) = P (zi |xi , θ(t) ) and we obtain
that (a) in (16) is with equality. So now when Q(z n ) is fixed, we can maximize over all
θ the following expression
P (xn , Z n ; θ)

(t+1)
θ = arg max EQZ n log n
= arg max EQZ n [log P (xn , Z n ; θ)] ,
θ Q(Z ) θ
Hence we have
P (xn , Z n ; θ(t+1 )

n (t+1) (a)
log(P (x ; θ )) − D(QZ n ||PZ n|xn ,θ(t+1) ) = EQZ n log
Q(Z n )
P (xn , Z n ; θ(t )
(b)

≥ EQZ n log
Q(Z n )
2: GMM and EM-8
(c)
= log P (xn ; θ(t) )
where (a) and (c) follows from the E step derivation, (b) from the M step. Finaly the last
sequence of equations implies that log P (xn ; θ(t+1) ) ≥ log P (xn ; θ(t) ) because divergence
is non negative.
E. Expectation Maximization (EM) application for GMM

△
Let’s start with the E-step. We define weights w(j, i) = P (z = j|xi ; θ). The weights
are ’soft’ assignment of the sample xi to the Gaussian j.
w(j, i) = P (j|xi ; θ) (19a)

P (j, xi ; θ)
= (19b)
P (xi ; θ)
P (j; θ)P (xi ; j, θ)
= P (19c)
l P (xi , l; θ)
P (j; θ)N(xi ; µj , σj2 )
=P 2
(19d)
l P (l; θ)N(xi ; µl , σl )
Now, we wish find the new parameters θ which maximize the log-likelihood with
△
respect to the expectations (the M-step). First let’s define φ(j) = P (j; θ). The model
parameters are θ = {φ(j), µj , σj2 }.
argmaxEQ(Z) [log(PX,Z (x, z; θ))] = (20)

θ
m X
X
= argmax w(j, i) log P (j, xi ; θ) (21)
φ(j),µj ,σj2 i=1 j
m X
X
= argmax w(j, i) log(φ(j)N(xi ; µj , σj2 ))
φ(j),µj ,σj2 i=1 j
m X
X 1 (xi − µj )2
= argmax w(j, i)(log φ(j) + log q − )
φ(j),µj ,σj2 i=1 j 2πσj2 2σj2
Deriving with respect to µj and comparing to 0 gives

P
w(j, i)xi
µj = Pi (22)
i w(j, i)
2: GMM and EM-9
Performing the same thing with respect to σj2 gives

(xi − µj )2 w(j, i)
P
2
σj = i P (23)
i (w(j, i))
Finding the optimal φ(j) is a bit more difficult since it is a probability distribution
function and therefore it has some constrains. The constrains are
X
φ(j) = 1
j
φ(j) ≥ 0 ∀j
The solution for this comes from the field of convex optimization. Further discussion
about convex optimization problems and Lagrange multipliers can be found in the
appendix and in Boyd’s book[4]. Let’s define the problem as a standard convex
optimization problem (defintion 3)
m X
X k
minimize − w(j, i) log φ(j)
i=1 j=1
subject to − φ(j) ≤ 0, 1 ≤ j ≤ k
X
φ(j) = 1
j
The Lagrange multipliers (definition 4) are

m X
X k k
X k
X
L(φ(j), λ, ν) = − w(j, i) log φ(j) − λj φ(j) + ( φ(j) − 1)ν (24)
i=1 j=1 j=1 j=1
Therefore the KKT conditions (theorem 2) are
1) ∇φ L(φ∗ , λ∗ , ν ∗ ) = 0
2) −φ∗ (j) ≤ 0 , ∀j
P
3) j φ(j) = 1
4) λ∗j φ∗ (j) = 0 , ∀j
5) λ∗j ≥ 0 , ∀i
After deriving the Lagrangian with respect to a specific φ(j) we get
Pm
w(j, i)
φ(j) = i=1 (25)
ν
2: GMM and EM-10
P
We do not know ν, so we use the fact that j φ(j) = 1
P Pm
X j i=1 w(j, i)
φ(j) = =1 (26)
ν
Therefore
Pn
w(j, i)
φ(j) = Pm i=1 Pk (27)
i=1 j=1 w(j, i)
Pn
w(j, i)
= i=1 (28)
n
We can see that the KKT conditions holds.
Algorithm:
E-step: for each i, j
w(j, i) : = P (zi = j|xi ; φ, µ, Σ) (29a)

φ(j)P (xi |zi = j; µj , Σj )
= Pk (29b)
l=1 φ(l)P (xi |zi = l; µl , Σl )
M-step: for each j
m
1 X
φ(j) := w(j, i) (30)
m i=1
Pm
w(j, i)xi
µj := Pi=1
m (31)
i=1 w(j, i)
Pm
w(j, i)(x(i) − µj )(xi − µj )T
Σj := i=1 Pm (32)
i=1 w(j, i)
Now we must consider when to stop the algorithm. One way is to repeat the two steps
until convergence θ(t+1) ≈ θ(t) . Another option is to repeat them for a fixed number
of times. Repeating the steps a fixed number of times can prevent over-fitting of the
training data set.
2: GMM and EM-11
F. K-means algorithm
K-means is very similar to EM except that it gives a ’hard’ decision for each sample
to the centroid to which it belongs. Below we discuss K-means briefly. We are given
a training set {x(1) , ..., x(m) } and we wish to group it into K different components.
Algorithm:
1) Initialize centroids µ1 ...µk ∈ Rn
2) Repeat until convergence: {
a) for every i, set
c(i) := arg min kx(i) − µj k2 (33)

j
b) for each j, set

Pm (i)
i=1 1{c = j}x(i)
µj := P m (i) = j}
(34)
i=1 1{c
}
To initiate the centroid, we can choose K random training samples. There are other
initialization methods. The inner loop assigns each sample to the ’closest’ centroid, and
moves the centroid to the mean of the points assigned to it.
The K-means algorithm guarantees convergence to a local optima (except in very
rare cases, where the K-means oscillate between a few different values), but it does not
guarantee convergence to a global optima. A common way to deal with this problem is
to run K-means many times with different initiations, and then to pick the one with the
lowest distortion. K-means is often used to initiate GMM modelling.
A common use of the K-means algorithm is for initial estimation for GMM parameters.
We use the centroids to initiate the Gaussian’s µ and set an arbitrary Σ and φ.
G. Tips and additional use of GMM
1) Choosing the number of GMM components: [6] Choosing the number of compo-
nents for modelling the distribution is an important issue. There is a trade-off between
choosing too many components which may cause over-fitting (for example, what happens
if we use the number of samples?), and choosing too few, which can render a model that
2: GMM and EM-12
is not flexible enough. Selecting the number of components is crucial for unsupervised
segmentation tasks, in which each component represents a separate class. However, when
each GMM models a different class and the segmentation is supervised, the selection of
the number of components becomes less critical.
2) Full or diagonal covariance?: [5] In the model that is the focus of this lecture,
we used unrestricted covariance for the GMMs. The GMM can also be restricted to have
diagonal covariance which has two practical advantages:
• There are fewer parameters to be estimated.
• Calculation of the inverse matrix is trivial.
In some applications (e.g., speaker verification), the diagonal model is sufficient. The
obvious disadvantage of assuming diagonality, however, is that the different features
may in practice be strongly correlated. In this case, the model might be incapable of
representing the feature distribution accurately. The figure below illustrates this point.
Fig. 2. Modelling the distributions of two different features(first column) using a full covariance(second column) and
a diagonal covariance(third column)
2: GMM and EM-13
On the first row, we see that the features are highly correlated, and therefore the
GMM must use a full covariance matrix. On the second row, the difference is far less
pronounced, and it seems that a diagonal covariance matrix is sufficient.
3) GMM for unsupervised learning: So far we only talked about using GMM for
supervised learning, where each sample has its own label, and a model is built for each
class. In many cases, GMM is used for unsupervised clustering. It is useful to gather data
from a group into sub-groups that can be modelled well by a gaussian distribution. Using
GMM, we can cluster the data into sub-groups that do not have labels or identity. For
example, we have a class with n students and we want to split it into three different study
groups according to student averages. An addition application of GMM in unsupervised
learning is the Universal Background Model (UBM). In the UBM, we first generate a
general class-independent statistical model by using all samples. We often use the UBM
to generate the class-dependent models. This is particularly useful for cases in which we
lack a sufficient number of samples of the class we want to identify, but we have many
samples of other classes. A good example of this application is in the field of speaker
verification, when we may have only a few voice samples of the person we are trying to
identify.
2: GMM and EM-14
A PPENDIX
R EVIEW: CONVEX FUNCTION AND J ENSEN ’ S INEQUALITY [4 ]
Definition 1 (convex set) :

A set C is convex if, for any x, y ∈ C and θ ∈ R with 0 ≤ θ ≤ 1:
θx + (1 − θ)y ∈ C. (35)
Fig. 3. The hexagon on the left is convex. on the right is a non-convex set.
Definition 2 (convex function) :

A function f : Rn → R is convex if domf is a convex set and if for all x, y ∈ domf ,
and θ with 0 ≤ θ ≤ 1, we have
f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y) (36)
It is called strictly convex if
f (θx + (1 − θ)y) < θf (x) + (1 − θ)f (y) (37)
f is (strictly) concave if −f is (strictly) convex.
Definition 3 (Convex optimization problem) :

The notation
minimize f0 (x)
2: GMM and EM-15
subject to fi (x) ≤ 0, 1≤i≤m
hj (x) = 0, 1≤j≤k
describes the problem of finding the x that minimizes f0 (x) among all x that satisfy the
constrains. It is called a convex optimization problem if f0 , ..., fm are convex functions
and h1 , ..., hk are affine.
Definition 4 (Lagrangian) :
For a constrained optimization problem as seen in definition 3 we define the Lagrangian
m
X k
X
L(x, λ, ν) = f0 (x) + λi fi (x) + νj hj (x) (38)
i=1 j=1
The idea of the Lagrangian duality is to take the constrains into account by augmenting
the objective function with a weighted sum of the constraint functions.
Theorem 1 (Jensen’s Inequality) :

Named after mathematician Johan Jensen, the theorem relates the value of a convex
function of an integral to the integral of the convex function.
In the context of probability theory: if X is a random variable and f is a convex function,
f (E[x]) ≤ E[f (x)]. (39)
Equality holds if and only if x is constant or f is linear.
Theorem 2 (KKT conditions) :

For an optimization problem with constrains, which we transform to a Lagrangian, let’s
assume that a strong duality holds (discussion about duality is out of our scope, and can
be found in Boyd’s book). Then the following condition holds
∇x L(x∗ , λ∗ , ν ∗ ) = 0
fi (x∗ ) ≤ 0 , ∀i
hj (x∗ ) = 0 , ∀i
λi fi (x∗ ) = 0 , ∀i
2: GMM and EM-16
λ∗i ≥ 0 , ∀i
where x∗ and (λ∗ , ν ∗ ) are a primal and dual optimal points with zero duality gap. This
are called Karush-Kuhn-Tucker conditions.
You can find a more rigorous discussion of convex optimization in Boyd’s book on
convex optimization in chapters 1-5[4].
R EFERENCES
[1] Andrew NG’s machine learning course. Lectures on Unsupervised Learning, k-means clustering, Mixture of
Gaussians and The EM Algorithm http://cs229.stanford.edu/materials.html.
[2] Haim Permuter’s Machine leaning course, lecture 1 http://www.ee.bgu.ac.il/∼haimp/ml/lectures.html.
[3] Arthur Dempster, Nan Laird, and Donald Rubin (1977) Maximum Likelihood from Incomplete Data via the EM
Algorithm https://www.jstor.org/stable/2984875.
[4] Stephan Boyd’s Convex Optimization https://web.stanford.edu/∼boyd/cvxbook/bv cvxbook.pdf.
[5] Haim Permuter, Joseph Francos, Ian Jermyn, A study of Gaussian mixture models of color and texture features
for image classification and segmentation 3.4 http://www.ee.bgu.ac.il/∼francos/gmm seg f.pdf.
[6] Haim Permuter, Joseph Francos, Ian Jermyn, A study of Gaussian mixture models of color and texture features
for image classification and segmentation 3.5 http://www.ee.bgu.ac.il/∼francos/gmm seg f.pdf.
[7] Ramesh Sridharan’s Gaussian mixture models and the EM algorithm
https://people.csail.mit.edu/rameshvs/content/gmm-em.pdf.
3 - Linear and Logistic Regression-1
Machine Learning Course
Lecture 3 - Linear and Logistic Regression

Lecturer: Haim Permuter Scribe: Ziv Aharoni
Throughout this lecture we talk about how to use regression analysis in Machine
Learning problems. First, we introduce regression analysis in general. Then, we talk about
Linear regression, and we use this model to review some optimization techniques, that
will serve us in the remainder of the course. Finally, we will discuss classification using
logistic regression and softmax regression. Parts of this lecture are based on lecture notes
from Stanfords CS229 machine learning course by Andrew NG[1]. This lecture assumes
you are familiar with basic probability theory. The notation here is similar to Lecture 1.
I. A N INTRODUCTION TO R EGRESSION VS C LASSIFICATION
Regression analysis, a branch in statistical modelling, is a statistical process for

estimating the relationship between Y , the dependent variable, given observations of
X, the independent variables. Similarly, in Supervised Learning, we seek to build a
statistical model (hypothesis) that maps optimally each x ∈ X to y ∈ Y with respect
to the underlying statistics of X, Y . If y ∈ Y can take values from a discrete group of
unordered values, we refer the problem as classification. Usually in classification there
is no meaning of distance between different labels Y . However, if y ∈ Y can take values
from a continuous interval of values, we refer the problem as regression. In regression
there should be a meaning of distance between different values of Y .
In machine learning, the commonly used terminology is to call problems with discrete
Y as classification problems, and problems of continuous Y as regression problems. In
the remainder of the course, we will use this convention and will clarify the terminology
if conflicts will arise.
Let’s sharpen the differences between regression and classification by an example of
weather forecasts. We set our independent variables as X1 , the temperature in given
places, and X2 , the humidity in a given places. When we define our dependent variable
to be the amount of precipitation, we will choose a hypothesis that will map X1 , X2 to the
actual precipitation amount. In this case, we refer the problem as a regression problem.
When we define the dependent variable to be the ”weather type” (clear, cloudy, rainy,
etc), our hypothesis will use the independent variables to classify X1 , X2 to the correct
group of weather types. In this case we refer the problem as classification problem.
Now, after we discussed about the terminology, let’s delve into a commonly used
regression model, Linear Regression.
II. L INEAR R EGRESSION
In linear Regression, as implied from its name, we try to estimate the value of y with
a linear combination of the new feature vector x ∈ Rm with respect to our training
set (x(1) , y (1) ), (x(2) , y (2) ), . . . , (x(n) , y (n)) . Let’s define hθ (x) as the hypothesis for

y from x with respect to the parameters θ, where θ ∈ Rm+1 is defined by

 
θ0
 
θ 
 1
θ =  . .
 .. 
 
θm
Hence, hθ (x) is given by
hθ (x) = θ0 + x1 θ1 + · · · + xm θm , (1)
where θ0 is the bias term. For convenience let’s define x0 = 1 so we can rewrite (1) as
m
X
hθ (x) = xk θk = θT x. (2)
k=0
Our Goal is to find the linear combination coefficients, θ, that will yield the hypothesis
which maps x(i) to y (i) most accurately. In order to do that, we will have to assume that
the training set are representative of the whole problem. If it is not the case, we should
enrich our training set appropriately (if it is possible). Next, we need to choose θ that will
make hθ (x(i) ) ≈ y (i) , for all i = 1, . . . , n (the ≈ sign stands for approximately equal).
By achieving a hypothesis which maps accurately x(i) to y (i) for the entire training set,
combined with the assumption that our training set is representative of the whole problem,
we can say that our hypothesis is optimal, with the limits of a linear model.
Now, given the training set, how should we choose θ? In order to do that, we need to
define a function which measures the error between our hypothesis based on the x(i) ’s and
the corresponding y (i) ’s for all i = 1, . . . , n. Let’s define a Cost function that measures
the error of our hypothesis.
Definition 1 (MSE Cost Function)

MSE Cost Function, f : R(m+1)×n → R is defined by
n
1 X
C(θ) = (hθ (x(i) ) − y (i) )2
2n i=1
n
1 X T (i)
= (θ x − y (i) )2 (3)
2n i=1
We can see that when hθ (x(i) ) ≈ y (i) for all i = 1, . . . , n , then our cost function
satisfies C(θ) ≈ 0. Moreover, every term in our cost function is positive, therefore C(θ) ≈
0 ⇐⇒ hθ (x(i) ) ≈ y (i) for all i = 1, . . . , n. These are two important properties, that lead
us to the conclusion that finding θ∗ that minimizes the cost function, eventually forces
hθ (x(i) ) ≈ y (i) for all i = 1, . . . , n. By achieving that, we are essentially accomplishing
our goal.
Let’s formulate our goal,
θ∗ = argmin C(θ). (4)
θ∈Rm+1
Note that if we find θ∗ , our linear regression model can be used on new sample x by the
following equation:
hθ (x) = (θ∗ )T x. (5)
Now, after formulating our problem, let’s survey some minimization methods to serve us
in minimizing our cost function.
III. C OST F UNCTION M INIMIZATION
In this section we discuss about some various ways for minimizing C(θ) with respect
to θ. The methods that we focus on are:
1) Gradient Descent:
• Batch.
• Stochastic (Incremental).
2) Newton- Raphson Method.
3) Analytically.
Definition 2 (Gradient Descent algorithm)

The Gradient Descent algorithm finds a minimum of a function by an iterative process
with the following steps:
1) Set θ0 as the initial guess for θ∗ .
2) Repeat the following updating rule until d(θ(t+1) , θ(t) ) < δ, where δ is defined to
be the convergence tolerance, d(·) is defined to be a distance norm.
θ(t+1) := θ(t) − α∇θ C(θ(t) ) (6)
where:
• t = 0, 1, . . . is the iteration number.
• α is the ”Learning Rate” and it determines how large step the algorithm takes
every iteration.
• ∇θ is the gradient of C(θ) with respect to the parameters θ and is defined by:
h iT
∇θ C(θ) = ∂C(θ)
∂θ
∂C(θ)
∂θ
. . . ∂C(θ)
∂θ
0 1 m
Intuitively, we can think of the algorithm as guessing an initial place in the domain.
Then, it calculates the direction and magnitude of the steepest descent of the function
for the current θ (∇θ C(θ)) and finally, in the domain, it takes a step, proportional to the
descent magnitude and to a fixed predefined step size α to the next θ. In case we want to
use the algorithm for maximization, we will exchange the ’−’ sign with a ’+’ sign and
then we will take step in the direction of the steepest ascent of the function. In this case
the algorithm is called gradient ascent. Note that, generally, the algorithm searches ’a’
minimum and not ’the’ minimum of the function. If we want to claim that the algorithm
finds the global minimum of the function we will need to prove that the function has
only one minimum, which means that the function is convex.
Here are some examples of gradient descent as it runs to minimize convex or non
convex functions.
Fig. 1: initial guess of GD Fig. 2: 3 iterations of GD on Fig. 3: 64 iterations of GD

on a convex function a convex function on a convex function
The leftmost figure shows us the value of f (θ0 ). The middle figure shows the algorithm
progress after three iterations and the rightmost figure shows the algorithm after achieving
convergence. We can visualize the fact that gradient descent applied on convex function
will converge to the global minimum of the function, assuming that α is relatively small.
Fig. 4: initial guess of GD Fig. 5: 3 iterations of GD on Fig. 6: 64 iterations of GD

on a non-convex function a non-convex function on a non-convex function
Next, let’s examine the performance of gradient descent on an non-convex function

given by f (x, y) = x5 − 3x2 + y 4 + 7y 2 . This function has no global minimum nor
maximum but has a local minimum at (1.06266, 0). At the three upper figures we can
see that for some initial guess the algorithm converged to the local minimum of the
function. At the three lower figures we can see that for some other initial guess the
algorithm began to make progress in the domain to the area where f (x, y) → −∞.
This approves the fact that gradient descent applied on a non-convex function generally
doesn’t converge to the global minimum of the function.
Fig. 7: initial guess of GD Fig. 8: 3 iterations of GD on Fig. 9: 7 iterations of GD on

on a non-convex function a non-convex function a non-convex function
Therefore, we can conclude that by choosing a convex cost function to our problems,
we can assure finding its global minimum by an iterative process such as gradient descent.
Luckily, the MSE cost function is a convex function with respect to θ so we can assure
that gradient descent will find its global minimum.
Now, let’s implement the gradient descent method on our linear regression problem.
For doing that, we have to calculate the term ∇θ C(θ). Let’s find an expression for the
j th coordinate of ∇θ C(θ) in our problem.
n
∂C(θ) ∂ 1 X T (i)
= (θ x − y (i) )2
∂θj ∂θj 2n i=1
n
(a) 1 X ∂ T (i)
= (θ x − y (i) )2
2n i=1 ∂θj
n
(b) 1 X (i)
= 2(θT x(i) − y (i) )xj
2n i=1
n
1 X T (i)
(c) (i)
= (θ x − y (i) )xj (7)
n i=1
Where in (a) we used the linearity of the derivative, In (b) and (c) we differentiated and
simplified the expression. By inserting Equation (7) in Equation (6) we get the following
update rule which is given by
n
(t+1) (t) α X T (i) (i)
θj := θj − (θ x − y (i) )xj , ∀j (8)
n i=1
Now lets understand the intuition of Eq. (8). First, for simplicity, lets examine it for
(t+1) (t) (i)
n = 1. We get θj := θj −α(ŷ (i) −y (i) )xj , where ŷ (i) is the estimator of y (i) using the
linear regression model of the current iteration. Now, if denote by error (i) = (ŷ (i) − y (i) ),
(t+1) (t) (i) (t)
then the update is θj := θj − α · error (i) · xj . Namely, update θj by a linear offset
that is proportional to the estimation error multiply by the input. And if now we have
n > 1, then we average over all the updates, so the noise would be less effective at each
iteration.
In can be easily verified that at every update, the algorithm uses the entire training set
to compute the gradient. For that reason we call this way of using gradient descent as
batch gradient descent. Alternatively, we can use only one training example to update θ
every iteration. The update rule at this case is given by,
for i=1 to n, {
(t+1) (t) (i)

θj := θj − α(θT x(i) − y (i) )xj , ∀j (9)
}
In this case, we call the algorithm Stochastic Gradient Descent (or Incremental Gradient
Descent). Next, let talk about our second method for optimizing C(θ). Our next method
uses the fact that finding an extremum of an a function is equal to finding its derivative’s
zeros.
Definition 3 (Newton-Raphson Method)

For some f : R → R differentiable over R, its zeros can be found by an iterative
process with the following update rule.
1) Set θ0 as the initial guess for θ∗ , such as f (θ∗ ) = 0.

2) Repeat the following update rule until d(θ(t+1) , 0) < δ, where δ is defined to be
the convergence tolerance, d(·) is defined to be a distance norm.
f (θ(t) )
θ(t+1) := θ(t) − (10)
f ′ (θ(t) )
where:
• t = 0, 1, . . . is the iteration number.
• f ′ (θ) is the first derivative of f (θ).
First, to get some intuition about the algorithm’s process, let’s examine the private case
where f (x) = 31 x3 . In our case the update rule is given by
1
x(t+1) := x(t) − x(t) . (11)
3
The following figures presents the algorithm process in the first iterations.
Fig. 10: initial guess of Fig. 11: 1 iteration of Fig. 12: 5 iterations of
Newton-Raphson method Newton-Raphson method Newton-Raphson method
In the leftmost figure, we can see the initial guess for x, denoted by x0 . In the middle
figure, we can see how the algorithm, calculate its next guess of x. First, it calculates the
tangent of f (x) at x0 , and then sets the next guess at the point where the tangent cross
the x axis. We can think of that as estimating the zeros of f (x) as if f (x) was a linear
function. From this point of view let’s find the derive the update rule.
The tangent’s slope is the derivative of f (x) at xt so we can conclude that
f (x(t) ) − 0
f ′ (x(t) ) = . (12)
x(t) − x(t+1)
Fig. 13: Newton-Raphson method visualiza-

tion
By simplifying the equation above we get

f (x(t) )
x(t+1) := x(t) − , (13)
f ′ (x(t) )
which is the update rule noted in the algorithm’s definition. Now, When applying Newton-
Raphson’s (N-R’s) method in minimizing the cost function, We will choose
x,θ
f (·) , C ′ (·)
f ′ (·) , C ′′ (·).
Therefore, the update rule for minimizing the cost function will be
C ′ (θ(t) )
θ(t+1) := θ(t) − . (14)
C ′′ (θ(t) )
In case that θ is a vector we can generalize the update rule to be
θ(t+1) := θ(t) − Hθ−1 ∇θ C(θ(t) ), (15)
where
• Hθ is the Hessian matrix of C with respect to θ, whose entries are given by
∂ 2 C(θ)
Hi,j =
∂θi ∂θj
• ∇θ C(θ(t) ) is the gradient of C with respect to θ.
Note that for a quadratic cost function, its gradient will be a linear function of θ and
therefore the algorithm will converge after one iteration (verify that you understand why
it is true). More generally, N-R method converges faster that gradient descent (this will
not be proven here), but it is more computationally expensive. That is because every
update rule the algorithm needs to find and invert (∼ m3 operations) the Hessian matrix.
Analytic Minimization
In some cases, there could be found an analytic, closed-formed solution for θ∗ which
minimizes the cost function. This is not always the case, but for linear regression there
is an analytic solution. In order to find a solution we take the derivatives of the cost
function and set them to zero. In terms of convenience, let’s make some notations that
will ease our process of finding θ∗ .
Let X be the design matrix, whose rows represent the training example’s index, and
its columns represent the feature’s index. E.g, Xij represents the j th feature of the ith
training example. X ∈ Rm+1×n is given by
 
 
X = (x(1) ) (x(2) ) · · · (x(n) ) .
 
 
Now, let ~y be the vector of the target values, such as ( ~y )i = y (i) . This means that ~y ∈ Rn
is given by
 
y (1)
 
 y (2) 
~y =  .  .
 
 .. 
 
(n)
y
Next, given that hθ (x(i) ) = (x(i) )T θ, we can express the term hθ (x(i) ) − y (i) by
hθ (x(i) ) − y (i) = X T θ − ~y

i
.
Combined with the fact that for some vector ~a, the property ~aT ~a = a2i holds, we can
P
i
conclude that
n
1 T 1 X T (i)
X T θ − ~y X T θ − ~y = (θ x − y (i) )2 , C(θ).

(16)
2n 2n i=1
Now, let’s take derivatives with respect to θ.

(a) 1 T
T T

∇θ C(θ) = ∇θ X θ − ~y X θ − ~y
2n
(b) 1
∇θ θT XX T θ − θT X~y − ~y T X T θ + ~y T ~y

=
2n
(c) 1
∇θ tr θT XX T θ − tr θT X~y − tr ~y T X T θ + tr ~y T ~y

=
2n
(d) 1
∇θ tr θT XX T θ − 2 tr ~y T X T θ

=
2n
(e) 1
2XX T θ − 2X~y

=
2n
1
XX T θ − X~y

=
n
Where in (a) we used the Equation (16). In (b) we simplified the cost function and used
the linearity of the gradient operator. In (c) we used the fact that a trace of a real number
is the number itself, and we eliminated the ~y T ~y term, because it has no dependency on
θ. In (d) we used the fact that tr a = tr aT . In (e) we used the following rules of matrix
calculus,
• ∇A tr AT BA = (B + B T )A, where A = θ, B = XX T .
• ∇A tr B T A = B, where A = θ, B = X~y .
By setting the gradient to the zero vector we get that
XX T θ∗ = X~y
⇓
−1
θ∗ = XX T X~y (17)
−1
Where the term XX T X is an m+1-by-n matrix and is called the pseudo-inverse of
X. So, by taking derivatives and setting them to zero, we can try finding explicitly θ∗ for
any model that maps x ∈ X to y ∈ Y, or any type of cost function. So, why should we
use an iterative process that finds θ∗ ? The first reason is that inverting matrix could be
very inaccurate calculation in the computer due to rounding errors. The second reason
is that inverting an m-by-m matrix takes ∼ m3 operations, and for large m it could be
more time-efficient to calculate θ∗ with an iterative process. The Third reason is that it
is sometimes not possible to find a closed-form solution for θ∗ due to the complexity of
the cost function nor the hypothesis’.
IV. P ROBABILISTIC I NTERPRETATION
In the previous sections, when using the linear regression on a regression problem,
we used some heuristic explanations for choosing the cost function to be the quadratic
cost function, and explained why we minimizes it. Now, let’s make some probabilistic
assumptions on our problem, then let’s try to back up our heuristic explanations with
probabilistic justifications, and specifically we will use the Maximum Likelihood principle
that we introduced in the first lecture.
Let’s assume that our target variables, the y (i) ’s, are given by the equation
y (i) = θT x(i) + ǫ(i) . (18)
Where ǫ(i) is the error, which can be interpreted as random noise, inaccuracy in
measurements, limitation of the linear model etc. Moreover, let’s assume that the ǫ(i) ’s
are i.i.d (independently and identically distributed) and its distribution is given by
ǫ(i) ∼ N (0, σ 2 ), ∀i . (19)
Now, lets examine the probability of P (y (i)|x(i) ; θ) (the semi-colon indicates that y (i)
is parameterized by θ, since θ is not a random variable). Given x(i) , θ, the term θT x(i)
is deterministic and hence P (y (i)|x(i) ; θ) is a Normal distribution with expectation θT x(i)
and variance σ 2 , i.e., N (θT x(i) , σ 2 ), ∀i . Let L(θ) be the likelihood function, which is
given by
L(θ) = p(~y |X; θ). (20)
The likelihood function represents the probability for all y (i) ’s given all x(i) ’s. Intuitively it
measures how likely is that y (i) is the correct value of x(i) for all i, under the probabilistic
assumptions we made. Due to that the entire training set is given, the likelihood function
depends exclusively on θ. Likewise, note that due to the fact ǫ(i) ’s are i.i.d,we have,
L(θ) = p(~y |X; θ)
= p(y (1) , . . . , y (n) |x(1) , . . . , x(n) ; θ)

n
(a) Y
= p(y (i) |x(i) ; θ)
i=1
n 2 !
(b) Y 1 y (i) − θT x(i)
= √ exp − . (21)
i=1 2πσ 2 2σ 2
Where in (a) we used the independency and in (b) we used the identical distribution.
Given the design matrix X (or equivalently x(1) , . . . , x(n) ) and the corresponding y (i) ’s,
we want to make p (~y |X; θ) as high as possible (because we know that y (i) is the correct
value that corresponds to x(i) for all i’s) by adjusting θ. Therefore we derived the
maximum likelihood criteria, which is maximizing L(θ). Instead of maximizing L(θ),
we can maximize any strictly increasing function of L(θ). Let l(θ) be the log likelihood
function which is given by l(θ) = log L(θ). Now let’s maximize l(θ).
l(θ) = log L(θ)

n 2 !
Y 1 y (i) − θT x(i)
= log √ exp −
i=1 2πσ 2 2σ 2
n 2 !
(a) X 1 y (i) − θT x(i)
= log √ exp −
i=1 2πσ 2 2σ 2
n 2
(b) X 1 y (i) − θT x(i)
= log √ −
i=1 2πσ 2 2σ 2
n
(c) 1
1 X (i) 2
= n log √ − 2 y − θT x(i) (22)
2πσ 2 2σ i=1
Where in (a), (b) we used the logarithm properties, and in (c), we simplified the
expression. So in fact, maximizing the likelihood function is equivalent to minimizing
the term
n
X 2
y (i) − θT x(i) . (23)
i=1
Note that the exact value of the error’s variance, σ, doesn’t affect the values of θ∗ . In
fact, it tells us that adjusting θ can’t minimize the errors which are caused by the ǫ(i) ’s.
Finally, we can conclude that under the probabilistic assumptions we made earlier, we
derived the same goal, minimizing the term (23) which is exactly C(θ) as we defined
earlier.
V. L OGISTIC R EGRESSION
Logistic regression, as opposed to its misleading name, is used for binary classification
problems, which means that y (i) can take values from the set {0, 1}. As we clarified in
the introduction section, the term ”regression” is rooted in regression analysis and not
in machine learning’s conventions. Note that for classification problems we cannot use
the linear regression model because in that case, y could get any value in the interval
(−∞, ∞). Therefore, we should try other attitude to solving this problem.
Let’s adopt the probabilistic interpretation from last section. First, we will make some
probabilistic assumptions at the problem and then, we will use the maximum likelihood
criteria to adjust θ.
Assume that P (y (i)|x(i) ; θ) is Bernoulli(φ(i) ), where φ(i) is the Bernoulli parameter
for the ith training example. Therefore,
(i) (i)
p y (i) |x(i) ; θ = (φ(i) )y (1 − φ(i) )1−y .

Now we use our hypothesis to estimate the probability that x(i) belong to a certain class.
Let’s choose our hypothesis to be
φˆ(i) = hθ (x(i) )
1
= σ θT x(i) =

. (24)
1 + e−θT x(i)
1
Where σ(z) = 1+e−z
is called the sigmoid function or the logistic function. Note that
σ(z) can output values in the interval [0, 1], such as
σ(z) → 1, z → ∞
1
σ(z) = , z = 0
2
σ(z) → 0, z → −∞
Moreover, σ(z) is differentiable over R and its derivative satisfies the following property
1
σ ′ (z) = − −z
2 (−e )
−z
(1 + e )
1 e−z
=
1 + e−z 1 + e−z

1 1
= 1−
1 + e−z 1 + e−z
= σ(z) (1 − σ(z)) (25)
Our last assumption is that the training set was generated independently. Backed up
with these assumptions, let’s adjust θ to maximize the classification accuracy over the
training set, hopefully to yield accurate results for unseen x. Let’s use the maximum
likelihood criteria to adjust θ. The likelihood function in this case is given by
L(θ) = p (y n |xn ; θ)
= p y (1) , . . . , y (n) |x(1) , . . . , x(n) ; θ

n
Y
p y (i) |x(i) ; θ

=
i=1
n
Y (i) (i)
(σ θT x(i) )y (1 − σ θT x(i) )1−y

= (26)
i=1
As we stated in the previous section, we can maximize θ over any strictly increasing
function of L(θ). So, to simplify our calculations, let’s maximize the log likelihood
function
n
Y (i) (i)
l(θ) = log (σ θT x(i) )y (1 − σ θT x(i) )1−y

i=1
n
X
y (i) log(σ θT x(i) ) + (1 − y (i) ) log(1 − σ θT x(i) )

= (27)
i=1
The log likelihood has an interpretation of Maximum likelihood criteria as developed

in (26)-(27) and its also related to cross entropy cost function.
Consider the following cost:

n
X
Cn (θ) = − log P̂ (y (i) |x(i) , θ) (28)
i=1
By the law of large number this cost convergence to
lim Cn (θ) = −E[log P (Y |X, θ)

n→∞
X
=− p(y, x) log p(y|x.θ)
x,y
X X
=− p(x) p(y|x) log p(y|, θ)
x y
X
= p(x)H(pY |X=x , pY |X=x,θ ), (29)
x
where last equality follows from the fact that for a fixed x,
P
− y pY |X (y|x) log pY |X,θ (y|x, θ) is the cross entropy H(pY |X=x , pY |X=x,θ ). Hence
our objective is for any x to minimize H(pY |X=x , pY |X=x,θ ). Recall that cross entropy
has the property
H(p, q) = H(p) + D(p||q), (30)
hence we minimize in the objective, for any x the divergence D(pY |X=x ||pY |X=x,θ ) where
pY |X (y|x) is the true conditional pmf and pY |X,θ (y|x, θ) is the one given by the model.
After finding the expression for the log likelihood function, we can use any of our
optimization methods we discussed earlier. Let’s implement the batch gradient ascent
method to maximize the log likelihood function of the Binary case Eq. (27). To do so,
we will need to find ∇θ l(θ).
n
∂l(θ) X ∂ (i)
y log(σ θT x(i) ) + (1 − y (i) ) log(1 − σ θT x(i) )

=
∂θj i=1
∂θj
n ′ T (i)
′ T (i)

(a) X (i) σ θ x (i) (i) −σ θ x (i)
= y x + (1 − y )
T x(i) ) j
x
T x(i) ) j
i=1
σ (θ 1 − σ (θ
n
(b) X (i) (i)
y (i) 1 − σ θT x(i) xj − (1 − y (i) )σ θT x(i) xj

=
i=1
n
(c) X (i)
= y (i) − σ θT x(i) xj (31)
i=1
Where in (a) we took derivatives, in (b) we used property (25), and in (c) we simplified
the expression. After finding the gradient, we can write the logistic regression update
rule n
(t+1) (t) (i)
X
y (i) − σ θT x(i)

θj := θj + α xj . (32)
i=1
Next we will talk about how to generalize the logistic regression algorithm to a non-binary
classifier.
VI. A DDITIONAL WAY OF OBTAINING C ROSS E NTROPY AS A COST FUNCTION
In the first part of this lecture we talked about linear regression and we introduced the
MSE cost function. The MSE cost function is widely used in applied mathematics and
machine learning. Using different cost functions will end up giving us different results
when using our optimization methods such as Gradient Descent for our regression and
classification problems.
In the second part of this lecture we introduced Logistic Regression and showed that
maximizing the log likelihood function resulted in achieving the Cross Entropy cost
function. The Cross Entropy cost function has an advantage over the MSE cost function
in some cases. In order to understand the advantage we shall first address a problem.
Assume we’re solving a Logistic regression problem, while using the logistic function
σ(θT x(i) ) and choosing our cost function to be the MSE cost function:
n
1 X
C(θ) = (hθ (x(i) ) − y (i) )2
2n i=1
n
1 X
= (σ(θT x(i) ) − y (i) )2 (33)
2n i=1
Now derive C(θ) in order to use Gradient Descent:
n
∂C(θ) 1X (i)
= (σ(θT x(i) ) − y (i) )xj σ ′ (θT x(i) ) (34)
∂θj n i=1
Updating, we will achieve:

n
(t+1) (t) αX (i)
θj := θj − (σ(θT x(i) ) − y (i) )xj σ ′ (θT x(i) ), ∀j (35)
n i=1
Notice that for large or small values of θT x(i) the term σ ′ (θT x(i) ) is very small (see
Fig. 14), causing what is called “Learning Slowdown”. This is a problem, because our
learning process can take many iterations to compute due to the slow update of θj (the
exact meaning of the term ’learning process’ will be clarified more in the next lecture).
Fig. 14: Sigmoid, σ(x), and its derivative, σ ′ (x).
Now let’s try a different approach. What if we could choose a cost function so that
′
the term σ (θT x(i) ) disappeared? In that case, the cost for a single training example X
would be:
∂C(θ)
= (σ(θT x) − y)xj . (36)
∂θj
For simplicity, define z = θT x. Now derive our cost function:
∂C(θ) ∂C(θ) ∂σ(z) ∂z ∂C(θ)

= = σ(z)(1 − σ(z))xj = (σ(z) − y)xj , (37)
∂θj ∂σ(z) ∂z ∂θj ∂σ(z)
Hence,
∂C(θ) (σ(z) − y) −y 1−y
= = + (38)
∂σ(z) σ(z)(1 − σ(z)) σ(z) 1 − σ(z)
Integrating this expression with respect to σ(z) will give us:
Cx = −[y log(σ(z)) + (1 − y) log(1 − σ(z))] (39)
This is the contribution to the cost from a single example. To get the full cost function
we must average over all training examples, obtaining:
1X
C=− (y log(σ(z)) + (1 − y) log(1 − σ(z)). (40)
n x
We achieved the Cross-Entropy cost function.
To summarize, we showed in this section that the Cross-Entropy cost function has an
advantage. When using cross-entropy cost function the larger the error (σ(z) − y), the
larger the change in θ. This means that the learning process will potentially accelerate.
VII. S OFTMAX R EGRESSION
Softmax regression is used when our target value y can take values from the discrete
set {1, 2, . . . , k}. In this case we assume that
p (y = 1|x; θ) = φ1
p (y = 2|x; θ) = φ2
..
.
p (y = k − 1|x; θ) = φk−1
k−1
X
p (y = k|x; θ) = 1 − φi = φk . (41)
i=1
As you might noticed, in order to estimate the parameters of a classifier of k classes,

we need to estimate k − 1 parameters and the last one is determined by the other ones
(exactly like the case of a binary classifier). Due to (41) we can write the conditional
probability as
p (y|x; θ) = p (y = 1|x; θ)1{y=1} p (y = 2|x; θ)1{y=2} · · · p (y = k|x; θ)1{y=k} (42)

where 1{·} is the indicator function that returns 1 if the argument statement is true and
0 otherwise. Now we will estimate φ1 , . . . , φk−1 with the following functions
exp (θ(1) )T x
φˆ1 = hθ(1) (x) = Pk
(i) T
i=1 exp (θ ) x
exp (θ(2) )T x
φˆ2 = hθ(2) (x) = Pk
(i) T
i=1 exp (θ ) x
..
.
(k−1) T
ˆ = hθ(k−1) (x) = Pexp (θ
φk−1
) x
k (i) T
i=1 exp (θ ) x
1
φˆk = hθ(k) (x) = Pk
(i) T
i=1 exp (θ ) x
Where θ(i) for all i = 1, . . . , k−1, and θ(k) = ~0 are the parameter that are used to generate
the hypothesis. Let Θ be the parameters matrix that contains θ(i) for all i = 1, . . . , k such
as  
 
 (1) (2) (k−1)
Θ = (θ ) (θ ) · · · (θ ) 0 . (43)

 
Pk
Moreover, note that i=1 φ̂i = 1, 0 ≤ φ̂i ≤ 1 so the φ(i) ’s form a probability distribution.
Next, let’s use the maximum likelihood criteria on the log likelihood function. The log
likelihood is given by
l(Θ) = log p (~y |X; Θ)
= log p y (1) , . . . , y (n)|x(1) , . . . , x(n) ; Θ

n
(a) Y
p y (i) |x(i) ; Θ

= log
i=1
n
hθ(1) (x(i) )1{y hθ(2) (x(i) )1{y · · · hθ(k) (x(i) )1{y
(b Y (i) =1} (i) =2} (i) =k}
= log
i=1
n
1{y (i) = 1} log hθ(1) (x(i) ) + · · · + 1{y (i) = k} log hθ(k) (x(i) )
(c) X
=
i=1
n X
k
1{y (i) = q} log hθ(q) (x(i) )
(d) X
= (44)
i=1 q=1
Where in (a) we used the assumption that the training set was generated independently,
in (b), (c) we used the logarithm properties and in (d) we simplified the expression. Let
substitute the hypothesis with its explicit function to get
n X
k
1{y (i) = q} log hθ(q) (x(i) )
X
l(Θ) =
i=1 q=1
n X
k
exp (θ(q) )T x(i)
1{y (i) = q} log Pk
X
=
i=1 q=1 p=1 exp (θ(p) )T x(i)
n X
k
" k
#
1{y (i) = q} log exp (θ(q) )T x(i) − log
(a) X X
= exp (θ(p) )T x(i)
i=1 q=1 p=1
n
" k k k
#
1{y (i) = q}(θ(q) )T x(i) − 1{y (i) = q} log exp (θ(p))T x(i)
(b) X X X X
=
i=1 q=1 q=1 p=1
n
" k k
#
1{y = q}(θ ) x − log exp (θ ) x
(c) X X (i) (q) T (i)
X
(p) T (i)
= (45)
i=1 q=1 p=1
Where (a), (b) we used logarithm properties and in (c) we used the fact that the indicator
function return 1 only once for every training example. Now, let’s find the derivatives of
(r)
l(Θ) with respect to θj to form our gradient ascent update rule that will maximize our
log likelihood function.
n
" k k
#
∂l(Θ) ∂
1{y (i) = q}(θ(q))T x(i) − log
X X X
(r)
= (r)
exp (θ(p) )T x(i)
∂θj ∂θj i=1 q=1 p=1
n
" k k
#
X ∂ ∂
1{y (i) = q}(θ(q))T x(i) −
(a) X X
= (r) (r)
log exp (θ(p) )T x(i)
i=1 q=1 ∂θj ∂θj p=1
n
" #
exp (θ(r) )T x(i)
1{y =
(b) X
(i) (i) (i)
= r}xj − Pk xj
i=1 p=1 exp (θ(p) )T x(i)
n
" #
exp (θ(r) )T x(i)
1{y (i) = r} − Pk
(c) X (i)
= xj
i=1 p=1 exp (θ(p) )T x(i)
n
1{y (i) = r} − p y (i) = r|x(i) ; Θ xj(i)
(d) X
= (46)
i=1
Where in (a) we used the linearity of the derivative, in (b) we took derivatives, and in
(c), (d) we simplified the expression. Finally we can write our batch gradient ascent
update rule to be
n
(r) (r)
1{y (i) = r} − p y (i) = r|x(i) ; Θ x(i)
X
θj := θj +α j (47)
i=1
Note that the gradient’s value goes to 0 when
y (i) = r, p y (i) = r|x(i) ; Θ → 1

y (i) 6= r, p y (i) = r|x(i) ; Θ → 0 , ∀i, r

so the gradient stops adjusting Θ when p y (i) = r|x(i) ; Θ is close to 1 for the correct

class and close to 0 for the incorrect classes, which approves that our results makes sense.
R EFERENCES
[1] Andrew NG’s machine learning course. Lecture on Supervised Learning http://cs229.stanford.edu/materials.html.
4 - Neural Networks-1
Machine learning
Lecture 4 - Neural Networks

Lecturer: Haim Permuter Scribe: Nave Algarici
Throughout this lecture we introduce Neural Netwoks, starting from a single neuron,
and ending with the Backpropagation method. Most of the material for this lecture is
based on the online book of Michael Nielsen [1].
I. N EURAL N ETWORKS
Neural networks are limited imitations of how our own brains work. They’ve had a big
recent resurgence because of advances in computer hardware. There is evidence that the
brain uses only one ”learning algorithm” for all its different functions. At a very simple
level, neurons are basically computational units that take input (dendrites) as electrical
input (called ”spikes”) that are channeled to outputs (axons).
Neural neworks are typically organized in layers. Layers are made up of a number of
interconnected ’nodes’ which contain an ’activation function’. Patterns are presented to
the network via the ’input layer’, which communicates to one or more ’hidden layers’
where the actual processing is done via a system of weighted ’connections’. The hidden
layers then link to an ’output layer’, which is the output of the network.
Neural Networks can be applied to many problems, such as: function approximation,
classification, regression, data processing, etc. We will start by looking at a single neuron,
define it’s model, and combine neurons to a complete network.
A. Single Neuron Model
A neuron is depicted in Fig. 1. The Neuron has k inputs, x1 , x2 , ..., xk , a set of weights
w1 , w2 , ..., wk corresponding to the inputs, a bias b and an activation function σ(z) that
produces a single output y. The output of the neuron y is determined by
y = σ(z),
k
X
z= wi xi + b. (1)
i=1
x1
w1
x2 σ(z) y
w2
⠇
wk
xk
Figure 1. Scheme of a single neuron.
There are many different options for the choice of the activation function σ(z). A few
of them are given below and a depicted in Fig. 2.
1
• σ(z) = 1+e−z
- sigmoid function
• σ(z) = sign(z) - sign function
ez −e−z
• σ(z) = tanh(z) = ez +e−z
- hyperbolic tangent
• σ(z) = max(0, z) - rectified linear unit (RLU)
1.5
0.5
σ(z)
-0.5
Logistic function
-1
Sign function
tanh
RLU
-1.5
-5 -4 -3 -2 -1 0 1 2 3 4 5
z
Figure 2. Plot of the activation functions mentioned above.

B. Neural Network Model
As we mentioned before, a Neural Network is organized in layers, where the first layer
contains the inputs of the network, the last layer is the output of the network, and the
layers in between are called hidden layers. Each layer gets it’s inputs from the layer
before, and passes it’s outputs to the next. We call this step forward propagation. Fig. 3
depicts a fully connected neural network with input layer, hidden layer and output layer.
The reason its called fully connected since the neural in each layer are connected to all
the neural in the next layer.
layer 1 (input) layer 2 hidden layer L − 1 layer L (output)
x1 2
w11
a21 a1L−1 L
w11
aL1
2 L
2
w21 w12 L
w21 w12
x2 2
w22
a22 a2L−1 L
w22
aL2
⠇ ⠇ ⠇ ⠇
2 L
w2K 1
w2K L−1
xK1 L−1
2
a2K2 aK L−1 L
aLKL
wK 2 K1
wK L KL−1
Figure 3. Structure of a general Neural Network
The parameters and variables that define a neural network are the following:
l
• wjk - the weight for the connection from the k th neuron in the (l − 1)th layer to the
j th neuron in the lth layer.
• blj - the coefficient we add to the j th neuron in the lth layer.
• Kl - the number of neurons in the lth layer.
PKl−1 l l−1
• zjl = k=1 wjk ak + blj
• alj = σ(zjl )
By organizing our parameters in matrices and using matrix-vector operations, we can take
advantage of fast linear algebra routines to quickly perform calculations in our network.
T
• z l = z1l , z2l , . . . , zK l
l
T
• bl = bl1 , bl2 , . . . , zK l
l
 
l l l
w11 w12 . . . w1K l−1
 
 wl w l
. . . w l 
l 21 22 2K l−1
• w = .
 
 .. .
.. . .. .
..


 
l l l
wKl 1 wKl 1 . . . wKl Kl−1
T
• al = al1 , al2 , . . . , alKl
• z l = w l al−1 + bl
• al = σ(z l )
• σ([x1 , . . . , xn ]T ) = [σ(x1 ), . . . , σ(xn )]T
x2
1 0
1
0 1
0
x1
0 1
Figure 4. XOR function - illustration of output for binary inputs x1 , x2
Example 1 (XOR function) One neuron is able to separate two sets only by a linear
separation. Now consider the XOR function that we are all familiar with and it is depicted
in Fig. 4. It is easy to see, that the XOR function cannot be approximated using a linear
function (there is no line that could separate the two groups of answers). But now we
will show, that it is possible to approximate the XOR function using a Neural Network.
We build a Network as depicted in Fig. 5 with three inputs, x1 , x2 , and the third set to
’1’, a hidden layer of two neurons, and a single neuron output layer. All biases are set
to zero. And the set the weights that we choose is given in Fig. 5.
x1 a21 a31
1 1
1
−2
1
x2 a22
1
0
−1
Figure 5. Example: XOR function Neural Network
We choose the activation function to be the RLU function. Let’s take the input
[x1 , x2 ] = [0, 0] and insert it to the network:
a21 = max(0, 1 ∗ x1 + 1 ∗ x2 + 0 ∗ 1) = 0
a22 = max(0, 1 ∗ x1 + 1 ∗ x2 + −1 ∗ 1) = 0
a31 = max(0, 1 ∗ a21 − 2 ∗ a22 ) = 0
The same goes for the other options for the inputs: [0, 1], [1, 0], [1, 1]:
x1 x2 a21 a22 a31

0 0 0 0 0
0 1 1 0 1
1 0 1 0 1
1 1 2 1 0
We can see that the output a31 fits the XOR function perfectly. Note that the output of
each layer is not dependant pof previous layers, only it’s own inputs and weights.
C. Cost Function
A very important part of defining the Neural Network is the goal that it is designed
to achieve. Hence, one need to define a cost function that quantify how close we’re to
achieving the goal. The cost function is a measure of how close is the output of the neural
network to the desired label. For instance a mean square error function (or a quadratic
cost function) is define as a cost function as follows:
N
1 X
C(w, b) = ka(xi ) − yi k2 , (2)
2N i=1
where x are the input vectors and y are the corresponding labels. The input and the label
are determined by the problem setting, hence are fixed. The parameters w and biases b
are determined by the neural network, hence we can consider that the cost function is a
function of the weights w and biases b.
Our main goal is to minimize the cost function, so that the the output from the network
will be close to the desired output as possible. To minimize the cost function, we will use
a method called Gradient Decent. Gradient descent is a iterative optimization algorithm.
It says that to find a local minimum of a function, one should takes steps proportional
to the negative of the gradient (or of the approximate gradient) of the function at the
current point.
D. Backpropagation
Backpropagation is about understanding how changing the weights and biases in

a network changes the cost function. Ultimately, this means computing the partial
∂C ∂C
derivatives l
∂wjk
and ∂blj
. The main idea of the backpropagation is using the chain-rule
of derivative. To compute the derivative, we first introduce an intermediate quantity, δjl ,
which we call the error in the j th neuron in the lth layer. Backpropagation provide a
∂C ∂C
procedure to compute the error δjl , and then will relate δjl to l
∂wjk
and ∂blj
.
Our goal is to minimize C as a function of w and b. To train our neural network, we

l
initialize each parameter wjk and each blj to a small random value near zero, and then
apply an optimization algorithm such as batch gradient descent. Since C is a non-convex
function, gradient descent is susceptible to local optima. However, in practice gradient
descent usually works fairly well. Note that it is important to initialize the parameters
randomly, rather than to all 0’s. If all the parameters start off at identical values, then all
the hidden layer units will end up learning the same function of the input. The random
initialization serves the purpose of symmetry breaking.
One iteration of gradient descent updates the parameters w, b as follows:
l l ∂C
wjk = wjk −α l
, (3)
∂wjk
∂C
blj = blj − α , (4)
∂blj
where α is the learning rate. The key step is computing the partial derivatives above.
We will now describe the backpropagation algorithm, which gives an efficient way to
compute these partial derivatives.
We define the error δjl of neuron j in layer l by
∂C
δjl = , (5)
∂zjl
starting with the Lth layer, we get
∂C
δjL = . (6)
∂zjL
Applying the chain rule, we can re-express the partial derivative above in terms of partial
derivatives with respect to the output activations
X ∂C ∂aL
k
δjL = L L
, (7)
k
∂a k ∂zj
where the sum is over all neurons k in the output layer. Of course, the output activation
aLk of the k t h neuron depends only on the input weight zjL for the j t h neuron when k = j.
∂aL
And so k
∂zjL
vanishes when k 6= j. As a result we can simplify the previous equation to
∂C ∂aLj
δjL = L L. (8)
∂aj ∂zj
Recalling that aLj = σ(zjL ), the second term on the right can be written as σ ′ (zjL ), and
the equation becomes
∂C ′ L
δjL = σ (zj ). (9)
∂aLj
We can rewrite the equation in a matrix-based form, as
δ L = ∇a C ⊙ σ ′ (z L ). (10)
∂C
Here, ∇a C is defined to be a vector whose components are the partial derivatives ∂aL
.
j
We use ⊙ to denote the elementwise product of the two vectors.

Next, we’ll develop the equation for the error δ l in terms of the error in the next layer,
δ l+1 . To do this, we want to rewrite δjl = ∂C
∂zjl
in terms of δkl+1 = ∂C
∂zkl+1
. We can do this
using the chain rule:
∂C
δjl = (11)
∂zjl
X ∂C ∂z l+1
k
= (12)
k
l+1
∂zk ∂zjl
X ∂z l+1
= k
δkl+1 , (13)
k
∂zjl
where in the last line we have interchanged the two terms on the right-hand side, and
substituted the definition of δkl+1 . To evaluate the first term on the last line, note that
X X
zkl+1 = l+1 l
wkj aj + bl+1
k = l+1
wkj σ(zjl ) + bl+1
k . (14)
j j
Differentiating, we obtain
∂zkl+1 l+1 ′ l
= wkj σ (zj ). (15)
∂zjl
Substituting back into (13) we obtain
X
δjl = l+1 l+1 ′ l
wkj δk σ (zj ). (16)
k
In a matrix-based form,
T
δl = w l+1 δ l+1 ⊙ σ ′ (z l ), (17)
T
where w l+1 is the transpose of the weight matrix w l+1 for the (l + 1)th layer.
By combining (17) with (10) we can copmute the error δ l for any layer in the network.
We start by using (10) to compute δ L , then apply Equation (17) to compute δ L−1 , then
Equation (17) again to compute δ L−1 , and so on, all the way back through the network.
Now that we have the errors δjl of all the layers of the network, we can compute the
∂C ∂C
partial derivatives l
∂wjk
and ∂blj
as a function of δjl :
∂C ∂C ∂zjl
l
= = δjl akl−1 , (18)
∂wjk ∂zjl ∂wjk
l
∂C ∂C ∂zjl
l
= l l
= δjl . (19)
∂bj ∂zj ∂bj
For each iteration we use (3) and (4) and compute the new values of the parameters.
To summerize the backpropagation algorithm:
1) Perform a feedforward pass, computing the activations for layers 2,3, and so on up
to the output layer L.
2) For each output unit j in layer L (the output layer), set
∂C ′ L
δjL = σ (zj ). (20)
∂aLj
3) For layers l = L − 1, L − 1, ..., 2, for each node j in layer l, set
X
l+1 l+1 ′ l
δjl = wkj δk σ (zj ). (21)
k
4) Compute the desired partial derivatives, which are given as:

∂C
l
= δjl al−1
k , (22)
∂wjk
∂C
= δjl . (23)
∂blj
5) Update the weights and biases of the network:
l l ∂C
wjk = wjk −α l
, (24)
∂wjk
∂C
blj = blj − α , (25)
∂blj
R EFERENCES
[1] Michael A. Nielsen. Neural Networks and Deep Learning. Determination Press, 2015.
5-1
Machine learning
Lecture 5
Lecturer: Haim Permuter Scribe: Gal Rattner
I. I NTRODUCTION
Throughout this lecture we introduce the overfitting problem and regularization

methods such as L-norm and dropout. Parts of this lecture are inspired by the work
of Michael Nielsen [1] and T. Cover’s book [3]. This lecture assumes you are familiar
with the basic probability theory. The notation here are similar to those of the previous
lectures.
II. OVERFITTING AND R EGULARIZATION
Assume we want to fit a polynomial model to a set of training pairs

(x1 , y1 ), . . . , (xN , yN ), s.t. it will generalize in the best way and will give regression
estimation for some test set xN +1 , . . . , xN +K . It is clear from Figure 1 that the 8th order
polynomial has smaller error from each one of the training examples, comparing to the
1st order polynomial. As the order of the polynomial increases, it consists of more free
parameters and therefore it is more likely to fit better to the training set. Though, it is
not clear whether or not it generalizes better and will fit better to an unseen test set, as
can be seen in this example.
This phenomena is called overfitting and it is prevalent in models consist of large
number of parameters. Overfitting is one of the major challenges when training deep
neural networks, which naturally consists of enormous number of parameters. In order to
avoid the overfitting phenomena, there has been suggested several forms of regularization,
including adding penalty term to the cost function, using dropout or max-norm constraints
over the weights inside the net.
5-2
1st order polynomial

10 8th order polynomial
Training examples
Test examples
8
6
y
0 1 2 3 4 5 6 7 8 9 10
x
Figure 1. Two different polynomial curves fitted to a set of training examples.
Definition 1 (L-Norm Regularization) Let C0 be a cost function as presented in the

previous lectures, then the term λ kwkL is the L-norm regularization term added to the
cost function. The total cost argument is given by
C = C0 + λ kwkL , (1)
where λ is the regularization hyper-parameter fixed to determine the influence of the

regularization term on the total cost, and kwkL is the L-norm expression for the entire
set of weights in the model.
Among the most common L-norm regularization methods are the L2 and L1 regularization
terms.
L2 regularization: Using the L2 regularization term, the total cost function is given
by
λ
C = C0 + kwk22
2n
5-3
λ X 2
= C0 + w , (2)
2n i i
where C0 can be the quadratic, cross-entropy or other cost function, wi are the weights in
the net, and the hyper-parameter λ scale the regularization term to have gentle influence
on the total cost C, and holds λ > 0. The influence of the L2 regularization term is clear
when observing the partial derivatives of the cost function in equation (2), i.e.
∂C ∂C0 λ
= + w, (3)
∂w ∂w n
and
∂C ∂C0
= . (4)
∂b ∂b
The partial derivatives can be computed using backpropagation, as described in lecture
4, and the update rules are given by
∂C0
b→b−η , (5)
∂b
and
∂C0 ηλ
w →w−η − w
∂w n
ηλ ∂C0
= 1− w−η . (6)
n ∂w
This update rule is the same as the usual gradient descent update rule, except we rescale
ηλ
the weights first by factor n
. This rescaling factor is sometimes referred to as the weight
decay factor.
L1 regularization: Using the L1 regularization term the total cost argument is given
by
λX
C = C0 + |wi |, (7)
n i
where λ > 0.
The partial derivatives of the cost function with respect to w is given by
∂C ∂C0 λ
= + sign(w). (8)
∂w ∂w n
5-4
The update rule for w is given by
∂C0 ηλ
w →w−η − sign(w). (9)
∂w n
We notice that both L2 and L1 regularization are penalizing large weights, and causing the
weights to decrease toward zero. The difference is in the rate of decrement, where with
L1 regularization the decrement rate is constant, with L2 regularization the decrement
rate is proportional to the size of the weight w.
III. D ROPOUT
A very common regularization method of deep neural network which have proven its
improvement skills is the dropout, as presented in [4]. Unlike the regularization methods
described above, dropout modify the network’s connections rather than the cost function.
The way dropout works is by setting each activation node to zero with some prefixed
probability p ∈ [0, 1], for each training iteration. The hyper-parameter p is generally set
before the training session starts. On that way, for an input or hidden layer with Nl
nodes, at each training iteration we get an average of only p × Nl active nodes in the
net. Notice that we do not drop any node at the net’s output layer. In order to keep the
total sum of activations coming out of each layer constant, we multiply any un-dropped
activation by factor 1p . Repeating this method over and over during the training operation,
the deep neural network will learn a set of weights and biases which generalize better
than the regular training.
The motivation of using this method can be given in three main points:
1) Ensemble of networks: Training a single network using dropout is similar to training

an ensemble of networks over the same training dataset. For each training instance, a
somewhat different network is trained. When using an ensemble of nets, each started
with different random starting point, its final training state will be slightly different.
Considering that, combining the results over the ensemble will almost surely reduce
the error and give better result. The same assumption stands for a single network
trained with dropout.
5-5
2) Reduced number of trained parameters: The number of trained weights is reduced

per training instance, making the number of co-adaptations between different nodes
smaller and easier to train.
3) Avoid overfitting: Specific weights are prevented from growing too large and push
the whole network toward overfitting, because they are dropped and untrained with
probability p. The survived nodes at each instance are trained using only part of the
total information learnt to that point, making overfitting less plausible.
Visual example for dropout can be seen here were in Figure (2) we see a fully-
connected net, and in Figure (3) we see the net after applying dropout on the hidden
layer. Notice that the droped nodes are kept blurred, signifying that the nodes are not
deleted but only ignored for a specific training iteration.
Figure 2. Fully connected neural network.

5-6
Figure 3. Fully connected neural network after applying dropout on the hidden layer.
R EFERENCES
[1] M. Nielsen Neural Networks and Deep Learning, Chap. 3. http://neuralnetworksanddeeplearning.com/chap3.html.

January 2016.
[2] H. Permuter Introduction to Information Theory course. http://www.ee.bgu.ac.il/ haimp/it/index.html
[3] T. M. Cover and J. A. Thomas Elements of Information Theory, Chap. 1.
[4] N. srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, (2014). Dropout: a simple way to
prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929-1958.
6-Decision Tree & MDL-1
Machine Learning
Lecture 6-Decision Tree & MDL

Lecturer: Haim Permuter Scribes: Asaf Lavi and Ben Marinberg
This lecture discusses decision trees and the minimum description length principle. the
goal of desicion trees is to have a series of questions(tree), that ends in decisions(leaves).
Discussed as a regulating mechanism, the MDL shows that the best hypothesis for a
given set of data is that which leads to the best compression of the data. The last part of
the lecture discuses two methods for avoiding overfitting: Random Forest and Pruning.
I. D ECISION T REE
Here we have a method in which we create a k’-nary tree. Each node represents a
question about the features, Θn . Each branch represents an answer to its origin node,
one of k options. Each leaf represents a decision. A decision tree is easily implemented
in a variety of applications and it is useful for both classification and regression. This
lecture introduces the notation of the binary decision tree, but extending it to k’-nary is
straightforward. Let {xi } be the i’th sample, i = 1, 2, .., m, each of which has n features,
and Θ is the question asked in each node. Note that a large part of this section is taken
from the book and lecture by Prof. Shai Shalev-Shwartz[1], [2].
Example 1 (classification-Papaya) Assume X is a papaya. The goal is to decide

whether or not its tasty, by using a decision tree. Each node represents a question about
the papaya. As we can see from the example (Fig.1), the top node asks about the color
of the papaya, which, if is not the right color, leads to the decision that the papaya is not
tasty. If it is the correct color we proceed to the next node and ask the next question. If
the answers to the question about both color and softness were ”right” (i.e according to
our definitions of tasty papaya), we decide that the papaya is tasty.
Firm but
not hard
Tasty
Pale green to
light yellow
Softness
Color Not tasty

Other
Not tasty
Other
Fig. 1. Classic Decision tree for papaya classification,using two features: color and softness.
A. Cost functions
Let {xi,j } be the j’th feature of the i’th sample, and {yi }i=1..m be the classifications
(labels). Define gain, where C(x) is the cost function, as
X
Gain(Y ; Θ) = C(Y ) − C(Y |Θ) = C(Y ) − P (Θ = θ)C(Y |Θ = θ) (1)
θ∈Θ
Examples of cost functions

1) Entropy H(a) = E[− log2 PΘ (a)]
2) Gini index G(a) = 4a(1 − a)
3) Error E(a) = 2min(a, 1 − a)
Cost functions
1
H(a)
0.9 G(a)
0.8
E(a)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
a
Fig. 2. Common cost function

Gain with entropy as the cost is called information gain, and it is widely used in
applications. Note that information gain is the mutual information we learned about
in the first lecture of the course.
Inf ormation Gain(Y ; Θ) = H(Y ) − H(Y |Θ) , I(X; Θ) (2)
Example 2 (One layer decision tree)

Lets begin with one-layer depth. Assume that we have 100 strawberries - 50 are tasty
and 50 are not. To aid in their classification, we ask about their color. Determine θ = 0
as not red, and θ = 1 as red.
Red
θ=1 Tasty 48
Not tasty 12
Color
Tasty 2
Otherwise Not tasty 38
θ=0
Fig. 3. One layer decision tree
2 2 3 12
Inf ormation Gain = H(Y ) − H(Y |Θ) = 1 − Hbinary ( ) − Hbinary ( ) (3)
5 40 5 60
Set cost to be the Gini index and we have
2 2 3 12
Gain(Y ; Θ) = 1 − G( ) − G( ) (4)
5 40 5 60
B. Algorithm
ID3-Iterative Dichotomiser 3 is an algorithm invented by Quinlan[3] for creating a

decision tree. It is a recursive and greedy algorithm whose inputs are the S-data set and
A-features. Its stopping conditions are simple: either all the remaining labels are the same
or it ran out of unused features.
ID3(S,A)
create a new node;
if all samples are ones then
label new node as leaf labeled 1;
else
if all samples are zeros then
label new node as leaf labeled 0 ;
else
if A ∈ ∅ then
label new node as leaf labeled as the majority of labels in S;
else
p = argmax G(S, 1xa =1 ) ;
a∈A
ID3( S1xa =0 , A\p);

ID3( S1xa =1 , A\p );
end
end
end
C. Regression
Like with classification problems, decision trees can also be simply applied in
regression problems. Generally speaking, we substitute the label decision with the average
or mean. So instead of asking questions with k answers, we ask about k ranges. And
instead of taking the majority of the labels we have the average or mean of the leaf’s data.
For example when putting a price on a new house, its estimated value will be the mean
of the values of all houses with the same features (i.e. identical answers to questions on
the tree).
II. MDL
The minimum description length principle describes a way to minimize models. It is

similar to combining the length (or in our case, the tree depth) and the cost into a new
and improved cost function. The goal of MDL can be described as ”to find regularity in
the data”. where ’regularity’ means compressibility. MDL combines the insights it gains
by viewing learning as data compression: it tells us that, for a given set of hypotheses
h and data set S, we should try to find the hypothesis or combination of hypotheses in
H that compresses S the most. MDL procedures automatically and inherently protect
against overfitting and can be used to estimate both the parameters and the structure of
a model. Also, in contrast to other statistical methods, MDL procedures have a clear
statistical interpretation.
i.i.d
Let P be the distribution, S = {xi , yi }m
i=1 ∼ Q to be the samples set, and h to be
h
a hypothesis (function) X −
→ Y . Define d(h) to be the description of h with the prefix
property(such that-there is no whole description in the system that is a prefix of any other
description in the system). Define LD (h) = Pr(Y 6= h(X)) as theoretical risk, and define
m
P
empirical risk as LS (h) = m1 {i : h(xi ) 6= yi } = m1 1{h(xi )6=yi } . Define the length of the
i=1
description of h as w(h) = 2−|d(h)| . Denote the Kraft inequality proved in this course’s
first lecture,
X X
w(h) = 2−|d(h)| ≤ 1 (5)
h h
Theorem 1 (MDL bound) :

w P
Let w be H −
→ R such that w(h) ≤ 1. Then, with a probability of at least 1 − δ
h∈H
over S ∼ Qm we have:
s
− log(w(h)) + log( 2δ )
∀h ∈ H, LD (h) ≤ LS (h) +
2m
Or alternately
 s 
− log(w(h)) + log( 2δ )
∀h ∈ H, Pr Ld (h) − LS (h) ≤  ≥1−δ
2m
The proof of the MDL bound is based mainly on Hoeffding’s inequality, wherein setting
some free parameters we bring the inequality to seen like the bound we need. From here
on the proof is straightforward.
Lemma 1 (Hoeffding’s Inequality) :

Let X1 , X2 , ..., Xn be independent random variables , bounded by [ai , bi ]. Following
Hoeffding:

2n2 t2
Pr X − E[X] ≥ t ≤ exp − n
P 2
(6)
i=1 (bi − ai )
Or similarly:
m
!
1 X 2mǫ2
Pr | Xi − µ| > ǫ ≤ 2 exp − Pn 2
(7)
m i=1 i=1 (bi − ai )
See the appendix for a proof of this lemma.
Proof 1 : Assume {Zi }, where the independent and bounded r.v. a ≤ Zi ≤ b. So by

Lemma 1
m
!
1 X 2mǫ2
Pr | Zi − µ| > ǫ ≤ 2 exp − (8)
m i=1 (b − a)2
q
log( δ2 )
Set ǫ = 2m
. Define δh = w(h) · δ for every h
h
 s   
m 2 log( δ2 )
1 X log( δn ) 2m 2m h
Pr | Zi − µ| >  ≤ 2 exp −  = δn (9)

m i=1 2m (b − a)2
Mark Zi = 1{h(xi )6=yi } and µ = E[Zi ]

 s 
m 2
1 X log( δh )
Pr | 1{h(xi )6=yi } − µ| >  ≤ δh (10)
m i=1 2m
Denote that
m
" # m
1 X 1 X
E[LS ] = E 1{h(xi )6=yi } = E[1{h(xi )6=yi } ] = Pr(Y 6= h(X)) = LD . (11)
m i=1 m i=1
So for a fixed h
 s 
− log(w(h)) + log( 2δ )
∀h ∈ H Pr |LS (h) − LD (h)| >  ≤ δh (12)
2m
Applying the union bound

 s 
− log(w(h)) + log( 2δ ) X
Pr |LD − LS | >  ≤ w(n)δ = δ (13)
2m h∈H
| {z }
≤1
Hence, we have
 s 
− log(w(h)) + log( 2δ )
Pr |LD − LS | ≤  ≥1−δ (14)
2m
When exploiting the MDL bound to optimize a decision tree, the most common method
employed is pruning. The simple algorithm suggests testing each node of the tree with
the following statement, which we define as a cost for the algorithm:
s
− log(w(h)) + log( 2δ )
Cost(h) = . (15)
2m
Now all that’s left is to calculate the cost with and without each node. If the cost difference
is significant, prune (i.e. delete the node). Pruning will be farther discussed in the next
section. An alternative approach is to use the MDL bound to set the tree’s optimal depth,
and then to use a modified algorithm to construct the tree by using combinations of
features.
III. AVOIDING OVER - FITTING
A. Random forest
The general method of random decision forests was first proposed in 1995 by Ho[7]
who established that if forests of trees split with oblique hyperplanes, are randomly
restricted to be sensitive to only selected feature dimensions, they can gain accuracy
as they grow without suffering from over-training. Random forests are an ensemble
learning method for classification, regression and other tasks that operate by constructing
a multitude of decision trees at training time. The output of the trees comprises the
class, i.e. the mode (the value that appears most often), of the classes for classification
problems, or the mean prediction of the individual trees for regression. Random decision
forests correct the habit of decision trees to overfit to their training set.[5]
1) Algorithm: The idea behind random decision forests is to train many small trees
with subsets of the samples. For each tree, we randomly select n samples of all the
samples, for those n samples we choose (again randomly) k features that will be used to
train the specific tree. When we have the new subsets, we should use the ID3 algorithm
and Feedforward for each tree. For classification, we take the majority of all trees to
make a decision, whereas for regression problems, we calculate the average of all trees
to reach a decision. The reason for the randomness is to decrease the correlation between
trees, and in so doing, to contribute to gains in accuracy. For example, if one or a few
features are very strong predictors for the response variable (the label), these features
will be selected in many of the B trees, causing them to become correlated.
Let B be the numbers of trees. For b = 1,...,B

- Randomly choose x′ = {Xi,j }, a matrix with n rows (samples) and k columns
(features).
- Train a classification or regression tree fb (x) on X ′ .
After training all trees, we want to make a decision. For classification, we will take
the majority of all trees, and for regression, we will take the mean value:
B
P
f̂= B1 fb (x′b )
b=1
√
Typically, for a classification problem with p features, p (rounded down) features are
p
used in each split. For regression problems the inventors recommend 3
(rounded down)
with a minimum node size of 5 as the default.
B. Pruning
The optimal final tree size in a decision tree algorithm is not a trivial matter. A tree that
is too large risks overfitting the training data, thus rendering it poorly generalizable to
new samples. Conversely, a tree that is too small might not capture important structural
information about the sample space. Determining when a tree algorithm should stop,
however, is difficult because one cannot know whether the addition of a single extra
node will dramatically decrease error. This problem, which is known as the horizon
effect,, is commonly addressed by growing the tree until each node contains a small
number of instances and then by using pruning to remove nodes that do not provide
additional information. Pruning should reduce the size of a learning tree without reducing
its predictive accuracy as measured by a cross-validation set. There are many techniques
for tree pruning that differ in terms of the measurement that is used to optimize
performance.[6]
1) Techniques: Generally speaking, tree pruning techniques can be classified as either
top down - traverse nodes and trim subtrees starting at the root, or up - start at the
leaf nodes. Each pruning technique has advantages and weaknesses that, if known and
understood, can aid us in selecting the most appropriate method in every instance.Here we
will learn two popular algorithms, one each from the top-down and bottom-up approaches.
1) Cost complexity pruning (top-down) -

Let’s define:
• {Ti }i=0,...,m - series of trees where T0 is the original tree and Tm is the root
alone.
• Error Rate of a tree - err(T, S) where T is the current tree and S is the
data set.
• Function - prune(T, t) - defines the resultant tree after removal of subtrees t
from T .
• Function - leaves(T ) - defines the quantity of leaves at tree T .
At each step,i, the tree is created by removing a subtree t from tree i − 1 and
replacing it with a leaf node whose value is chosen as in the tree building algorithm.
First we find the error rate err(Ti , S) for the current tree Ti and then we search
err(prune(Ti ,t),S)−err(Ti ,S)
for the subtree that minimizes the expression: |leaves(Ti )|−|leaves(prune(Ti ,t))|
Once the series of trees has been created, the best tree is chosen base on generalized
accuracy as measured by a training set or by cross-validation.
2) Reduced error pruning (bottom-up) -
Starting at the leaves, each node is replaced with its most popular class. If the
prediction accuracy is not affected then the change is kept. Very simple to apply,
this technique has the added advantage of being rapid.
IV. A PPENDIX
Recall Lemma 1 - Let X1 , X2 , ..., Xn be independent random variables bounded by

[ai , bi ]. Define the empirical mean of these by
1
X= (X1 + X2 + ... + Xn ).
n
Lemma 1 states that:

2n2 t2
Pr(|X − E[X]| ≥ t) ≤ 2 exp − Pn 2
. (16)
i=1 (bi − ai )
Here E[X] is the expected value of X. Setting the following
Sn = X1 + X2 + ... + Xn . (17)
We have:

2t2
P (|Sn − E[Sn ]| ≥ t) ≤ 2 exp − Pn 2
(18)
i=1 (bi − ai )
Lemma 2 (Hoeffding Lemma) Suppose X is a real random variable with mean equal
to zero such that P (X ∈ [a, b]) = 1. Then
1
E[esX ≤ exp( s2 (b − a)2 ).
8
Another inequality we need is
Lemma 3 (Markov’s inequality) Let X be a non negative r.v. and a > 0, then:
E[X]
P (X ≥ a) ≤ .
a
Proof 2 Suppose X1 , X2 , ..., Xn are n independent r.v. such that
P (Xi ∈ [ai , bi ]) = 1, 1 ≤ i ≤ n.
Let
Sn = X1 + X2 + ... + Xn .
Then for s, t ≥ 0, the independence of Xi and Markov’s inequality implies:
P (Sn − E[Sn ] ≥ t) = P (es(Sn −E[Sn ]) ≥ est ) (19)


−st s(Sn −E[Sn ])
≤e E e (20)
n
Y
−st s(Xi −E[Xi ])
=e E e (21)
i=1
n
Y s2 (bi −ai )2
≤ e−st e 8 (22)
i=1
n
!
1 2X
= exp −st + s (bi − ai )2 . (23)
8 i=1
Now, for us to obtain the best possible upper bound, we need to find the minima of
g
right-hand side of the last inequality as a function of s. Define g : R+ −
→ R as
n
s2 X
g(s) = −st + (bi − ai )2
8 i=1
Note that g(s) is a quadric function, and hence, its minimum is at
4t
s = Pn 2
.
i=1 (bi − ai )
Thus we get

2t2
P (Sn − E[Sn ] ≥ t) ≤ exp − Pn 2
(24)
i=1 (bi − ai )

2t2
P (|Sn − E[Sn ]| ≥ t) ≤ 2 exp − Pn 2
(25)
i=1 (bi − ai )
Refer to[4] for further reading and better understanding about this subject.
R EFERENCES
[1] Shai Shalev-Shwartz. Understanding Machine Learning course, Lecture No.4,
Hebrew University of Jerusalem, 2014.
[2] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms
May,2015.
[3] Quinlan, J. R. 1986. Induction of Decision Trees
http://hunch.net/∼coms-4771/quinlan.pdf Netherlands, 1986
[4] Hoeffding, Wassily. ”Probability inequalities for sums of bounded random variables”. Journal of the American
Statistical Association ,Vol. 58, No. 301 (Mar., 1963), pp. 13-30
[5] Wikipedia https://en.wikipedia.org/wiki/Random forest
[6] Wikipedia https://en.wikipedia.org/wiki/Pruning (decision trees)
[7] Ho, Tin Kam (1995). Proceedings of the 3rd International Conference on Document Analysis and Recognition,
Montreal, QC, 1416 August 1995. pp. 278282. http://ect.bell-labs.com/who/tkh/publications/papers/odt.pdf
7: Tree distribution-1
Machine learning Apr 15, 2022
Lecture 7: Tree distribution

Lecturer: Haim Permuter Scribe: Avital Yarden and Maman Gil
This lecture discusses tree distribution and the method of types. We will introduce a
method developed by Chow and Liu [1] to fit the best tree to the data, wherein the ”Best
tree” refers to the tree that have the minimum divergence. We will use some principles that
originated in the field of the method of types, in order to show that minimum divergence
is equivalent to maximum likelihood. The method of types introduced here was fully
developed by Csiszar and Korner [2], who derived the main theorems of information
theory from this perspective.
Z Y
U W V
Fig. 1. A tree with distributions of the structure Pt (x, y, z, u, v, w) = P (x)P (y|x)P (z|x)P (u|z)P (w|z)P (v|y).
I. D EFINING THE PROBLEM
We will start by introducing the definition of a tree structure.
Definition 1 (Tree) A tree is an undirected graph with no cycles (loops).
A tree with nodes corresponding to random processes defines a conditional independence

structure on the variables. Conditioned on any node, the subtrees on its edges are
independent. For example, the tree in Figure 1 corresponds to
Pt (x, y, z, w) = P (x)P (y|x)P (z|x)P (u|z)P (w|z)P (v|y). (1)

Chow and Liu used the tree distribution which have only one father for every node
(excluding the root), they assumed that the criteria of the optimal tree is minimum
divergence, and they proved that the minimum divergence achieved by maximum mutual
information over the air of variables. Note that a tree structure has a much smaller number
of parameters (linear in the number of nodes) when compared to the exponentially many
parameters needed for a general distribution. We get less complex sparse approximator
when we use the tree structure but we also get less accuracy compared to the general
distribution.
Assuming we have information represented as follows:
x1 y1 z 1 w 1
x2 y2 z 2 w 2
.. .. .. ..
. . . .
xn yn z n w n
then X n , Y n , Z n , W n are n samples of processes. For example, n can be the number of

students and X, Y , Z and W are the students’ psychometric grade, first year average,
the second year average, and the third year average, respectively. Of course, there are
many possible trees that can describe the probability of the data presented, but we are
interested in the tree that has the largest probability among all the possible trees. From
equation (1):
n
Y
n n n n
max Pt (x , y , z , w ) = max Pt (xi , yi , zi , wi ). (2)
All T rees All T rees
i=1
where the sequences of the samples are i.i.d.
II. M ETHOD OF TYPES
In order to show that minimum divergence is equivalent to maximum likelihood we

will use some principles that originated in the field of the method of types. The method
of types evolved from notions of strong typicality. Though some of its ideas were used
by Wolfowitz [6] to prove channel capacity theorems, the method was fully developed
by Csiszar and Korner [2], who derived the main theorems of information theory from
a method of types perspective.
Let xn = (x1 , x2 , ..., xn ) be a sequence from the alphabet X = (a1 , a2 , a3 , ...a|X | ). Let
N (a|xn ) be the number of times that a appears in sequence xn .
Definition 2 (Type) The type Pxn (or empirical probability distribution) of a sequence
N (a|xn )
xn is the relative proportion of occurrences of each symbol of X , i.e., Pxn (a) = n
for all a ∈ X .
Example 1 Let X = {0, 1, 2}, let n = 5 and x5 = (1, 1, 2, 2, 0). Then N (0|x5 ) = 1,
N (1|x5 ) = 2 and N (2|x5 ) = 2. Hence, Pxn = 51 , 25 , 52 .

Definition 3 (All possible types) Let Pn be the collection of all possible types of
sequences of length n.
For example, if X = {0, 1}, the set of possible types with denominator n is

0 n 1 n−1 n 0
Pn = (P (0), P (1)) : , , , , ..., , . (3)
n n n n n n
Lemma 1 An upper bound for |Pn |:
|Pn | ≤ (n + 1)|X | . (4)
Proof:
There are |X | components in the vector that specifies Pxn . The numerator in each
component can take on only n + 1 values. So there are at most (n + 1)|X | choices
for the type vector.
Definition 4 (Type class) Let P ∈ Pn . The set of sequences of length n with type P is
called type class of P, denoted T (P ):
T (P ) = {xn : Pxn = P }. (5)
Theorem 1 (Probability of a sequence in the type class) If X ∼ Q i.i.d., the proba-

bility of xn depends only on the type of xn , i.e., Pxn
Q(xn ) = 2−n(H(Pxn )+D(Pxn ||Q)) . (6)

Proof:
Since {Xi }i≥1 are i.i.d,
n
Y
Qn (xn ) = Q(xi ). (7)
i=1
Now consider
n
X
log Qn (xn ) = log Q(xi ) (8)
i=1
(a) X
= N (a|xn ) log Q(a) (9)
a∈X
(b) X
=n Pxn (a) log Q(a) (10)
a∈X
X Q(a)
=n Pxn (a) log · Pxn (a) (11)
a∈X
Pxn (a)
= n(−H(Pxn ) − D(Pxn ||Q)). (12)
where
(a) follows because each a ∈ X contributes exactly log Q(a) times it’s number of
occurrences in xn to the sum in (8).
(b) follows from the definition of Pxn (a).
Hence, we obtained
Qn (xn ) = 2−n(H(Pxn )+D(Pxn ||Q)) . (13)
In general, the equation can be constructed vectorially as follows:

−n(H(Pxn ,xn ,...,xn )+D(Pxn ,xn ,...,xn ||QX1 ,X2 ,...,Xm ))
Qn (xn1 , xn2 , . . . , xnm ) = 2 1 2 m 1 2m . (14)
where xn1 , xn2 ,. . . , xnm are the random processes. And QX1 ,X2 ,...,Xm is the joint distribution
function.
For more information on method of types, see a whole lecture on the subject:
http://www.ee.bgu.ac.il/˜haimp/multi2/lec1/lec1.pdf
III. T REE D ISTRIBUTION
We want to find the tree distribution that have the maximum probability (maximum
likelihood principle). We saw in the last section about method of types, that is achieved
by minimum divergence. From equation (13):
Pt (xn1 , xn2 , ..., xnm ) = 2−n(H(Pemp )+D(Pemp ||Pt (x1 ,x2 ,...,xm ))) , (15)
where Pemp = Pxn1 ,xn2 ,...,xnm (x1 , x2 , ..., xm ). Since the empirical entropy H(Pemp ) does not
depend on the selected tree but only on samples, then to obtain the maximum probability,
we should look for the minimum of the divergence D(Pemp ||Pt (x1 , x2 , ..., xm )).
X X Pemp
D Pemp ||Pt (x1 , x2 , ..., xm ) = ... Pemp log
x ∈X x ∈X
Pt (x1 , x2 , ..., xm )
1 1 m m
X X
= H(Pemp ) − ... Pemp log Pt (x1 , ..., xm ). (16)
x1 ∈X1 xm ∈Xm
In a tree distribution, it can be said that each node in the tree depends only on its parent
and not on any other node in the tree.
m
Y
Pt (x1 , x2 , ..., xm ) = Pt (xi |xj(i) ), (17)
i=1
where xj(i) is the parent of xi , see Fig 1.
The Chow and Liu algorithm of creating a tree based on the assumption that minimum
divergence is the criteria of the optimal tree. We will define the next lemma of minimum
cross entropy for the following proof which show that the minimum divergence is
equivalent to maximum sum of the mutual information.
Lemma 2 (Minimum cross entropy) The minimum by q of the cross entropy between
(p, q) is equal to the entropy of p
min H(p, q) = H(p), (18)

q
and it is achieved by q = p.
Proof:
X
min H(p, q) = min − p(x) log q(x)
q q
x
X q(x)p(x)
= min − p(x) log
q
x
p(x)
X p(x) X
= min p(x) log − p(x) log p(x)
q
x
q(x) x
= min D(p||q) + H(p)

q
(a)
= H(p),
where (a) follows from the fact that KL divergence D(p||q) is always non-negative and
it get zero if and only if q = p.
Theorem 2 (Chow and Liu results [1] ) The minimum divergence achieved by maxi-
mum sum of mutual information, and it given by,
m
X
min D Pemp ||Pt (x1 , x2 , ..., xm ) = const − I(Xi , Xj(i) ),
Pt
i=1
where the mutual information induced by Pemp .
Proof:

D Pemp ||Pt (x1 , x2 , ..., xm )
(a) X
= H(Pemp ) − Pemp (xm ) log Pt (x1 , ..., xm )
x1 ∈X1 ..xm ∈Xm
m
(b) X
m
Y
= H(Pemp ) − Pemp (x ) log Pt (xi |xj(i) )
xm i=1
m
X Y Pt (xi |xj(i) )Pemp (xi )
= H(Pemp ) − Pemp (xm ) log
xm i=1
Pemp (xi )
m
X
m
X Pt (xi |xj(i) )
= H(Pemp ) − Pemp (x ) log + log Pemp (xi )
xm i=1
Pemp (xi )
m
XX Pt (xi |xj(i) )
= H(Pemp ) − Pemp (xm ) log + log Pemp (xi )
xm i=1
Pemp (xi )
m X m
X
m Pt (xi |xj(i) ) X X
= H(Pemp ) − Pemp (x ) log − Pemp (xm ) log Pemp (xi ),
i=1 xm
Pemp (xi ) i=1 xm
(19)
m X m
X Pt (xi |xj(i) ) X X
= H(Pemp ) − Pemp (xi , xj(i) ) log − Pemp (xi ) log Pemp (xi ),
i=1 xi ,xj(i)
Pemp (xi ) i=1 x i
(20)
m X m
X X Pt (xi |xj(i) ) X X
= H(Pemp ) − Pemp (xj ) Pemp (xi |xj(i) ) log − Pemp (xi ) log Pemp (xi ),
i=1 xj(i) xi
Pemp (xi ) i=1 xi
(21)
m X m
(d) X X Pemp (xi |xj(i) ) X X
= H(Pemp ) − Pemp (xj ) Pemp (xi |xj ) log − Pemp (xi ) log Pemp (xi ),
i=1 xj(i) xi
Pemp (xi ) i=1 x i
(22)
m
(c) X
= Const − I(Xi ; Xj(i) ), (23)
i=1
where
(a) - follows from equation (16)
(b) - follows from equation (17)
P Pt (xi |xj(i) )
(c) - Note that xi Pemp (xi |xj ) log Pemp (xi ) is a divergence and the minimum is
achieved when Pt (xi |xj(i) ) = Pemp (xi |xj(i) )
(d) we define Const = H(Pemp ) − m
P P
i=1 xi Pemp (xi ) log Pemp (xi ) which do not
depends on the tree structure
IV. M AXIMUM SPANNING TREE ALGORITHM
The problem is to create a tree that best describe the data, i.e. the tree that will take
the divergence to minimum. In the previous section, we showed that minimizing the
divergence is equivalent to maximizing the sum of all the mutual information between
each node and its parent in the tree. Our goal is to find an algorithm to create the tree.
Kruskal [4] proposed the following algorithm called ”Minimum spanning tree algorithm”
to assemble the desired tree. This is a greedy algorithm which doesn’t promise that an
optimal tree will be found. The Greedy Choice is to pick the smallest weight edge that
does not cause a cycle in the MST constructed so far. This is a generic pseudo-code for
”Minimum spanning tree algorithm”, when V is a set of all the vertices, E is a set of all
the possible edges, A is a sub set of E which contain all the edges that included in the tree.
KRUSKAL(V):
A=∅
for each vertex v,u in V: do
E = calculate weight of edge(u,v);
end
E = SORT-INC(E);
for each edge in E: do
if cycle is not formed: then
A = A ∪ (u, v)
end
if length(A)=length(V)-1: then
break;
end
end
In our problem we need to use ”Maximum spanning tree algorithm” which means
instead of order the wights(mutual information) by increasing order, we need to order
the wights by decreasing order. Before starting the routine we must compute all the
mutual information and arrange the resulting process pairs in a list from the largest
wight to the smallest wight.
Stage 1: Find the pair of nodes with the greatest mutual information in the list.
P (xi |xj(i) )P (xj(i) )
I(i,j) = P (xi , xj(i) ) log , i, j ∈ [1, 2 . . . , N ] , i 6= j. (24)
P (xj(i) )P (xi )
Stage 2: Connect the pair of nodes found in stage 1, update the list of mutual
information and return to Stage 1 if the list still contains pair of processes. The update
includes the following:
1. Delete the mutual information of the selected pair.
2. Delete all items in the list that represent prohibited connections, i.e., connections that
create loops in the graph.
Stage 3: Decide which of the two processes will be at the head of the tree and determine
the direction of the arrows. In fact, for either choice, we obtain the same sum of mutual
information, and therefore, either choice is possible.
Example 2 Let us assume that we have the random processes X Y Z W . The mutual
information between each possible pair was calculated. The results are shown in the table
below and in Figure 2.
X
I=0.5 I=0.6
I=0.1
W I=0.3 Y
I=0.4 I=0.4
Fig. 2. Calculate the mutual information between each pair of random processes.
pairs empirical mutual information

X, Y 0.6
X, Z 0.1
X, W 0.5
Y, Z 0.4
Y, W 0.3
Z, W 0.4
We perform step 1 of the routine and see that the greatest mutual information is
between X and Y . Thus, we connect them with a line. We then Proceed to step 2,
delete X, Y from the table and return to step 1. See Figure 3.
X X
I=0.5 I=0.6
I=0.1
W I=0.3 Y Y
I=0.4 I=0.4
Fig. 3. Connect the two random processes with the largest mutual information, and remove that mutual information
from the list.

X, Z 0.1
X, W 0.5
Y, Z 0.4
Y, W 0.3
Z, W 0.4
We again apply Step 1 followed by Step 2 again: in this instance, we delete X, W and
we also delete Y, W , the latter pair because it might create a loop in the graph. See
Figure 4.

X, Z 0.1
Y, Z 0.4
Z, W 0.4
We re-apply step 1 and find that we have two lines with the same mutual information,
so we can arbitrarily choose between the two options.
After step 2, it appears that the table is empty, so we proceed to step 3 and select the
X X
I=0.5 I=0.6
I=0.1
W I=0.3 Y W Y
I=0.4 I=0.4
Fig. 4. Connect the next two random processes with the largest mutual information in the list, and remove it and all
connections that might create loops on the graph.
tree head and the directions of the arrows. See Figure 5 and Figure 6.
X X
I=0.5 I=0.6
I=0.1
W I=0.3 Y W Y
I=0.4 I=0.4
Z Z
Fig. 5. Connect the next two random processes with the largest mutual information in the list, and remove it and all
connections that might create loops on the graph.
W Y
Fig. 6. Step 3: decide the direction of the arrows arbitrarily or by preference.

R EFERENCES
[1] Chow, C. K., Liu, C.N. (1968), ”Approximating discrete probability distributions with dependence trees”, IEEE
Transactions on Information Theory, IT-14 (3): 462-467
[2] I. Csiszar and J. Korner. Information Theory: ”Coding Theorems for Discrete Memoryless Systems”. Academic
Press, New York, 1981.
[3] I Csiszar. ”Sanov property, generalized I-projection and a conditional limit theorem”. Ann. Prob., 12:768793,
1984.
[4] Kruskal, J. B. (1956). ”On the shortest spanning subtree of a graph and the traveling salesman problem”.
Proceedings of the American Mathematical Society. 7: 4850. doi:10.1090/S0002-9939-1956-0078686-7. JSTOR
2033241.
[5] I. N. Sanov. ”On the probability of large deviations of random variables”. Mat. Sbornik, 42:1144, 1957. English
translation in Sel. Transl. Math. Stat. Prob., Vol. 1, pp. 213-244, 1961.
[6] J. Wolfowitz. ”Coding Theorems of Information Theory”. Springer-Verlag, Berlin, and Prentice-Hall, Englewood
Cliffs, NJ, 1978.
[7] T. M. Cover and J. A. ”Thomas, Elements of Information Theory”, 2nd ed. New-York: Wiley, 2006.
[8] T Weissman. Information Theory: ”Conditional Differential Entropy, Info. Theory in ML”, Lecture 20,
EE376A/STATS376A, stanford university, 2018.
https://web.stanford.edu/class/ee376a/files/lecture_20.pdf
Sequential Machine Learning-1
Sequential Machine Learning

Lecturer: Haim Permuter Scribe: Ziv Aharoni
Throughout this lecture we discuss the task of modeling sequential data. This task
comprises on overcoming the fact that training examples are no longer i.i.d. Therefore, we
have to build models that are capable of modeling the time dependencies between samples
in order to model the data optimally in the sense of maximum likelihood. First, we define
the problem mathematically and derive the necessity of Recurrent Neural Network (RNN)
models. Then, we will elaborate on two types of Recurrent Neural Networks (RNNs):
the Elman Network and the Long-Short-Term-Memory (LSTM) cell.
I. I NTRODUCTION
Given a training set of n ∈ N examples {(xi , yi )}ni=1 drawn from the joint probability
PX n ,Y n , we want to model the generating distribution of the data. In order to do so, we
seek to estimate PX n ,Y n with a parametric model QθX n ,Y n , whose parameters are denoted
by θ.
We distinct the sample elements into two groups: (1) the features X, those elements
that we sample freely from the environment, and (2) the labels Y , those elements that
we can not sample from the environment and seek to predict. We now decompose the
joint probability, namely PX n ,Y n , using the chain rule of probabilities as follows,
n
Y
n n
PXi ,Yi |X i−1 ,Y i−1 xi , yi |xi−1 , y i−1

P X n ,Y n (x , y ) = (1)
i=1
Yn
PXi |X i−1 ,Y i−1 xi |xi−1 , y i−1 PYi |X i ,Y i−1 yi |xi , y i−1 .

= (2)
i=1
Generally, want to model only PYi |X i ,Y i−1 since we sample X n freely from the real
distribution, i.e PXi |X i−1 ,Y i−1 . Hence we can model QθX n ,Y n (xn , y n ) by
n
Y
QθX n ,Y n (xn , y n ) PXi |X i−1 ,Y i−1 xi |xi−1 , y i−1 QθYi |X i ,Y i−1 yi |xi , y i−1 .

= (3)
i=1
In the rest of the lecture we use this setting, but the derivation could be extended to
estimating the term PXi |X i−1 ,Y i−1 (xi |xi−1 , y i−1 ) as well.
Now, we find the model parameters θ by the maximum likelihood estimator which is
given by
θ̂M L = argmax log QθX n ,Y n (xn , y n )

(4)
θ
( n
)
Y
PXi |X i−1 ,Y i−1 xi |xi−1 , y i−1 QθYi |X i ,Y i−1 yi |xi , y i−1

= argmax log (5)
θ i=1
( n )
X
log QθYi |X i ,Y i−1 yi |xi , y i−1

= argmax (6)
θ i=1
If the samples were sampled i.i.d the term QθYi |X i ,Y i−1 (yi |xi , y i−1 ) would collapse to
QθYi |Xi (yi |xi ), which is feasible to approximate with a parametric model. Unfortunately,
when modeling QθYi |X i ,Y i−1 (yi |xi , y i−1 ), there are two major difficulties with the modeling
task: (1) the model needs more parameters to combine the features from past times, and
(2) the function QθYi |X i ,Y i−1 varies (even in the number of arguments) as i changes.
For example, if we use a logistic model, in the i.i.d case, if x ∈ Rd , y ∈ R, we
can parameterize QθYi |Xi (yi |xi ) the model with d + 1 parameters. However, in the time
dependent case, if we want to model QθYi |X i ,Y i−1 we would need di + d(i − 1) + 1
parameters.
These problems encourage us to make assumptions on the distribution that generated
the data, that eventually lead us to build feasible models which are capable on
encapsulating time dependencies. In the next section we develop a model with shared
weights by approximating the underlying distribution of the data as Markov process.
II. S TATIONARY M ARKOV P ROCESS M ODELING
In order to deal with the parameterization problem, we approximate the process that
generated X n , Y n as a stationary Markov process.
Let us assume that there exists a R.V Si , f (X i−1 , Y i−1 ), for some deterministic func-
tion f (·), that summarizes the history of inputs and labels, such that, PXi ,Yi |X i−1 ,Y i−1 =
PXi ,Yi |Si . That is, we assume that the following Markov chain holds,
(S1 , X1 , Y1 ) − (S2 , X2 , Y2 ) − · · · − (Sn , Xn , Yn ) . (7)
Next, we assume stationarity of this Markov chain, that is,
P (Xi = x, Yi = y, Si = s) = P (Xj = x, Yj = y, Sj = s) i 6= j, (8)
where (x, y, s) ∈ X × Y × S, the alphabet of inputs, targets and states respectively. The
last assumption is that this stationary Markov chain is Ergodic, which basically means that
we could recover the stationary distribution of the Markov chain by sampling observation
of the data.
Under the three preceding assumptions, and by the law of large numbers, the
optimization criteria in (6) becomes
n
X
log QθYi |X i ,Y i−1 yi |xi , y i−1 =

(9)
i=1
n
n→∞
X
log QθYi |Xi ,Si−1 (yi |xi , si−1 ) −−−→ (10)
i=1
EPY |X,S QθY |X,S (Y |X, S) =

(11)
H PY |X,S + DKL PY |X,S kQθY |X,S

(12)
Using this approximation, we can build a feasible model that would be able to encapsulate
time dependencies in the data.
III. R ECURRENT N EURAL N ETWORK (RNN)
In this section we show the equations of the Elman (vanilla) RNN. Then, we delve into
the error propagation analysis of the Elman RNN and explain the vanishing/exploding
gradient phenomena, which will motivate us finally derive the architecture of the Long-
Short-Memory-Term (LSTM) cell.
A. Elman Network
1) Description: The Elman network uses the Markov settings from the preceding
section. That is, the Elman networks generates a state S that summarizes the history and
is used to generate along with the input X the prediction for Y and the next state S 0 .
Fig. 1. Depiction of single-layered Elman network and its time unrolling representation
We introduce the equation of a single layered Elman network, as depicted in Figure 1,

for simplicity, even though that the generalization for multi-layered network (or stacked
Elman network) is direct. The equations of the Elman network are given by
zt = Wx xt + Wh ht−1 + bx (13)
ht = σ (zt ) (14)
zty = Wy ht + by (15)
yt = σ (zty ) (16)
where xt ∈ Rd , bx , zt , ht ∈ Rm , Wx ∈ Rm×d , Wh ∈ Rm×m , Wy ∈ R1×m , zy , yt , by ∈ R and

hl0 = 0. Here the state at time step t is denoted by ht , the input at time t is denoted by
xt , and the σ(·) denote a element-wise non-linearity function. For our analysis we use
1
the sigmoid, σ(x) = 1+e−x
.
2) Error Propagation Analysis: Let us denote the error signal by
t = dt − yt , (17)
where dt denotes the true label and the term yt denotes the model prediction. For
computational simplicity we assume that the target is a scalar. The goal is to minimize
the MSE loss, namely J(θ) which is given by
T
X
J(θ) = Jt (θ) (18)
t=1
T
X 1
= 2t (19)
t=1
2
We would like to adjust the network weights by propagating the error back in time.
Let us calculate the error signal as it propagates backwards in the network. The error
propagated to the network output is denoted by δty ∈ R1×ny and is given by
∂
δty , Jt (θ) (20)
∂zty
∂ 1
= y (dt − yt )2 (21)
∂zt 2
∂ 1 ∂yt
= (dt − yt )2 y (22)
∂yt 2 ∂zt
= (dt − yt ) σ 0 (zty )T (23)
where
h iT
σ 0 (zty ) = σ 0 (zty 1 ), σ 0 (zty 2 ), . . . , σ 0 (zty ny ) . (24)
The error that is propagated to the RNN state is denoted by δt ∈ R1×m and is given by
∂
δt , Jt (θ)
∂zt
∂ ∂z y
= y Jt (θ) t
∂zt ∂zt
∂
= δty (Wy ht + by )
∂zt
= δty Wy diag (σ 0 (zt )) (25)
Next, we can calculate the error propagated backwards in time inside the network by
δt−1 ∈ R1×nL and δt ∈ R1×n respectively. The time propagated error is given by
∂
δt−1 , Jt (θ) (26)
∂zt−1
∂zt
= δt (27)
∂zt−1
∂
= δt (Wx xt + Wh ht−1 + bx ) (28)
∂zt−1
= δt Wh diag (σ 0 (zt−1 )) , (29)
Now, we calculate the error that is propagated k time steps backwards in time. We will
get that
δt−k = et σ 0 (zty )T Wy diag (σ 0 (zt )) Wh diag (σ 0 (zt−1 )) · · · Wh diag (σ 0 (zt−k )) (30)
Let us examine the norm of the error as it propagates backwards in time. Since that σ 0
is bounded, say by M > 0, and assume that Wh has maximal eigen value of λ we can
claim that
kδt−k k ≤ |et | M 1T Wy kM λIk · · · kM λIk (31)
= |et | (λM )2k M 2 1T Wy (32)
We can see that if |λM | > 1.0 the error explodes as we keep propagating the error
back in time, and if |λM | < 1.0 the error vanishes in time. The naive approach is to
set λM = 1 but this does not work in practice since it causes saturated units that drives
the activation to zero. This property of the Elman network is the main motivation for
the LSTM cell, which addresses the vanishing/exploding gradient problem by enforcing
constant error propagation in time.
B. Long Short-Term Memory (LSTM)
1) Motivation - Constant Error Flow: For simplicity, let us examine the naive elman
RNN with a single unit. In that case, the propagated error from time t to time t − 1 is
given by
∂zt
δt−1 = δt (33)
∂zt−1
∂
= δt (wx xt + wh ht−1 + b) (34)
∂zt−1
= δt wh f 0 (zt−1 ) . (35)
In order to achieve constant error flow we demand that the activation function will satisfy
wh f 0 (zt−1 ) = 1. (36)
z
By solving the differential equation we get that f (z) = whl , i.e the activation function
must be linear. So by setting f to be the identity mapping and wh = 1 we can ensure a
constant error flow, namely Constant Error Carousel (CEC). However a unit is connected
to other units except itself that introduce two conflicts.
1) Input weight conflict: Excitation of input units enter the state by a linear projection
and can contaminate the state if the input is not correlated with the targets.
2) Output weight conflict: State units are used to predict the target by a linear
transformation. Nevertheless, at different time steps different units are correlated
with the target. Hence, a linear transformation of the states essentially uses
uncorrelated units to predict the target.
These conflicts, and the conclusion that linear activation can enforce constant error flow
motivates us for the architecture of the LSTM cell.
Fig. 2. Depiction of the LSTM cell architecture
2) The Original Model: The architecture of the LSTM is depicted in Figure 2, an its
equations are given by
at = tanh (Ua xt + Wa ht−1 + ba ) (37)

it = σ (Ui xt + Wi ht−1 + bi ) (38)
ot = σ (Uo xt + Wo ht−1 + bo ) (39)
c t = at it + ct−1 (40)
ht = ot tanh (ct ) (41)
These equations were modified to the last common LSTM architecture by adding a
”forget” gate that can forget certain cell units when propagating the error through time.
3) The Current Model: The LSTM equations are given by
at = tanh (Ua xt + Wa ht−1 + ba ) (42)
it = σ (Ui xt + Wi ht−1 + bi ) (43)
ft = σ (Uf xt + Wf ht−1 + bf ) (44)
ot = σ (Uo xt + Wo ht−1 + bo ) (45)
c t = at i t + ft ct−1 (46)
ht = ot tanh (ct ) (47)
This structure enables constant gradient flow through time by removing the non-linear
activation from the memory update as depicted in figure 3, namely ct tre of the LSTM
cell enab
Fig. 3. Depiction of gradient flow in time of the LSTM cell.

1-1
Machine Learning
Lecture 9 - Mutual Information Neural Estimation

Lecturer: Haim Permuter Scribe: Yonatan Dadon, Cameron Solomon
I. I NTRODUCTION
In this lecture we introduce an estimation method for the Mutual Information between
two random variables using the power of neural networks. First, we recall the required
definitions from information theory, and expand on their properties. Then, we introduce
a new and a very useful way of representing information measures, which is called
the variational formulation. Using the variational formulation we will be able to apply
maximization methods that we have previously applied in Machine Learning algorithms
and, hence, to develop an iterative algorithm that will output an estimation of Mutual
Information. Lastly, we will consolidate our understanding of this methodology and view
it from the the standpoint of Hypothesis Testing.
II. D IVERGENCE AND M UTUAL INFORMATION
Definition 1 (Kullback Liebler Divergence) The Kullback Liebler Divergence between

two probability densities P (x), Q(x) is defined as:

P (X)
DKL (P ||Q) , EP log
Q(X)
Z
P (x)
= P (x) log dx. (1)
X Q(x)
Remark 1 (Non-negativity of the Kullback Liebler Divergence) For any two distri-
butions P, Q the Kullback Liebler Divergence is non-negative. i.e.
DKL (P ||Q) ≥ 0. (2)
And equality holds if and only if P (x) = Q(x), ∀x ∈ X .

1-2
Definition 2 (Mutual Information) Let X and Y be two random variables with a joint
distribution P (x, y). The Mutual Information I(X; Y ) is defined as

P (X, Y )
I(X; Y ) , EPXY log . (3)
P (X)P (Y )
Remark 2 (Defining Mutual Information using Kullback Liebler Divergence)
Mutual Information can be easily defined using the Kullback Liebler Divergence as
follows:
I(X; Y ) = DKL (PXY ||PX PY ). (4)
Remark 3 From the previous remark it is easy to show that I(X; Y ) = 0 if and only if
X, Y are statistically independent.
III. T HE D ONSKER -VARADHAN VARIATIONAL FORMULA
In this section we introduce a new and a very useful way of representing the
Kullback Liebler Divergence ,which is called the variational formulation. The variation
formulation is a way of representing some measures as a supremum or infimum over a
set of functions. A general form of variational formulation is as follows:
f (x) = sup Fλ (x). (5)

λ
This representation has some distinct advantages. First, it provides upper / lower
bounds for the represented measure which could not be obtained beforehand. Second, in
many cases the variational formulation might be easier to compute analytically. Third,
the use of optimization methods in order to achieve certain approximations might be
a handy solution. The following theorem introduces a variational formulation for the
Kullback Liebler Divergence, and perform as a key ingredient of neural estimation of
Mutual Information.
Theorem 1 (Donsker-Varadhan representation [1])

Let X be a random variable with alphabet X and let P, Q be two probability density
1-3
functions. The Kullback Liebler Divergence admits the following dual representation:
DKL (P ||Q) = sup EP [T (X)] − log(EQ [eT (X) ]). (6)

T :X →R
Proof:
The proof consists of two parts which we will formulate and prove in the following
two lemmas:
Lemma 1 (Existence of supremum in Donsker-Varadhan variational representation)

There exists a function T ∗ : X → R such that:
DKL (P ||Q) = EP [T ∗ (X)] − log(EQ [eT

∗ (X)
]). (7)
Proof:
P (x)
Let us choose T ∗ (x) = log Q(x) . Note that the following series of equalities hold:
h i
(a) P (X) P (X)
∗
EP [T (X)] − log(EQ [e T ∗ (X)
]) = EP log − log EQ elog Q(X) (8)
Q(X)

(b) P (X)
= DKL (P ||Q) − log EQ (9)
Q(X)
Z
(c) P (x)
= DKL (P ||Q) − log( Q(x) dx) (10)
X Q(x)
Z
= DKL (P ||Q) − log( P (x)dx) (11)
X
(d)
= DKL (P ||Q) − log(1) (12)
= DKL (P ||Q). (13)
where
P (x)
(a) Follows from the specific choice of T ∗ (x) = log Q(x) .
(b) Follows from the definition of the Kullback Liebler Divergence.
(c) Follows from the definition of the expectation of continuous random variable.
(d) Integration of any probability density is always 1.
1-4
Lemma 2 (Lower bound for the Kullback Liebler Divergence) For any function T :
X → R the following inequality holds:
DKL (P ||Q) ≥ EP [T (X)] − log(EQ [eT (X) ]). (14)
Proof:
Let us define a new probability density function by:
Q(x)eT (x)
G(x) , . (15)
EQ [eT (X) ]
Note that G(x) ≥ 0 and forms a probability density function since
Z Z
Q(x)eT (x) EQ [eT (X) ]
G(x)dx = dx = = 1. (16)
X X EQ [eT (X) ] EQ [eT (X) ]
Back to the proof, we use G(x) to obtain:

T (X)
(a) P (X)
DKL (P ||Q) − EP [T (X)] + log(EQ e ) = EP log − T (X) + log(EQ eT (X) )
Q(X)

(b) P (X) T (X)
= EP log + log(EQ [e ])
Q(X)eT (X)

P (X)EQ [eT (X) ]
= EP log
Q(X)eT (X)

(c) P (X)
= EP log
G(X)
= DKL (P ||G)
(d)
≥ 0.
where
(a) Follows from the definition of Divergence and linearity of expectation.
(b) Follows from the fact that log(EQ [eT (X) ]) is a deterministic constant.
Q(x)eT (x)
(c) Follows from the definition of G(x) = EQ [eT (X) ]
.
(d) Follows from the non-negativity of Kullback Liebler Divergence (see Definition 1 in
section II).
1-5
Now, back to the proof of Theorem 1 (Donsker-Varadhan representation):

P (x)
We showed that by choosing T ∗ (x) = log Q(x) we obtain:
DKL (P ||Q) = EP [T ∗ (X)] − log(EQ [eT

∗ (X)
]). (17)
We also proved that for any function T : X → R the following holds:
DKL (P ||Q) ≥ EP [T (X)] − log(EQ [eT (X) ]). (18)
Hence,
DKL (P ||Q) = sup EP [T (X)] − log(EQ [eT (X) ]). (19)
T :X →R
IV. M UTUAL I NFORMATION N EURAL E STIMATION A LGORITHM (MINE)
In this section we will use the Donsker-Varadhan variational formulation in order to

estimate the Mutual Information using neural networks. Using the results from Theorem
1 in the previous section we can write the Mutual Information in its variational form as:
I(X; Y ) = sup EPXY [T (X, Y )] − log(EPX PY [eT (X,Y ) ]). (20)

T :X ×Y→R
Problem definition:
Let X ∼ PX and Y ∼ PY be two random variables with alphabets X , Y, respectively,
and let (Xi , Yi )ni=1 ∼ PXY be i.i.d samples. Our goal is to estimate the Mutual
Information I(X; Y ) from the n given samples.
First, we may notice the following difficulties:

1) Note that in order to evaluate the Mutual Information in its variational form one
requires a full knowledge of the joint and marginal distributions of X and Y . In practice
these distributions may well be unknown.
2) Note that a maximization over all possible functions T : X × Y → R is required,
which may be impractical in reality.
1-6
In order to overcome the first challenge we assumed a set of i.i.d samples which are
drawn according to PXY is given. Under the assumption that the number of samples
n is large enough, we can use the Law of Large Numbers to obtain the following
approximation:
n
1X
EPXY [T (X, Y )] ≈ T (Xi , Yi ). (21)
n i=1
Note that an evaluation of log(EPX PY [eT (X,Y ) ]) is still required. Now we are facing
a new challenge since the given samples (Xi , Yi ) are drawn according to PXY and not
according to PX PY . Therefore, direct use of the Law of Large Number is not correct.
This can still be overcome by artificially constructing tuples of the form (Xi , Yei ), where
Yei is taken from the randomly shuffled set of all samples (Yi )ni=1 . From here we obtain
that Xi and Yei are statistically independent, which means that (Xi , Yei )n are i.i.d samples
i=1
distributed according to PX PY . Now it is possible to use the Law of Large Numbers to

obtain the following approximation:
" n
#
T (X,Y )
1 X T (Xi ,Yei )
log EPX PY [e ] ≈ log e . (22)
n i=1
Finally, using equations (21) and (22) we shall define our Mutual Information estimator
as:
n
" n #
1 X 1 X e
ˆ
I(X; Y) , sup T (Xi , Yi ) − log eT (Xi ,Yi ) . (23)
T :X ×Y→R n n
i=1 i=1
As mentioned, maximization over all possible functions T : X × Y → R still remains.

This problem now reduces to finding the function T (X, Y ) that maximizes equation (23).
1-7
To solve this maximization problem, we construct a neural network with parameters

n
θ, which gets as its input the samples (Xi , Yi )n and (Xi , Yei ) ; we can regard the
i=1 i=1
function Tθ (X, Y ) as the output of the neural network The network’s cost function is
defined as follows:
n n
!
1X 1 X Tθ (Xi ,Yei )
Iˆθ (X; Y ) = Tθ (Xi , Yi ) − log e . (24)
n i=1 n i=1
It is our goal to maximize this function. The back propagation algorithm makes it
possible for us to compute partial derivatives with respect to the network’s parameters θ,
and then use the gradient ascent algorithm in order to move step by step towards local
maxima of Iˆθ (X; Y ). Since, practically speaking, the number of given samples is large,
we may use mini-batch gradient ascent in order to train the network to reach a local
maximum of Iˆθ (X; Y ) by adjusting the network’s parameters θ. The algorithm steps are
as follows:
Algorithm 1 Mutual Information Neural Estimation Algorithm (MINE) [1]

1: θ ← Network parameters initialization.
2: repeat
3: Draw mini-batch of samples: (X1 , Y1 ), (X2 , Y2 ), ..., (Xm , Ym ) ∼ PXY .
4: Draw m samples from the marginal distribution: Ye1 , Ye2 , ..., Yf m ∼ PY .
P P Tθ (Xi ,Yei )
5: Evaluate: Iˆθ (X; Y ) ← m1 m 1
i=1 Tθ (Xi , Yi ) − log( m
m
i=1 e ).
6: Update network parameters: θ ← θ + ∇θ Iˆθ (X; Y ).
7: until convergence
1-8
V. H YPOTHESIS T ESTING
A. Introduction
In the previous sections, we developed a way to approximate the Kullback Liebler

Divergence using the Donsker-Varadhan representation. While doing this, we found that
the the optimal function which maximizes the Donsker-Varadhan representation is given
P (X)
by log Q(X) , where P (X) and Q(X) are two probability functions from which our samples
are generated. We then concluded that by optimizing over all neural Networks, we could
obtain a network that would give us this logarithm as its output.
In this section, our goal is to identify from which distribution the samples
(x1 , x2 , ..., xn ) that we feed into our network were generated. Later on in this section,
P (X n )
we will see that in order to do this, we will have to check the condition Q(X n )
> T for
some threshold T. Since the samples are i.i.d, we can re-write:
Qn
P (xn ) = i=1 P (xi )
Qn
Q(xn ) = i=1 Q(xi ),
and since logarithms are monotonically increasing functions, we can take the logarithm
of both sides of the inequality written above and receive:
n
!
n
P (x ) (a) Y P (xi )
log n
= log
Q(x ) i=1
Q(xi )
n
(b) X P (xi )
= log
i=1
Q(xi )
> log(T ),
where (a) follows from the fact that the samples are i.i.d and (b) follows from the fact
that the logarithm of a product is the sum of the individual logarithms.

P (xi )
But log Q(x i)
is the output of the neural network that we found through the
optimization problem presented in the previous sections. Hence, when using the neural
network to approximate the Kullback Liebler Divergence, we can simultaneously sum
over the network’s outputs, as a byproduct, find from which distribution our samples
were generated!
1-9
B. Hypothesis Testing
We will now discuss how we can identify from which distribution our samples were
generated and why the inequality given in the introduction provides us with the optimal
decision region.
A set of samples (x1 , x2 , ..., xn ) are given and our goal is to decide from which of two
given distributions, P1 (x) or P2 (x), the samples were generated. In order to do this, we
first define two hypotheses:
H1 : X ∼ P1 ,
H2 : X ∼ P2 .
In addition, we define two forms of errors:
α = P (H2 |X is from P1 ),
β = P (H1 |X is from P2 ).
α is the resultant error when we decided that X was generated from P2 (X) when it was
actually generated from P1 (X), and β is the resultant error when we decided that X was
generated from P1 (X) when it was actually generated from P2 (X).
We will now also define a function of the samples, G : xn → 1, 2, to be equal to
1 when our hypothesis for the samples is H1 , and to be equal to 2 when when our
hypothesis for the samples is H2 . We can now re-write α and β with the help of G(xn ):
α = P (G(xn ) = 2 |H1 is true),
β = P (G(xn ) = 1 |H2 is true).
Remark 4 From this representation of α and β, we can see that there is a trade-off. As
α decreases in value, β increases and vice-versa. For example, if α were to be small,
G(xn ) would have to be equal to 1 most of the time, thus causing β to increase.
Now that we have defined our errors, we must determine the decision region in order to
1-10
decide from which distribution our samples were generated.

Let us define the decision region:
n

n P1 (x )
An (T ) , x : >T .
P2 (xn )
Using the decision region given above, our decision criteria will be that if X ∈ An ,
P1 (xn )
meaning that P2 (xn )
> T , then X is generated by P1 (X), and if X ∈
/ An , meaning that
P1 (xn )
P2 (xn )
< T , then X is generated by P2 (X).
The following lemma will guarantee that the decision region defined above is the
optimal region with regards to minimum error.
Lemma 3 (Neyman-Pearson Lemma [2])

We define:
α∗ = P1 (Acn (T )) = P (decide H2 |H1 is true),
β ∗ = P2 (An (T )) = P (decide H1 |H2 is true),
to be the errors of the optimal decision region, An .

Let Bn be any other decision region with errors α and β.
Then, if α < α∗ , β ∗ must be smaller than β.
Proof:
Let us first define two indicator function, φA (X) and φB (X), which will get the value 0
or 1 according to which of the two decision regions X belongs to. The explicit definition
of these functions is given by:
 

1, 
1,
X∈A X∈B
φA (X) = , φB (X) =

0, 
0,
X∈
/A X∈
/B
Let us consider
(φA (x) − φB (x)) ∗ (P1 (x) − T ∗ P2 (x)) ≥ 0.
If x ∈ A, P1 (x) > T ∗ P2 (x) and (φA (x) − φB (x)) ≥ 0. If x ∈

/ A, P1 (x) < T ∗ P2 (x)
and (φA (x) − φB (x)) ≤ 0. In both cases, the product is greater than or equal to zero. Let
1-11
us now take the sum of these products over all values of x:

X
(φA (x) − φB (x)) ∗ (P1 (x) − T ∗ P2 (x))
x
(a) X
= φA (x) ∗ (P1 (x) − T ∗ P2 (x)) − φB (x) ∗ (P1 (x) − T ∗ P2 (x))
x
(b) X X
= (P1 (x) − T ∗ P2 (x)) − (P1 (x) − T ∗ P2 (x))
x∈A x∈B
(c) X X X X
= P1 (x) − T ∗ P2 (x) − P1 (x) + T ∗ P2 (x)
x∈A x∈A x∈B x∈B
(d)
= 1 − α∗ − T ∗ β ∗ − 1 + α + T ∗ β
(e)
= (α − α∗ ) + T ∗ (β − β ∗ )
(f )
≥ 0,
where
(a) Is obtained by opening the parentheses of the product.
(b) Is obtained by splitting the sum into two by summing over A and B and by
remembering that φA (x) and φB (x) are indicators for each of these two groups.
(c) Follows from the linearity of the sums.
(d) Is obtained as follows:
Let us first consider
P
x∈A P1 (x).
Summing over the probability P1 (X) when X ∈ A is the same as taking one minus the
/ A. But this sum is the exact definition of α∗
sum of the probability P1 (X) when X ∈
that was given earlier since we are taking the probability of X with regards to P1 (X)
when it was really generated by P2 (X). Therefore, we can write the sum as follows:
P P
x∈A P1 (x) = 1 − x∈A
/ P1 (x) = 1 − α∗ .
Now let us consider

P
x∈A T ∗ P2 (x).
1-12
This time, we are summing over the probability P2 (X) when X was really generated
from P1 (X). But this is the exact definition of β ∗ which was given previously. Therefore,
P
x∈A T ∗ P2 (x) = T ∗ β ∗ .
From the explanations given above, we can now write

P
x∈A (P1 (x) − T ∗ P2 (x)) = 1 − α∗ − T ∗ β ∗ .
As for the sum over B, in the exact manner explained above, we can show that
P
x∈B (P1 (x) − T ∗ P2 (x)) = 1 − α − T ∗ β.
Our equality is now obtained by substituting the sums for the equations derived above.
(e) Is obtained by tidying up the equation.
(f ) Follows from the fact that the equation that we were originally summing,
(φA (x) − φB (x)) ∗ (P1 (x) − T ∗ P2 (x)),
is non-negative. So therefore the sum is also non-negative.

We now have the following inequality:
(α − α∗ ) + T ∗ (β − β ∗ ) ≥ 0.
If we look at this phrase, it is easy to see that if α < α∗ , then β must be greater than
β ∗ in order for the inequality to hold. This concludes our proof.
Remark 5 As a result of this lemma, we can conclude that the decision region defined
by An is the optimal decision region for deciding from which probability distribution
our samples were generated. Let us plot an example of the error defined by the optimal
decision region An on a two-dimensional grid with axes α and β representing our two
forms of error.
1-13
β
Error of Decision Region An
β∗
Forbidden
α
∗
α
Fig. 1. The error given by An with an illegal region drawn beneath it.
If we were to pick any point on the plot of the error given by An and draw two
perpendicular lines to each of the axes, as is shown in the plot, then any other decision
region’s error would have to fall outside this region, which can be seen in the plot as
the area enclosed by the square. This is due to the fact that the Neyman-Pearson lemma
tells us that if the error in one axis is smaller than the error of the optimal region in that
same axis (e.g. α < α∗ ), then the error in the other axis must be larger (in this case,
β > β ∗ ). Since the point on the line is chosen arbitrarily, we can conclude that at no
point beneath the plot of the error defined by our optimal decision region can there be
an error from another decision region. Therefore, if we were to plot the error given by
any other decision region, the plot must be above the plot of the error defined by An ,
thus showing us that this is indeed the optimal decision region.
1-14
R EFERENCES
[1] M. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R.D Hjelm. Mine: mutual information
neural estimation. arXiv preprint arXiv:1801.04062, 2018.
[2] T. M. Cover and J. A. Thomas. Elements of information theory, chapter 11, pages 375–379. John Wiley & Sons,
second edition, 2012.
10-1
Machine Learning
Lecture 10
Lecturer:Haim Permuter Scribe: Omer Luxembourg
I. I NTRODUCTION
In this lecture we introduce the f-Divergence definition which generalizes the Kullback-
Leibler Divergence, and the data processing inequality theorem. Parts of this lecture are
guided by the work of T. Cover’s book [1], Y. Polyanskiy’s lecture notes [3] and Z.
Goldfeld’s lecture 6 about f-Divergences [2]. This lecture assumes the student is familiar
with basic probability theory. The notations here are similar to those of the previous
lectures.
II. f-Divergence
Definition 1 (Kullback-Leibler Divergence) Recall the Kullback-Leibler Divergence

(a.k.a. KL-Divergence) definition:

P (x)
DKL (PX ||QX ) , EP log . (1)
Q(x)
For discrete probabilities eq. (1) becomes:

X P (x)
DKL (PX ||QX ) , P (x) log , (2)
x∈X
Q(x)
and for continuous probabilities:

Z
P (x)
DKL (PX ||QX ) , P (x) log dx, (3)
x∈X Q(x)
for P, Q such that if Q(x) = 0 then P (x) = 0 for the same x.
There are two main properties for Divergence, which were proved in previous lectures.
a. DKL (PX ||QX ) ≥ 0, and equality hold if and only if P = Q.
b. DKL (PX ||QX ) is convex in (PX , QX ).
10-2
Definition 2 (f-Divergence) For two distributions P and Q, the f-Divergence is defined

as:
P (x)
Df (PX ||QX ) , EQ f , (4)
Q(x)
for P, Q, such that if Q(x) = 0 then P (x) = 0 for the same x, and for f that satisfies
the following:
• f is convex for R+ .
• f (1) = 0.
The following are special cases of f-Divergences:

a. Kullback-Leibler Divergence: a.k.a. relative entropy, f (x) = x log x ,

P (x)
Df (PX ||QX ) , EQ f (5)
Q(x)

(a) X P (x) P (x)
= Q(x) · log
x∈X
Q(x) Q(x)

X P (x)
= P (x) log
x∈X
Q(x)
, DKL (PX ||QX ),
where (a) follows from the definition of f. Note that f (1) = 0 and f is convex for
all t ≥ 0. (f 00 (t) = 1t ).
b. Negative Log: f (x) = − log(x),

P (x)
Df (PX ||QX ) , EQ f (6)
Q(x)

(a) X P (x)
= −Q(x) log
x∈X
Q(x)
, D(QX ||PX ),
where (a) is the definition of divergence, which is non-negative, and 0 if P = Q.

Note that f (1) = 0 and f is convex for all t ≥ 0. It is worth noting that, in general,
D(P ||Q) 6= D(Q||P ).
c. Total Variation: f (x) = 12 |x − 1| ,
DT V (P, Q) , DfT V (PX ||QX ) (7)

10-3

P (x)
= EQ fT V
Q(x)
X 1 P (x)
= Q(x) · −1
x∈X
2 Q(x)
1X
= |P (x) − Q(x)| .
2 x
Note that f (1) = 0 and f is convex for all t ≥ 0. In addition DT V (P, Q) =
DT V (Q, P ) means that the total variation is a metric on the space of probability
distributions. That is because it is a divergence function and a symmetric function
of P and Q .
2x 2
d. Jensen-Shannon divergence (symmetrized KL): f (x) = x log x+1 + log x+1 ,
DJS (P ||Q) , DfJS (PX ||QX ) (8)

P (x)
= EQ f
Q(x)
P (x) !
X P (x) 2 Q(x) 2
= Q(x) log P (x) + log P (x)
x∈X
Q(x) +1 +1
Q(x) Q(x)
! !
X P (x) P (x)
= P (x) log P (x)+Q(x) + Q(x) log P (x)+Q(x)
x∈X 2 2

(a) P +Q P +Q
= D P || + D Q|| ,
2 2
where (a) is the definition of divergence.
f (1) = 0 and f is a convex function. (f 00 (x) = 1
x2 +x
≥ 0 for all x > 0).
Theorem 1 (Properties of f-Divergence).

• Non-negativity: For a f function that is strictly convex around 1, Df (P ||Q) ≥ 0.
The equality holds if and only if P = Q.
Proof:

P
Df (P ||Q) = EQ f (9)
Q
(a)

P (x)
≥ f EQ
Q(x)
(b)
= f (1)
10-4
(c)
= 0,
where (a) is from Jensen’s inequality for a convex function f, (b) is due to the fact
P (x)
that Q(x)
is fixed ∀x because P = Q, (c) is from the definition of f. Note that if f
is not strictly convex around 1, the equality can hold from Jensen’s inequality and
not from P = Q.
• Joint convexity: (P, Q) 7−→ Df (P ||Q) is a jointly convex function. Consequently,
P 7−→ Df (P ||Q) for fixed Q and Q 7−→ Df (P ||Q) are also convex functions.
Proof: From the Perspective Transform Preserve Convexity lemma we learned that
if f (x) is convex ⇒ t · f xt is convex in (x, t).

X P (x)
Df (P ||Q) = Q(x)f , (10)
x
Q(x)
f is a convex function; thus, from the Perspective Transform Preserve Convexity

P (x)
Lemma, Q(x) · f Q(x) is convex in (x, t). Therefore Df (P ||Q) is the sum of
convex functions in (P, Q) by eq. (10); thus it is a convex function in (P, Q).
Theorem 2 Conditioning Increases f-Divergence: Define the conditional f-Divergence

Df (PY |X ||QY |X |PX ) , EPX,Y Df PY |X ||QY |X . (11)
Let PY be the output of the system PY |X for input PX , and QY be the output of the
system QY |X for input PX , see figure 1.
PY |X PY
PX
QY |X QY
Fig. 1. Channel transition matrices

10-5
Then

Df (PY ||QY ) ≤ Df PY |X ||QY |X |PX . (12)
One can view PY and QY as the output distributions after passing PX through the channel
transition matrices PY |X and QY |X , respectively. The above relation tells us that the
average f-Divergence between the corresponding channel transition rows is at least the
f-Divergence between the output distributions.
Proof:

X X P (Y |X)
Df (PY |X ||QY |X |PX ) , PX Q(Y |X)f (13)
x y
Q(Y |X)
(a) X
= PX Df (P (Y |X = x)||Q(Y |X = x))
x
! !!
(b) X X
≥ Df PX P (Y |X = x) || PX Q(Y |X = x)
x x
(c)
= Df (EPX [P (Y |X)] ||EPX [Q(Y |X)])
(d)
= Df (P (Y )||Q(Y )) ,
where (a) follows from the definition of f-Divergence, (b) follows from Jensen’s
inequality, because Df is convex in P, Q, (c) is the definition of expectation, and (d)
follows from the Law of Total Expectation.
Remark 1 (equality for Df (PY |X ||QY |X |PX )): We can notice the following equality
holds:
" #
PY,X
Df (PY,X ||Q̃Y,X ) , EQ̃Y,X f (14)
Q̃Y,X

X P (y, x)
= Q̃(y, x)f
y,x
Q̃(y, x)

X X P (y, x)
= P (x) Q(y|x)f
x y
Q(y, x)

(a) X X P (y|x)P (x)
= P (x) Q(y|x)f
x y
Q(y|x)P (x)
10-6

(b) X X P (y|x)
= P (x) Q(y|x)f
x y
Q(y|x)
= Df (PY |X ||QY |X |PX ),
where (a) follows from the definition of conditional probability, and Q̃( y, x) ,
P (x)Q(y|x), and (b) is from the definition of divergence.
III. DATA P ROCESSING I NEQUALITY
The data processing inequality for KL divergence extends to all f -Divergences.
P (x) P (y)
W (y|x)
Q(x) Q(y)
Fig. 2. One channel transition [3]
The intuition behind the following inequality is that processing the observation x by a
channel WY |X makes it more difficult to determine whether it came from PX or QX . In
neural networks, for instance, the divergence of the system output will decrease as we
move to the next layer.
Theorem 3 (Data Processing Inequality): Consider a channel that produces Y given

X based on the law WY |X . If PY and QY are distributions of Y when X is generated
by PX and QX , respectively, then for any f-Divergence,
Df (PX ||QX ) ≥ Df (PY ||QY ), (15)
as for the KL divergence.

Proof:
Df (PX ||QX ) , Df (PX WY |X ||QX WY |X ) (16)

X P (x, y)
= Q(x, y)f
y,x
Q(x, y)
10-7

(a) X X P (x, y)
= Q(y) Q(x|y)f
y x
Q(x, y)
!
(b) X X P (x, y)
≥ Q(y)f Q(x|y)
y x
Q(x, y)
!
X X P (x, y)
= Q(y)f Q(x|y)
y x
Q(y)Q(x|y)

(c) X P (y)
= Q(y)f
y
Q(y)
= Df (PY ||QY ),
where (a) follows from conditioning, (b) is Jensen’s inequality for convex f in P, Q, and
(c) is from Law of Total Probability. Note that PX,Y = PX WY |X and QX,Y = QX WY |X .
10-8
R EFERENCES
[1] T. M. Cover and J. A. Thomas. Elements of Information Theory, Chap. 1. ISBN, 1991.
[2] Z. Goldfeld. Lecture 6: f-divergences.
Available at http://people.ece.cornell.edu/zivg/ECE 5630 Lectures6.pdf, 2020.
[3] Y. Polyanskiy. Lecture notes on information theory, chap. 6.
Available at http://people.lids.mit.edu/yp/homepage/data/itlectures v5.pdf, 2017.
1-1
Machine Learning
Lecture 11 - Variational Inference

Lecturer: Haim Permuter Scribe: Daniel Duenias
I. I NTRODUCTION [2]
Statistical inference is the process of drawing conclusions such as punctual estima-

tions, confidence intervals or distribution estimations about some latent variables in a
population, based on some observed variables.
Bayesian inference is the process of producing statistical inference taking a Bayesian
point of view. Bayesian paradigm is embed in the so called Bayes theorem that expresses
the relation between the updated knowledge (the “posterior”), the prior knowledge (the
“prior”) and the knowledge coming from the observation (the “likelihood”). Let’s assume
a model where data x are generated from a probability distribution depending on an
unknown parameter θ and that the parameter θ is distributed p(θ). Then, when data x
are observed, we can update the prior knowledge about this parameter using the Bayes
theorem as follows
1-2
The Bayes theorem tells us that the computation of the posterior requires three terms:
a prior, a likelihood and an evidence. The first two can be expressed easily as they are
part of the assumed model. However, the third term, requires to be computed such that
Z
P (x) = P (x|θ)P (θ). (1)
θ
Although in low dimension this integral can be computed without too much difficulties,
it can become intractable in higher dimensions.
II. N OTATION
consider the following notations

• xn - an known observation vector of size n with the i’th coordinate xi .
• z m - a hidden/latent variable (equivalent to θ mentioned above) - vector of size m
with the i’th coordinate zi .
• z −i - a group of all the vector coordinates except of the i’th one. in general, upper
script notation is a vector and lower script means an entry in that vector.
III. VARIATIONAL I NFERENCE
In this lecture we introduce Variational Inference (VI), a method that approximates

probability densities through optimization [2]. Throughout the lecture we will use VI on
a Bayesian mixture of Gaussians as an example. As mentioned earlier, we are interested
in computing the posterior distribution,
P (z m , xn )
P (z m |xn ) = . (2)
P (xn )
IV. BAYESIAN MIXTURE OF G AUSSIANS
A Bayesian mixture of Gaussians is a model that assumes that the data is distributed
as a mixture of k Gaussians with the following parameters [3]:
• the expectation is also a random variable, normally distributed - µi ∼ N (0, σ 2 ), i =
1, 2...k, for some known σ.
• given the expectation µi , the standard deviation of the Gaussian is 1 - Gi |µi ∼
N (µi , 1).
1-3
• P (ci ) - the probability that xi belongs to some Gaussian, i.e, the assignment of each
i’th observation, has a uniform distribution, ci ∼ U nif orm(k).
We encode ci into ’one hot’ k sized vector. We then draw that xi |ci , µk ∼ N (cTi µ, 1).
define z m = (µk , cn ), m = k + n, an m sized vector of hidden variables.
For a sample of size n, the joint density of latent and observed variables is,
P (xn , z m ) =P (xn , cn , µk )
n
Y (3)
=P (µk ) P (ci )P (xi |ci , µk ).
i=1
Here, the evidence is,

Z n X
Y
n
P (x ) = P (µk ) P (ci )P (xi |ci , µk ). (4)
i=1 ci
Therefore, eq. (2) is an equation with time complexity of numerically evaluating

Kdimensional integral - O(K n ) [1].
V. T HE EVIDENCE LOWER BOUND (ELBO) [1]
In variational inference, we specify a family of densities over the latent variables -

q(z m ). We then try to approximate the exact conditional distribution P (z m |xn ) with that
densities family, i.e, find the closest q(z m ) to the conditional distribution P (z m |xn ). That
is done by solving the following optimization problem:
q ∗ (z m ) = argmin D(q(z m )||P (z m |xn )). (5)

q(z m )
We know that by definition,
D(q(z m )||P (z m |xn )) = Eq(zm ) [log q(Z m )] − Eq(zm ) [log P (Z m |xn )]. (6)
Therefor, using Bayes and logarithm rules we get,
D(q(z m )||P (z m |xn )) = Eq [log q(Z m )] − Eq [log P (Z m , xn )] + Eq [log P (xn )]. (7)
Expand the conditional knowing that p(xn ) is not a function of the random variable Z m
and therefor Eq [log P (xn )] = log P (xn ),
log P (xn ) = D(q(z m )||P (z m |xn )) + Eq [log P (Z m , xn )] − Eq [log q(Z m )]. (8)
1-4
Noting the term Eq [log P (Z m , xn )]−Eq [log q(Z m )] as ELBO and using the non negativity
of D we get,
log P (xn ) =D(q(z m )||P (z m |xn )) + ELBO

(9)
≥ELBO.
And thereby its name - The Evidence (P (xn )) Lower Bound. Using eq. (9) and the fact
that w.r.t q(z m ), log P (xn ) = const, we can write D as,
D(q(z m )||P (z m |xn )) = −ELBO + const. (10)
So, by maximizing ELBO we actually minimize D(q(z m )||P (z m |xn )), therefore we
may solve the optimization problem
q ∗ (z m ) =argmin D(q(z m )||P (z m |xn ))

q(z m )
(11)
=argmax ELBO,
q(z m )
instead of solving eq. (5).

In addition, using Bayes and logarithms rules , ELBO can be written as,
ELBO =Eq [log P (Z m )] + Eq [log P (xn |Z m )] − Eq [log q(Z m )]

(12)
=Eq [log P (xn |Z m )] − D(q(z m )||P (z m )).
Knowing eq. (12) and eq. (11), we can present another interpretation of our optimization
problem:
argmax ELBO =argmax Eq [log P (xn |z m )] − D(q(z m )||P (z m ))

q(z m ) q(z m )
X (13)
=argmax q(z m ) log P (xn |z m ) − D(q(z m )||P (z m )).
q(z m )
Looking at eq. (13) we can see that there is a trade-off between minimizing D and
maximizing the sum. As for minimizing D, we would like q(z m ) to be as close as
possible to P (z m ), while as for maximizing the sum, we understand that q(z m ) which
gives more weight to z m that make the term log P (xn |z m ) bigger i.e, z m that contains
more information about xn , will get better results. As we can see, the more samples there
q(z m ) log P (xn |z m ) will be over the divergence.
P
are, the more significant the term
1-5
VI. C OORDINATE ASCENT - A LTERNATING MAXIMIZATION PROCEDURE
We will use a method called coordinate ascent. This method is a maximization method
of a multi-variable functions. In this method we fix all the variables except one, maximize
the function as an normal one variable function, then again fix all the variables except
the next one and repeat. If the function is concave in all of its variables, the method will
get the global maximum, otherwise, a local one [4], [5].
Example 1 (maximizing two variable function using coordinate ascent) Suppose

we want to maximize f (x, y):
Algorithm 1: coordinate ascent / Alternating maximization procedure

Input: f (x, y).
Output: x, y of Local/Global maximum of f (x, y).
initiate y0 to some value.
solve x0 = maxf (x, y0 )
x
i=0
while f (xi , yi ) not converged do
i=i+1
yi = maxf (xi−1 , y)
y
xi = maxf (x, yi−1 )
x
Compute f (xi , yi )
end
Return xi , yi
VII. C OORDINATE ASCENT MEAN - FIELD VARIATIONAL INFERENCE [1]
In a mean-field variational family the latent variables are mutually independent. A

generic member of the family is
m
Y
m
q(z ) = q(zi ). (14)
i=1
We assume that this is the case in our problem. Using Bayes and logarithm rules, we
can write ELBO as,
1-6
ELBO = Eq [log P (xn )] + Eq [log P (Z m |xn )] − Eq [log q(Z m )]
= log P (xn ) + Eq [log P (Z m |xn )] − Eq [log q(Z m )] (15)

n
X
= const + Eq [log P (Z m |xn )] − Eq(zi ) [log q(Zi )],
i=1
while the second transaction is because P (xn ) is not random in q(z m ) and the third one
is by using eq. (14). Applying coordinate ascent and fixing q(z −i ) (all q(zm ) except of
the i’th coordinate) we get,
argmax ELBO = argmax q(zi )Eq(z−i ) [log P (Zi , z −i |xn )] − Eq(zi ) [log q(Zi )] + const,
q(zi ) q(zi )
(16)
∂ELBO
= Eq(z−i ) [log P (Zi |z −i xn )] − log q(zi ) + 1 = 0. (17)
∂q(zi )
Which yields,
log q ∗ (zi ) ∝ Eq(z−i ) [log P (zi |Z −i xn )], (18)
q ∗ (zi ) ∝ exp(Eq(z−i ) [log P (zi |Z −i xn )]). (19)
Therefore, coordinate ascent variational inference (CAVI) - algorithm may be written as:
Algorithm 2: CAVI
Input: model P (xn , z m ), data xn .
Output: variation density q(z m ) = m
Q
i=1 q(zi ) and ELBO (evidance lower bound
of P (xn )).
Initialization - initiate q(zi ) for some i.
while the ELBO has not converged do
for i = 1,2...m do
Set q(zi ) ∝ exp(Eq(z−i ) [log P (zi |Z −i xn )])
end
Compute ELBO = Eq [log P (Z m , xn )] − Eq [log q(Z m )]
end
Return m
Q
i=1 q(zi ), ELBO
1-7
Note that in order to compute Eq [log P (Z m , xn )] we use,

X
Eq [log P (Z m , xn )] = q(z m ) log P (xn , z m ). (20)
zm
VIII. CAVI FOR A BAYESIAN MIXTURE OF G AUSSIANS MODEL [1]
As we saw before, we need to find P (z m |xn ) and the term is hard to compute so
we will approximate it with q(z m ). In order to do so we will use the CAVI algorithm
modified to our example. We assume now that the mixture of Gaussians is defined by
the parameters ϕ, mk , sk as follows:
• the expectation is normally distributed - µi ∼ N (mi , s2i ), i = 1, 2...k.
• P (ci ) - has a categorical distribution, ci ∼ ϕki (k sized vector of non-negative number
that sums to 1). Therefore, ϕ is an n ∗ k matrix - the row i is a k sized vector noted
ϕi .
That said we define the initialization of the algorithm like the model presented in section
IV: µi ∼ N (0, σ 2 ), i = 1, 2...k, the expectation of each Gaussian (σ 2 is a known hyper
parameter) and, ϕi ∼ U nif orm(k), i = 1, 2...n. In each iteration we will update our
distributions parameters ϕ, mk , sk .
Let us evaluate the ELBO of the mixture assuming mean field family,
k
X
k k
ELBO(ϕ, m , s ) = E[log P (µi ); mi , s2i ]
i=1
n
X
+ E[log P (cj ); ϕj ] + E[log P (xj |cj , µk ); ϕj , mk , (s2 )k ] (21)
j=1
n
X k
X
− E[log q(cj ; ϕj )] − E[log q(µi ; mi , s2i )].
j=1 i=1
Expanding equation (21) using equation (19) we derive that that the following holds:
ϕji ∝ exp(E[µi ; mj , s2j ]xj − E[µ2i ; mj , s2j ]/2). (22)

n
X
q(µi ) ∝ exp(E[log p(µi ) + E[log P (xj |cj , µk ); ϕj , m−i , (s2 )−i ]). (23)
j=1
1-8
Continue developing those eqations we eventually get that the update for q(µi ) is,
Pn
j=1 ϕji xj 1
mi = Pn , s2i = Pn . (24)
1/σ2 + j=1 ϕji 1/σ2 + j=1 ϕji
Therefore, we can write the algorithm as follows:
Algorithm 3: CAVI for mixture of Gussians model
Input: Data xn , number of components K, prior variance of component means
σ2.
Output: Variational densities q(µi ; mi , s2i ) (Gaussian) and q(ci ; ϕi )
(K-categorical).
Initialization as discribed in the beginning of this section.
while the ELBO has not converged do
for j = 1,2...n do
Set ϕji ∝ exp(E[µi ; mj , s2j ]xj − E[µ2i ; mj , s2j ]/2)
end
for i = 1,2...k do
Pn
j=1 ϕji xj
Set mi = 1/σ2 + n
P
j=1 ϕji
1
Set s2i = 1/σ2 +
P n
ϕji
j=1
end
Compute ELBO(ϕ, mk , (s2 )k )
end
Return q(ϕ, mk , (s2 )k )
R EFERENCES
[1] D. M. Blei, A. Kucukelbir, J. D. McAuliffe. Variational Inference: A Review for Statisticians. Journal of the
American statistical Association , 112:859-877, 2017.
[2] https://towardsdatascience.com/bayesian-inference-problem-mcmc-and-variational-inference-25a8aa9bce29
[3] https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf
[4] I. Naiss and H. H. Permuter, ”Alternating maximization procedure for finding the global maximum of directed
information,” 2010 IEEE 26-th Convention of Electrical and Electronics Engineers in Israel, 2010, pp. 000545-
000549, doi: 10.1109/EEEI.2010.5662161.
[5] I. Naiss and H. H. Permuter, ”Extension of the Blahut–Arimoto Algorithm for Maximizing Directed In-
formation,” in IEEE Transactions on Information Theory, vol. 59, no. 1, pp. 204-222, Jan. 2013, doi:
10.1109/TIT.2012.2214202.
12-1
Introduction to Information Theory
Lecture 12
Lecturer: Haim Permuter Scribe: Tom Galili
I. VARIATION I NFERENCE
The objective of Variation Inference is to estimate P (z m |xn ),
P (z m , xn ) P (z m )P (xn |z m )
P (z m |xn ) = = R (1)
P (xn ) P (z m )P (xn |z m ) dz m
• z m - latent (hidden)
• xn - observation evidence
Notice that the estimation of P (z m |xn ) is not trivial, therefore we simplify the term
P (z m , xn )

m n (a) m
arg min D((qzm )||P (z |x )) = arg min Eq(zm ) [log q(z )] − Eq log
q(z m )∈Q P (xn )
(b)
= arg min Eq [log q(z m )] − Eq [log P (z m , xn )] + log P (xn )
(c)
= arg min(−ELBO + log P (xn )) (2)
−ELBO = Eq [log q(z m )] − Eq [log P (z m , xn )]

(d)
= Eq [log q(z m )] − Eq [log p(z m )] − Eq [log P (xn |z m )]
= Eq [− log P (xn |z m )] + D(q(z m )||P (z m )) (3)
where
(a) follows from the definition of divergence.
(b) follows from the logarithm rules.
12-2
(c) follows from the definition of evidence Lower Bound (ELBO) as defined in the
previous lectures.
(d) follows from conditional probability.
Note that P (z m ) is the prior probability and q(z m ) ≈ P (z m |xn ) is the posterior probability
(we want to estimate) of the latent space given the evidence. First interpretation: find
maximum, we want to get as close as possible to the prior and on the other hand the
probability of q(z m ) will be greater as z m gives more information about xn :
max(E[log(P (xn |z m ))] − D(q(z m )||P (z m ))) (4)
Second interpretation: MLD - minimum description length: Description of xn using z m

with as few as possible bits.
II. AUTO E NCODER (AE)
AutoEncoders are unsupervised learning models. The general idea of Auto Encoders
consists of setting an encoder and a decoder as neural networks and learning the best
encoding-decoding scheme using an iterative optimization process.[1]
Fig. 1. Illustration of an Auto Encoder
In this way, the architecture creates an information bottleneck for the data that ensures
only the main structured part of the information, with which it can be restored exactly
well, can go through and be reconstructed. Therefore, we would like to use dimension
12-3
reduction (feature reduction). In many cases, the data you want to analyze has a high
dimension, which means that each sample has a large number of features. For the most
part, not all characteristics are equally significant. Because it is difficult to analyze data
from a high dimension and build models for such data, in many cases we will try to
reduce the dimension of the data with as little information loss as possible. As illustrated
in Fig. 1, after the encoder part of the neural network we get z = e(x) which is the
latent vector of the input, characterized by a lower dimension than the data, represented
by the important features to be reconstructed in the encoder.
The AE model objective is a minimization of the recovery error between the input data
and the reconstructed output data to be as small as possible,
Loss = ||x − x̂||2 = ||x − d(z)||2 = ||x − d(e(x))||2 (5)
If the equality x = d(e(x)) holds then no information was lost in the encoder-decoder
process. On the other hand, if x 6= d(e(x)) then some information is lost due to the
dimension reduction and the complete reconstruction of the encoded information is not
possible in the decoder.
12-4
III. VARIATIONAL AUTO E NCODER (VAE)
Unlike AE which takes data and performs dimension reduction, VAE [3] determines a
prior distribution to the latent space z, for example, Gaussian distribution z ∼ N (0, I) ,
when I - Identity covariance matrix. The encoder network is trained to receive data x and
output µ(x), σ(x) parameters of z (z ∼ N (µx , σx ) ), in order to minimizing as much as
possible the distance between P (z) and P (z|x). Then sample vectors from z|x (given by
the parameters calculated in the encoder) and pass them through the decoder to produce
parameters of the P (x|z) [1] [2].
Fig. 2. Illustration of Variational Auto Encoder
It is important to mention that in comparison to the AE decoder part which uses for
the training process only, the VAE decoder is important as the encoder since it uses to
generate new data at inference time and to make the whole Variational Auto Encoder
model to a generative model.
Fig. 3. Graphical model of the data generation process

12-5
Lets derived the loss function of VAE, first define the link between the encoder and the
decoder as
(a) P (x|z)P (z)

P (z|x) = (6)
P (x)
Where
(a) follows from the Bayes theorem.
P (z|x) describes the distribution of the encoded variable given the input data.
P (x|z) describes the distribution of the decoded variable given the encoded one.
The objective is to approximate P (z|x) by a Gaussian distribution qx (z) whose mean

and covariance are defined by two functions, g and h, of the parameter x.
P (x|z) ∼ N (f (z), cI) , c>0
P (z) ∼ N (0, I)
qx (z) ∼ N (g(x), h(x)) (7)
We are looking for the optimal g ∗ and h∗ such that
(g ∗ , h∗ ) = arg min D((qx (z))||P (z|x))

g,h

(a) P (x|z)P (z)
= arg min [Ez∼qx (z) (log qx (z)) − Ez∼qx (z) log ]
g,h P (x)
(b)
= arg min [Ez∼qx (z) (log qx (z)) − Ez∼qx (z) (log P (x, z))]
g,h
= arg min [Ez∼qx (z) (log qx (z)) − Eq (log P (z)) − Eq (log P (x|z))]
g,h
= arg min D(q||p) − Eq (log P (x|z)) = arg min (−ELBO) (8)

g,h g,h
Where
(a) follows from the definition of divergence and Bayes theorem.
(b) follows from the fact that P (x) and q are independents.Thus, P (x) Considered a
constant and doesn’t affect.
We know that
12-6
P (x|z) ∼ N (f (z), cI) (9)
1 −(x−f (z))2
P (x|z) = √ e 2c , P (z) ∼ N (0, I) (10)
2πc
Thus,
(x − f (z))2

∗ ∗
(g , h ) = arg min Ez∼qx (z) + D(N (g(x), h(x)) , N (0, I) ) (11)
g,h 2c
Now, by the Maximum likelihood principle,
maxf (Ez∼qx (z) (log(P (x|z))) = maxf (Ez∼qx (z) [log(N (f (z, cI)) ])
(x − f (z))2

= maxf (Ez∼qx (z) − ) (12)
2c
Gathering all the pieces together, we are looking for optimal f ∗ , g ∗ and h∗ such that
(x − f (z))2

∗ ∗ ∗
(g , h , f ) = arg min Ez∼qx (z) + D(N (g(x), h(x)) , N (0, I) ) (13)
g,h,f 2c
In Eq. (13), we get two terms: The first one for the reconstruction of x using the decoder
part, and the second term use the KL divergence to approximating the posterior P (z|x)
to be close to the prior probability P (z). The overall architecture is then obtained
by concatenating the encoder and the decoder parts and we can use gradient descent
optimization to find the optimal parameters of the VAE encoder and decoder and the loss
function is well defined as
(x − f (z))2

Loss = Ez∼qx (z) + D(N (g(x), h(x)) , N (0, I) ) (14)
2c
The second term can also be treated as a regularisation term given by the KL divergence
between two Gaussian distributions which helps the VAE model’s encoder approximation
12-7
of the posterior probability to be close to the prior probability (which is a standard

Gaussian). We can also notice the constant c that rules the balance between the two
previous terms. When c is bigger, we assume a high variance around f (z) for the
probabilistic decoder of the VAE, and we are more like to favor the regularisation term
over the reconstruction term. Opposite stands if c is low.
A. Reparametrization Trick
We note that we still need to be very careful about the way we sample from the
distribution returned by the encoder during the training. The sampling process has to be
expressed in a way that allows the error to be backpropagated through the network to
compute the gradients for the Gradient Descent process as part of the training. Thus, the
reparametrization trick [1] is used as illustrated in Fig. 4 to make the gradient descent
possible despite the random sampling that occurs halfway through the architecture (after
the encoder). Using the fact that z is a random variable following a Gaussian distribution
with g(x) (mean) and h(x) (covariance) then it can be expressed as
z = g(x) + ζh(x) , ζ ∼ N (0, I) (15)
In this approach the whole process becomes deterministic - sample ζ in advance and then
only remains to schematically calculate the spread of the value in the network.
Fig. 4. Illustration of reparametrization trick

12-8
R EFERENCES
[1] Joseph Rocca. Understanding Variational Autoencoders (VAEs). 2019.

https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73
[2] Carl Doersch. Tutorial on Variational Autoencoders. Carnegie Mellon / UC Berkeley.
https://arxiv.org/pdf/1606.05908.pdf.
[3] Pu, Y., Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin. “Variational
autoencoder for deep learning of images, labels and captions”. In: Advances in
Neural Information Processing Systems. 2016.
12-1
Variational Autoencoders (VAE’s)
Lecture 12
Lecturer: Haim Permuter Scribe: Moshe Bunker
S UMMARY
In just three years, Variational Autoencoders (VAEs),section IV,[3], [1] have emerged
as one of the most popular approaches to unsupervised learning of complicated distribu-
tions. VAEs are appealing because they are built on top of standard function approximators
(neural networks), and can be trained with stochastic gradient descent. VAEs have already
shown promise in generating many kinds of complicated data, including handwritten
digits, faces, house numbers CIFAR images, physical models of scenes, segmentation,
and predicting the future from static images. This lecture introduces the intuitions behind
VAEs, explains the mathematics behind them, and describes some empirical behavior.
No prior knowledge of variational Bayesian methods is assumed.
I NTRODUCTION
Variational Autoencoders belong to the family of generative models [1]. The generator
of VAEs is able to produce meaningful outputs while navigating its continuous latent
space. The possible attributes of the decoder outputs are explored through the latent
vector. VAEs attempt to model the input distribution from a decodable continuous latent
space. Within VAEs, the focus is on the variational inference of latent codes.
Therefore, VAEs provide a suitable framework for both learning and efficient Bayesian
inference with latent variables. For example, VAEs with disentangled representations
enable latent code reuse for transfer learning. In terms of structure, VAE bears a
resemblance to an autoencoder. It is also made up of an encoder (also known as a
recognition or inference model) and a decoder (also known as a generative model). Both
VAEs and autoencoders attempt to reconstruct the input data while learning the latent
12-2
Input Output
z=h(x)
x Encoder Decoder x=d(z)

ˆ
Latent
Variable
Fig. 1. Illustration of an autoencoder, dimensionality reduction principle can be seen in the diagram
vector. However, unlike autoencoders, the latent space of VAE is continuous, and the
decoder itself is used as a generative model.
I. AUTO ENCODER
In its simplest form, an autoencoder [3] will learn the representation or code by trying
to copy the input to output. However, using an autoencoder is not as simple as copying the
input to output. Otherwise, the neural network would not be able to uncover the hidden
structure in the input distribution.An autoencoder will encode the input distribution into a
low-dimensional tensor, which usually takes the form of a vector. This will approximate
the hidden structure that is commonly referred to as the latent representation, code, or
vector. This process constitutes the encoding part. The latent vector will then be decoded
by the decoder part to recover the original input.
As a result of the latent vector being a low-dimensional compressed representation
of the input distribution, it should be expected that the output recovered by the decoder
can only approximate the input. The dissimilarity between the input and the output can
be measured by a loss function. But why would we use autoencoders? Simply put,
autoencoders have practical applications both in their original form or as part of more
complex neural networks. They’re a key tool in understanding the advanced topics of
deep learning as they give you a low-dimensional latent vector, therefore are used to
dimension reduction [3].
Definition 1 (dimension reduction) The dimension reduction is the process of reducing

12-3
the number of features that describe some data. This reduction is done either by selection
(only some existing features are conserved) or by extraction (a reducednumber of new
features are created based on the old features) and can be useful in many situations
that require low dimensional data (data visualisation, data storage,heavy computation. . . ).
Although there exists many different methods ofdimensionality reduction, we can set a
global framework that is matched by most of these methods.
Here, we should however keep two things in mind. First, an important dimensionality
reduction with no reconstruction loss often comes with a price: the lack ofinterpretable
and exploitable structures in the latent space (lack of regularity) [2]. Second, most of
the time the final purpose of dimensionality reduction is not to onlyreduce the number
of dimensions of the data but to reduce this number of dimensions while keeping the
major part of the data structure information in the reducedrepresentations. For these two
reasons, the dimension of the latent space and the“depth” of autoencoders (that define
degree and quality of compression) have to becarefully controlled and adjusted depending
on the final purpose of the dimensionalityreduction.
II. VARIATIONAL AUTOENCODERS IDEA
In a generative model, we’re often interested in approximating the true distribution of

our inputs using neural networks:
z = f (x) (1)
In machine learning, to perform a certain level of inference, we’re interested in finding

pθ (x, z) ,a joint distribution between inputs - x and latent variables - z, when θ represents
the parameters determined during training, The latent variables are not part of the dataset
but instead encode certain properties observable from inputs. pθ (x, z) is practically a
distribution of input data points and their attributes. pθ (x) can be computed from the
marginal distribution:
Z
pθ (x) = pθ (x, z)dz (2)
12-4
In other words, considering all of the possible attributes, we end up with the distribution
that describes the inputs. The problem is that Equation 2 is intractable. The equation does
not have an analytic form or an efficient estimator. It cannot be differentiated with respect
to its parameters. Therefore, optimization by a neural networkis not feasible. Using Bayes’
theorem, we can find an alternative expression for Equation 2:
Z
pθ (x) = pθ (x | z)p(z)dz (3)
When p(z) is a prior distribution over z.

In practice, if we try to build a neural network to approximate pθ (x | z) without a
suitable loss function, it will just ignore z and arrive at a trivial solution, pθ (x | z) = pθ (x).
Therefore, Equation 3 does not provide us with a good estimate of pθ (x). Alternatively,
Equation 2 can also be expressed as:
Z
pθ (x) = pθ (z | x)p(x)dz (4)
However, pθ (z | x) is also intractable. The goal of a VAE is to find a tractable distribution

that closely estimates pθ (z | x) an estimate of the conditional distribution of the latent
attributes, z, given the input, x (later in the lecture we will define a normal distribution).
III. VARIATIONAL IFERENCE
We want to compute p(z m | xn ), but it is very difficult to compute.

(z m , xn ) p(z m )P (xn |z n )
p(z m |xn ) = = R (5)
p(xn ) P (z m )p(z n |z m )dz n
However, we can use the inference network to compute an approximate poste-
rior, q(z m | xn ). In this section, we discuss variational inference, which is another
optimization-based approach to posterior inference, but which has much more modeling
flexibility (and thus can give a much more accurate approximation).variational inference
attempts to approximate an intractable probability distribution, such as p(z m |xn ), with one
that is tractable, q(z m ), so as to minimize some discrepancy D between the distributions:
D q(z m )kp(z m |xn )

arg min
m
(6)
q(z )∈Qθ
12-5
where Q is some tractable family of distributions (e.g., multivariate Gaussian). we define

D to be the divergence, then we can derive a lower bound to the log marginal likelihood:
h p(z n , xn ) i
= arg min Eq(zm ) [log q(z m )] − Eq log
p(xn ) (7)
m m n n

= arg min Eq log q(z − Eq [log p(z , x )] + log p(x )
By using the definition of evidence lower bound or ELBO:
ELBO = Eq [log p(z m , xn )] − Eq log q(z m )

(8)
We can present the expression in Equation 8:
ELBO = Eq [log p(z m , xn )] − Eq log q(z m )

= Eq [log p(z m )] + Eq [log p(xn | z m )] − Eq log q(z m )

(9)
= −D q(z m )kp(z m ) + Eq [log p(xn | z m )]

Therefore, we will return to Equation 6 and place the development we made in Equation
8 We get that the minimum condition becomes the maximum on the expression:
" #
max Eq [log p(xn | z m )] − D q(z m )kp(z m ) (10)
IV. VARIATIONAL AUTO ENCODERS
Let’s now make the assumption that p(z) is a standard Gaussian distribution and that
p(x|z) is a Gaussian distribution whose mean is defined by a deterministic function f of
the variable of z and whose covariance matrix has the form of a positive constant c that
multiplies the identity matrix I. The function f is left unspecified for the moment and
that will be chosen later. Thus, we have
p(x | z) ∼ N (f (z), cI), c > 0

(11)
p(z) ∼ N (0, I)
Here we are going to approximate p(z|x) by a Gaussian distribution qx (z) whose mean
and covariance are defined by two functions, g and h, of the parameter x. These two
12-6
Encoder Random Sample Latent Decoder

Space
µ¹
µ² Diagonal
Multivariate
Gaussian
σ¹
σ²
Input Data Mean, Variance Encoded Data Reconstructed Data
Fig. 2. Schematic illustration of a VAE
functions are supposed to belong, respectively, to the families of functions G and H that
will be specified later but that are supposed to be parametrised. Thus we can denote
q(z | x) ∼ N (g(x), h(x)) (12)
So, we have defined this way a family of candidates for variational inference and need
now to find the best approximation among this family by optimising the functions g and
h (in fact, their parameters) to minimise the divergence between the approximation and
the target p(z|x). In other words, we are looking for the optimal g∗ and h∗ such that:
12-7
(g ∗ , h∗ ) = arg ming,h D q(z)kp(z|x)

h p(x|z)p(z) i
= arg ming,h Eq log q(z) − Eq log
p(x)
h i
= arg ming,h Eq log q(z) − Eq log p(x, z)
h i h i
= arg ming,h Eq log q(z) − Eq log p(z) − Eq log p(x|z) (13)
h i
= arg ming,h D qkp − Eq log p(x|z)
= arg ming,h [−ELBO]

h (x − f (z))2 i
= arg ming,h Eq + D N (g(x), h(x))kN (0, I)
2c
Given the encoder and decoder models, there is one more problem to solve before
we can build and train a VAE, the stochastic sampling block, which generates the latent
attributes. In the next section, we will discuss this issue and how to resolve it using the
reparameterization trick.
V. R EPARAMETERIZATION TRICK
The left-hand side of Figure 3 below shows the VAE network. The encoder takes
the input, and estimates the mean, g(x), and the standard deviation, h(x), of the
multivariate Gaussian distribution of the latent vector, z, to reconstruct the input as x̃.
This seems straightforward until the gradient updates happen during backpropagation.
Backpropagation gradients will not pass through the stochastic Sampling block. While
it’s fine to have stochastic inputs for neural networks, it’s not possible for the gradients
to go through a stochastic layer.
The solution to this problem is to push out the Sampling process as the input, as shown
on the right side of Figure 3. Then, compute the sample as:
z = g(x) + ∗ h(x) (14)
If and h(x) are expressed in vector format, then ∗ h(x) is element-wise multiplication.
Using Equation 14, it appears as if sampling is directly coming from the latent space
12-8
x̃ x̃
Decoder Decoder
ɛ
Sampling + *
g(x) h(x) g(x) h(x) Sampling
Encoder Encoder
N(0,1)
x x
Fig. 3. on the left side A VAE network [2] without the reparameterization trick, on the right is a diagram with the
reparameterization trick
as originally intended. This technique is better known as the Reparameterization trick.

With Sampling now happening at the input, the VAE network can be trained using the
familiar optimization algorithms, that we have learned in previous lectures.
R EFERENCES
[1] Carl Doersch, Carnegie Mellon/ UC Berkeley. Tutorial on Variational Autoencoders. Addison-Wesley, Reading,
Massachusetts, 1993.
[2] Rowel Atienzaörper. [Advanced Deep Learning with Keras]. Birmingham - Mumbai, Packt Publishing
Ltd,Birmingham B3 2PB, UK, 2018.
[3] Joseph Rocca - Understanding Variational Autoencoders (VAEs), sep 24, 2019.
https://towardsdatascience.com/˜understanding-variational-autoencoders-vaes-f70510919f73

ML Merge

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

ML Merge

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Merge

Uploaded by

Copyright:

Available Formats

1-1

I. I NTRODUCTION TO M ACHINE L EARNING

A. Type of learning problems

• Regression problems: Given stochastic observation x, regression associates a

Throughout the course we will use the following notation:

III. P ROBLEM SETTINGS AND CRITERIA DECISION (MAP, MLE) WHEN

P (a, b) = P (a) · P (b|a). (3)

and the Posterior probability vector:

where the expectation is described in terms of the compound density function

Then for each class j ∈ {1, 2, . . . , N} the conditional loss is

j = argmax{ηˆj (x)} = argmax{P (j|x)}. (13)

Definition 5 (MAP) Maximum a Posteriori is the estimation method for a random

j ∗ = argmax P (j|x) = argmax ηj fj (x). (14)

In our case, the posterior probability function is the vector

argmax P (j|x) = argmax fj (x) = argmax P (x|j), (16)

Definition 6 (MLE) Maximum Likelihood Estimation is the method of estimating a

j ∗ = argmax P (x|j) = argmax fj (x). (17)

C HOOSING BEST CLASSIFIER WHEN PROBABILITY IS UNKNOWN BUT WE HAVE

SAMPLES VIA E MPIRICAL R ISK M INIMIZATION (ERM)

R(h) = E[L(h(X), Y )], (19)

h∗ = min R(h). (20)

IV. K-N EAREST N EIGHBOR M ODEL

Definition 7 (KNN) K-Nearest Neighbor is a basic classification algorithm, that uses a

Training set: (x1 , θ1 ) (x2 , θ2 ) . . . (xn , θn )

where θ′ is the label of x′ .

And L(θ, θn′ ) is the loss function.

R(n) = E [L(θ, θn′ )] (24)

And for a large number of training samples n, we define R to be the NN risk:

R = lim R(n) (25)

Theorem 1 (Nearest neighbor risk bounds) Under the assumption on f1 , f2 , that x

proof: Under the same assumptions on f1 , f2 and x as in Theorem 1, the conditional

r(x, x′n ) = E [L(θ, θn′ )|xn , x′n ]

= P (θ = 1, θn′ = 2|x, x′n ) + P (θ = 2, θn′ = 1|x, x′n ) (27)

= P (θ = 1|x) · P (θn′ = 2|x′n ) + P (θ = 2|x) · P (θn′ = 1|x′n ).,

Lemma 1 (Convergence of the nearest neighbor) : Under the assumption on f1 , f2

P { min d(x, xk ) ≥ δ} = (1 − P (Sx (δ)))n → 0. (28)

(b) holds since x′n converge to x.

E [r(x, x′n )] = E [E [L(θ, θn′ )|x, x′n ]] = E [L(θ, θn′ )] . (30)

From the dominant convergence theorem we get that

R = E [2η1 (x)η2 (x)] = E [2r ∗ (x)(1 − r ∗ (x))] , (32)

and now we can write equation (31) as follows:

= E[2r ∗ (x)(1 − r ∗ (x))] (33b)

= E[r ∗ (x) + r ∗ (x)(1 − 2r ∗ (x))] (33c)

R = E[2r ∗ (x)(1 − r ∗ (x))]

= 2R∗ (1 − R∗ ) − 2V ar(r ∗ (x)) (34)

Lecture 2: GMM and EM

II. R EVIEW: S UPERVISED CLASSIFICATION

III. M IXTURES OF G AUSSIANS AND THE EM ALGORITHM

A. Gaussian Mixture Model (GMM) - Introduction

The probability of getting a specific gaussian given a class is

P (zi = j|c) = φj (2)

argmax P (xn ; θ) = argmax log(P (xn |θ)) (3a)

We define ℓ as the log-likelihood function. To estimate our parameters , we can write

B. One Gaussian introduction and complete solution

We perform the same thing with respect to σ 2

C. GMM parameters estimation and EM motivation

φj stands for P (z = j; θ). By differentiating (10b) with respect to µ1 , we obtain