Unit 04 - Unsupervised Learning - MD
Unit 04 - Unsupervised Learning - MD
We will then move to recognise data structures where each point can belong to
multiple clusters, trying to discover the different components embedded in the
data and looking at the likelihood a certain point was generated by that
component.
13.2. Objectives
Understand the definition of clustering
Understand clustering cost with different similarity measures
Understand the K-means algorithm
So in this case, we are provided with just a set of feature vectors, and we are
1 of 27 24/04/2020, 13:26
trying to learn some meaningful structure which would represent this set. In this
lesson in particolar we will cover clustering algorithms.
It is intuitive that the points belong to two different categories, in this case
based on their position on the (x,y) axis, even if the points have no label (colour)
associated to them.
Find a structure is however not enough. We need to be able to describe how the
structure we found is good in representing the data, how good is the cluster, the
partitioning of the data (the “clustering cost”). And we will need for that a
measure of similarity between points.
We will finish the lecture dealing with the K-means algorithm, an important and
widely used algorithm which actually can take any set of training point and find
a partitioning to K clusters.
To conclude the segment with an example, let’s consider again Google news. In
google News we have both an example of classification and one of clustering.
The grouping of the news in one of the categories (sport, science, business,…) is
a categorisation task. The categories are fixed, and someone has trained the
algorithm manually setting the categories of each news for a training set, so that
now it can be predicted automatically.
The grouping of news around individual new events (the covid-19, the brexit, the
US elections…) is instead a clustering problem: here the categories are not
2 of 27 24/04/2020, 13:26
predefined and there has been no training with manually label all possible
events.
A typical 1024 x 1024 pixel image coded in 24 bits per pixel (8 bits per each of
the RGB colours) would size 1024x1024x24 = 25165824 bits = 3 MB.
If we create a palette restricted to only 32 colours, and represent each pixel with
one of the colours in this palette, the image then would size only
1024x1024x5+5*24 = 5243000 bits = 640.01 KB (this by the way is a
compression technique used in PNG and GIF images).
Note that the image quality depends from the number of colours of the palette
we choose, i.e. the number of clusters K. While cartoon, screenshots, logos
would likely require very small K to appear of a quality similar to the original
image, for photos we would likely need higher K to achieve the same quality.
Still, to get the compressed image we need to undergo the three steps described
in the previous segment. We need to represent the image (as a vector of pixels
with a colour code). The second step is to find some similarity measure between
colours, and finally we can run the clustering algorithm to replace each pixel 24
bits representation with the new 5 bits one, choosing a map from the 12777216
original colours (24 bits) to the 32 “representative ones” (also to be found by the
algorithm!).
3 of 27 24/04/2020, 13:26
Partitioning
Let’s Sn = {x(i) |i = 1, . . . , n} be the set of n feature vectors indexed by i and
CK = {Cj |j = 1, . . . , K} being the set of K clusters containing the indexes of the
data points, then C1 , . . . , Cj , . . . , CK forms a K-partition of the data if, at the
same time, their union contain all the data indexes and their intersection is
empty: ∪(C1 , . . . , Cj , . . . , CK ) = {1, 2, . . . , n} and ∩(C1 , . . . , Cj , . . . , CK ) = {}
This partitioning define a so-called hard clustering, where every element just
belongs to one cluster.
Representativeness
The representativeness view of clustering is the problem of how to select the
feature vector to be used as representative of the cluster, like the new story to
put as header on each event in Google News or the colour to be use for each
palette item. Note that the representative element for each cluster doesn’t
necessarily be one existing element of the cluster set, can be also a combination
of them, like the average.
We will see later in this lesson what is the connection between these two views.
Clearly they are connected, because if you know what is your partitioning, you
may be able to guess who is the representative. If you know who is a
representative, you may be able to guess who is a constituent, but we will see
how we can actually unify these two views together.
Note that finding the optimal number of K is itself another problem, as K is not
part of the clustering algorithm, it is an input (parameter) to it. That is, K is a
hyperparameter of the clustering algorithm.
x(i) ⋅x(j)
cosine distance, dist(x(i) , x(j) ) =
||x(i) ||∗||x(j) ||
2
square euclidean distance, dist(x(i) , x(j) ) = ||x(i) − x(j) ||
The cosine distance has the merit of being invariant to the magnitude of the
vectors. For example, in terms of the Google News application, if we choose
cosine to compare between two vectors representing two different stories, this
4 of 27 24/04/2020, 13:26
measurement will not take it into account how long are the stories. On the other
hand, the Euclidean distance would be sensitive to the lengths of the story.
While cosine looks at the angle between vectors (thus not taking into regard
their weight or magnitude), euclidean distance is similar to using a ruler to
actually measure the distance (and so cosine is essentially the same as euclidean
on normalized data)
Cosine similarity is generally used as a metric for measuring distance when the
magnitude of the vectors does not matter. This happens for example when
working with text data represented by word counts. We could assume that when
a word (e.g. science) occurs more frequent in document 1 than it does in
document 2, that document 1 is more related to the topic of science. However, it
could also be the case that we are working with documents of uneven lengths
(Wikipedia articles for example). Then, science probably occurred more in
document 1 just because it was way longer than document 2. Cosine similarity
corrects for this (from https://cmry.github.io/notes/euclidean-v-cosine).
While we’ll need to understand which metric is most suited for a specific
application (and where it matters in the first instance), for now we will employ
the Euclidean distance.
Given these two choices, our cost function for a specific partition becomes:
Note that for now we are giving this cost function as a function of both the
partition and the representative vectors, but when we will determine how to
compute the representative vectors zj endogenously, this will be a function of
the partition alone.
1
{ nk } = k!
∑ki=0 (−1)i (ki)(k − i)n
3 100
For example, { 2 } = 3 , but { 3 } ≈ 8.58e + 46
5 of 27 24/04/2020, 13:26
representative at the center of its newly acquired cluster (where “center”
depends again from the metric). Steps ( b ) and ( c ) are reiterated until the
algorithm converge, i.e. the tentative k representative points (and their relative
clusters) don’t move any more.
Graphically:
6 of 27 24/04/2020, 13:26
Redefinition of the constituencies (note that some points have changed their
representative):
Let’s describe this process a bit more formally in terms of the cluster costs.
7 of 27 24/04/2020, 13:26
1. Randomly select the representatives z (1) , . . . , z (j) , . . . , z (K)
2. Iterate:
2.1. Given z (1) , . . . , z (j) , . . . , z (Z) , assign each data point x(i) to the
closest representative z (1) , . . . , z (j) , . . . , z (Z) , so that the resulting cost
2
will be cost(z (1) , . . . , z (j) , . . . , z (Z) ) = ∑i=1 minj=1,...,K ||x(i) − z (j) ||
n
Note that in both step 1 and step 2.2 we are not restricted that the initial and
final representatives are part of the data sets. The fact that the final
representatives will be guarantee to be part of the dataset is instead one of the
advantages of the K-medoids algorithm presented in the next Lesson.
Using the squared Euclidean distance, the “new” zj representative vector for
2
each cluster, must satisfy j : minzj cost(zj ; Cj ) = minzj ∑i∈CJ ||x(i) − zj || .
When we compute the gradient of the cost function with respect to zj and set it
∑i∈C x(i)
to zero, we retrieve the optimal zj as zj = J
, where |CJ | is the number of
|CJ |
elements of the cluster CJ . Intuitively the optimal representative vector is at the
center of the cluster, i.e. it is the centroid of the group.
Impact of initialisation
Note that the KM algorithm is guarantee to converge and find a local cost
minimisation, because at each iteration the cost can only decrease, the cost
function is non-negative and the number of possible partitions, however large, is
finite.
8 of 27 24/04/2020, 13:26
In the upper part we see the initial random assignation of the representative
vectors, and in the bottom picture we see the final assignment of the clusters.
While the feature vectors moved a bit, the final partition is clearly not the
optimal one (where the representative vectors would be at the center of the
three groups).
In particular, we may run into troubles when the initial representative vectors
are close to each others rather than spread up across the multidimensional
space.
The first class is that the measure doesn’t describe what we consider as a
natural classification, so the cost function doesn’t represent what we would like
the cost of the partition to be, it does not return a useful information concerning
the partition ranking.
While K-M algorithm scale well to large datasets, it has many other drawbacks.
One is that vanilla K-M algorithm tries to find spherical clusters in the data, even
when the groups have other spatial grouping:
9 of 27 24/04/2020, 13:26
To account for this, k-means can be kernelized, where separation of arbitrary
shapes can be reached theoretically using higher dimensional spaces. Or there
could be used a regularized gaussian mixture model, like in this paper or in
these slides.
These further drawbacks are discussed, for example, here, here or here.
In some applications this is not a concern, but it is for others. For example,
looking at Google News, if we create a representative of the cluster of the story
10 of 27 24/04/2020, 13:26
which doesn’t correspond to any story, we actually have nothing to show.
We will now introduce the K-medoids algorithm that modifies the k-means one to
consider any kind of distance (not only the squared Euclidean one) and return a
set of representative vectors that are always part of the original data set.
The step 2.1 (determining the constituencies of the representatives) remains the
same, while in step 2.2, instead of choosing the new representatives as
centroids, we constrain again that these have to be one of the point of the
cluster. This allow to just loop over all the points of the cluster to see which
minimise the cluster cost. And as in both step 2.1 and 2.2 we explicitly use a
“distance” function we can employ any kind of distance definition we want, i.e.
we are no longer restricted to use the squared Euclidean one.
11 of 27 24/04/2020, 13:26
example, if the number of steps involved is 5n2 + n + 1 , then we say it is “of
order n2 " and denote this by O(n2 ). When n is large, the highest order term 5n2
dominates and we drop the scaling constant 5 .
More formally, a function f(n) is of order g(n), and we write f(n) ∼ O(g(n)), if
there exists a constant C such that f(n) < Cg(n) as n grows large.
In other words, the function f does not grow faster than the function g as n
grows large.
The big-O notation can be used also when there are more input variables. For
example, in this problem, the number of steps necessary to complete one
iteration depends on the number of data points n, the number of clusters K, the
dimension d of each vector xi . Hence, the number of steps required are of
O(g(n, K, d)) for some function g(n, K, d) .
The order of computation for the step 2.1 of the K-Mean algorithm is
O(n ∗ k ∗ d) as we need to go trough each point (n), compute the distance with
each representative (k) and account that we deal with multidimensional vectors
(d). The step 2.2 is the same, but because we’re talking about the asymptotic
growth, whenever we sum them up, we’re still staying within the same order of
complexity.
For the K-Medoids algorithm instead we can see that is much more complex, as
in the step 2.2 we need to go trough all the clusters, and then all the point.
Depending how the data is distributed across the cluster, we can go from the
best case of all points in one cluster, and then we would have a computational
complexity of O(n2 ∗ d) to a situation where the points are homogeneously
distributed across the clusters, and in such case we would have n/k points in
2
each cluster and a total complexity O( nk ∗ nk ∗ k ∗ d) = O( nk ∗ d) .
Given that normally n ≫ k , what makes the difference is the n2 term, and even
in the best scenario the computational complexity of the K-Medoids algorithm is
much higher than those of the K-Mean one, so for some applications with big
datasets the K-Mean algorithm would be preferable.
At this point we already have seen two clustering algorithms, but there are
hundreds of them available for our use. Each one has different strengths and
weaknesses, and whenever we are selecting a clustering algorithm which fits for
our application, we should account for all of them in order to find the best
clustering algorithm for our needs.
12 of 27 24/04/2020, 13:26
decrease, until the degenerated case where the number of clusters equals the
number of data and the cost is zero.
When is it time then to stop the number of clusters ? To decide which is the
“optimal” number K ? There are three general settings. In the first one the
number of clusters is fixed, is really given to us by the application. For example
we need to cluster our course content in 5 recitations, this is the space we have.
Problem “solved”
In clustering for example we decide about which similarity measure to use and
we decide about how many clusters to give and, as we saw, ow many clusters
provide. And if you’re thinking about bag of words approach of representing
text, these algorithms cannot figure out which words are more semantically
important than other words. So it would be up to use to do some weighting, so
that the clustering results is actually acceptable.
15.1. Objectives
Understand what Generative Models are and how they work
Understand estimation and prediction phases of generative models
Derive a relation connecting generative and discriminative models
13 of 27 24/04/2020, 13:26
Derive Maximum Likelihood Estimates (MLE) for multinomial and Gaussian
generative models
For example, if your task is not just to do classification but to generate new
samples of such (feature, label) pair based on the information provided in
training data.
Given such probabilistic model, the generative model p(w|θ) is nothing else than
the statistical model to estimate the parameters θw given the observed data w ,
or more formally the statistical model described by the pair (E, {Pθ }θ∈Θ ) , where
E is the sample space of the data and {Pθ }θ∈Θ ) is the family of distributions
parametrized by θ .
For example the probabilistic model may be a words generating model, given
fixed vocabulary of W words and where each word would have its own
probability. Then we could see a document as a serie of random (independent)
extraction of these words so that a document with common words have more
probabilities to be generated compared to a document made of uncommon
words. As the words are independent in this model, the likelihood of the whole
document to be generated is just the product of the likelihood of each world
14 of 27 24/04/2020, 13:26
being generated.
Note that the categorical and multinomial distributions are nothing else than an
extension of respectively the Bernoulli and binomial distributions to multiple
classes (rather than just a binary 0/1 ones). They model the probability of
respectively one or multiple, independent, categorical trials, i.e. the outcome for
the categorical distribution is the single word, while those for the multinomial
distribution is the whole document (encoded as world count).
n!
For completion, there are w !,...,w possible combinations of sequences of a
1 i !,...wk !
k
document with w1 , . . . , wi , . . . wk worlds count (with ∑i=1 wi = n), so the PMF
of the multinomial is
p(w1 , . . , wi , . . . , wk ; θ1 , . . . , θi , . . . , θk ) = w !,...,wn! !,...w ! ∏ki=1 θw
i .
i
1 i k
Note indeed that in some fields, such as machine learning and natural language
processing, the categorical and multinomial distributions are conflated, and it is
common to speak of a “multinomial distribution” when a “categorical
distribution” would be more precise.
15 of 27 24/04/2020, 13:26
having generated the observed document.
Let’s consider for example a vocabulary of just two words, “cat” and “dog” and
an observed Document “cat cat dog”. Let’s also assume that we have just two
compelling models to choose from. The first (probabilistic) model has θcat = O.3
and θdog = O.7. The second model has θcat = O.9 and θdog = O.1 It is intuitive
that it is more probable that is the second model that generated the document,
but let’s compute it. The probability of the document being generated by the
first model is 0.32 ∗ 0.7 = 0.0189. The probability that D has been generated
instead by the second model is 0.92 ∗ 0.1 = 0.081, that is higher than those for
the first model.
While often used synonymously in common speech, the terms “likelihood” and
“probability” have distinct meanings in statistics. Probability is a property of the
sample, specifically how probable it is to obtain a particular sample for a given
value of the parameters of the distribution; likelihood is a property of the
parameter values.
We want now to find which are the parameters that maximise the likelihood.
count(w)
For the multinomial model it is argmaxθ {L(D|θ0 ) = ∏w∈W θw }.
We hence compute the first order conditions in terms of theta for the log-
likelihood to find the theta that maximise it:
Taking the derivative with respect to θ0 and setting it equal to zero we obtain:
16 of 27 24/04/2020, 13:26
∂lL count(0) count(1)
∂θ0
= θ0
− 1−θ0
∂lL ~ count(0)
∂θ0
= 0 → θ0 =
count(0)+count(1)
Note the symbol of the tilde to indicate an estimated parameter. This is our “best
guess” parameter given the observed data.
The problem is however that the thetas are not actually free, but we have the
constrain that the thetas have to sum to one (we have actually also those that
the thetas need to be positive, but for the nature of this maximisation problem
we can ignore this constraint).
Lx =fx − λgx =0
Ly =fy − λgy =0
Lλ =c − g(x, y) =0
17 of 27 24/04/2020, 13:26
third equation in the FOC guarantee that indeed the constrain is respected and,
at the stationary point, the two functions have the same value.
Lx =y − λ =0
Ly =x − λ =0
Lλ =6 − x − y =0
nw
Lw = −λ =0
θw
Lλ = ∑ θw − 1 =0
w∈W
Where the first equation is actually a gradient with respect to all the w .
These set of θw parameters are the maximum likelihood estimates, the values
of θ that maximize the likelihood function for the observed data and this
multinomial generative model.
15.7. Prediction
Let’s now see how can we actually use generative Multinomial Models to make
predictions, for example deciding if a document is in Spanish of Portuguese
language.
For that we have first given a large set of documents in both languages, together
with their label (the language).
From there we can estimate the PMF of the words using the MLE estimator (i.e.,
counting the relative proportions) for both Spanish and Italian (so our
vocabulary would have both Spanish and Italian terms).
18 of 27 24/04/2020, 13:26
We are now given a new document without the language label. Calling the two
languages as “+” and “-”, we compute the likelihood of the document under the
first hypothesis (P (D|θ+ ) ) as the joint probability using the first PMF , and the
likelihood under the second hypothesis (P (D|θ− ) ) using the second PMF, and we
decide how to classify the document according to which one is bigger.
P(D|θ+ )
But log
P(D|θ− )
= log P (D|θ+ ) − log P (D|θ− )
θ+
= ∑w∈W count(w) log w
θ−
w
= ∑w∈W count(w)θ^w
θ+
where θ^w is just a symbol for log w− .
θw
The last expression shows that this classifier, derived looking through a
generative view on classification, should actually remind us a linear classifier
that goes through origin with respect to this parameter θ^w .
P(B|A)∗P(A)
P (A|B) = P(B)
P (X = x) = ∑y P (Y = y) ∗ P (X = x|Y = y)
P(D|lang=Spanish)∗P(lang=Spanish)
P (lang = spanish|D) = P(D)
19 of 27 24/04/2020, 13:26
Where
P (D) = P (D|lang = Spanish) ∗ P (lang = Spanish) + P (D|lang = P ortuguese) ∗ P
Going back to our plus/minus class notation instead of the languages, we can
compute the log of the ratio between the posteriors for the two classes as:
Using the expression from the previous segment for the first term and θ^0 as a
symbol representing the second term, we can rewrite the equation as:
So what we’ve seen here that in this model we can very easily incorporate our
prior knowledge about the likelihood of certain classes. And at the end, what we
got, that even though we’re talking about generative models and we’re using a
different mechanism on some ways of estimating the parameters of this
multinomial, at the end, we actually are getting the same linear separators that
we see in our discriminative modelling.
While the categorical distribution was opportune for modelling discrete random
variables, like classes, the Normal is the analogous counterpart for continuous
random variables.
1 T
Σ−1 (x−μ)
pX (x; μ, Σ) = 1
e− 2 (x−μ) , x ∈ Rd
√(2π) det(Σ)
d
20 of 27 24/04/2020, 13:26
where x ∈ Rd , μ ∈ Rd is the vector of the means in the d dimensions and
Σ ∈ Rd×d is the covariance matrix (with det(Σ) being its determinant).
When all the components (dimensions) are uncorrelated (i.e. the covariant
matrix Σ is diagonal) and have the same standard deviation σ, the Normal PDF
reduces to:
1 − 1
∗||x−μ||2
pX (x; μ, σ 2 ) = ∗e 2σ 2
(2πσ 2 )d/2
1 − 1
∗||x−μ||2
The likelihood is p(μ, σ 2 ; Sn ) = ∏t=1 p(μ, σ 2 ; x(t) ) = ∏t=1
n n
∗e 2σ 2
(2πσ 2 )d/2
1 − 1
∗||x(t) −μ||2
logp(μ, σ 2 ; Sn ) = log(∏nt=1 ∗e 2σ 2 )
(2πσ 2 )d/2
= − nd
2
log(2πσ 2 ) + 1
2σ 2
||x(t) − μ||2
From this expression we can take the derivatives of μ and σ and equal them to
2 ∑nt=1 ||x(t) −μ||2
^ = n1 ∑t=1 x(t) and σ
zero to find the MLE estimators μ ^ = n
nd
.
21 of 27 24/04/2020, 13:26
16.1. Mixture Models and the Expectation
Maximization (EM) Algorithm
Objectives:
While a “hard" clustering algorithm like k-means or k-medoids can only provide
a cluster ID for each data point, the EM algorithm, along with the generative
model driving its equations, can provide the posterior probability (“soft"
assignments) that every data point belongs to any cluster.
Today, we will talk about mixture model and the EM algorithm. The rest of the
segment is just a recap of the two probabilistic models we saw, the multinomial
(actually the categorical) and the multidimansional Gaussian.
The corresponding statistical model is, given observed data and the assumption
that the data has been generated by the given probabilistic model, which is the
most likely parameter of the distribution, like the individual category
probabilities θw for the categorical or (μ, σ 2 ) for the Gaussian ?
Please also consider that the sum of independent normal random variables
remains normal:
X ∼ N(μ, σ 2 ) → aX + b ∼ N(aμ + b, a2 σ 2 )
22 of 27 24/04/2020, 13:26
16.3. Introduction to Mixture Models
Let’s know consider a generative (probabilistic) model consisting of two
subsequent steps. We first draw a class using a K-categorical distribution with
probabilities p1 , . . . , pj , . . . pK .
We then draw the final data from a Normal distribution with parameter (x(j) and
(j)
σ 2 ), i.e. specific of the class sampled in the first step.
This generative model goes under the name of mixture model and mixture
models are actually very useful for describing many real-life scenarios.
The full set of parameters of this mixture model is then collectively represented
(1) (j) (K)
by θ = {p1 , . . . , pj , . . . pK ; μ(1) , . . . , μ(j) , . . . , μ(K) , σ 2 , . . . , σ 2 , . . . , σ 2 }
where the non-negative pj are called mixture weights (and sum to 1) and the
(j)
individual Gaussians with parameters μ(j) and σ 2 are the mixture
components (so this is more specifically a Gaussian mixture model (GMM),
not to be confonded with the “Generalized Method of Moments”). K is the size
of the mixture model.
The output of sampling a set of points from such model, using 2 components
(two classes of the categorical distribution) and a two-dimensional Gaussian,
would look something like this:
We can intuitively distinguish the two clusters and see that one is bigger than
the other, both in terms of points (i.e. its pj is larger) and in term of spread (its
(j)
σ2 is bigger).
But where is the border between the two points? While the K-Mean or K-Medoids
algorithm would have provided a classifier, returning one cluster id for each
point (hard clustering model), a clustering based on a mixture generative
model (to be implemented using the EM algorithm we will discuss in the next
segment) would instead return for each point the probability to belong to each
cluster (soft clustering model). For example, the points in the top-left corner
of the picture will have probability to belong to the first cluster almost one and
those in the bottom-right corner probability to belong to the second cluster (i.e.
being generated by a Normal distribution with parameters (μ2 , σ22 )) almost 1,
23 of 27 24/04/2020, 13:26
but those in the center will have much more balanced probabilities.
So here, we have a much more refined way to talk about our distribution,
because every point kind of belongs to different components, just with different
likelihoods.
If we know all of the parameters of the K Gaussians and the mixture weights, the
probability density function p(x|θ) can be computed using the law of total
probability as p(x|θ) = ∑j=1 pj N (x; μ(j) , σj2 ).
K
Using the Bayes rule we can also find the posterior probability that x belongs to
a Gaussian component j.
Now, given this expression, we want to take the derivatives of it with respect to
my parameters, make them equal to 0, and find the parameters. It turns out
however that’s actually a pretty complex task.
We will first start to reason in a setting where we know the category associated
to each point, to further generalise in a context where we don’t know the class
from which it belongs, with the EM algorithm.
Note also that while we will consider different means of the Gaussian across its
dimensions (i.e. μ(j) is a vector) we will assume the same variance across them
(i.e. σj2 is a scalar).
For that let’s define as indicator function a function that, for a given pair (i, j) ,
returns 1 if the j cluster is the one to with point i belongs to, 0 otherwise.
1(j; i) := {
1 if x(i) ∈ clusterj ,
0 if otherwise.
The log-likelihood can then be rewritten using such indicator function and
inverting the summation as:
(i) (j)
logp(θ, Sn ) = ∑K n 2
j=1 ∑i=1 1(j; i)log[pj N (x , μ , σj )]
24 of 27 24/04/2020, 13:26
Note that in general log(sum(x)) is not the same as sum(log(x)), but the
presence of the indication function guarantees that in reality there is no
summation over the cluster when we have a specific data example.
And what this expression actually demonstrates is that, whenever we are trying
to find all the parameters that optimize for each mixture component, we can
actually do this computation for each cluster independently. We can
independently find the best parameters that fit the cluster, because the fact that
these points are observed, we actually know where the point is belonging to. So
we can separately find, here the mu, sigma and the p.
^2j
σ = 1
^j d
n
∑ni=1 1(j; i) ∗ ||x(t) − μ(j) ||2
These equations show how, given the observed case, when we know to which
component each point belong, we can estimate all the parameters that we need
to define our mixture of Gaussian.
The notation of the indicator function implies a hard counting: the point belongs
to one cluster with certainty or it doesn’t belong. But one of the powers of
mixture model is that we can see each point to doesn’t solely belong to one
single cluster, we know that it can be generated from different ones with
different probabilities. So now, instead of talking about hard counts - belong or
doesn’t - we’ll actually talk about soft counts: how much does this point
belongs to this particular cluster?
What we want is p(j|x(i) ) : the probability that, having observed point i, this
would have been generated by cluster j.
If we would know all the parameters we could compute it using the Bayes rule:
Indeed p(j) would be just our pj and p(x(i) |j) would be N(μj , σj2 ).
25 of 27 24/04/2020, 13:26
as the parameters are very interconnected.
^2j
σ = 1
^j d
n
∑ni=1 p(j|i) ∗ ||x(t) − μ(j) ||2
At this point I can re-compute the p(j|x(i) ) and so on until my results don’t
change any more or change very little.
Finding the posterior p(j|x(i) ) given the parameters and deriving the expected
likelihood under this p(j|x(i) ) is called the E step (for “expectations”), while
then finding the parameters by maximising the likelihood given the estimated
p(j|x) is called the M step (for “maximisation”). Note that the EM algorithm is
not specific to the GMM, but it can be used any time we need to estimate the
parameters in a statistical models where the model depends on unobserved
latent variables.
Of course we are still left with the problem of how to initiate all the parameters
for the first E step. The EM algorithm indeed is very sensitive to initial
conditions as it guarantee to find only a local maximisation of the p(θ, Sn ), not a
global one.
Homework 5
26 of 27 24/04/2020, 13:26
P4.3. Expectation–maximization algorithm
27 of 27 24/04/2020, 13:26