0% found this document useful (0 votes)

21 views27 pages

Unit 04 - Unsupervised Learning - MD

Uploaded by

fourfiveseventhree4573

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views27 pages

Unit 04 - Unsupervised Learning - MD

Uploaded by

fourfiveseventhree4573

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

[MITx 6.

86x Notes Index]

Unit 04 - Unsupervised Learning

Lecture 13. Clustering 1

13.1. Unit 4: Unsupervised Learning

In this unit we start talking of problems where we don’t have labels, where we
need to ﬁnd a structure in the data itself. For example, Googe news provide
clusters of stories over the same event. It introduce a structure in the feed of
news stories.

We start looking at hard assignment clustering algorithms, where we assume

that each data point can belong to just a single cluster of (somehow to deﬁne)
“similar” points.

We will then move to recognise data structures where each point can belong to
multiple clusters, trying to discover the diﬀerent components embedded in the
data and looking at the likelihood a certain point was generated by that
component.

And we will introduce generative model, which provide us the probability

distribution over the points. We will look at multinomials, Gaussian and mixture
of Gaussians. And by the end of this unit, we will get to a very powerful
unsupervised learning algorithm for uncovering then, the line structure, which is
called the expectation–maximization (EM) algorithm.

We will ﬁnally practice both clustering and the EM algorithm in a Netﬂix-

recommandation project in Project 4.

13.2. Objectives
Understand the deﬁnition of clustering
Understand clustering cost with diﬀerent similarity measures
Understand the K-means algorithm

13.3. Introduction to Clustering

The topic in this unit is unsupervised machine-learning algorithms. In
unsupervised learning, we will still have a training set, but we will not have a
label like in supervised learning.

So in this case, we are provided with just a set of feature vectors, and we are

1 of 27 24/04/2020, 13:26
trying to learn some meaningful structure which would represent this set. In this
lesson in particolar we will cover clustering algorithms.

We will ﬁrst start from the deﬁnition of a cluster.

While in a classiﬁcation problem we already have the possible categories and

the label associated to it in the training data, in clustering we need to ﬁnd a
structure in the data and associate it with features of the data itself, in order to
predict the category from the data itself without the need for a explicit label.

For example take this chart:

It is intuitive that the points belong to two diﬀerent categories, in this case
based on their position on the (x,y) axis, even if the points have no label (colour)
associated to them.

Similarly, in many cases the meaningful structure is actually quite intuitive in

the context of a speciﬁc application.

Find a structure is however not enough. We need to be able to describe how the
structure we found is good in representing the data, how good is the cluster, the
partitioning of the data (the “clustering cost”). And we will need for that a
measure of similarity between points.

We will ﬁnish the lecture dealing with the K-means algorithm, an important and
widely used algorithm which actually can take any set of training point and ﬁnd
a partitioning to K clusters.

To conclude the segment with an example, let’s consider again Google news. In
google News we have both an example of classiﬁcation and one of clustering.

The grouping of the news in one of the categories (sport, science, business,…) is
a categorisation task. The categories are ﬁxed, and someone has trained the
algorithm manually setting the categories of each news for a training set, so that
now it can be predicted automatically.

The grouping of news around individual new events (the covid-19, the brexit, the
US elections…) is instead a clustering problem: here the categories are not

2 of 27 24/04/2020, 13:26
predeﬁned and there has been no training with manually label all possible
events.

What do we need for such task?

We ﬁrst need to provide a representation of the news stories, for example

employing a bag-of-words approach, giving a (sparse) vector representation
to each word;
Second, we need to decide how to compare each of these representations, in
this case vectors;
Finally we need to implement the clustering algorithm itself, based on the
pairwise measure of proximity found in step two, or, for example, selecting
one representative story to show as a central story in the news.

13.4. Another Clustering Example: Image

Quantization
If the previous example was dealing with text clustering for visualisation
purposes, let’s now deal with image clustering for compression purposes, and
more speciﬁcally for creating a colour palette for the image, a form of image
quantisation (“quantisation” refers to the process of mapping input values from
a large setto output values in a countable smaller set, often with a ﬁnite number
of elements).

A typical 1024 x 1024 pixel image coded in 24 bits per pixel (8 bits per each of
the RGB colours) would size 1024x1024x24 = 25165824 bits = 3 MB.

If we create a palette restricted to only 32 colours, and represent each pixel with
one of the colours in this palette, the image then would size only
1024x1024x5+5*24 = 5243000 bits = 640.01 KB (this by the way is a
compression technique used in PNG and GIF images).

Note that the image quality depends from the number of colours of the palette
we choose, i.e. the number of clusters K. While cartoon, screenshots, logos
would likely require very small K to appear of a quality similar to the original
image, for photos we would likely need higher K to achieve the same quality.

Still, to get the compressed image we need to undergo the three steps described
in the previous segment. We need to represent the image (as a vector of pixels
with a colour code). The second step is to ﬁnd some similarity measure between
colours, and ﬁnally we can run the clustering algorithm to replace each pixel 24
bits representation with the new 5 bits one, choosing a map from the 12777216
original colours (24 bits) to the 32 “representative ones” (also to be found by the
algorithm!).

13.5. Clustering Deﬁnition

There are two concepts related to a cluster. The ﬁrst one is the partitioning of
the data, how the diﬀerent data records are distributed among the clusters. The
second one is the representativeness, how we aggregate the cluster members in
a representative element per each cluster.

3 of 27 24/04/2020, 13:26
Partitioning
Let’s Sn = {x(i) |i = 1, . . . , n} be the set of n feature vectors indexed by i and
CK = {Cj |j = 1, . . . , K} being the set of K clusters containing the indexes of the
data points, then C1 , . . . , Cj , . . . , CK forms a K-partition of the data if, at the
same time, their union contain all the data indexes and their intersection is
empty: ∪(C1 , . . . , Cj , . . . , CK ) = {1, 2, . . . , n} and ∩(C1 , . . . , Cj , . . . , CK ) = {}

This partitioning deﬁne a so-called hard clustering, where every element just
belongs to one cluster.

Representativeness
The representativeness view of clustering is the problem of how to select the
feature vector to be used as representative of the cluster, like the new story to
put as header on each event in Google News or the colour to be use for each
palette item. Note that the representative element for each cluster doesn’t
necessarily be one existing element of the cluster set, can be also a combination
of them, like the average.

We will see later in this lesson what is the connection between these two views.
Clearly they are connected, because if you know what is your partitioning, you
may be able to guess who is the representative. If you know who is a
representative, you may be able to guess who is a constituent, but we will see
how we can actually unify these two views together.

Note that ﬁnding the optimal number of K is itself another problem, as K is not
part of the clustering algorithm, it is an input (parameter) to it. That is, K is a
hyperparameter of the clustering algorithm.

13.6. Similarity Measures-Cost functions

Given the same set of feature vectors, we can have multiple clustering results,
multiple way to partition the set. We need to deﬁne a “cost” to each possible
partition of our data set, in order to rank them (and ﬁnd the one that minimise
the cost). We will make the assumption that the cost of a given partition is the
sum of the cost of its individual clusters, where we can intuitively associate the
cost to some measure of the cluster heterogeneity, like (a) the cluster diameter
(the distance between the most extreme feature vectors, i.e. the outliers), (b) the
average distance or © (what we will use) the sum of the distances between
every member and zj , the representative vector of cluster Cj .

In turn we have to choose which metric we use to measure such distance. We

have several approaches, among which:

x(i) ⋅x(j)
cosine distance, dist(x(i) , x(j) ) =
||x(i) ||∗||x(j) ||
2
square euclidean distance, dist(x(i) , x(j) ) = ||x(i) − x(j) ||

The cosine distance has the merit of being invariant to the magnitude of the
vectors. For example, in terms of the Google News application, if we choose
cosine to compare between two vectors representing two diﬀerent stories, this

4 of 27 24/04/2020, 13:26
measurement will not take it into account how long are the stories. On the other
hand, the Euclidean distance would be sensitive to the lengths of the story.

While cosine looks at the angle between vectors (thus not taking into regard
their weight or magnitude), euclidean distance is similar to using a ruler to
actually measure the distance (and so cosine is essentially the same as euclidean
on normalized data)

Cosine similarity is generally used as a metric for measuring distance when the
magnitude of the vectors does not matter. This happens for example when
working with text data represented by word counts. We could assume that when
a word (e.g. science) occurs more frequent in document 1 than it does in
document 2, that document 1 is more related to the topic of science. However, it
could also be the case that we are working with documents of uneven lengths
(Wikipedia articles for example). Then, science probably occurred more in
document 1 just because it was way longer than document 2. Cosine similarity
corrects for this (from https://cmry.github.io/notes/euclidean-v-cosine).

While we’ll need to understand which metric is most suited for a speciﬁc
application (and where it matters in the ﬁrst instance), for now we will employ
the Euclidean distance.

Given these two choices, our cost function for a speciﬁc partition becomes:

cost(C1 , . . . , Cj , . . . CZ , z (1) , . . . , z (j) , . . . , z (Z) ) = ∑Zj=1 ∑i∈Cj ||x(i) − z (j) ||2

Note that for now we are giving this cost function as a function of both the
partition and the representative vectors, but when we will determine how to
compute the representative vectors zj endogenously, this will be a function of
the partition alone.

13.7. The K-Means Algorithm: The Big Picture

How to ﬁnd the speciﬁc partition that minimise the cost ? We can’t iterate over
all partitions, measure the cost and select the smaller one, as the number of
(non empty) k-partitions of a set of n elements is huge.

To be exact it is known as the “Stirling numbers of the second kind” and it is

equal to:

1
{ nk } = k!
∑ki=0 (−1)i (ki)(k − i)n

3 100
For example, { 2 } = 3 , but { 3 } ≈ 8.58e + 46

The K-M algorithm helps in ﬁnding the minimal distance in a computationally

feasible way.

In a nutshell, it involves ( a ) to ﬁrst randomly select k representative points.

Then ( b ) iterate for each point to assign the point to the cluster of the closest
representative (according with the adopted metric), and ( c ) move each

5 of 27 24/04/2020, 13:26
representative at the center of its newly acquired cluster (where “center”
depends again from the metric). Steps ( b ) and ( c ) are reiterated until the
algorithm converge, i.e. the tentative k representative points (and their relative
clusters) don’t move any more.

Graphically:

Random selection of 3 representative points in 2D:

Assignment of the “constituent” of each representative:

Moving the representative at the center of (the previous step) constituency:

6 of 27 24/04/2020, 13:26
Redeﬁnition of the constituencies (note that some points have changed their
representative):

Moving again the representatives and redeﬁnition of the constituencies:

Let’s describe this process a bit more formally in terms of the cluster costs.

7 of 27 24/04/2020, 13:26
1. Randomly select the representatives z (1) , . . . , z (j) , . . . , z (K)
2. Iterate:
2.1. Given z (1) , . . . , z (j) , . . . , z (Z) , assign each data point x(i) to the
closest representative z (1) , . . . , z (j) , . . . , z (Z) , so that the resulting cost
2
will be cost(z (1) , . . . , z (j) , . . . , z (Z) ) = ∑i=1 minj=1,...,K ||x(i) − z (j) ||
n

2.2. Given partition C1 , . . . , Cj , . . . , CK , ﬁnd the best representatives

z (1) , . . . , z (j) , . . . , z (Z) such to minimise the total cluster cost, where now
the cost is driven by the clusters:
cost(C1 , . . , Cj , . . . , CK ) = minz (1) ,...,z (j) ,...,z (Z) ∑K
j=1 ∑i∈CJ ||x
(i)
− z (j) ||2

Note that in both step 1 and step 2.2 we are not restricted that the initial and
ﬁnal representatives are part of the data sets. The fact that the ﬁnal
representatives will be guarantee to be part of the dataset is instead one of the
advantages of the K-medoids algorithm presented in the next Lesson.

13.8. The K-Means Algorithm: The Speciﬁcs

Finding the best representatives

While for step (2.1) we can just iterate for each point and each representative to
ﬁnd the representative for each point that minimise the cost, we still need to
deﬁne how to do exactly the step (2.2). We will see later extensions to the KM
algorithm with other distance metrics, but for now let’s stuck with the squared
geometric distance and note that each cluster select its own representative
independently.

Using the squared Euclidean distance, the “new” zj representative vector for
2
each cluster, must satisfy j : minzj cost(zj ; Cj ) = minzj ∑i∈CJ ||x(i) − zj || .

When we compute the gradient of the cost function with respect to zj and set it
∑i∈C x(i)
to zero, we retrieve the optimal zj as zj = J
, where |CJ | is the number of
|CJ |
elements of the cluster CJ . Intuitively the optimal representative vector is at the
center of the cluster, i.e. it is the centroid of the group.

We stress however that this solution is linked to the speciﬁc deﬁnition of

distance used, the squared Euclidean one.

Impact of initialisation
Note that the KM algorithm is guarantee to converge and ﬁnd a local cost
minimisation, because at each iteration the cost can only decrease, the cost
function is non-negative and the number of possible partitions, however large, is
ﬁnite.

It doesn’t however guarantee to ﬁnd a global cost minimisation, and it is indeed

very sensitive to the choice of the initial representative vectors. If we start with
a diﬀerent initialization, we may get a very diﬀerent partitioning.

Take the case of the above picture:

8 of 27 24/04/2020, 13:26
In the upper part we see the initial random assignation of the representative
vectors, and in the bottom picture we see the ﬁnal assignment of the clusters.
While the feature vectors moved a bit, the ﬁnal partition is clearly not the
optimal one (where the representative vectors would be at the center of the
three groups).

In particular, we may run into troubles when the initial representative vectors
are close to each others rather than spread up across the multidimensional
space.

While there are improvements to the K_Mean algorithm to perform a better

initialisation than a random one, that take this consideration into account, we
will use in class a vanilla KM algorithm with simple random initialisation. An
example of a complete such KM algorithm in Julia can be found on
https://github.com/sylvaticus/lmlj/blob/master/km.jl

Other drawbacks of K-M algorithm

In general we can say there are two classes of drawbacks of the K-M algorithm.

The ﬁrst class is that the measure doesn’t describe what we consider as a
natural classiﬁcation, so the cost function doesn’t represent what we would like
the cost of the partition to be, it does not return a useful information concerning
the partition ranking.

The second class of problems involves instead the computational aspects to

reach the minimum of this cost, like the ability to ﬁnd a global minimum.

While K-M algorithm scale well to large datasets, it has many other drawbacks.

One is that vanilla K-M algorithm tries to ﬁnd spherical clusters in the data, even
when the groups have other spatial grouping:

9 of 27 24/04/2020, 13:26
To account for this, k-means can be kernelized, where separation of arbitrary
shapes can be reached theoretically using higher dimensional spaces. Or there
could be used a regularized gaussian mixture model, like in this paper or in
these slides.

Other drawbacks, includes the so called “curse of dimensionality”, where

k-means algorithm becomes less eﬀective at distinguishing between examples,
the manual choice of the number of clusters (that need to be solved using cross-
validation), the fact of not being robust to outliers.

These further drawbacks are discussed, for example, here, here or here.

An implementation of K-Means algorithm in Julia: https://github.com/sylvaticus

/lmlj.jl/blob/master/src/clusters.jl#L65

Lecture 14. Clustering 2

14.1. Clustering Lecture 2

Objectives:

Understand the limitations of the K-Means algorithm

Understand how K-Medoids algorithm is diﬀerent from the K-Means
algorithm
Understand the computational complexity of the K-Means and the K-Medoids
algorithms
Understand the importance of choosing the right number of clusters
Understand elements that can be supervised in unsupervised learning

14.2. Limitations of the K Means Algorithm

On top of the limitations already described in segment 13.8 (computational
problem to reach the minimum and the cost function not really measuring what
we want) there is further signiﬁcant limitation of the K-Mean algorithm: the fact
that the z’s are actually not guaranteed to be the members of the original set of
points x.

In some applications this is not a concern, but it is for others. For example,
looking at Google News, if we create a representative of the cluster of the story

10 of 27 24/04/2020, 13:26
which doesn’t correspond to any story, we actually have nothing to show.

We will now introduce the K-medoids algorithm that modiﬁes the k-means one to
consider any kind of distance (not only the squared Euclidean one) and return a
set of representative vectors that are always part of the original data set.

We will ﬁnish the lesson discussing how to choose K, the number of K.

14.3. Introduction to the K-Medoids Algorithm

In K-means, whenever we randomly selected the initial representative points, we
were not constrained to select within the data set points, we could select any
point on the plane.

In K-Medoids algorithm instead we start by selecting the representatives only

from within the data points.

The step 2.1 (determining the constituencies of the representatives) remains the
same, while in step 2.2, instead of choosing the new representatives as
centroids, we constrain again that these have to be one of the point of the
cluster. This allow to just loop over all the points of the cluster to see which
minimise the cluster cost. And as in both step 2.1 and 2.2 we explicitly use a
“distance” function we can employ any kind of distance deﬁnition we want, i.e.
we are no longer restricted to use the squared Euclidean one.

The algorithm becomes then:

1. Randomly select the representatives z (1) , . . . , z (j) , . . . , z (K) from within

X (1) , . . . , X (j) , . . . , X (n)
2. Iterate:
2.1. Given z (1) , . . . , z (j) , . . . , z (Z) , assign each data point x(i) to the
closest representative z (1) , . . . , z (j) , . . . , z (Z) , so that the resulting cost
2
will be cost(z (1) , . . . , z (j) , . . . , z (Z) ) = ∑i=1 minj=1,...,K ||x(i) − z (j) ||
n

2.2. Given partition C1 , . . . , Cj , . . . , CK , ﬁnd the best representatives

z (1) , . . . , z (j) , . . . , z (Z) from within X (1) , . . . , X (j) , . . . , X (n) such to
minimise the total cluster cost, where now the cost is driven by the
clusters: cost(C1 , . . , Cj , . . . , CK ) = ∑j=1 minz (j) ∑i∈CJ dist(x(i) − z (j) )
K

14.4. Computational Complexity of K-Means and

K-Medoids

The Big-O notation

Let’t now compare the computational complexity of the two algorithms, using
the capital O (or “Big-O”) notation, which talks to us about the order of growth,
which means that we are going to look at the asymptotic growth and eliminate
all the constants.

We often describe computational complexity using the “Big-O" notation. For

11 of 27 24/04/2020, 13:26
example, if the number of steps involved is 5n2 + n + 1 , then we say it is “of
order n2 " and denote this by O(n2 ). When n is large, the highest order term 5n2
dominates and we drop the scaling constant 5 .

More formally, a function f(n) is of order g(n), and we write f(n) ∼ O(g(n)), if
there exists a constant C such that f(n) < Cg(n) as n grows large.

In other words, the function f does not grow faster than the function g as n
grows large.

The big-O notation can be used also when there are more input variables. For
example, in this problem, the number of steps necessary to complete one
iteration depends on the number of data points n, the number of clusters K, the
dimension d of each vector xi . Hence, the number of steps required are of
O(g(n, K, d)) for some function g(n, K, d) .

The computational complexity of the two algorithms

We don’t know how many iterations the algorithms take, so let’s compare only
the complexity of a single iteration.

The order of computation for the step 2.1 of the K-Mean algorithm is
O(n ∗ k ∗ d) as we need to go trough each point (n), compute the distance with
each representative (k) and account that we deal with multidimensional vectors
(d). The step 2.2 is the same, but because we’re talking about the asymptotic
growth, whenever we sum them up, we’re still staying within the same order of
complexity.

For the K-Medoids algorithm instead we can see that is much more complex, as
in the step 2.2 we need to go trough all the clusters, and then all the point.
Depending how the data is distributed across the cluster, we can go from the
best case of all points in one cluster, and then we would have a computational
complexity of O(n2 ∗ d) to a situation where the points are homogeneously
distributed across the clusters, and in such case we would have n/k points in
2
each cluster and a total complexity O( nk ∗ nk ∗ k ∗ d) = O( nk ∗ d) .

Given that normally n ≫ k , what makes the diﬀerence is the n2 term, and even
in the best scenario the computational complexity of the K-Medoids algorithm is
much higher than those of the K-Mean one, so for some applications with big
datasets the K-Mean algorithm would be preferable.

At this point we already have seen two clustering algorithms, but there are
hundreds of them available for our use. Each one has different strengths and
weaknesses, and whenever we are selecting a clustering algorithm which fits for
our application, we should account for all of them in order to find the best
clustering algorithm for our needs.

14.5. Determining the Number of Clusters

First, let’s recognise that more K we add to a model, more the cost function
decreases, as the distances from each point and the closer representative will

12 of 27 24/04/2020, 13:26
decrease, until the degenerated case where the number of clusters equals the
number of data and the cost is zero.

When is it time then to stop the number of clusters ? To decide which is the
“optimal” number K ? There are three general settings. In the ﬁrst one the
number of clusters is ﬁxed, is really given to us by the application. For example
we need to cluster our course content in 5 recitations, this is the space we have.
Problem “solved”

In the second setting, the number of clusters is driven by the speciﬁc

application, in the sense that the optimal level of the trade-oﬀ between cluster
cost and K is determined by the application, for example how many colours to
include in a palette vs the compression rate of the image depends from the
context of the application and our needs.

Finally, the “optimal” number of clusters could be determined in a cross-

validation step when clustering is used as a pre-processing tool for a supervised
learning tool (for example to add a dimension “distance from the cluster
centroid” when we have few labelled data and many unlabelled ones). The
supervised task algorithm would then have more information to perform its
separation task. Here it is the result of the supervised learning task during
validation that would determine the “optimal” number of clusters.

Finally let’s have a thought on the word “unsupervised”. One common

misunderstaning is that in “usupervised” tasks, as there is no labels, we don’t
provide our system with any knowledge, we just “let the data speak”, decide the
algorithm and let it provide with a solution. Indeed, the people who develop
those unsupervised algorithms actually provide quite a bit of indirect
supervision, of expert knowledge.

In clustering for example we decide about which similarity measure to use and
we decide about how many clusters to give and, as we saw, ow many clusters
provide. And if you’re thinking about bag of words approach of representing
text, these algorithms cannot ﬁgure out which words are more semantically
important than other words. So it would be up to use to do some weighting, so
that the clustering results is actually acceptable.

Therefore, we should think very carefully how to do these decision choices so

that our clustering is consistent with the expectation.

An implementation of K-Medoids algorithm in Julia: https://github.com/sylvaticus

/lmlj.jl/blob/master/src/clusters.jl#L133

Lecture 15. Generative Models

15.1. Objectives
Understand what Generative Models are and how they work
Understand estimation and prediction phases of generative models
Derive a relation connecting generative and discriminative models

13 of 27 24/04/2020, 13:26
Derive Maximum Likelihood Estimates (MLE) for multinomial and Gaussian
generative models

15.2. Generative vs Discriminative models

The model we studies up to now were discriminative models that learn explicit
decision boundary between classes. For instance, SVM classiﬁer, which is a
discriminative model, learns its decision boundary by maximising the distance
between training data points and a learned decision boundary.

At the contrary, generative models work by explicitly modelling the probability

distribution of each of the individual classes in the training data. For instance,
Gaussian generative models ﬁt a Gaussian probability distribution to the training
data in order to estimate the probability of a new data point belonging to
diﬀerent classes during prediction.

We will study two types of generative models, Multinomial generative models

and Gaussian generative models, where for both we will ask two type of
questions: (1) how do estimate the model ? How do we ﬁt our data (the
“estimation” question) and (2) how do we actually do prediction with a (ﬁtted)
model ? (the “prediction” question)

What are the advantages of generative models when compared to the

discriminative ones?

For example, if your task is not just to do classiﬁcation but to generate new
samples of such (feature, label) pair based on the information provided in
training data.

15.3. Simple Multinomial Generative model

We now think to a data-generator probabilistic model where we have diﬀerent
categories and hence a discrete probability distribution (a PMF, Probability Mass
Function).

We name θw the probability for a given class w , so that θw ≥ 0∀w and

∑w θw = 1.

Given such probabilistic model, the generative model p(w|θ) is nothing else than
the statistical model to estimate the parameters θw given the observed data w ,
or more formally the statistical model described by the pair (E, {Pθ }θ∈Θ ) , where
E is the sample space of the data and {Pθ }θ∈Θ ) is the family of distributions
parametrized by θ .

For example the probabilistic model may be a words generating model, given
ﬁxed vocabulary of W words and where each word would have its own
probability. Then we could see a document as a serie of random (independent)
extraction of these words so that a document with common words have more
probabilities to be generated compared to a document made of uncommon
words. As the words are independent in this model, the likelihood of the whole
document to be generated is just the product of the likelihood of each world

14 of 27 24/04/2020, 13:26
being generated.

If we are interested in the document as a speciﬁc sequence of words, for

example “IT WILL RAIN IT WILL” then it’s probability is P(“IT”) * P(“WILL”) *
P(“RAIN”) * P(“IT”) * P(“WILL”) = P(“IT”)² * P(“WILL”)² * P(“RAIN”). More in
count(w)
general it is P (D|θ) = ∏w∈W θw . The underlying probabilistic model is
then the categorical distribution, where the sample space E is then the set
1, . . . . , K mapped to all the possible words, and {Pθ } is the categorical
distribution parametrized by the probabilities θ . Θ, the space of the possible
parameters of the probability distribution, is the set of K positive ﬂoats that sum
to one.

If we are interested instead in all the possible combination of documents that

use that specific number of words (in whatever order),
{“IT”:2,“WILL”:2,“RAIN”:1}, we have to consider and count all the possible
n!
combinations, like “IT WILL IT WILL RAIN”, and there are w !,...,w !,...w !
of them.
1 i k
The sample space E is then the set of all the possible documents, encoded as
count of the different words, and {Pθ } is the multinomial distribution
parametrized by the probabilities θ . Θ, the space of the possible parameters of
the probability distribution, remains the set of K positive floats that sum to one.

Note that the categorical and multinomial distributions are nothing else than an
extension of respectively the Bernoulli and binomial distributions to multiple
classes (rather than just a binary 0/1 ones). They model the probability of
respectively one or multiple, independent, categorical trials, i.e. the outcome for
the categorical distribution is the single word, while those for the multinomial
distribution is the whole document (encoded as world count).

n!
For completion, there are w !,...,w possible combinations of sequences of a
1 i !,...wk !
k
document with w1 , . . . , wi , . . . wk worlds count (with ∑i=1 wi = n), so the PMF
of the multinomial is
p(w1 , . . , wi , . . . , wk ; θ1 , . . . , θi , . . . , θk ) = w !,...,wn! !,...w ! ∏ki=1 θw
i .
i
1 i k

15.4. Likelihood Function

Even if we will use the term “multinomial generative model”, in this lecture we
will consider the ﬁrst approach of encoding a document and its relative
probabilistic model (the categorical distribution)

Note indeed that in some ﬁelds, such as machine learning and natural language
processing, the categorical and multinomial distributions are conﬂated, and it is
common to speak of a “multinomial distribution” when a “categorical
distribution” would be more precise.

So, for a particular document D it’s probability to be generated by a sequence of

samples from a categorical distribution with parameter θ is
count(w)
P (D|θ) = ∏w∈W θw .

Given an observed document and compelling probabilistic models (with diﬀerent

thetas) we can then determine wich is the one with the highest probability of

15 of 27 24/04/2020, 13:26
having generated the observed document.

Let’s consider for example a vocabulary of just two words, “cat” and “dog” and
an observed Document “cat cat dog”. Let’s also assume that we have just two
compelling models to choose from. The first (probabilistic) model has θcat = O.3
and θdog = O.7. The second model has θcat = O.9 and θdog = O.1 It is intuitive
that it is more probable that is the second model that generated the document,
but let’s compute it. The probability of the document being generated by the
first model is 0.32 ∗ 0.7 = 0.0189. The probability that D has been generated
instead by the second model is 0.92 ∗ 0.1 = 0.081, that is higher than those for
the first model.

15.5. Maximum Likelihood Estimate

The joint probability of a set of given outcomes when we want to stress it as a
function of the parameters of the probabilistic model (thus treating the random
variables as ﬁxed at the observed values) is known as the likelihood.

While often used synonymously in common speech, the terms “likelihood” and
“probability” have distinct meanings in statistics. Probability is a property of the
sample, speciﬁcally how probable it is to obtain a particular sample for a given
value of the parameters of the distribution; likelihood is a property of the
parameter values.

We want now to ﬁnd which are the parameters that maximise the likelihood.

count(w)
For the multinomial model it is argmaxθ {L(D|θ0 ) = ∏w∈W θw }.

It turns out that maximising its log (known as the log-likelihood) is

computationally simpler while equivalent (in terms of the argmax, not in terms
of its value) because the log function is a monotonically increasing function:
wherever the likelihood is on its maximum, its log it is on its maximum as well.

We hence compute the ﬁrst order conditions in terms of theta for the log-
likelihood to ﬁnd the theta that maximise it:

argmaxθ {lL(D|θ0 ) = ∑w count(w)log(θw )}

Max likelihood estimate for the cat/dog example

Let’s ﬁrst consider the speciﬁc case of just two categories to later generalise
(and let’s calling the outcomes just 0/1 instead of cat/dog).

In such setting we have really just one parameter, θ0 , (the probability of a 0) as

we can write θ1 as 1 − θ0 The log_Likelihood of a given document is then:

lL(D|θ0 ) = count(0)log(θ0 ) + count(1)log(1 − θ0 )

Taking the derivative with respect to θ0 and setting it equal to zero we obtain:

16 of 27 24/04/2020, 13:26
∂lL count(0) count(1)
∂θ0
= θ0
− 1−θ0

∂lL ~ count(0)
∂θ0
= 0 → θ0 =
count(0)+count(1)

Note the symbol of the tilde to indicate an estimated parameter. This is our “best
guess” parameter given the observed data.

15.6. MLE for Multinomial Distribution

Let’s now consider the full multinomial case. So given a document (or, as the
words are assumed to be independent, a whole set of documents… just
concatenated one to the other) and a vocabulary W we want to ﬁnd θw ∀w ∈ W
as to maximise {L(D|\theta_0) = \prod_{w \in W} \theta_w^{\text{count}
(w)}}$.

The problem is however that the thetas are not actually free, but we have the
constrain that the thetas have to sum to one (we have actually also those that
the thetas need to be positive, but for the nature of this maximisation problem
we can ignore this constraint).

The method of Lagangian multipliers for constrained

optimisation
We have hence a constrained optimisation problem that we can solve with the
method of the Lagrange multipliers, a method to ﬁnd the stationary values
(min or max) of a function subject to equality constraints.

Let’s consider a case of an objective function of two variables z = f(x, y) subject

to a single constraint g(x, y) = c, where c is a constant (both the objective
function and the constraint doesn’t need to be necessarily linear).

We can then write the so-called Lagrangian function as

L = f(x, y) + λ[c − g(x, y)] , i.e. we “augment” the original function we want to
ﬁnd the extremes with a term made by a new variable (the Lagrangian
multiplier) that multiplies the relative constraint (if we have multiple contraints,
we would have multiple additional terms and Lagrangian multipliers).

All we need to do now is to ﬁnd the stationary values of ℒ, regarded as a

function of the original variables x and y but also of the new variable(s) λ we
introduced.

The First Order necessary Condition (FOC) are:

Lx =fx − λgx =0
Ly =fy − λgy =0
Lλ =c − g(x, y) =0

Solving this system of 3 equations in 3 unknown will equivalently ﬁnd the

stationary point of the unconstrained function ℒ and original function f , as the

17 of 27 24/04/2020, 13:26
third equation in the FOC guarantee that indeed the constrain is respected and,
at the stationary point, the two functions have the same value.

For a numerical example, let’s the function to optimize be z = xy and be it

subjected tothe constraint x + y = 6. We can write the Lagrangian as
L = xy + λ[6 − x − y] whose FOC are:

Lx =y − λ =0
Ly =x − λ =0
Lλ =6 − x − y =0

Solving for (x, y, λ) we ﬁnd the optimal values x∗ = 3, y ∗ = 3, λ∗ = 3 and

z ∗ = 9.

The constrained multinomial log-likelihood

As the sum of thetas must be one, the Lagrangian of the log-likelihood is:

L = log P (D|θ) + λ (∑w∈W θw − 1) = ∑w∈W nw log θw + λ (∑w∈W θw − 1)

where nw is the count of word w in the document.

The FOC of the lagrangian are then:

nw
Lw = −λ =0
θw
Lλ = ∑ θw − 1 =0
w∈W

Where the ﬁrst equation is actually a gradient with respect to all the w .

Solving the above system of equation we obtain θ^w =

n
w
∑w∈W nw
∀w ∈ W .

These set of θw parameters are the maximum likelihood estimates, the values
of θ that maximize the likelihood function for the observed data and this
multinomial generative model.

15.7. Prediction
Let’s now see how can we actually use generative Multinomial Models to make
predictions, for example deciding if a document is in Spanish of Portuguese
language.

For that we have ﬁrst given a large set of documents in both languages, together
with their label (the language).

From there we can estimate the PMF of the words using the MLE estimator (i.e.,
counting the relative proportions) for both Spanish and Italian (so our
vocabulary would have both Spanish and Italian terms).

18 of 27 24/04/2020, 13:26
We are now given a new document without the language label. Calling the two
languages as “+” and “-”, we compute the likelihood of the document under the
ﬁrst hypothesis (P (D|θ+ ) ) as the joint probability using the ﬁrst PMF , and the
likelihood under the second hypothesis (P (D|θ− ) ) using the second PMF, and we
decide how to classify the document according to which one is bigger.

The probability of a data item to belong to a given class, conditional to the

probabilities of its feature vector is known also as class-conditional probability.

We can equivalently say that we category the document as “+” if

P(D|θ+ )
log( P(D|θ− ) ) > 0.

P(D|θ+ )
But log
P(D|θ− )
= log P (D|θ+ ) − log P (D|θ− )

= log ∏w∈W (θ+

w)
count(w)
− log ∏w∈W (θ−
w)
count(w)

= ∑w∈W count(w) log θ+ −

w − ∑w∈W count(w) log θw

θ+
= ∑w∈W count(w) log w
θ−
w

= ∑w∈W count(w)θ^w

θ+
where θ^w is just a symbol for log w− .
θw

The last expression shows that this classifier, derived looking through a
generative view on classification, should actually remind us a linear classifier
that goes through origin with respect to this parameter θ^w .

15.8. Prior, Posterior and Likelihood

Let’s now try to compute P (θ|D) , i.e. the probability of the language of the
document (that is the probability of the generating PMF being those of the
Spanish or Portuguese language) given the observed document.

To compute it, we will apply the Bayesian rule that states:

P(B|A)∗P(A)
P (A|B) = P(B)

and the total probability law that states:

P (X = x) = ∑y P (Y = y) ∗ P (X = x|Y = y)

P(D|lang=Spanish)∗P(lang=Spanish)
P (lang = spanish|D) = P(D)

19 of 27 24/04/2020, 13:26
Where
P (D) = P (D|lang = Spanish) ∗ P (lang = Spanish) + P (D|lang = P ortuguese) ∗ P

We can think for example to a bilingual colleague giving us a document in either

Spanish or Portuguese. P (lang = Spanish) is the prior probability that the
document is Spanish (because maybe we know that this guy works more with
Spanish documents than Portuguese ones). Then P (D|lang = Spanish) and
P (D|lang = P ortuguese) are the two conditional probabilities that depends on
the world frequencies of the two languages respectively and P (D) is the
probability that our colleague gave us exactly that document.
P ((lang = spanish|D) is known as the posterior probability, as it is the one we
have once we observe document D.

Going back to our plus/minus class notation instead of the languages, we can
compute the log of the ratio between the posteriors for the two classes as:

P(y=+|D) P(D|y=+)∗P(y=+) P(D|y=+) P(y=+)

log( P(y=−|D) ) = log( P(D|y=−)∗P(y=−) ) = log( P(D|y=−) ) + log( P(y=−) )

Using the expression from the previous segment for the ﬁrst term and θ^0 as a
symbol representing the second term, we can rewrite the equation as:

log( P(y=−|D) ) = ∑w∈W count(w)θ^w + θ^0 .

P(y=+|D)

So now what we actually see here that we translated it again to a linear

classifier. But in contrast to the previous case, when we had a linear classifier
that went through origin, now we have a linear classifier with an offset, and the
offset itself would be actually guided by our prior, which will drive the location of
the separator.

So what we’ve seen here that in this model we can very easily incorporate our
prior knowledge about the likelihood of certain classes. And at the end, what we
got, that even though we’re talking about generative models and we’re using a
diﬀerent mechanism on some ways of estimating the parameters of this
multinomial, at the end, we actually are getting the same linear separators that
we see in our discriminative modelling.

15.9. Gaussian Generative models

We will now switch, in place to a categorical distribution, to a generative
(probabilistic) model based on the multidimensional Normal (or “Gaussian”)
distribution.

While the categorical distribution was opportune for modelling discrete random
variables, like classes, the Normal is the analogous counterpart for continuous
random variables.

The multidimensional Normal has Probability Density Function (PDF):

1 T
Σ−1 (x−μ)
pX (x; μ, Σ) = 1
e− 2 (x−μ) , x ∈ Rd
√(2π) det(Σ)
d

20 of 27 24/04/2020, 13:26
where x ∈ Rd , μ ∈ Rd is the vector of the means in the d dimensions and
Σ ∈ Rd×d is the covariance matrix (with det(Σ) being its determinant).

So the PDF has a single spike in correpondance of x = μ and then gradually

declines in a typical “bell shape”. The “speed” of the decline is regulated by the
sigma parameter: more sigma is high, more the spike is low and the tails are
“fat”; making the whole curve “ﬂatted”.

When all the components (dimensions) are uncorrelated (i.e. the covariant
matrix Σ is diagonal) and have the same standard deviation σ, the Normal PDF
reduces to:

1 − 1
∗||x−μ||2
pX (x; μ, σ 2 ) = ∗e 2σ 2
(2πσ 2 )d/2

We should also specify that a random vector X = (X (1) , … , X (d) )T is deﬁned as

a Gaussian vector, or “multivariate Gaussian” or “normal variable”, if any
linear combination of its components is a (univariate) Gaussian variable or a
constant (a “Gaussian" variable with zero variance), i.e., if αT X is (univariate)
Gaussian or constant for any constant non-zero vector α ∈ Rd .

It is important to note that in generative models we are introducing a speciﬁc

functional form for the probabilistic model (here the gaussian), and our freedom
relates only on the parameters of the model, not on the form itself. Hence we
ﬁrst need to consider if our data, our speciﬁc case, is appropriate to be modelled
by such functional form or not.

15.10. MLE for Gaussian Distribution

As for the multinomial case, if we have a training set Sn = {x(t) |t = 1...n} we
can compute the -log-likelihood of this set and ﬁnd the (μ, σ 2 ) that maximise it:

1 − 1
∗||x−μ||2
The likelihood is p(μ, σ 2 ; Sn ) = ∏t=1 p(μ, σ 2 ; x(t) ) = ∏t=1
n n
∗e 2σ 2
(2πσ 2 )d/2

The log-likelihood is then:

1 − 1
∗||x(t) −μ||2
logp(μ, σ 2 ; Sn ) = log(∏nt=1 ∗e 2σ 2 )
(2πσ 2 )d/2

= − nd
2
log(2πσ 2 ) + 1
2σ 2
||x(t) − μ||2

From this expression we can take the derivatives of μ and σ and equal them to
2 ∑nt=1 ||x(t) −μ||2
^ = n1 ∑t=1 x(t) and σ
zero to ﬁnd the MLE estimators μ ^ = n
nd
.

Lecture 16. Mixture Models; EM algorithm

21 of 27 24/04/2020, 13:26
16.1. Mixture Models and the Expectation
Maximization (EM) Algorithm
Objectives:

Review Maximum Likelihood Estimation (MLE) of mean and variance in

Gaussian statistical model
Deﬁne Mixture Models
Understand and derive ML estimates of mean and variance of Gaussians in
an Observed Gaussian Mixture Model
Understand Expectation Maximization (EM) algorithm to estimate mean and
variance of Gaussians in an Unobserved Gaussian Mixture Model

16.2. Recap of Maximum Likelihood Estimation for

Multinomial and Gaussian Models
So far, in clustering we have assumed that the data has no probabilistic
generative model attached to it, and we have used various iterative algorithms
based on similarity measures to come up with a way to group similar data points
into clusters. In this lecture, we will assume an underlying probabilistic
generative model that will lead us to a “natural” clustering algorithm called the
EM algorithm .

While a “hard" clustering algorithm like k-means or k-medoids can only provide
a cluster ID for each data point, the EM algorithm, along with the generative
model driving its equations, can provide the posterior probability (“soft"
assignments) that every data point belongs to any cluster.

Today, we will talk about mixture model and the EM algorithm. The rest of the
segment is just a recap of the two probabilistic models we saw, the multinomial
(actually the categorical) and the multidimansional Gaussian.

The corresponding statistical model is, given observed data and the assumption
that the data has been generated by the given probabilistic model, which is the
most likely parameter of the distribution, like the individual category
probabilities θw for the categorical or (μ, σ 2 ) for the Gaussian ?

This is known in statistics as parametric estimation, and can be solved by

interpreting the joint probability function of a set of observed data in terms of a
function of the parameter of the model given the observed data (and in doing
this we name the joint probability function “likelihood”) and then ﬁnd the
parameter(s) of the model that maximise it, using ﬁrst order conditions,
eventually employing Lagrange Multipliers if the maximisation involves
respecting some contraints, like setting the sum of all the categorical
probabilities equal to one.

Please also consider that the sum of independent normal random variables
remains normal:

X ∼ N(μ, σ 2 ) → aX + b ∼ N(aμ + b, a2 σ 2 )

22 of 27 24/04/2020, 13:26
16.3. Introduction to Mixture Models
Let’s know consider a generative (probabilistic) model consisting of two
subsequent steps. We ﬁrst draw a class using a K-categorical distribution with
probabilities p1 , . . . , pj , . . . pK .

We then draw the final data from a Normal distribution with parameter (x(j) and
(j)
σ 2 ), i.e. specific of the class sampled in the first step.

This generative model goes under the name of mixture model and mixture
models are actually very useful for describing many real-life scenarios.

The full set of parameters of this mixture model is then collectively represented
(1) (j) (K)
by θ = {p1 , . . . , pj , . . . pK ; μ(1) , . . . , μ(j) , . . . , μ(K) , σ 2 , . . . , σ 2 , . . . , σ 2 }
where the non-negative pj are called mixture weights (and sum to 1) and the
(j)
individual Gaussians with parameters μ(j) and σ 2 are the mixture
components (so this is more speciﬁcally a Gaussian mixture model (GMM),
not to be confonded with the “Generalized Method of Moments”). K is the size
of the mixture model.

The output of sampling a set of points from such model, using 2 components
(two classes of the categorical distribution) and a two-dimensional Gaussian,
would look something like this:

We can intuitively distinguish the two clusters and see that one is bigger than
the other, both in terms of points (i.e. its pj is larger) and in term of spread (its
(j)
σ2 is bigger).

But where is the border between the two points? While the K-Mean or K-Medoids
algorithm would have provided a classiﬁer, returning one cluster id for each
point (hard clustering model), a clustering based on a mixture generative
model (to be implemented using the EM algorithm we will discuss in the next
segment) would instead return for each point the probability to belong to each
cluster (soft clustering model). For example, the points in the top-left corner
of the picture will have probability to belong to the ﬁrst cluster almost one and
those in the bottom-right corner probability to belong to the second cluster (i.e.
being generated by a Normal distribution with parameters (μ2 , σ22 )) almost 1,

23 of 27 24/04/2020, 13:26
but those in the center will have much more balanced probabilities.

So here, we have a much more refined way to talk about our distribution,
because every point kind of belongs to different components, just with different
likelihoods.

If we know all of the parameters of the K Gaussians and the mixture weights, the
probability density function p(x|θ) can be computed using the law of total
probability as p(x|θ) = ∑j=1 pj N (x; μ(j) , σj2 ).
K

Using the Bayes rule we can also ﬁnd the posterior probability that x belongs to
a Gaussian component j.

Likelihood of Gaussian Mixture Model

Given the GMM we can ﬁnd the PDF associated with each point x using the total
probability theorem: p(x; θ) = ∑j=1 pj N (x, μ(j) , σj2 )
K

The likelihood associated to a set Sn of observed, independent, data is then

(i) (j)
p(θ, Sn ) = ∏ni=1 ∑K 2
j=1 pj N (x , μ , σj )

Now, given this expression, we want to take the derivatives of it with respect to
my parameters, make them equal to 0, and ﬁnd the parameters. It turns out
however that’s actually a pretty complex task.

We will ﬁrst start to reason in a setting where we know the category associated
to each point, to further generalise in a context where we don’t know the class
from which it belongs, with the EM algorithm.

Note also that while we will consider diﬀerent means of the Gaussian across its
dimensions (i.e. μ(j) is a vector) we will assume the same variance across them
(i.e. σj2 is a scalar).

16.4. Mixture Model - Observed Case

Let’s hence ﬁrst assume that we know the cluster j to which each point i belong
to and we need to estimate the parameters of the likelihood using MLE.

For that let’s deﬁne as indicator function a function that, for a given pair (i, j) ,
returns 1 if the j cluster is the one to with point i belongs to, 0 otherwise.

1(j; i) := {
1 if x(i) ∈ clusterj ,
0 if otherwise.

The log-likelihood can then be rewritten using such indicator function and
inverting the summation as:

(i) (j)
logp(θ, Sn ) = ∑K n 2
j=1 ∑i=1 1(j; i)log[pj N (x , μ , σj )]

24 of 27 24/04/2020, 13:26
Note that in general log(sum(x)) is not the same as sum(log(x)), but the
presence of the indication function guarantees that in reality there is no
summation over the cluster when we have a speciﬁc data example.

And what this expression actually demonstrates is that, whenever we are trying
to find all the parameters that optimize for each mixture component, we can
actually do this computation for each cluster independently. We can
independently find the best parameters that fit the cluster, because the fact that
these points are observed, we actually know where the point is belonging to. So
we can separately find, here the mu, sigma and the p.

By maximising the likelihood, we would ﬁnd the following estimators:

^ j = sumni=1 1(j; i) (this actually

^ j , the number of points of each cluster: n
n
just derives from our assumption to know the indicator function output)
^
n
p^j = nj
μ^(j) = n^1 ∑ni=1 1(j; i) ∗ x(i)
j

^2j
σ = 1
^j d
n
∑ni=1 1(j; i) ∗ ||x(t) − μ(j) ||2

These equations show how, given the observed case, when we know to which
component each point belong, we can estimate all the parameters that we need
to deﬁne our mixture of Gaussian.

16.5. Mixture Model - Unobserved Case: EM

Algorithm
We now consider the unobservated case, where we don’t know the mapping
between data xi and cluster j.

The notation of the indicator function implies a hard counting: the point belongs
to one cluster with certainty or it doesn’t belong. But one of the powers of
mixture model is that we can see each point to doesn’t solely belong to one
single cluster, we know that it can be generated from diﬀerent ones with
diﬀerent probabilities. So now, instead of talking about hard counts - belong or
doesn’t - we’ll actually talk about soft counts: how much does this point
belongs to this particular cluster?

What we want is p(j|x(i) ) : the probability that, having observed point i, this
would have been generated by cluster j.

If we would know all the parameters we could compute it using the Bayes rule:

p(j,x(i) ) p(j)∗p(x(i) |j)

p(j|x(i) ) = p(x(i) )
=
∑K
k=1 p(k)∗p(x(i) |k)

Indeed p(j) would be just our pj and p(x(i) |j) would be N(μj , σj2 ).

But unfortunately there is no closed solution for maximising the likelihood

p(θ, Sn ) involving the full set of parameters, even with a very large Sn observed,

25 of 27 24/04/2020, 13:26
as the parameters are very interconnected.

We need hence to break these dependence with a solving procedure in two

parts: given my best estimated parameters, I ﬁrst compute p(j|x(i) ) . Then,
treating the estimated p(j|x(i) ) as given, I maximise the likelihood p(θ, Sn ) to
retrieve the parameters. Solving for the parameters I would then ﬁnd something
very similar to the results in the previous segment:

^ j , the number of points of each cluster: n

n ^ j = sumni=1 p(j|i) (this is actually a
weighted sum of the points, where the weights are the relative probability of
the points to belong to cluster j)
^
n
p^j = nj
μ^(j) = n^1 ∑ni=1 p(j|i) ∗ x(i)
j

^2j
σ = 1
^j d
n
∑ni=1 p(j|i) ∗ ||x(t) − μ(j) ||2

At this point I can re-compute the p(j|x(i) ) and so on until my results don’t
change any more or change very little.

Finding the posterior p(j|x(i) ) given the parameters and deriving the expected
likelihood under this p(j|x(i) ) is called the E step (for “expectations”), while
then ﬁnding the parameters by maximising the likelihood given the estimated
p(j|x) is called the M step (for “maximisation”). Note that the EM algorithm is
not speciﬁc to the GMM, but it can be used any time we need to estimate the
parameters in a statistical models where the model depends on unobserved
latent variables.

Of course we are still left with the problem of how to initiate all the parameters
for the ﬁrst E step. The EM algorithm indeed is very sensitive to initial
conditions as it guarantee to ﬁnd only a local maximisation of the p(θ, Sn ), not a
global one.

We could either do a random initialization of the parameter set θ or we could

employ k-means to ﬁnd the initial cluster centers of the K clusters and use the
global variance of the dataset as the initial variance of all the K clusters. In the
latter case, the mixture weights can be initialized to the proportion of data
points in the clusters as found by the k-means algorithm.

An implementation of E-M algorithm in Julia: https://github.com/sylvaticus/lmlj.jl

/blob/master/src/clusters.jl#L222

Homework 5

Project 4: Collaborative Filtering via

Gaussian Mixtures

26 of 27 24/04/2020, 13:26
P4.3. Expectation–maximization algorithm

Data Generation Models

[MITx 6.86x Notes Index]

27 of 27 24/04/2020, 13:26

ML - Unit - 4 - Part Ii
No ratings yet
ML - Unit - 4 - Part Ii
79 pages
Clustering
No ratings yet
Clustering
82 pages
13.1 Introduction To Clustering Approaches
No ratings yet
13.1 Introduction To Clustering Approaches
3 pages
Lect 6 - Clustering
No ratings yet
Lect 6 - Clustering
50 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
Data Science: Unsupervised Learning
No ratings yet
Data Science: Unsupervised Learning
49 pages
Data Clustering for Informatics Students
No ratings yet
Data Clustering for Informatics Students
5 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
16 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Lecture Notes On Clustering
No ratings yet
Lecture Notes On Clustering
10 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
Aiml 5th Module Part2
No ratings yet
Aiml 5th Module Part2
28 pages
UNIT5
No ratings yet
UNIT5
60 pages
Clustering Examples
No ratings yet
Clustering Examples
47 pages
MLT Unit 1 Vaishali
No ratings yet
MLT Unit 1 Vaishali
44 pages
S VD For Clustering
No ratings yet
S VD For Clustering
10 pages
Clustering
No ratings yet
Clustering
6 pages
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
No ratings yet
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
18 pages
Mod2 Clustering Text Book
No ratings yet
Mod2 Clustering Text Book
30 pages
Module 5
No ratings yet
Module 5
45 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
Kohonen Networks and Clustering: Comparative Performance in Color Clustering
No ratings yet
Kohonen Networks and Clustering: Comparative Performance in Color Clustering
9 pages
Intro to Clustering Methods
No ratings yet
Intro to Clustering Methods
39 pages
cs4811 ch10c Clustering
No ratings yet
cs4811 ch10c Clustering
35 pages
Clustering
No ratings yet
Clustering
34 pages
ML Unit-4
No ratings yet
ML Unit-4
23 pages
Machine Learning Bloque 4
No ratings yet
Machine Learning Bloque 4
12 pages
Unit-V Clustering Part 1
No ratings yet
Unit-V Clustering Part 1
26 pages
Unit V Notes
No ratings yet
Unit V Notes
39 pages
Week6 Clustering Regression
No ratings yet
Week6 Clustering Regression
101 pages
Machine Learning IV
No ratings yet
Machine Learning IV
54 pages
Clustering
No ratings yet
Clustering
27 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
Lecture 2.1.1 To 2.1.2
No ratings yet
Lecture 2.1.1 To 2.1.2
97 pages
Outline: Three Basic Algorithms
No ratings yet
Outline: Three Basic Algorithms
34 pages
Clustering Part1
No ratings yet
Clustering Part1
79 pages
5 - Clustering
No ratings yet
5 - Clustering
13 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
37 pages
Clustering, THE MACHINE LEARNING TOPIC..
No ratings yet
Clustering, THE MACHINE LEARNING TOPIC..
25 pages
Unit-Iv Material
No ratings yet
Unit-Iv Material
24 pages
Segment 7 (Ch10)
No ratings yet
Segment 7 (Ch10)
60 pages
Medical Imabmnge Analysis
No ratings yet
Medical Imabmnge Analysis
41 pages
4 Clustering With K-Means - Kaggle
No ratings yet
4 Clustering With K-Means - Kaggle
9 pages
ML 03 Clustering
No ratings yet
ML 03 Clustering
63 pages
Cluster Lecture-1
No ratings yet
Cluster Lecture-1
20 pages
Aiml Ece Unit-4
No ratings yet
Aiml Ece Unit-4
130 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Cluster Analysis and Methods Overview
No ratings yet
Cluster Analysis and Methods Overview
47 pages
Machine Learning For Humans, Part 3 - Unsupervised Learning - by Vishal Maini - Machine Learning For Humans - Medium
No ratings yet
Machine Learning For Humans, Part 3 - Unsupervised Learning - by Vishal Maini - Machine Learning For Humans - Medium
23 pages
AIML Chapter 13
No ratings yet
AIML Chapter 13
26 pages
Prasanna Hebbar @govt First Grade College Honnavar
No ratings yet
Prasanna Hebbar @govt First Grade College Honnavar
11 pages
Section 3
No ratings yet
Section 3
22 pages
DM 4
No ratings yet
DM 4
76 pages
Data Clustering for Data Miners
No ratings yet
Data Clustering for Data Miners
14 pages
Asit Kumar Das - M4 BDA Clustering
No ratings yet
Asit Kumar Das - M4 BDA Clustering
99 pages
Self Reading - KNN - Notes
No ratings yet
Self Reading - KNN - Notes
7 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
Biasing N-Channel Enhancement Type MOSFET
No ratings yet
Biasing N-Channel Enhancement Type MOSFET
2 pages
Applied Electronics (IC-161) : L3: Devices and Basic Circuits
No ratings yet
Applied Electronics (IC-161) : L3: Devices and Basic Circuits
25 pages
Applied Electronics (IC-161) : L1: Motivation & Overview
No ratings yet
Applied Electronics (IC-161) : L1: Motivation & Overview
22 pages
Applied Electronics (IC-161) : Unit 2: Devices and Basic Circuits
No ratings yet
Applied Electronics (IC-161) : Unit 2: Devices and Basic Circuits
18 pages
Basic Config
No ratings yet
Basic Config
3 pages
Tower of Hanoi
No ratings yet
Tower of Hanoi
17 pages
IC253 - LabExercise-3
No ratings yet
IC253 - LabExercise-3
2 pages
ABB - 2CMA167801R1000 Datasheet 2023-03-06-151007
No ratings yet
ABB - 2CMA167801R1000 Datasheet 2023-03-06-151007
4 pages
Iso 1518 1 2023
0% (1)
Iso 1518 1 2023
10 pages
Astm F 683
No ratings yet
Astm F 683
34 pages
Cheat Sheet: Javascript
No ratings yet
Cheat Sheet: Javascript
2 pages
Flintkote
No ratings yet
Flintkote
2 pages
Daily Inspection 737-400 19th September 2016
100% (1)
Daily Inspection 737-400 19th September 2016
24 pages
Settlement of An Work Order
No ratings yet
Settlement of An Work Order
7 pages
KS4 / Edexcel GCSE 360 Additional Science / Topics P2.9 and P2.10 SQ Test and Answers - 11 - 12
No ratings yet
KS4 / Edexcel GCSE 360 Additional Science / Topics P2.9 and P2.10 SQ Test and Answers - 11 - 12
7 pages
mx240 Mkiii
No ratings yet
mx240 Mkiii
24 pages
Keerthi Resume
No ratings yet
Keerthi Resume
3 pages
Technical Manual Euroklav 23 VS - 29 VS
No ratings yet
Technical Manual Euroklav 23 VS - 29 VS
22 pages
Burster Oechsler 86413-5010-V501
No ratings yet
Burster Oechsler 86413-5010-V501
4 pages
Package N 16A: UNIT - 6300 Common Etp Unit
No ratings yet
Package N 16A: UNIT - 6300 Common Etp Unit
3 pages
Simple Digital Lock: Department of Electronics and Communication Engineering
No ratings yet
Simple Digital Lock: Department of Electronics and Communication Engineering
18 pages
Sequencia Start
No ratings yet
Sequencia Start
2 pages
Lab Exp # 1
No ratings yet
Lab Exp # 1
3 pages
Raees Ahmad: Total Experience: 5+ Years
No ratings yet
Raees Ahmad: Total Experience: 5+ Years
2 pages
Per Unit (P.u.) System: Decimal Fractions of Chosen Base Values
No ratings yet
Per Unit (P.u.) System: Decimal Fractions of Chosen Base Values
12 pages
Sunita Kohli: (Interior Designer)
No ratings yet
Sunita Kohli: (Interior Designer)
31 pages
BMBL CDC
No ratings yet
BMBL CDC
1,000 pages
Eyelashes PDF
100% (1)
Eyelashes PDF
6 pages
Gambar Rumah 1 Lantai
No ratings yet
Gambar Rumah 1 Lantai
32 pages
Mass Concrete
100% (1)
Mass Concrete
8 pages
Junction Box Installation Manual 2019
No ratings yet
Junction Box Installation Manual 2019
10 pages
Gate Architecture and Planning 2010
No ratings yet
Gate Architecture and Planning 2010
16 pages
Emp 168
100% (1)
Emp 168
2 pages
Manila Broadcasting Company Radio Stations 2012
No ratings yet
Manila Broadcasting Company Radio Stations 2012
6 pages
Simevents Ug
No ratings yet
Simevents Ug
230 pages
Steel Selection by Hardenability
No ratings yet
Steel Selection by Hardenability
2 pages