Text Clustering, K-Means, Gaussian
Mixture Models, Expectation-
Maximization, Hierarchical Clustering
Sameer Maskey
Week 3, Sept 19, 2012
1
Topics for Today
Text Clustering
Gaussian Mixture Models
K-Means
Expectation Maximization
Hierarchical Clustering
2
Announcement
Proposal Due tonight (11:59pm) not graded
Feedback by Friday
Final Proposal due (11:59pm) next Wednesday
5% of the project grade
Email me the proposal with the title
Project Proposal : Statistical NLP for the Web
Homework 1 is out
Due October 4th (11:59pm) Thursday
Please use courseworks
3
Course Initial Survey
Class Survey
100.00%
95.45% 95.45%
90.91% 90.48%
90.00%
80.00% 77.27% 77.27%
72.73%
70.00%
63.64% 63.64%
Percentage (Yes, No)
59.09% 59.09%
60.00%
54.55%
50.00%
50.00% Yes
50.00%
45.45% No
40.91% 40.91%
40.00% 36.36% 36.36%
30.00% 27.27%
22.73% 22.73%
20.00%
9.09% 9.09%
10.00%
4.55% 4.55%
0.00%
NLP SLP ML NLP for ML Adv ML NLP-ML Pace Math Matlab Matlab Excited for Industry Larger
Tutorial Project Mentors Audience
Category
4
Perceptron Algorithm
We are given (xi , yi )
Initialize w
Do until converged
if error(yi , sign(w.xi )) == T RU E
w w + yi xi
end if
End do
If predicted class is wrong, subtract or add that point to weight vector
5
Perceptron (cont.) Y is prediction based on
weights and its either 0 or 1
yj (t) = f[w(t).xj ] in this case
wi (t + 1) = wi (t) + (dj yj (t))xi,j
Error is either 1, 0 or -1
Example from Wikipedia
6
Nave Bayes Classifier for Text
P (Y = yk |X1 , X2 , ..., XN ) = P (Y =yk )P (X1 ,X2 ,..,XN |Y =yk )
j P (Y =yj )P (X1 ,X 2 ,..,XN |Y =yj )
= P (Y =yk )i P (Xi |Y =yk )
j P (Y =yj )i P (Xi |Y =yj )
Y argmaxyk P (Y = yk )i P (Xi |Y = yk )
7
Nave Bayes Classifier for Text
Given the training data what are the parameters to
be estimated?
P (Y ) P (X|Y1 ) P (X|Y2 )
the: 0.001 the: 0.001
Diabetes : 0.8 diabetic : 0.0001
Hepatitis : 0.2 diabetic : 0.02
blood : 0.0015 water : 0.0118
sugar : 0.02 fever : 0.01
weight : 0.018 weight : 0.008
Y argmaxyk P (Y = yk )i P (Xi |Y = yk )
8
Data without Labels
Data with corresponding Human Scores
is writing a paper 0.5
Regression
has flu 0.1
is happy, yankees won!
0.87
Data with corresponding Human Class Labels
Perceptron
is writing a paper SAD
has flu Nave Bayes
SAD
is happy, yankees won! Fishers Linear Discriminant
HAPPY
Data with NO corresponding Labels
is writing a paper -
has flu - ?
is happy, yankees won!
-
9
Document Clustering
Previously we classified Documents into Two Classes
Diabetes (Class1) and Hepatitis (Class2)
We had human labeled data
Supervised learning
What if we do not have manually tagged documents
Can we still classify documents?
Document clustering
Unsupervised Learning
10
Classification vs. Clustering
Supervised Training Unsupervised Training
of Classification Algorithm of Clustering Algorithm
11
Clusters for Classification
Automatically Found Clusters
can be used for Classification
12
Document Clustering
Baseball Docs ?
Which cluster does the new document
belong to?
Hockey Docs
13
Document Clustering
Cluster the documents in N clusters/categories
For classification we were able to estimate parameters using
labeled data
Perceptrons find the parameters that decide the separating
hyperplane
Nave Bayes count the number of times word occurs in the
given class and normalize
Not evident on how to find separating hyperplane when no
labeled data available
Not evident how many classes we have for data when we do
not have labels
14
Document Clustering Application
Even though we do not know human labels automatically
induced clusters could be useful News Clusters
15
Document Clustering Application
A Map of Yahoo!, Mappa.Mundi Map of the Market with Headlines
Magazine, February 2000. Smartmoney [2]
16
How to Cluster Documents with No
Labeled Data?
Treat cluster IDs or class labels as hidden variables
Maximize the likelihood of the unlabeled data
Cannot simply count for MLE as we do not know
which point belongs to which class
User Iterative Algorithm such as K-Means, EM
Hidden Variables?
What do we mean by this?
17
Hidden vs. Observed Variables
Assuming our observed data is in R2
How many observed variables? How many observed variables?
How many hidden variables?
18
Clustering (30, 1)
(55, 2)
(24, 1)
(40, 1)
(35, 2)
If we have data with labels
Find out i and
from data N (1 , 1 )
i
for both classes N (2 , 2 )
(30, ?)
(55, ?)
(24, ?)
(40, ?)
If we have data with NO labels but know (35, ?)
data comes from 2 classes
Find out i and
i from data
?
N (1 , 1 )
for both classes N (2 , 2 )
19
K-Means in Words
Parameters to estimate for K classes Baseball
Let us assume we can model this data Hockey
with mixture of two Gaussians
Start with 2 Gaussians (initialize mu values)
Compute distance of each point to the mu of 2 Gaussians and
assign it to the closest Gaussian (class label (Ck))
Use the assigned points to recompute mu for 2 Gaussians
20
K-Means Clustering
Let us define Dataset in D dimension{x1 , x2 , ..., xN }
We want to cluster the data in Kclusters
Let k be D dimension vector representing clusterK
Let us define rnk for each xn such that
rnk {0, 1} where k = 1, ..., K and
rnk = 1 if xn is assigned to cluster k
21
Distortion Measure
N
K
J= rnk ||xn k ||2
n=1 k=1
Represents sum of squares of distances to mu_k from each data point
We want to minimize J
22
Estimating Parameters
We can estimate parameters by doing 2 step
iterative process
Minimize J with respect to rnk
Step 1
Keep k fixed
Minimize J with respect to k
Step 2
Keep rnk fixed
23
Minimize J with respect to rnk
Step 1
Keep k fixed
Optimize for each n separately by choosing rnk for k that
gives minimum ||x r ||2
n nk
rnk = 1 if k = argminj ||xn j ||2
= 0 otherwise
Assign each data point to the cluster that is the closest
Hard decision to cluster assignment
24
Minimize J with respect to k
Step 2
Keep rnk fixed
J is quadratic in k . Minimize by setting derivative w.rt. k to
zero
n rnk xn
k =
n rnk
Take all the points assigned to cluster K and re-estimate the
mean for cluster K
25
Document Clustering with K-means
Assuming we have data with no labels for Hockey and
Baseball data
We want to be able to categorize a new document into one of
the 2 classes (K=2)
We can extract represent document as feature vectors
Features can be word id or other NLP features such as POS
tags, word context etc (D=total dimension of Feature vectors)
N documents are available
Randomly initialize 2 class means
Compute square distance of each point (xn)(D dimension) to
class means (k)
Assign the point to K for which k is lowest
Re-compute k and re-iterate
26
K-Means Example
K-means algorithm Illustration [1]
27
Clusters
Number of documents
clustered together
28
Hard Assignment to Clusters
K-means algorithm assigns each point to the closest
cluster
Hard decision
Each data point affects the mean computation equally
How does the points almost equidistant from 2
clusters affect the algorithm?
Soft decision?
Fractional counts?
29
Gaussian Mixture Models (GMMs)
30
Mixtures of 2 Gaussians
P(x)= N (x|1 , 1) + (1 )N (x|2 , 2)
GMM with 2 gaussians
31
Mixture Models
Mixture of Gaussians [1]
1 Gaussian may not fit the data
2 Gaussians may fit the data better
Each Gaussian can be a class category
When labeled data not available we can treat class category
as hidden variable
32
Mixture Model Classifier
Given a new data point find out posterior probability from each class
p(x|y)p(y)
p(y|x) = p(x)
p(y = 1|x) N (x|1 , 1 )p(y = 1)
33
Cluster ID/Class Label as Hidden
Variables
p(x) = z p(x, z) = z p(z)p(x|z)
We can treat class category as hidden variable z
Z is K-dimensional binary random variable in which zk = 1 and 0 for
other elements
z = [00100...]
K
p(z) = k=1 kzk
K
Also, sum of priors sum to 1 k=1 k = 1
Conditional distribution of x given a particular z can be written as
P (x|zk = 1) = N (x|k , k )
K z
P (x| z ) = k=1 N (x|k , k ) k 34
Mixture of Gaussians with Hidden Variables
p(x) = z p(x, z) = z p(z)p(x|z)
Component of
K Mixture
p(x) = k N (x|k , k)
k=1
Mixing Covariance
Mean
Component
K 1
p(x) = k=1 k
(2) D/2
1
(|
|
exp( 12 (xk )T k
(xk ))
k
Mixture models can be linear combinations of other distributions as well
Mixture of binomial distribution for example
35
Conditional Probability of Label Given
Data
Mixture model with parameters mu, sigma and prior can
represent the parameter
We can maximize the data given the model parameters to find
the best parameters
If we know the best parameters we can estimate
p(z =1)p(x|zk =1)
(zk ) p(zk = 1|x) = K k
j=1 p(zj =1)p(x|zj =1)
N (x|k , k )
= K k
j=1 j N (x|j , j)
This essentially gives us probability of class given the data
i.e label for the given data point
36
Maximizing Likelihood
If we had labeled data we could maximize likelihood simply by
counting and normalizing to get mean and variance of
Gaussians for the given classes
N
l = n=1 log p(xn , yn |, , )
N
l = n=1 log yn N (xn |yn , yn )
(30, 1)
(55, 2)
(24, 1)
If we have two classes C1 and C2 (40, 1)
Lets say we have a feature x (35, 2)
x = number of words field
And class label (y)
y = 1 hockey or 2 baseball documents
N (1 , 1 )
Find out i and i from data N (2 , 2 )
for both classes
37
Maximizing Likelihood for Mixture Model with
Hidden Variables
For a mixture model with a hidden variable
representing 2 classes, log likelihood is
N
l= n=1 logp(xn |, , )
N 1
l= n=1 log y=0 N (xn , y|, , )
N
= n=1 log (0 N (xn |0 , 0 )+1 N (xn |1 , 1 ))
38
Log-likelihood for Mixture of Gaussians
N k
log p(X|, , ) = n=1 log ( k=1 k N (x|k , k ))
We want to find maximum likelihood of the above log-
likelihood function to find the best parameters that maximize
the data given the model
We can again do iterative process for estimating the log-
likelihood of the above function
This 2-step iterative process is called Expectation-Maximization
39
Explaining Expectation Maximization
EM is like fuzzy K-means Baseball
Hockey
Parameters to estimate for K classes
Let us assume we can model this data
with mixture of two Gaussians (K=2)
Start with 2 Gaussians (initialize mu and sigma values) Expectation
Compute distance of each point to the mu of 2 Gaussians and assign it a soft
class label (Ck)
Use the assigned points to recompute mu and sigma for 2 Gaussians; but
weight the updates with soft labels Maximization
40
Expectation Maximization
An expectation-maximization (EM) algorithm is used in statistics for
finding maximum likelihood estimates of parameters in
probabilistic models, where the model depends on unobserved
hidden variables.
EM alternates between performing an expectation (E) step, which
computes an expectation of the likelihood by including the latent
variables as if they were observed, and a maximization (M) step,
which computes the maximum likelihood estimates of the
parameters by maximizing the expected likelihood found on the E
step. The parameters found on the M step are then used to begin
another E step, and the process is repeated.
The EM algorithm was explained and given its name in a classic
1977 paper by A. Dempster and D. Rubin in the Journal of the
Royal Statistical Society.
41
Estimating Parameters
(znk ) = E(znk |xn ) = p(zk = 1|xn )
E-Step
N (xn |k , k )
(znk ) = K k
j=1 j N (xn |j , j)
42
Estimating Parameters
M-step
1
N
k = Nk n=1 (znk )xn
1
N T
k = Nk n=1 (znk )(xn k )(x n k)
k = Nk
N
N
where Nk = n=1 (znk )
Iterate until convergence of log likelihood
N k
log p(X|, , ) = n=1 log ( k=1 N (x|k , k ))
43
EM Iterations
EM iterations [1]
44
Clustering Documents with EM
Clustering documents requires representation of
documents in a set of features
Set of features can be bag of words model
Features such as POS, word similarity, number of
sentences, etc
Can we use mixture of Gaussians for any kind of
features?
How about mixture of multinomial for document
clustering?
How do we get EM algorithm for mixture of
multinomial?
45
Clustering Algorithms
We just described two kinds of clustering algorithms
K-means
Expectation Maximization
Expectation-Maximization is a general way to
maximize log likelihood for distributions with hidden
variables
For example, EM for HMM, state sequences were hidden
For document clustering other kinds of clustering
algorithm exists
46
Hierarchical Clustering
Build a binary tree that groups similar data in
iterative manner
K-means
distance of data point to center of the gaussian
EM
Posterior of data point w.r.t to the gaussian
Hierarchical
Similarity : ?
Similarity across groups of data
47
Types of Hierarchical Clustering
Agglomerative (bottom-up):
Assign each data point as one cluster
Iteratively combine sub-clusters
Eventually, all data points is a part of 1 cluster
Divisive (top-down):
Assign all data points to the same cluster.
Eventually each data point forms its own cluster
One advantage :
Do not need to define K, number
of clusters before we begin
clustering
48
Hierarchical Clustering Algorithm
Step 1
Assign each data point to its own cluster
Step 2
Compute similarity between clusters
Step 3
Merge two most similar cluster to form one less cluster
49
Hierarchical Clustering Demo
Animation source [4]
50
Similar Clusters?
How do we compute similar clusters?
Distance between 2 points in the clusters?
Distance from means of two clusters?
Distance between two closest points in the clusters?
Different similarity metric could produce different
types of cluster
Common similarity metric used
Single Linkage
Complete Linkage
Average Group Linkage
51
Single Linkage
Cluster1
Cluster2
52
Complete Linkage
Cluster1
Cluster2
53
Average Group Linkage
Cluster1
Cluster2
54
Hierarchical Cluster for Documents
Figure : [Ho, Qirong, et. al]
55
Hierarchical Document Clusters
Highlevel multi view of the corpus
Taxonomy useful for various purposes
Q&A related to a subtopic
Finding broadly important topics
Recursive drill down on topics
Filter irrelevant topics
56
Summary
Unsupervised clustering algorithms
K-means
Expectation Maximization
Hierarchical clustering
EM is a general algorithm that can be used to
estimate maximum likelihood of functions with
hidden variables
Similarity Metric is important when clustering
segments of text
57
References
[1] Christopher Bishop, Pattern Recognition and Machine Learning,
2006
[2] http://www.smartmoney.com/map-of-the-market/
[3] Ho, Qirong, et. al, Document Hierarchies from Text and Links, 2012
58