0% found this document useful (0 votes)

167 views58 pages

Week3 Statnlp Web

This document summarizes several machine learning techniques for clustering and classifying text documents without labeled data: K-means clustering, Gaussian mixture models, expectation-maximization (EM) algorithm, and hierarchical clustering. It discusses how these unsupervised learning methods can be used to automatically group documents into clusters when no human-assigned labels are available. The clusters found can then potentially be used for document classification.

Uploaded by

Neev Tighnavard

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

167 views58 pages

Week3 Statnlp Web

Uploaded by

Neev Tighnavard

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

Text Clustering, K-Means, Gaussian

Mixture Models, Expectation-

Maximization, Hierarchical Clustering

Sameer Maskey

Week 3, Sept 19, 2012

1
Topics for Today

Text Clustering
Gaussian Mixture Models
K-Means
Expectation Maximization
Hierarchical Clustering

2
Announcement

Proposal Due tonight (11:59pm) not graded

Feedback by Friday
Final Proposal due (11:59pm) next Wednesday
5% of the project grade
Email me the proposal with the title
Project Proposal : Statistical NLP for the Web
Homework 1 is out
Due October 4th (11:59pm) Thursday
Please use courseworks

3
Course Initial Survey
Class Survey

100.00%
95.45% 95.45%
90.91% 90.48%
90.00%

80.00% 77.27% 77.27%

72.73%
70.00%
63.64% 63.64%
Percentage (Yes, No)

59.09% 59.09%
60.00%
54.55%
50.00%
50.00% Yes
50.00%
45.45% No
40.91% 40.91%
40.00% 36.36% 36.36%

30.00% 27.27%
22.73% 22.73%
20.00%

9.09% 9.09%
10.00%
4.55% 4.55%

0.00%
NLP SLP ML NLP for ML Adv ML NLP-ML Pace Math Matlab Matlab Excited for Industry Larger
Tutorial Project Mentors Audience
Category

4
Perceptron Algorithm

We are given (xi , yi )

Initialize w
Do until converged
if error(yi , sign(w.xi )) == T RU E
w w + yi xi
end if
End do

If predicted class is wrong, subtract or add that point to weight vector

5
Perceptron (cont.) Y is prediction based on
weights and its either 0 or 1
yj (t) = f[w(t).xj ] in this case

wi (t + 1) = wi (t) + (dj yj (t))xi,j

Error is either 1, 0 or -1

Example from Wikipedia

6
Nave Bayes Classifier for Text

P (Y = yk |X1 , X2 , ..., XN ) = P (Y =yk )P (X1 ,X2 ,..,XN |Y =yk )

j P (Y =yj )P (X1 ,X 2 ,..,XN |Y =yj )

= P (Y =yk )i P (Xi |Y =yk )

j P (Y =yj )i P (Xi |Y =yj )

Y argmaxyk P (Y = yk )i P (Xi |Y = yk )

7
Nave Bayes Classifier for Text

Given the training data what are the parameters to

be estimated?

P (Y ) P (X|Y1 ) P (X|Y2 )

the: 0.001 the: 0.001

Diabetes : 0.8 diabetic : 0.0001
Hepatitis : 0.2 diabetic : 0.02
blood : 0.0015 water : 0.0118
sugar : 0.02 fever : 0.01
weight : 0.018 weight : 0.008

Y argmaxyk P (Y = yk )i P (Xi |Y = yk )
8
Data without Labels
Data with corresponding Human Scores

is writing a paper 0.5

Regression
has flu 0.1
is happy, yankees won!
0.87

Data with corresponding Human Class Labels

Perceptron
is writing a paper SAD
has flu Nave Bayes
SAD
is happy, yankees won! Fishers Linear Discriminant
HAPPY

Data with NO corresponding Labels

is writing a paper -
has flu - ?
is happy, yankees won!
-
9
Document Clustering

Previously we classified Documents into Two Classes

Diabetes (Class1) and Hepatitis (Class2)
We had human labeled data
Supervised learning
What if we do not have manually tagged documents
Can we still classify documents?
Document clustering
Unsupervised Learning

10
Classification vs. Clustering

Supervised Training Unsupervised Training

of Classification Algorithm of Clustering Algorithm

11
Clusters for Classification

Automatically Found Clusters

can be used for Classification

12
Document Clustering
Baseball Docs ?
Which cluster does the new document
belong to?

Hockey Docs

13
Document Clustering

Cluster the documents in N clusters/categories

For classification we were able to estimate parameters using

labeled data
Perceptrons find the parameters that decide the separating
hyperplane
Nave Bayes count the number of times word occurs in the
given class and normalize

Not evident on how to find separating hyperplane when no

labeled data available
Not evident how many classes we have for data when we do
not have labels

14
Document Clustering Application
Even though we do not know human labels automatically
induced clusters could be useful News Clusters

15
Document Clustering Application

A Map of Yahoo!, Mappa.Mundi Map of the Market with Headlines

Magazine, February 2000. Smartmoney [2]

16
How to Cluster Documents with No
Labeled Data?
Treat cluster IDs or class labels as hidden variables
Maximize the likelihood of the unlabeled data
Cannot simply count for MLE as we do not know
which point belongs to which class
User Iterative Algorithm such as K-Means, EM

Hidden Variables?
What do we mean by this?

17
Hidden vs. Observed Variables
Assuming our observed data is in R2

How many observed variables? How many observed variables?

How many hidden variables?
18
Clustering (30, 1)
(55, 2)
(24, 1)
(40, 1)
(35, 2)

If we have data with labels

Find out i and

from data N (1 , 1 )
i
for both classes N (2 , 2 )
(30, ?)
(55, ?)
(24, ?)
(40, ?)
If we have data with NO labels but know (35, ?)

data comes from 2 classes

Find out i and

i from data
?
N (1 , 1 )
for both classes N (2 , 2 )
19
K-Means in Words

Parameters to estimate for K classes Baseball

Let us assume we can model this data Hockey
with mixture of two Gaussians

Start with 2 Gaussians (initialize mu values)

Compute distance of each point to the mu of 2 Gaussians and

assign it to the closest Gaussian (class label (Ck))

Use the assigned points to recompute mu for 2 Gaussians

20
K-Means Clustering

Let us define Dataset in D dimension{x1 , x2 , ..., xN }

We want to cluster the data in Kclusters

Let k be D dimension vector representing clusterK

Let us define rnk for each xn such that

rnk {0, 1} where k = 1, ..., K and
rnk = 1 if xn is assigned to cluster k

21
Distortion Measure

N
K
J= rnk ||xn k ||2
n=1 k=1

Represents sum of squares of distances to mu_k from each data point

We want to minimize J

22
Estimating Parameters

We can estimate parameters by doing 2 step

iterative process

Minimize J with respect to rnk

Step 1
Keep k fixed

Minimize J with respect to k

Step 2
Keep rnk fixed

23
Minimize J with respect to rnk
Step 1
Keep k fixed

Optimize for each n separately by choosing rnk for k that

gives minimum ||x r ||2
n nk

rnk = 1 if k = argminj ||xn j ||2

= 0 otherwise

Assign each data point to the cluster that is the closest

Hard decision to cluster assignment

24
Minimize J with respect to k
Step 2
Keep rnk fixed

J is quadratic in k . Minimize by setting derivative w.rt. k to

zero

n rnk xn
k =
n rnk

Take all the points assigned to cluster K and re-estimate the

mean for cluster K

25
Document Clustering with K-means

Assuming we have data with no labels for Hockey and

Baseball data
We want to be able to categorize a new document into one of
the 2 classes (K=2)
We can extract represent document as feature vectors
Features can be word id or other NLP features such as POS
tags, word context etc (D=total dimension of Feature vectors)
N documents are available
Randomly initialize 2 class means
Compute square distance of each point (xn)(D dimension) to
class means (k)
Assign the point to K for which k is lowest
Re-compute k and re-iterate

26
K-Means Example

K-means algorithm Illustration [1]

27
Clusters
Number of documents
clustered together

28
Hard Assignment to Clusters

K-means algorithm assigns each point to the closest

cluster
Hard decision
Each data point affects the mean computation equally
How does the points almost equidistant from 2
clusters affect the algorithm?
Soft decision?
Fractional counts?

29
Gaussian Mixture Models (GMMs)

30
Mixtures of 2 Gaussians

P(x)= N (x|1 , 1) + (1 )N (x|2 , 2)

GMM with 2 gaussians

31
Mixture Models

Mixture of Gaussians [1]

1 Gaussian may not fit the data

2 Gaussians may fit the data better
Each Gaussian can be a class category
When labeled data not available we can treat class category
as hidden variable

32
Mixture Model Classifier

Given a new data point find out posterior probability from each class

p(x|y)p(y)
p(y|x) = p(x)

p(y = 1|x) N (x|1 , 1 )p(y = 1)

33
Cluster ID/Class Label as Hidden
Variables
p(x) = z p(x, z) = z p(z)p(x|z)

We can treat class category as hidden variable z

Z is K-dimensional binary random variable in which zk = 1 and 0 for
other elements
z = [00100...]
K
p(z) = k=1 kzk
K
Also, sum of priors sum to 1 k=1 k = 1

Conditional distribution of x given a particular z can be written as

k=1

Mixing Covariance
Mean
Component

K 1
p(x) = k=1 k
(2) D/2
1
(|

|
exp( 12 (xk )T k
(xk ))
k

Mixture models can be linear combinations of other distributions as well

Mixture of binomial distribution for example
35
Conditional Probability of Label Given
Data
Mixture model with parameters mu, sigma and prior can
represent the parameter
We can maximize the data given the model parameters to find
the best parameters
If we know the best parameters we can estimate
p(z =1)p(x|zk =1)
(zk ) p(zk = 1|x) = K k
j=1 p(zj =1)p(x|zj =1)

N (x|k , k )
= K k
j=1 j N (x|j , j)

This essentially gives us probability of class given the data

i.e label for the given data point
36
Maximizing Likelihood
If we had labeled data we could maximize likelihood simply by
counting and normalizing to get mean and variance of
Gaussians for the given classes
N
l = n=1 log p(xn , yn |, , )
N
l = n=1 log yn N (xn |yn , yn )
(30, 1)
(55, 2)
(24, 1)
If we have two classes C1 and C2 (40, 1)
Lets say we have a feature x (35, 2)

x = number of words field
And class label (y)
y = 1 hockey or 2 baseball documents
N (1 , 1 )
Find out i and i from data N (2 , 2 )
for both classes
37
Maximizing Likelihood for Mixture Model with
Hidden Variables
For a mixture model with a hidden variable
representing 2 classes, log likelihood is
N
l= n=1 logp(xn |, , )

N 1
l= n=1 log y=0 N (xn , y|, , )

N
= n=1 log (0 N (xn |0 , 0 )+1 N (xn |1 , 1 ))

38
Log-likelihood for Mixture of Gaussians

N k
log p(X|, , ) = n=1 log ( k=1 k N (x|k , k ))

We want to find maximum likelihood of the above log-

likelihood function to find the best parameters that maximize
the data given the model
We can again do iterative process for estimating the log-
likelihood of the above function
This 2-step iterative process is called Expectation-Maximization

39
Explaining Expectation Maximization

EM is like fuzzy K-means Baseball

Hockey
Parameters to estimate for K classes

Let us assume we can model this data

with mixture of two Gaussians (K=2)

Start with 2 Gaussians (initialize mu and sigma values) Expectation

Compute distance of each point to the mu of 2 Gaussians and assign it a soft

class label (Ck)

Use the assigned points to recompute mu and sigma for 2 Gaussians; but
weight the updates with soft labels Maximization

40
Expectation Maximization
An expectation-maximization (EM) algorithm is used in statistics for
finding maximum likelihood estimates of parameters in
probabilistic models, where the model depends on unobserved
hidden variables.

EM alternates between performing an expectation (E) step, which

computes an expectation of the likelihood by including the latent
variables as if they were observed, and a maximization (M) step,
which computes the maximum likelihood estimates of the
parameters by maximizing the expected likelihood found on the E
step. The parameters found on the M step are then used to begin
another E step, and the process is repeated.

The EM algorithm was explained and given its name in a classic

1977 paper by A. Dempster and D. Rubin in the Journal of the
Royal Statistical Society.
41
Estimating Parameters
(znk ) = E(znk |xn ) = p(zk = 1|xn )

E-Step

N (xn |k , k )
(znk ) = K k
j=1 j N (xn |j , j)

42
Estimating Parameters
M-step

1
N
k = Nk n=1 (znk )xn
1
N T
k = Nk n=1 (znk )(xn k )(x n k)

k = Nk
N
N
where Nk = n=1 (znk )
Iterate until convergence of log likelihood
N k
log p(X|, , ) = n=1 log ( k=1 N (x|k , k ))
43
EM Iterations

EM iterations [1]
44
Clustering Documents with EM

Clustering documents requires representation of

documents in a set of features
Set of features can be bag of words model
Features such as POS, word similarity, number of
sentences, etc
Can we use mixture of Gaussians for any kind of
features?
How about mixture of multinomial for document
clustering?
How do we get EM algorithm for mixture of
multinomial?

45
Clustering Algorithms

We just described two kinds of clustering algorithms

K-means
Expectation Maximization
Expectation-Maximization is a general way to
maximize log likelihood for distributions with hidden
variables
For example, EM for HMM, state sequences were hidden
For document clustering other kinds of clustering
algorithm exists

46
Hierarchical Clustering

Build a binary tree that groups similar data in

iterative manner
K-means
distance of data point to center of the gaussian
EM
Posterior of data point w.r.t to the gaussian
Hierarchical
Similarity : ?
Similarity across groups of data

47
Types of Hierarchical Clustering
Agglomerative (bottom-up):
Assign each data point as one cluster
Iteratively combine sub-clusters
Eventually, all data points is a part of 1 cluster

Divisive (top-down):
Assign all data points to the same cluster.
Eventually each data point forms its own cluster

One advantage :
Do not need to define K, number
of clusters before we begin
clustering

48
Hierarchical Clustering Algorithm

Step 1
Assign each data point to its own cluster

Step 2
Compute similarity between clusters

Step 3
Merge two most similar cluster to form one less cluster

49
Hierarchical Clustering Demo

Animation source [4]

50
Similar Clusters?

How do we compute similar clusters?

Distance between 2 points in the clusters?
Distance from means of two clusters?
Distance between two closest points in the clusters?
Different similarity metric could produce different
types of cluster
Common similarity metric used
Single Linkage
Complete Linkage
Average Group Linkage

51
Single Linkage

Cluster1
Cluster2

52
Complete Linkage

Cluster1
Cluster2

53
Average Group Linkage

Cluster1
Cluster2

54
Hierarchical Cluster for Documents

Figure : [Ho, Qirong, et. al]

55
Hierarchical Document Clusters

Highlevel multi view of the corpus

Taxonomy useful for various purposes
Q&A related to a subtopic
Finding broadly important topics
Recursive drill down on topics
Filter irrelevant topics

56
Summary

Unsupervised clustering algorithms

K-means
Expectation Maximization
Hierarchical clustering
EM is a general algorithm that can be used to
estimate maximum likelihood of functions with
hidden variables
Similarity Metric is important when clustering
segments of text

57
References

[1] Christopher Bishop, Pattern Recognition and Machine Learning,

2006
[2] http://www.smartmoney.com/map-of-the-market/
[3] Ho, Qirong, et. al, Document Hierarchies from Text and Links, 2012

5 Clustering
No ratings yet
5 Clustering
38 pages
Statistical Methods For NLP: Document and Topic Clustering, K-Means, Mixture Models, Expectation-Maximization
No ratings yet
Statistical Methods For NLP: Document and Topic Clustering, K-Means, Mixture Models, Expectation-Maximization
47 pages
Week 5 v1.1 - Unsupervised Learning
No ratings yet
Week 5 v1.1 - Unsupervised Learning
40 pages
EM and Kmeans Relations
No ratings yet
EM and Kmeans Relations
70 pages
Intro to Machine Learning Concepts
No ratings yet
Intro to Machine Learning Concepts
30 pages
Predict Classify Cluster
No ratings yet
Predict Classify Cluster
12 pages
Lecture 4
No ratings yet
Lecture 4
64 pages
15 GMC
No ratings yet
15 GMC
4 pages
ML.5-Clustering Techniques (Week 9)
No ratings yet
ML.5-Clustering Techniques (Week 9)
71 pages
DSA5102 Lecture10
No ratings yet
DSA5102 Lecture10
40 pages
Tema5 Teoria-2830
No ratings yet
Tema5 Teoria-2830
57 pages
ML Lecture06 Unsupervised Learning
No ratings yet
ML Lecture06 Unsupervised Learning
87 pages
6.2 K Means
No ratings yet
6.2 K Means
23 pages
Michael Melese (PH.D.) Michael - Melese@aau - Edu.et
No ratings yet
Michael Melese (PH.D.) Michael - Melese@aau - Edu.et
22 pages
Week6 Clustering Regression
No ratings yet
Week6 Clustering Regression
101 pages
(KtabPDF Com) xrwA7TEBGp
No ratings yet
(KtabPDF Com) xrwA7TEBGp
32 pages
Unit IV Clustering
No ratings yet
Unit IV Clustering
60 pages
Lect 10 - Unsupervised Learning
No ratings yet
Lect 10 - Unsupervised Learning
50 pages
PROBABILISTIC Learning Jb-New
No ratings yet
PROBABILISTIC Learning Jb-New
13 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
Outline: Three Basic Algorithms
No ratings yet
Outline: Three Basic Algorithms
34 pages
ML Unit3
No ratings yet
ML Unit3
21 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
Topic: Machine Learning
No ratings yet
Topic: Machine Learning
35 pages
Unsupervised Learning: K-Means & EM
No ratings yet
Unsupervised Learning: K-Means & EM
34 pages
ML RUSA Module 6 Probablistic EM KNN SVM
No ratings yet
ML RUSA Module 6 Probablistic EM KNN SVM
51 pages
Classification
No ratings yet
Classification
20 pages
Probability and Statistics Mansoura Day4
No ratings yet
Probability and Statistics Mansoura Day4
23 pages
ML Clustering
No ratings yet
ML Clustering
33 pages
Lec11 Ann
No ratings yet
Lec11 Ann
61 pages
Cs8080 Unit3 Text Classification and Clustering
No ratings yet
Cs8080 Unit3 Text Classification and Clustering
171 pages
Week 7 - Latent Variable Models and Expectation Maximization
No ratings yet
Week 7 - Latent Variable Models and Expectation Maximization
39 pages
ML - Unit - 4 - Part Ii
No ratings yet
ML - Unit - 4 - Part Ii
79 pages
M146 Lec14 Sidenotes S25
No ratings yet
M146 Lec14 Sidenotes S25
33 pages
DM&BAFall2204 2
No ratings yet
DM&BAFall2204 2
61 pages
04-FSSR DS610 2024 2025T1 Kmeans
No ratings yet
04-FSSR DS610 2024 2025T1 Kmeans
57 pages
K Means
No ratings yet
K Means
24 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
31 pages
Unsupervised Machine Learning in Python
100% (2)
Unsupervised Machine Learning in Python
89 pages
Clustering Techniques - Hierarchical, K-Means Clustering
No ratings yet
Clustering Techniques - Hierarchical, K-Means Clustering
22 pages
2019BurkovTheHundred pageMachineLearnin2
No ratings yet
2019BurkovTheHundred pageMachineLearnin2
33 pages
Medical Imabmnge Analysis
No ratings yet
Medical Imabmnge Analysis
41 pages
Lecture Unsupervised (17!04!2024)
No ratings yet
Lecture Unsupervised (17!04!2024)
61 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
79 pages
Unit 2 - SVM
No ratings yet
Unit 2 - SVM
137 pages
Lecture 4
No ratings yet
Lecture 4
36 pages
Clustering Partitioning-Hierarchical-DensityBased
No ratings yet
Clustering Partitioning-Hierarchical-DensityBased
87 pages
ML UNIT 4 Sir
No ratings yet
ML UNIT 4 Sir
42 pages
Classification (NaiveBayes KNN SVM DecisionTrees)
No ratings yet
Classification (NaiveBayes KNN SVM DecisionTrees)
105 pages
Week 11
No ratings yet
Week 11
49 pages
Week 9
No ratings yet
Week 9
66 pages
UNIT III Part-1
No ratings yet
UNIT III Part-1
69 pages
Unsupervised Learning: K-Means & GMM
No ratings yet
Unsupervised Learning: K-Means & GMM
27 pages
Concepts and Techniques: - Chapter 11
No ratings yet
Concepts and Techniques: - Chapter 11
103 pages
Lecture 06
No ratings yet
Lecture 06
51 pages
Data Mining For BI - Part 5
No ratings yet
Data Mining For BI - Part 5
34 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
57 pages
First&Semester&: Chapter&1:&Intro&to&Social&Psych& &
No ratings yet
First&Semester&: Chapter&1:&Intro&to&Social&Psych& &
55 pages
502 Terms
0% (1)
502 Terms
39 pages
Formula Sheet
No ratings yet
Formula Sheet
2 pages
Chapter 2: Doing Social Psychology Research Study Sheet: 1) Why Learn About Research Methods?
No ratings yet
Chapter 2: Doing Social Psychology Research Study Sheet: 1) Why Learn About Research Methods?
6 pages
ECSE 414 - Intro To Telecom Networks - Fall 2007 Midterm 2007 Solutions
No ratings yet
ECSE 414 - Intro To Telecom Networks - Fall 2007 Midterm 2007 Solutions
6 pages
Jogl Tutorial
No ratings yet
Jogl Tutorial
118 pages
Computer Vision and Pattern Recognition - EEE3032 Pre-Course Reading / Self-Study Material
No ratings yet
Computer Vision and Pattern Recognition - EEE3032 Pre-Course Reading / Self-Study Material
36 pages
DSP Formula
No ratings yet
DSP Formula
2 pages
Understanding Borderline Personality Disorder
100% (14)
Understanding Borderline Personality Disorder
28 pages
2016 06 29 Seminar Implementing Agile Development
No ratings yet
2016 06 29 Seminar Implementing Agile Development
127 pages
Image Processing Wikipedia Book PDF
No ratings yet
Image Processing Wikipedia Book PDF
99 pages
Cvdict
No ratings yet
Cvdict
309 pages
Assignment1 2017w
No ratings yet
Assignment1 2017w
3 pages
Jogl Tutorial
No ratings yet
Jogl Tutorial
118 pages
Analysis of Text-Based CAPTCHA Images Using Template Matching Correlation Technique
No ratings yet
Analysis of Text-Based CAPTCHA Images Using Template Matching Correlation Technique
5 pages
Ecse 414
No ratings yet
Ecse 414
2 pages
1.2 File Handling Updated
No ratings yet
1.2 File Handling Updated
11 pages
ASUS X55A Series Laptop Manual
No ratings yet
ASUS X55A Series Laptop Manual
116 pages
West Bengal IT Policy
No ratings yet
West Bengal IT Policy
15 pages
Question Paper
No ratings yet
Question Paper
2 pages
Main Objective of The Course: Microprocessors and Interfacing Devices
No ratings yet
Main Objective of The Course: Microprocessors and Interfacing Devices
4 pages
Wmfun - E05 v0 - Wm-Im Interface
100% (1)
Wmfun - E05 v0 - Wm-Im Interface
19 pages
The Official Full Circle Magazine Index - Full Circle Index
No ratings yet
The Official Full Circle Magazine Index - Full Circle Index
96 pages
Swahili Tales As Told by Natives of Zanzibar
No ratings yet
Swahili Tales As Told by Natives of Zanzibar
524 pages
Ict g11-1st Handout
No ratings yet
Ict g11-1st Handout
7 pages
Electronic Engineering Business Plan
100% (1)
Electronic Engineering Business Plan
41 pages
Amazon Method
100% (2)
Amazon Method
3 pages
Peacock
No ratings yet
Peacock
1 page
Operating System Functions Guide
No ratings yet
Operating System Functions Guide
39 pages
From GE Digital: iFIX 6.1
No ratings yet
From GE Digital: iFIX 6.1
3 pages
Wifi Stealer Tutorial
No ratings yet
Wifi Stealer Tutorial
4 pages
Prelom U InDesign-u
100% (1)
Prelom U InDesign-u
20 pages
AiroPeek Gettingstarted
No ratings yet
AiroPeek Gettingstarted
54 pages
An Introduction To Git and Github: Prof. Andrew C.R. Martin, University College London November, 2018
No ratings yet
An Introduction To Git and Github: Prof. Andrew C.R. Martin, University College London November, 2018
25 pages
Job Sheet 1.1-2 (Oliver V Calledo)
No ratings yet
Job Sheet 1.1-2 (Oliver V Calledo)
8 pages
Q4 ICT Reviewer
No ratings yet
Q4 ICT Reviewer
3 pages
SAP BW Business Content Analysis
No ratings yet
SAP BW Business Content Analysis
51 pages
Equifax Breach - Security Failures Uncovered
No ratings yet
Equifax Breach - Security Failures Uncovered
21 pages
PHP Coding for Crime Reporting
No ratings yet
PHP Coding for Crime Reporting
33 pages
MAPEH Teachers' Meeting Summary
No ratings yet
MAPEH Teachers' Meeting Summary
2 pages
GS2200M IP2WiFi Adapter Command Reference Rev 1.0b
No ratings yet
GS2200M IP2WiFi Adapter Command Reference Rev 1.0b
203 pages
iRIS4 Quick User Guideline TestDisable EN-1
No ratings yet
iRIS4 Quick User Guideline TestDisable EN-1
1 page
Clifford Sze-Tsan Choy and Wan-Chi Siu - Fast Sequential Implementation of "Neural-Gas" Network For Vector Quantization
No ratings yet
Clifford Sze-Tsan Choy and Wan-Chi Siu - Fast Sequential Implementation of "Neural-Gas" Network For Vector Quantization
4 pages
DM 2019-06-26 Designation of District and School ICTC
No ratings yet
DM 2019-06-26 Designation of District and School ICTC
3 pages
Amazon Things Graph
No ratings yet
Amazon Things Graph
242 pages

Week3 Statnlp Web

Uploaded by

Week3 Statnlp Web

Uploaded by

Text Clustering, K-Means, Gaussian

Mixture Models, Expectation-

Week 3, Sept 19, 2012

Proposal Due tonight (11:59pm) not graded

80.00% 77.27% 77.27%

We are given (xi , yi )

If predicted class is wrong, subtract or add that point to weight vector

wi (t + 1) = wi (t) + (dj yj (t))xi,j

Example from Wikipedia

P (Y = yk |X1 , X2 , ..., XN ) = P (Y =yk )P (X1 ,X2 ,..,XN |Y =yk )

= P (Y =yk )i P (Xi |Y =yk )

Given the training data what are the parameters to

the: 0.001 the: 0.001

is writing a paper 0.5

Data with corresponding Human Class Labels

Data with NO corresponding Labels

Previously we classified Documents into Two Classes

Supervised Training Unsupervised Training

Automatically Found Clusters

Cluster the documents in N clusters/categories

For classification we were able to estimate parameters using

Not evident on how to find separating hyperplane when no

A Map of Yahoo!, Mappa.Mundi Map of the Market with Headlines

How many observed variables? How many observed variables?

Find out i and

Parameters to estimate for K classes Baseball

Start with 2 Gaussians (initialize mu values)

Compute distance of each point to the mu of 2 Gaussians and

Use the assigned points to recompute mu for 2 Gaussians

Let us define Dataset in D dimension{x1 , x2 , ..., xN }

We want to cluster the data in Kclusters

Let us define rnk for each xn such that

Represents sum of squares of distances to mu_k from each data point

We can estimate parameters by doing 2 step

Minimize J with respect to rnk

Minimize J with respect to k

Optimize for each n separately by choosing rnk for k that

rnk = 1 if k = argminj ||xn j ||2

Assign each data point to the cluster that is the closest

J is quadratic in k . Minimize by setting derivative w.rt. k to

Take all the points assigned to cluster K and re-estimate the

Assuming we have data with no labels for Hockey and

K-means algorithm Illustration [1]

K-means algorithm assigns each point to the closest

GMM with 2 gaussians

Mixture of Gaussians [1]

1 Gaussian may not fit the data

We can treat class category as hidden variable z

Conditional distribution of x given a particular z can be written as

Mixture models can be linear combinations of other distributions as well

This essentially gives us probability of class given the data

We want to find maximum likelihood of the above log-

EM is like fuzzy K-means Baseball

Let us assume we can model this data

Start with 2 Gaussians (initialize mu and sigma values) Expectation

Compute distance of each point to the mu of 2 Gaussians and assign it a soft

EM alternates between performing an expectation (E) step, which

The EM algorithm was explained and given its name in a classic

Clustering documents requires representation of

We just described two kinds of clustering algorithms

Build a binary tree that groups similar data in

Animation source [4]

How do we compute similar clusters?

Figure : [Ho, Qirong, et. al]

Highlevel multi view of the corpus

Unsupervised clustering algorithms

[1] Christopher Bishop, Pattern Recognition and Machine Learning,

You might also like