S27

Statistical Computing with R:
Masters in Data Sciences 503 (S27)

Third Batch, SMS, TU, 2024
Shital Bhandary
Associate Professor
Statistics/Bio-statistics, Demography and Public Health Informatics
Patan Academy of Health Sciences, Lalitpur, Nepal
Faculty, Data Analysis and Decision Modeling, MBA, Pokhara University, Nepal
Faculty, FAIMER Fellowship in Health Professions Education, India/USA.
Review Preview: Unsupervised models
•Clustering • Clustering
• K-means clustering • Hierarchical clustering
Cluster analysis (Clustering): Chapter 12 “An
Introduction to Statistical Learning” book!
• Clustering refers to a very broad set • Both clustering and PCA seek to
of techniques for finding simplify the data via a small
subgroups, or clustering clusters, in number of summaries, but their
a data set. mechanisms are different:
• When we cluster the observations • PCA looks to find a low-

of a data set, we seek to partition dimensional representation of the
them into distinct groups so that observations that explain a good
the observations within each group fraction of the variance;
are quite similar to each other,
while observations in different
groups are quite different from • Clustering looks to find
each other. homogeneous subgroups among
the observations or cases.
Cluster analysis (Clustering): Chapter 12 “An
Introduction to Statistical Learning” book!
• Since clustering is popular in • In K-means clustering, we seek
many fields, there exist a great to partition the observations
number of clustering methods. into a pre-specified number of
• In this section we focus on clusters.
perhaps the two best-known • On the other hand, in
clustering approaches: hierarchical clustering, we do
• not know in advance how many
• K-means clustering and clusters we want and we use
“dendogram” to find the
number of clusters for the data
• hierarchical clustering.
k-means clustering: which “k” is the best?
k-means clustering with random data: ISLR
#ISLR book #k-means clustering
• set.seed (2) • km.out <- kmeans(x, 2, nstart = 20)
• x <- matrix(rnorm (50 * 2), ncol = 2) • km.out (check the variance
• x[1:25, 1] <- x[1:25, 1] + 3 explained!)
• x[1:25, 2] <- x[1:25, 2] – 4
#Checking the clusters
#We are creating two group! • km.out$cluster
#We have used K=2 as we have

created a random data with 2 groups
ISLR book: We strongly recommend always running K-means
clustering with a large value of nstart, such as 20 or 50, since
otherwise an undesirable local optimum may be obtained.
Plot the clusters:
#Plot
plot(x, col = (km.out$cluster + 1),

main = "K-Means Clustering
Results with K = 2",
xlab = "", ylab = "", pch = 20, cex =
2)
Let us use k=3 in the data and see what
happens:
#Clustering with 3 clusters in the K-means clustering with 3 clusters
random data of sizes 17, 23, 10
• set.seed (4) • Cluster means:
• [,1] [,2]
• km.out <- kmeans(x, 3, nstart = 20) • 1 3.7789567 -4.56200798
• 2 -0.3820397 -0.08740753
• km.out • 3 2.3001545 -2.69622023
Within cluster sum of squares by
cluster:
• [1] 25.74089 52.67700 19.56137
• (between_SS / total_SS = 79.3 %)
Plot:
#Plot
plot(x, col = (km.out$cluster + 1),

main = "K-Means Clustering
Results with K = 3",
xlab = "", ylab = "", pch = 20, cex =
2)
Comparing two plots: 2-cluster and 3-cluster
#Plot the clusters
• par(mfrow = c(1, 2))
• Code of plot with 2 clusters
• Code of plot with 3 clusters
How to decide: which k is best?
We need to use hierarchical clustering!

Comparing different nstarts:
Plot them and see the changes!
#nstart = 1 #Km within ss for
• set.seed (4) • 97.97927
• km.out <- kmeans(x, 3, nstart = • (between_SS / total_SS = 79.3
1) %)
• km.out$tot.withinss
#nstart =20 #Km within ss
• km.out <- kmeans(x, 3, nstart = • 104.3319
20) • (between_SS / total_SS = 78.0
• km.out$tot.withinss %)
Question:
• What is the difference when nstart = 1 and nstart = 20 used?
• Which one should we use?
• Why nstart=1 has more variance that nstart=20?
• https://datascience.stackexchange.com/questions/11485/k-means-
in-r-usage-of-nstart-parameter
Let’s do k-means clustering with “iris” data:
#Load two packages for special # Fitting K-Means clustering
plot Model to training dataset
• library(ClusterR) • set.seed(240)
• library(cluster)
• kmeans.res <- kmeans(iris_1,
#Get, check and make data centers = 3, nstart = 20)
• data(iris)
• str(iris) • kmeans.res
• iris_1 <- iris[,-5]
We have used k=3 as we know that there are 3 types of
flowers!
k-means fit:
• K-means clustering with 3 • Within cluster sum of squares by
clusters of sizes 50, 62, 38 cluster:
• Cluster means: • [1] 15.15100 39.82097 23.87947

Sepal.Length Sepal.Width Petal.Length Petal.Width
• 1 5.006000 3.428000 1.462000 0.246000

• 2 5.901613 2.748387 4.393548 1.433871
• (between_SS / total_SS = 88.4
• 3 6.850000 3.073684 5.742105 2.071053
%)
Confusion matrix: Possible here as we also
have dependent variable to compare!
# Confusion Matrix (not usual!) 1 2 3
• cm <- table(iris$Species, • setosa 50 0 0
kmeans.res$cluster) • versicolor 0 48 2
• cm • virginica 0 14 36
• #Accuracy • [1] 0.8933333 Do the same for the

Decision Tree based
• (accuracy <-
models fitted with
sum(diag(cm))/sum(cm)) CTG data in the
• [1] 0.1066667
• (mce <- 1 - accuracy) previous class and
compare!
Model Evaluation and Visualization:
# Model Evaluation and visualization
• plot(iris_1[c("Sepal.Length",
"Sepal.Width")])
"Sepal.Width")],
col = kmeans.res$cluster)
"Sepal.Width")],
col = kmeans.res$cluster,
main = "K-means with 3
clusters")
Adding cluster centers ( use points after plot):
# Getting cluster centers
• kmeans.res$centers
• kmeans.res$centers[,
c("Sepal.Length",
• "Sepal.Width")]
# Plotting cluster centers
• points(kmeans.res$centers[,
c("Sepal.Length",
• "Sepal.Width")],
• col = 1:3, pch = 8, cex = 3)
Visualizing clusters:
# Visualizing clusters
• y_kmeans <- kmeans.res$cluster
• library(cluster)
• clusplot(iris_1[, c("Sepal.Length",
"Sepal.Width")],
• y_kmeans,
• lines = 0,
• shade = TRUE, color = TRUE,
• labels = 2,
• plotchar = FALSE, span = TRUE,
• main = paste("Cluster iris"),
• xlab = 'Sepal.Length',
• ylab = 'Sepal.Width')
Hierarchical cluster analysis (HCA):
• One potential disadvantage of K-
means clustering is that it requires
us to pre-specify the number of
clusters K.
• Hierarchical clustering is an
alternative approach which does
not require that we commit to a
particular choice of K.
• Hierarchical clustering has an
added advantage over K-means
clustering in that it results in an
attractive tree-based
representation of the observations,
called a dendrogram.
HCA algorithm:
• The hierarchical clustering • The algorithm proceeds
dendrogram is obtained via an iteratively.
extremely simple algorithm.
• Starting out at the bottom of the
• We begin by defining some sort dendrogram, each of the n
of dissimilarity measure observations is treated as its
between each pair of own cluster.
observations. Most often,
Euclidean distance is used.
HCA algorithm:
• The two clusters that are most • The algorithm proceeds in this
similar to each other are then fashion until all of the
fused so that there now are n−1 observations belong to one
clusters. single cluster, and the
dendrogram is complete.
• Next the two clusters that are
most similar to each other are
fused again, so that there now
are n − 2 clusters.
Hierarchical clustering: How many clusters?
Need to find k using a vertical “cut” line/s!
HCA: Linkage methods for choosing
“dissimilarity” measure
HCA with “single” linkage: USArrests.1 data:
#Hierarchical clustering with single • hirar.1
linkage
• #US Arrests data • Call:
• USArrests.1 <- USArrests[,-3] • hclust(d = state.disimilarity,
• state.disimilarity <- method = "single")
dist(USArrests.1)
• hirar.1 <- hclust(state.disimilarity, • Cluster method : single
method='single')
• Distance : euclidean
• plot(hirar.1,
labels=rownames(USArrests.1), • Number of objects: 50
ylab="Distance")
HCA with “single” linkage in USArrests.1 data
How many k if
we cut at
distance of 20
and 30?
HCA with “single” linkage: USArrests.1 data:
#Hierarchical clustering with • Call:
complete linkage • hclust(d = state.disimilarity,
• #US Arrests data method = "complete")
• hirar.2 <-
hclust(state.disimilarity, • Cluster method : complete
method=‘complete')
• Distance : euclidean
• plot(hirar.2,
labels=rownames(USArrests.1), • Number of objects: 50
ylab="Distance")
HCA with “complete” linkage in USArrests.1
data
How many k if
we cut at
distance of 75,
150 and 200?
HCA with “complete” linkage in USArrests.1
data with cut at distance of 200 and 150!
So, it is always better
to use the HCA to
determine the K and
then use it to fit the
k-means clustering.
In Data Science, we
need to use all the
four methods and
find best k and the
fit k-mean for each
of the best K’s. Then
select the best
clustering model
based on the highest
R-square value.
Question/queries?
• Next two classes: • Final class: Projects in R
1. Association rules • Install “git” on Windows so that

2. Monte Carlo Simulations we can use “github” in R Studio
while creating online projects
• We will need to use the offline

projects in R Studio too
Thank you!
@shitalbhandary

S27

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

S27

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

S27

Uploaded by

Copyright:

Available Formats

Statistical Computing with R:

Masters in Data Sciences 503 (S27)

• When we cluster the observations • PCA looks to find a low-

#We have used K=2 as we have

plot(x, col = (km.out$cluster + 1),

plot(x, col = (km.out$cluster + 1),

How to decide: which k is best?

We need to use hierarchical clustering!

• Which one should we use?

• Why nstart=1 has more variance that nstart=20?

• Cluster means: • [1] 15.15100 39.82097 23.87947

• 1 5.006000 3.428000 1.462000 0.246000

• #Accuracy • [1] 0.8933333 Do the same for the

1. Association rules • Install “git” on Windows so that

• We will need to use the offline

You might also like