0% found this document useful (0 votes)

33 views21 pages

Unsupervised Learning

The document contains exercises related to unsupervised learning, specifically focusing on K-Means clustering and hierarchical clustering methods. It includes mathematical proofs, dendrogram sketches, and practical implementations using R code for clustering analysis. Key concepts such as minimizing within-cluster variance and calculating cluster centroids are discussed in detail.

Uploaded by

ycguan99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views21 pages

Unsupervised Learning

Uploaded by

ycguan99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Ch.

10 Exercises: Unsupervised Learning

library(e1071)
library(ISLR)
require(caTools)

## Loading required package: caTools

## Warning: package 'caTools' was built under R version 3.6.2

require(plotrix)

## Loading required package: plotrix

library(tidyr)
library(knitr)

## Warning: package 'knitr' was built under R version 3.6.2

Conceptual
1. (a)
We have the following equation:
𝑝 𝑝
1
∑ ∑(𝑥 − 𝑥𝑖’ 𝑗 )2 = 2 ∑ ∑(𝑥𝑖𝑗 − 𝑥𝑘𝑗
̄ )2 (10.12)
|𝐶𝑘 | 𝑖,𝑖’ ∈𝐶 𝑗=1 𝑖𝑗 𝑖∈𝐶 𝑗=1
𝑘 𝑘

Where, 𝑥𝑘𝑗
̄ = 1
|𝐶𝑘 | ∑𝑖∈𝐶 𝑥𝑖𝑗 is the mean for feature 𝑗 in cluster 𝐶𝑘 .
𝑘

Expanding the LHS over i:

𝑝 𝑝 𝑝
1 1 1
∑ ∑(𝑥𝑖𝑗 − 𝑥𝑖′ 𝑗 )2 = ∑ ∑(𝑥1𝑗 − 𝑥𝑖′ 𝑗 )2 + ∑ ∑(𝑥 − 𝑥𝑖′ 𝑗 )2 + ...
|𝐶𝑘 | 𝑖,𝑖′ ∈𝐶 𝑗=1 |𝐶𝑘 | 𝑖′ ∈𝐶 𝑗=1 |𝐶𝑘 | 𝑖′ ∈𝐶 𝑗=1 2𝑗
𝑘 𝑘 𝑘

Further expanding a single term from the summation:

𝑝 𝑝
1 1
∑ ∑(𝑥 − 𝑥𝑖′ 𝑗 )2 = ∑ ∑((𝑥 − 𝑥𝑗̄ ) − (𝑥𝑖′ 𝑗 − 𝑥𝑗̄ ))2
|𝐶𝑘 | 𝑖′ ∈𝐶 ,𝑖′ ≠𝑖 𝑗=1 𝑖𝑗 |𝐶𝑘 | 𝑖′ ∈𝐶 ,𝑖′ ≠𝑖 𝑗=1 𝑖𝑗
𝑘 𝑘
𝑝
1
= ∑ ∑((𝑥 − 𝑥𝑗̄ )2 − 2(𝑥𝑖𝑗 − 𝑥𝑗̄ )(𝑥𝑖′ 𝑗 − 𝑥𝑗̄ ) + (𝑥𝑖′ 𝑗 − 𝑥𝑗̄ )2 )
|𝐶𝑘 | 𝑖′ ∈𝐶 ,𝑖′ ≠𝑖 𝑗=1 𝑖𝑗
𝑘
𝑝 𝑝 𝑝
1 2 1
= ∑ ∑(𝑥 − 𝑥𝑗̄ )2 − ∑ ∑(𝑥 − 𝑥𝑗̄ )(𝑥𝑖′ 𝑗 − 𝑥𝑗̄ ) + ∑ ∑(𝑥 ′ − 𝑥𝑗̄ )2
|𝐶𝑘 | 𝑖∈𝐶 𝑗=1 𝑖𝑗 |𝐶𝑘 | 𝑖′ ∈𝐶 ,𝑖′ ≠𝑖 𝑗=1 𝑖𝑗 |𝐶𝑘 | 𝑖′ ∈𝐶 𝑗=1 𝑖 𝑗
𝑘 𝑘 𝑘
𝑝 𝑝
2 2
= ∑ ∑(𝑥 − 𝑥𝑗̄ )2 − ∑ ∑(𝑥 − 𝑥𝑗̄ )(𝑥𝑖′ 𝑗 − 𝑥𝑗̄ )
|𝐶𝑘 | 𝑖∈𝐶 𝑗=1 𝑖𝑗 |𝐶𝑘 | 𝑖′ ∈𝐶 ,𝑖′ ≠𝑖 𝑗=1 𝑖𝑗
𝑘 𝑘

1
Substituting the final term from above in the summation:

𝑝 𝑝 𝑝
|𝐶𝑘 | 2 2
2 ∑ ∑(𝑥𝑖𝑗 −𝑥𝑗̄ )2 − ∑ ∑(𝑥1𝑗 − 𝑥𝑗̄ )(𝑥𝑖′ 𝑗 − 𝑥𝑗̄ ) − ∑ ∑(𝑥 − 𝑥𝑗̄ )(𝑥𝑖′ 𝑗 − 𝑥𝑗̄ ) + ...
|𝐶𝑘 | 𝑖∈𝐶 𝑗=1 |𝐶𝑘 | 𝑖′ ∈𝐶 ,𝑖′ ≠𝑖 𝑗=1 |𝐶𝑘 | 𝑖′ ∈𝐶 ,𝑖′ ≠𝑖 𝑗=1 2𝑗
𝑘 ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
𝑘 𝑘

This term is equal to zero because ∑𝑖′ ∈𝐶 𝑥𝑖′ 𝑗 −|𝐶𝑘 |𝑥𝑗̄ =0

𝑘

The remaining term is:

𝑝
|𝐶𝑘 |
2 ∑ ∑(𝑥 − 𝑥𝑗̄ )2 Proving 10.12
|𝐶𝑘 | 𝑖∈𝐶 𝑗=1 𝑖𝑗
𝑘

(b)

• At end of each iteration the K-Means algorithm assigns the observations to the closest (Euclidean
distance) centroid. Minimising the RHS of the identity will reduce the euclidean distance and so the
LHS (10.11) will also be reduced. In other words the identity shows that minimizing the sum of the
squared Euclidean distance for each cluster is the same as minimizing the within-cluster variance for
each cluster.

2. (a)
Using algorithm 10.2 and complete linkage (maximal intercluster dissmilarity) will result in the following
dendogram:
Initial dissmilarity matrix:
0.3 0.4 0.7
⎡0.3 0.5 0.8 ⎤
⎢ ⎥
⎢0.4 0.5 0.45⎥
⎣0.7 0.8 0.45 ⎦

Treating each observation as its own cluster and fusing those most similar, will result in clusters 1 and 2
being fused at height 0.3 and giving us the following matrix:

0.5 0.8
⎡ 0.5 0.45⎤
⎢ ⎥
⎣ 0.8 0.45 ⎦
As before, cluster 3 and 4 will be fused at 0.45, leaving the highest dissmilarity value of 0.8 at which both
clusters (1 & 2) and (3 & 4) will be fused.

0.8
[ ]
0.8

Dendogram sketch:

m = as.dist(matrix(c(0, 0.3, 0.4, 0.7,

0.3, 0, 0.5, 0.8,
0.4, 0.5, 0.0, 0.45,
0.7, 0.8, 0.45, 0.0), nrow = 4))
plot(hclust(m, method = "complete"))

2
Cluster Dendrogram
0.8
0.7
0.6
Height

0.5
0.4

4
0.3

m
hclust (*, "complete")
(b)
Repeating (a) with single linkage (minimal intercluster dissmilarity)
Like in (a), cluster 1 and 2 are the most similar and are fused at height 0.3. After this first fusion we are
left with the matrix below:

0.4 0.7
⎡ 0.4 0.45⎤
⎢ ⎥
⎣ 0.7 0.45 ⎦
The lowest dissimilarity value is 0.4, between cluster (1 & 2) and 3, so we fuse them together. Leaving us
the following matrix:
0.45
[ ]
0.45

Finally we are left with 0.45 at which the cluster ((1, 2) & 3) will be fused with 4.

m = as.dist(matrix(c(0, 0.3, 0.4, 0.7,

0.3, 0, 0.5, 0.8,
0.4, 0.5, 0.0, 0.45,
0.7, 0.8, 0.45, 0.0), nrow = 4))
plot(hclust(m, method = "single"))

3
Cluster Dendrogram
0.45

4
0.40
Height

3
0.35
0.30

2
m
hclust (*, "single")
(c)

• (1,2) & (3,4)

(d)

• ((1,2),3) & 4

(e)

plot(hclust(m, method = "complete"), labels = c(2,1,4,3))

4
Cluster Dendrogram
0.8
0.7
0.6
Height

0.5
0.4

3
0.3

m
hclust (*, "complete")
3. (a)

df = data.frame(x1 = c(1, 1, 0, 5, 6, 4), x2 = c(4, 3, 4, 1, 2, 0))

colnames(df)=c('X1','X2')
plot(df, pch = 19, col = "black")

5
4
3
X2

2
1
0

0 1 2 3 4 5 6

(b)

set.seed(3)

# Assigning observations randomly to two clusters

cluster = sample(2, nrow(df), replace=T)

# Table of assigned clusters

cbind(df,cluster)

## X1 X2 cluster
## 1 1 4 1
## 2 1 3 2
## 3 0 4 2
## 4 5 1 1
## 5 6 2 2
## 6 4 0 2

(c)
The cluster centroids are simply the mean of the observations assigned to each cluster.
So for cluster 1, we can calculate it as follows:
1 1
𝑥11 = (1 + 5) = 3 𝑎𝑛𝑑 𝑥12 = (4 + 1) = 2.5
2 2

6
And for cluster 2:
1 1
𝑥21 = (1 + 0 + 6 + 4) = 2.75 𝑎𝑛𝑑 𝑥22 = (3 + 4 + 2 + 0) = 2.25
4 4
# Computing centroids and plotting them alongside the clusters
c1 = c(mean(df[cluster==1, 1]), mean(df[cluster==1, 2])) #Centroid 1
c2 = c(mean(df[cluster==2, 1]), mean(df[cluster==2, 2])) #Centroid 2

plot(df, pch = 19, col = (cluster+1))

points(c1[1], c1[2], col="red", pch=4)
points(c2[1], c2[2], col="green", pch=4)
4
3
X2

2
1
0

0 1 2 3 4 5 6

(d)

# Function to calculate the euclidean distance between two points

euclidean = function(a, b) {
(sqrt((a[1] - b[1])^2 + (a[2]-b[2])^2))
}

# Function that loops over all observations and assigns them to the closest centroid, by calling the euc
new_labels = function(df, c1, c2) {
labels = rep(NA, nrow(df))
for (i in 1:nrow(df)) {
if (euclidean(df[i,], c1) < euclidean(df[i,], c2)) {
labels[i] = 1

7
} else {
labels[i] = 2
}
}
return(labels)
}

#Assigning observations to their closest centroids

new_cluster = new_labels(df, c1, c2)

#Table of original and new clusters

cbind(df,cluster,new_cluster)

## X1 X2 cluster new_cluster
## 1 1 4 1 2
## 2 1 3 2 2
## 3 0 4 2 2
## 4 5 1 1 1
## 5 6 2 2 1
## 6 4 0 2 2

(e) (f)

#Running the new labels function until the assigned clusters stop changing.
final_cluster = rep(-1, 6)
while (!all(final_cluster == cluster)) {
final_cluster = cluster
c1 = c(mean(df[cluster==1, 1]), mean(df[cluster==1, 2]))
c2 = c(mean(df[cluster==2, 1]), mean(df[cluster==2, 2]))
cluster = new_labels(df, c1, c2)
}

#Plot of the final cluster assigments

plot(df, pch = 19, col = (cluster+1))
points(c1[1], c1[2], col="red", pch=4)
points(c2[1], c2[2], col="green", pch=4)

8
4
3
X2

2
1
0

0 1 2 3 4 5 6

4. (a)
Not enough information to tell.
Since complete linkage uses the highest inter-cluster dissimilarity and single linkage uses the lowest, in most
of the cases the fusion with complete linkage will occur higher on the tree.
For example, if d(1,4)=2, d(1,5)=3, d(2,4)=1, d(2,5)=3, d(3,4)=4 and d(3,5)=1, the single linkage dissimi-
larity between {1,2,3} and {4,5} would be equal to 1 and the complete linkage dissimilarity between {1,2,3}
and {4,5} would be equal to 4. So, with single linkage, they would fuse at a height of 1, and with complete
linkage, they would fuse at a height of 4.
In some special cases (when the inter-cluster distances are all the same) the fusion will occur at the same
height for both linkage methods.
For example, if all inter-cluster distances are equal to 2, we would have that the single and complete linkage
dissimilarities between {1,2,3} and {4,5} are equal to 2.
(b)
They would fuse at the same height because linkage type does not affect leaf-to-leaf fusion. For example, if
d(5,6)=2, the single and complete linkage dissimilarities between {5} and {6} would be equal to 2. So, they
would fuse at a height of 2 for single and complete linkage.
5.

• For the figure on the left, we have unscaled variables and the number of socks will have the biggest im-
pact on the resulting clusters. The observations would likely be split betweeen a cluster with customers
purchasing ‘most socks and computer’, and ‘least socks and computer’.

9
• For the central figure, the variables have been scaled, and so the number of computers will have a
much bigger impact than before. We would likely have clusters split between the number of computers
purchased.
• For the figure on the right, the price of computers has an overwhelmingly large impact. The clusters
will be split between ‘purchsed computer’, and ‘no computer purchased’.

6.
(a)

• PCA finds, in the data space, the dimension (or direction) with the largest variance out of the overall
variance. This is our first principal component, which explains 10% of the total variance in this data
set. Then it would find the dimension with the second largest variance (second principal component),
orthogonal to the first one, out of the remaining 90% variance and so on.
• Another way of looking at it is that 90% of the variance in the data is not contained in the first principal
component, and other components are required to explain more of the variance. in a situation where
there exist a whole bunch of independent variables, PCA helps you figure out which linear combinations
of these variables matter the most. Ideally, we would want a few principle components that explain
most of the variability in the data set.

(b)

• Given the flaw shown in pre-analysis of a time-wise linear trend amongst the tissue samples’ first
principal component, I would advise the researcher to include the machine used (A vs B) as a feature
of the data set. This should enhance the PVE of the first principal component before applying the
two-sample t-test.
• The use of machine A during earlier and then B at later times respectively, could have been responsible
for a time-wise linear trend amongst the tissue samples first principle component. As such I would
advise the researcher to include the machine used (A vs B) as a new feature for the data set. This
should enhance the PVE (proportion of variance explained) for the first principle component, before
appling the two-sample t-test.

(c)
A toy data set to show that adding a feature for the machine (A vs B) used, improves the PVE for the first
priciple component. This is a very basic simulation, so don’t take the results for granted.

set.seed(101)

Control = matrix(rnorm(50 * 1000), ncol = 50)

Treatment = matrix(rnorm(50 * 1000), ncol = 50)

X = cbind(Control, Treatment)
X[1, ] = seq(-18, 18 - .36, .36) # linear trend in one dimension
pr.out = prcomp(scale(X))
summary(pr.out)$importance[, 1]

## Standard deviation Proportion of Variance Cumulative Proportion

## 3.174779 0.100790 0.100790

• From the summary, we can observe that 10.08% of the variance is explained by the first principle
component.

10
Adding in machine used (A vs B) using 10 vs 0 encoding.

set.seed(101)

X = rbind(X, c(rep(10, 50), rep(0, 50)))

pr.out = prcomp(scale(X))
summary(pr.out)$importance[, 1]

## Standard deviation Proportion of Variance Cumulative Proportion

## 3.425154 0.117320 0.117320

• We can see the PVE explained increased to 11.7% after adding in a feature for the machine used.

Applied
7.
As per the HINT, using ‘cor’ and ‘dist’ functions to calculate the correlation and distance of the USArrest
observations.

scaledArrests = scale(USArrests)

corr_dis = as.dist(1-cor(t(scaledArrests))) #correlation based distance

euclidean_dis = dist(scaledArrests)^2 #squared euclidean distance

summary(corr_dis/euclidean_dis)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 0.000086 0.069135 0.133943 0.234193 0.262589 4.887686

If the quantities are approximately proportional then 𝑒𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛_𝑑𝑖𝑠 ≈ 𝐾 ⋅ 𝑐𝑜𝑟𝑟_𝑑𝑖𝑠 for a constant K.

summary(corr_dis-0.1339*euclidean_dis)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## -4.569956 -0.459715 0.000393 -0.059043 0.557616 1.808901

If 𝐾 = 0.1339 then they are approximately equal, with a mean difference of around 0.05 only.
8. (a)

pr.out = prcomp(scaledArrests)
pr.var = pr.out$sdev^2

pve = pr.var/sum(pr.var)
pve

## [1] 0.62006039 0.24744129 0.08914080 0.04335752

• PVE calculated using the prcomp() function. As expected, the first principal component explains 62%
of the variance in this dataset.

(b)

11
loadings = pr.out$rotation #principle component loadings

num = apply((as.matrix(scaledArrests) %*% loadings)^2, 2, sum) #Summation of squares of scorings (numera

denom = sum(apply(as.matrix(scaledArrests)^2, 2, sum)) #Variability in the data (denominator in the form

pve2 = num/denom
pve2

## PC1 PC2 PC3 PC4

## 0.62006039 0.24744129 0.08914080 0.04335752

• As expected, the resulting PVE’s are exactly the same as when using prcomp().

9. (a)

set.seed(1)

hc.complete = hclust(dist(USArrests), method = 'complete')

plot(hc.complete, cex = 0.7)

Cluster Dendrogram
300
200
100
Height

Florida
North Carolina

Rhode Island
California

Hawaii
Alaska

Georgia
Missouri

Washington
Mississippi
South Carolina

Minnesota
Delaware

Maryland

Wyoming
Alabama
Louisiana

Idaho
Colorado
Texas
Arizona
New Mexico

Nebraska
Michigan
Nevada

North Dakota
Vermont
West Virginia
Arkansas
Tennessee

Oregon

Massachusetts
New Jersey

Wisconsin

Maine
South Dakota
Connecticut
Pennsylvania
Oklahoma
Virginia

Ohio
Utah
Illinois
New York

Indiana
Kansas
Kentucky
Montana

Iowa
New Hampshire

dist(USArrests)
hclust (*, "complete")
(b)

hc.cut = cutree(hc.complete, k=3) #Cutting tree with K=3, so we get 3 clusters.

12
tibble(
states = rownames(USArrests),
cluster = hc.cut
)

## # A tibble: 50 x 2
## states cluster
## <chr> <int>
## 1 Alabama 1
## 2 Alaska 1
## 3 Arizona 1
## 4 Arkansas 2
## 5 California 1
## 6 Colorado 2
## 7 Connecticut 3
## 8 Delaware 1
## 9 Florida 1
## 10 Georgia 2
## # ... with 40 more rows

hc.complete.scaled = hclust(dist(scaledArrests), method = 'complete')

plot(hc.complete.scaled, cex = 0.7)

Cluster Dendrogram
0 1 2 3 4 5 6

Alaska
Height

Florida
Delaware
Rhode Island

Colorado
Missouri

Arizona
California
Nevada

North Carolina
Kentucky

Michigan
Hawaii
Utah
Connecticut

Georgia
Tennessee
Arkansas
North Dakota
Vermont

Texas
Idaho

Massachusetts
New Jersey

Mississippi
South Carolina
Ohio
Pennsylvania

Alabama
Louisiana
Montana
Nebraska
South Dakota
West Virginia

Virginia
Wyoming
Maine

Oregon
Washington

Maryland
New Mexico
Oklahoma
Minnesota
Wisconsin
Indiana
Kansas

Illinois
New York
Iowa
New Hampshire

dist(scaledArrests)
hclust (*, "complete")

13
(d)

• The scaled dendogram has greatly reduced height, and the clusters obtained are somewhat different.
However, the bushiness of the tree doesn’t appear to be affected.
• As a general rule, variables with different measurement units should be scaled before computing the
inter-obersvation dissimilarities. Which applies to this dataset.

10.
(a)

set.seed(1)
# Simulated data with mean shift to create three distinct classes.
# Increasing the difference in means for each class would create greater separation between the classes.
simulated.data = matrix(c(rnorm(20 * 50, mean = 1),
rnorm(20 * 50, mean = 2),
rnorm(20 * 50, mean = 3)), ncol = 50, byrow = TRUE)

class = unlist(lapply(1:3,function(x){rep(x,20)}))

(b)

pr.out2 = prcomp(simulated.data)
plot(pr.out2$x[,1:2],col=class)
4
2
PC2

0
−2
−4

−5 0 5

PC1

(c)

14
kmeans.out = kmeans(simulated.data, 3, nstart = 60)
table(class, kmeans.out$cluster)

##
## class 1 2 3
## 1 0 0 20
## 2 20 0 0
## 3 0 20 0

• All observations are correctly classified. This is expected as the observations in the three classes are
well separated.

(d)

kmeans.out = kmeans(simulated.data, 2, nstart = 60)

table(kmeans.out$cluster)

##
## 1 2
## 24 36

• The observations are now assigned to two classes. One class is assigned much more of the observations
than the other. A plot of the newly assigned cluster is shown below.

plot(pr.out2$x[,1:2],col=kmeans.out$cluster)
4
2
PC2

0
−2
−4

−5 0 5

PC1

15
(e)

kmeans.out = kmeans(simulated.data, 4, nstart = 60)

table(kmeans.out$cluster)

##
## 1 2 3 4
## 12 20 20 8

• We already know that there are three distinct classes, and so when using K = 4, we could see obser-
vations being assigned into more classes or clusters than is necessary (or is supported by the dataset).
That is what has happened in this case. In other words, when using 4 clusters it becomes more difficult
to determine the difference between the new found clusters and the actual class values.
• The plot below shows the newly assigned clusters.

plot(pr.out2$x[,1:2],col=kmeans.out$cluster)
4
2
PC2

0
−2
−4

−5 0 5

PC1

(f)

kmeans.out = kmeans(pr.out2$x[,1:2], 3, nstart = 60)

table(kmeans.out$cluster)

##
## 1 2 3
## 20 20 20

16
table(class)

## class
## 1 2 3
## 20 20 20

• As expected, and like in part (c), the clusters obtained have a perfect mapping to the original classes.

(g)

set.seed(1)
kmeans.out = kmeans(scale(simulated.data), 3, nstart = 60)
table(kmeans.out$cluster)

##
## 1 2 3
## 20 20 20

table(class)

## class
## 1 2 3
## 20 20 20

plot(pr.out2$x[,1:2],col=kmeans.out$cluster)
4
2
PC2

0
−2
−4

−5 0 5

PC1

17
• The results are exactly like in part (b), where the assigned clusters are perfectly mapped to the
original classes. Likely because the simulated data set I created was very well separated. Datasets
with overlapping observations would likely result in a different outcome.

11. (a)

set.seed(1)
gene.data = read.csv("Ch10Ex11.csv", header = F)

(b)

# Using complete linkage and correlation based distance

corr.dis = as.dist(1 - cor(gene.data))

hc.complete = hclust(corr.dis, method = "complete")
plot(hc.complete)

Cluster Dendrogram
1.1
1.0

V1
0.9

V16

V6
V12
V2
V19

V13

V8
V14
V7
V9

V4
V20
Height

V11
V15
V10
V18
V5
V17
0.8
0.7

V25
V26
V40
0.6

V23
V29
V39

V35
V37
V38

V31
V24
V30

V21
V22
V27
V36
V28
V32

V33
V34

corr.dis
hclust (*, "complete")

# Using single linkage and correlation based distance

hc.single = hclust(corr.dis, method = "single")

plot(hc.single)

18
Cluster Dendrogram
0.90

V19
V2
V12
V1
V14
V16
V7
V6
V3
V20
V8

V11
V9
V13
V15

V4
V10
V18

V5
V17
0.80
Height

0.70

V25
V26
V40
V29
V39
0.60

V37
V38

V35
V24
V30
V21
V22
V23
V27
V36
V31
V28
V32
V33
V34
corr.dis
hclust (*, "single")

# Using average linkage and correlation based distance

hc.average = hclust(corr.dis, method = "average")

plot(hc.average)

19
Cluster Dendrogram
1.0

V1
0.9

V16
V2
V19
V6
V12
V8
V14

V3
V13

V7
V9
V4
V20
V11
V15

V10
V18
V5
V17
0.8
Height

0.7

V25
V26

V40
V23
V29

V39
0.6

V37
V38

V35
V24
V30
V31

V21
V22
V27
V36
V28
V32

V33
V34

corr.dis
hclust (*, "average")

• The resulting trees are quite different, and so are impacted by the type of linkage. Single linkage results
in a very unbalanced tree and average has three clusters, whereas complete gives the most balanced
tree where the samples are split into two roughly 50/5 groups. As we already know that samples are
equally split between healthy and diseased groups, complete linkage will likely give the best results.

(c)

• We could use PCA to see which genes differ the most between the healthy and diseased group. We will
examine the absolute values of the total loadings for each gene as it characterizes the weight of each
gene.

pr.gene = prcomp(t(gene.data))
summary(pr.gene)

## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 11.9409 6.06818 5.93476 5.83115 5.75209 5.70031 5.63448
## Proportion of Variance 0.1267 0.03271 0.03129 0.03021 0.02939 0.02887 0.02821
## Cumulative Proportion 0.1267 0.15939 0.19068 0.22089 0.25029 0.27915 0.30736
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 5.57726 5.54943 5.50625 5.48852 5.46025 5.40230 5.33441
## Proportion of Variance 0.02764 0.02736 0.02694 0.02676 0.02649 0.02593 0.02528
## Cumulative Proportion 0.33499 0.36236 0.38929 0.41605 0.44254 0.46847 0.49375
## PC15 PC16 PC17 PC18 PC19 PC20 PC21

20
## Standard deviation 5.27756 5.21594 5.20000 5.15140 5.11600 5.05591 5.03836
## Proportion of Variance 0.02475 0.02417 0.02402 0.02358 0.02325 0.02271 0.02255
## Cumulative Proportion 0.51850 0.54267 0.56669 0.59027 0.61352 0.63623 0.65878
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 5.01868 4.95965 4.91393 4.86397 4.81796 4.80811 4.73485
## Proportion of Variance 0.02238 0.02185 0.02145 0.02102 0.02062 0.02054 0.01992
## Cumulative Proportion 0.68116 0.70301 0.72447 0.74548 0.76611 0.78665 0.80656
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 4.70098 4.65564 4.61621 4.56733 4.53032 4.49528 4.36502
## Proportion of Variance 0.01963 0.01926 0.01893 0.01853 0.01823 0.01795 0.01693
## Cumulative Proportion 0.82620 0.84545 0.86439 0.88292 0.90115 0.91910 0.93603
## PC36 PC37 PC38 PC39 PC40
## Standard deviation 4.35858 4.26700 4.20277 4.13922 5.251e-15
## Proportion of Variance 0.01688 0.01618 0.01569 0.01522 0.000e+00
## Cumulative Proportion 0.95291 0.96909 0.98478 1.00000 1.000e+00

total.load = apply(pr.gene$rotation, 1, sum)

index = order(abs(total.load), decreasing = TRUE)
index[1:10]

## [1] 865 68 911 428 624 11 524 803 980 822

• These are the 10 genes that differ most across the two groups.

Assignment 3
No ratings yet
Assignment 3
2 pages
Intermediate R - Cluster Analysis
33% (3)
Intermediate R - Cluster Analysis
27 pages
ML CH 4
No ratings yet
ML CH 4
65 pages
ML Lec-18
No ratings yet
ML Lec-18
21 pages
20 - 1 - ML - UNSUP - 02 - Hierarchical Clustering
No ratings yet
20 - 1 - ML - UNSUP - 02 - Hierarchical Clustering
41 pages
Lecture07 95791
No ratings yet
Lecture07 95791
84 pages
Cluster Analysis Hierarchical & - Means
No ratings yet
Cluster Analysis Hierarchical & - Means
41 pages
Solutions 04
No ratings yet
Solutions 04
25 pages
Cluster Analysis (Continued)
No ratings yet
Cluster Analysis (Continued)
3 pages
Cluster Analysis
No ratings yet
Cluster Analysis
8 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
No ratings yet
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
7 pages
STAT452 Project1
No ratings yet
STAT452 Project1
13 pages
Tutorial Exercises Clustering - K-Means, Nearest Neighbor and Hierarchical
No ratings yet
Tutorial Exercises Clustering - K-Means, Nearest Neighbor and Hierarchical
7 pages
K Mean
No ratings yet
K Mean
3 pages
Assg 3
No ratings yet
Assg 3
31 pages
Exercises695Clus Solution - Doc Exercises695Clus Solution
No ratings yet
Exercises695Clus Solution - Doc Exercises695Clus Solution
7 pages
Clustering: CMPUT 466/551 Nilanjan Ray
No ratings yet
Clustering: CMPUT 466/551 Nilanjan Ray
34 pages
Hierarchical Clustering Guide
No ratings yet
Hierarchical Clustering Guide
38 pages
Cluster Ana
No ratings yet
Cluster Ana
12 pages
Cluster Analysis in R
No ratings yet
Cluster Analysis in R
8 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
51 pages
MachineLearning Unit IV
No ratings yet
MachineLearning Unit IV
51 pages
Unit 3
No ratings yet
Unit 3
12 pages
Chapter 2
No ratings yet
Chapter 2
39 pages
Lab 07
No ratings yet
Lab 07
4 pages
Fourth: Aeideirhelnnom
No ratings yet
Fourth: Aeideirhelnnom
9 pages
Lecture-11 Cluster Analysis-1
No ratings yet
Lecture-11 Cluster Analysis-1
28 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
6 pages
Intro To Data Science
No ratings yet
Intro To Data Science
47 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
32 pages
9536 DWM Expt 7 Merged
No ratings yet
9536 DWM Expt 7 Merged
14 pages
TD9 - K-Means 2025 Correction
No ratings yet
TD9 - K-Means 2025 Correction
10 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
84 pages
MDA Session 4
No ratings yet
MDA Session 4
5 pages
Cluster Analysis Chapter 8 Solution
No ratings yet
Cluster Analysis Chapter 8 Solution
8 pages
Clustering: Georg Gerber Lecture #6, 2/6/02
No ratings yet
Clustering: Georg Gerber Lecture #6, 2/6/02
50 pages
ML Cat 2 - 4
No ratings yet
ML Cat 2 - 4
16 pages
Unit 6 - Machine Learning in R
No ratings yet
Unit 6 - Machine Learning in R
45 pages
13 Clustering and Classifier
No ratings yet
13 Clustering and Classifier
123 pages
Lecture 8 Clustring
No ratings yet
Lecture 8 Clustring
16 pages
Chapter 7
No ratings yet
Chapter 7
49 pages
Unsupervised Methods Overview
No ratings yet
Unsupervised Methods Overview
26 pages
K-Means Clustering Using Matlab: December 2015
No ratings yet
K-Means Clustering Using Matlab: December 2015
6 pages
Module 3
No ratings yet
Module 3
123 pages
401 Week7 Part 1 KMeans
No ratings yet
401 Week7 Part 1 KMeans
45 pages
HW 8
No ratings yet
HW 8
4 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
Lecture 6
No ratings yet
Lecture 6
55 pages
Clustering Dendogram
No ratings yet
Clustering Dendogram
13 pages
Cluster Analysis
No ratings yet
Cluster Analysis
37 pages
K Means Example
No ratings yet
K Means Example
8 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
The Fastcluster Package: User's Manual: Daniel Müllner
No ratings yet
The Fastcluster Package: User's Manual: Daniel Müllner
16 pages
Statistical Computing With R: Masters in Data Sciences 503 (S27) Third Batch, SMS, TU, 2024
No ratings yet
Statistical Computing With R: Masters in Data Sciences 503 (S27) Third Batch, SMS, TU, 2024
30 pages
03 Hierarchical Clustering
100% (1)
03 Hierarchical Clustering
15 pages
Hierarchical Clustering: Relationship Between Clusters
No ratings yet
Hierarchical Clustering: Relationship Between Clusters
23 pages
Batch Input
No ratings yet
Batch Input
25 pages
Katalog TBS 2016
No ratings yet
Katalog TBS 2016
624 pages
Elevator Pitch Series: L A U N C H P A D
No ratings yet
Elevator Pitch Series: L A U N C H P A D
17 pages
Tle 10-2
No ratings yet
Tle 10-2
12 pages
PFMS
No ratings yet
PFMS
2 pages
Student Result: Session: 2020-21 (REGULAR) Semesters: 1 Result: Marks: 783/1200
No ratings yet
Student Result: Session: 2020-21 (REGULAR) Semesters: 1 Result: Marks: 783/1200
2 pages
Manufacturing Flash Tool: User Guide
No ratings yet
Manufacturing Flash Tool: User Guide
35 pages
Stirling Engine Manufacturing & Testing
No ratings yet
Stirling Engine Manufacturing & Testing
6 pages
Class 12 Cs Investigatory Project
No ratings yet
Class 12 Cs Investigatory Project
24 pages
ASI4518R53v07 Datasheet
No ratings yet
ASI4518R53v07 Datasheet
2 pages
3 Eigen Values and Eigen Vector
No ratings yet
3 Eigen Values and Eigen Vector
21 pages
Bosch MEVD1728 ECU Connection Guide
No ratings yet
Bosch MEVD1728 ECU Connection Guide
6 pages
DMAIC Analyze Phase 1694960554
100% (2)
DMAIC Analyze Phase 1694960554
186 pages
India Energy Forum 2020 by CERAWeek - Agenda
No ratings yet
India Energy Forum 2020 by CERAWeek - Agenda
5 pages
Essential Guide LEED BPDO Credits 2024
No ratings yet
Essential Guide LEED BPDO Credits 2024
63 pages
Five Competitive Forces of e Comerce Industry
91% (11)
Five Competitive Forces of e Comerce Industry
5 pages
Cybersource Payer Authentication: Using The Simple Order API
No ratings yet
Cybersource Payer Authentication: Using The Simple Order API
261 pages
Research Paper About Effects of Facebook
100% (3)
Research Paper About Effects of Facebook
17 pages
SY0-071-Module 3 Powerpoint Slides
No ratings yet
SY0-071-Module 3 Powerpoint Slides
232 pages
Energies: Recent Advances in Transcritical CO (R744) Heat Pump System: A Review
No ratings yet
Energies: Recent Advances in Transcritical CO (R744) Heat Pump System: A Review
35 pages
66 FT 2020 Azimut 66 FLY, Evado: Naples, Florida, United States
No ratings yet
66 FT 2020 Azimut 66 FLY, Evado: Naples, Florida, United States
15 pages
Portable Crimper for 1" Hose Repair
No ratings yet
Portable Crimper for 1" Hose Repair
2 pages
Step For FOC Invoice
No ratings yet
Step For FOC Invoice
4 pages
Chapter 1 Innovation Management
No ratings yet
Chapter 1 Innovation Management
21 pages
Move Things With CSS Jhey Tompkins 2020
No ratings yet
Move Things With CSS Jhey Tompkins 2020
78 pages
Atlas Copco Compressor Parts List
No ratings yet
Atlas Copco Compressor Parts List
242 pages
Project-Proposal On Internet Wireless Connection (P2P)
100% (6)
Project-Proposal On Internet Wireless Connection (P2P)
5 pages
Https Cssps - Gov.gh Placement Placementslip
No ratings yet
Https Cssps - Gov.gh Placement Placementslip
2 pages
Li Ion Battery PSDS
No ratings yet
Li Ion Battery PSDS
2 pages
grade 12 mock exams new - (2020-2019) عمان
No ratings yet
grade 12 mock exams new - (2020-2019) عمان
71 pages