0% found this document useful (0 votes)

32 views8 pages

Cluster Analysis

Uploaded by

manohargade19

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views8 pages

Cluster Analysis

Uploaded by

manohargade19

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Name: Divya Jha

Rollno: L021
Practical: Cluster Analysis

# Load the data

install.packages(c("factoextra","cluster","NbClust"))

library(factoextra)
library(cluster)
library(NbClust)

data=read.csv("C:/ProgramData/Microsoft/Windows/Start
Menu/Programs/RStudio/Sales_Product_Details.csv")
View(data)

str(data)

# Standardize the data

data$Product_Description=as.numeric(as.factor(data$Product_Description))
data$Product_Category=as.numeric(as.factor(data$Product_Category))
data$Product_Line=as.numeric(as.factor(data$Product_Line))
data$Raw_Material=as.numeric(as.factor(data$Raw_Material))
data$Region=as.numeric(as.factor(data$Region))
df <- scale(data)
df <- scale(data): This line scales your data by centering and scaling each column so
that they have a mean of 0 and a standard deviation of 1. The result is stored in a new
variable df.

# Show the first 6 rows

head(df, nrow = 6)

# Compute the dissimilarity matrix

# df = the standardized data
res.dist <- dist(df, method = "euclidean")
The dist( function in R computes the distance matrix between the rows of a data matrix.
In this case, df contains your pre-processed and scaled data.
This line calculates the Euclidean distance between all pairs of rows.

as.matrix(res.dist)[1:6, 1:6]

This line extracts the upper left 6x6 block from the distance matrix and converts it into a
standard matrix format using as.matrix(). This is done for visualization or further analysis
of a subset of distances.

kmeans(df, centers=2, iter.max = 10, nstart = 25)

kmeans(df, centers=3, iter.max = 10, nstart = 25)

kmeans(df, centers=4, iter.max = 10, nstart = 25)

1. 2 clusters:
Cluster 1 has 15 observations.
Cluster 2 has 15 observations.
Within cluster sum of squares: 93.64579 and 91.32353.
Proportion of variance explained: 20.3%.

2. 3 clusters:
Cluster 1 has 14 observations.
Cluster 2 has 14 observations.
Cluster 3 has 2 observations.
Within cluster sum of squares: 65.49310, 70.31447, and 19.55407.
Proportion of variance explained: 33.0%.

3. 4 clusters:
Cluster 1 has 5 observations.
Cluster 2 has 14 observations.
Cluster 3 has 1 observation.
Cluster 4 has 10 observations.
Within cluster sum of squares: 29.53349, 65.49310, 0.00000, and 38.44444.
Proportion of variance explained: 42.5%.

The "Cluster means" section provides the means for each variable within each cluster. The
"Clustering vector" indicates the cluster assignment for each observation.
You can choose the number of clusters based on various criteria such as the proportion of variance
explained.

# Visualize the optimal number of clusters based on the silhouette method

fviz_nbclust(data,FUN=hcut, method = "silhouette")
Interpretation: this plot to identify the number of clusters that maximizes the average silhouette
width, which indicates the optimal number of clusters for your dataset. Placing the vertical line at 3
indicates that you are interested in exploring the possibility of clustering your data into 3 clusters. If
the silhouette scores are relatively high and consistent up to 3 clusters, it suggests that partitioning
the data into 3 clusters leads to well-separated and internally cohesive clusters. However, if the
silhouette scores start to decrease after 3 clusters, it might indicate that further partitioning the data
into more clusters doesn't improve the overall quality of the clustering.

There are various methods for performing clustering. I tried using the average linkage method, single
linkage method, ward method, complete method, and centroid method. Of the five methods, the
best method will be determined by calculating the cophenetic correlation value of each variable. The
cophenetic correlation coefficient is the correlation coefficient between the original elements of the
dissimilarity matrix (squared Euclidean distance matrix) and the elements generated by the
dendrogram (cophenetic matrix). Cophenetic correlation coefficient values range from 1 to -1. The
closer the value is to 1, the better the clustering is done.
Interpretation: Based on the results of the cophenetic correlation coefficient of the five methods, it
is known that the average linkage method has the largest value, which is 0.7913062. Then the
clustering will use the average linkage method.

plot(metode_al)
rect.hclust(metode_al,3)
Interpretation: Based on the dendogram and figure above, cluster 1 consists of 1 ID and cluster 2
consists of 1 ID and cluster 3 consists of 28 ID

library(factoextra)
km <- kmeans(df, centers = 3, nstart = 25)
fviz_cluster(km, data = df)

Cluster Profiling
Based on the members of each cluster that have been obtained, then profiling is carried out to
determine the characteristics of each cluster that has been obtained. Profiling is done by finding the
average value for each variable in each cluster.

group3=cutree(metode_al,3)
table(group3)
table(group3/nrow(df))
aggregate(df,list(group3),mean)

Interpretation:
 Cluster 1 (Group.1 = 1):
 It has relatively low values for Quantity, while having very high values for Unit_Price,
Sales_Revenue, and Region.
 Product_Description and Product_Line has negative mean values, indicating that
they are skewed towards certain categories.
 Raw_Material and Product_Category are close to zero, suggesting less influence in
defining this cluster's characteristics.
 Cluster 2 (Group.1 = 2):
 It has average values for Quantity, Unit_Price, and Sales_Revenue.
 Product_Description and Product_Line has positive mean values, indicating a
preference for certain categories.
 Raw_Material has a slightly negative mean value, suggesting a different distribution
compared to Cluster 1.
 Region is close to zero, indicating less influence in defining this cluster's
characteristics.
 Cluster 3 (Group.1 = 3):
 It has high values for Quantity and Unit_Price, while also having the highest
Sales_Revenue.
 Product_Description and Product_Line has very negative mean values, indicating a
different preference compared to the other clusters.
 Raw_Material and Product_Category are significantly negative, suggesting a distinct
distribution compared to the other clusters.
 Region has a negative mean value, but less influence compared to Cluster 1.

Hierarchical Clustering and Data Science Group Project - Assignment 2
No ratings yet
Hierarchical Clustering and Data Science Group Project - Assignment 2
29 pages
My Lecture On CLUSTER ANALYSIS PDF
No ratings yet
My Lecture On CLUSTER ANALYSIS PDF
55 pages
Cluster Analysis
No ratings yet
Cluster Analysis
37 pages
Clustering Dendogram
No ratings yet
Clustering Dendogram
13 pages
Clustering X
No ratings yet
Clustering X
2 pages
STAT452 Project1
No ratings yet
STAT452 Project1
13 pages
Intermediate R - Cluster Analysis
33% (3)
Intermediate R - Cluster Analysis
27 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
32 pages
Cluster Analysis in R
No ratings yet
Cluster Analysis in R
8 pages
Hierarchical Clustering Guide
No ratings yet
Hierarchical Clustering Guide
6 pages
Data Mining Business Report 2
No ratings yet
Data Mining Business Report 2
18 pages
Clustering 2
No ratings yet
Clustering 2
11 pages
Unsupervised Methods Overview
No ratings yet
Unsupervised Methods Overview
26 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
69 pages
20 - 1 - ML - UNSUP - 02 - Hierarchical Clustering
No ratings yet
20 - 1 - ML - UNSUP - 02 - Hierarchical Clustering
41 pages
8.cluster Analysis HCA
No ratings yet
8.cluster Analysis HCA
31 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
Health & Economic Clustering Report
91% (11)
Health & Economic Clustering Report
18 pages
19 - Clustering in Operation Research
No ratings yet
19 - Clustering in Operation Research
11 pages
MA Unit 5
No ratings yet
MA Unit 5
7 pages
Cluster Analysis in R TML
No ratings yet
Cluster Analysis in R TML
5 pages
DATA - Dist
No ratings yet
DATA - Dist
90 pages
Creating Heatmaps With Hierarchical Clustering
No ratings yet
Creating Heatmaps With Hierarchical Clustering
14 pages
Gene and Sample Clustering
No ratings yet
Gene and Sample Clustering
5 pages
Marketing Analytics Week-10 LAQ
No ratings yet
Marketing Analytics Week-10 LAQ
5 pages
Lecture-11 Cluster Analysis-1
No ratings yet
Lecture-11 Cluster Analysis-1
28 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
6 pages
FullMarks - Clustering StudentSolution 2
No ratings yet
FullMarks - Clustering StudentSolution 2
13 pages
Clustering
No ratings yet
Clustering
55 pages
MDA Session 4
No ratings yet
MDA Session 4
5 pages
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
No ratings yet
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
7 pages
L18 19 Clustering
No ratings yet
L18 19 Clustering
48 pages
Cluster Analysis Using Statgraphics: Dr. Neil W. Polhemus
No ratings yet
Cluster Analysis Using Statgraphics: Dr. Neil W. Polhemus
32 pages
13 Clustering and Classifier
No ratings yet
13 Clustering and Classifier
123 pages
Hierarchical Clustering Guide
No ratings yet
Hierarchical Clustering Guide
38 pages
Group#10 (Cluster Analysis)
No ratings yet
Group#10 (Cluster Analysis)
53 pages
Chapter13 Slides
No ratings yet
Chapter13 Slides
24 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
23 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
10 pages
Lecture 02 - Cluster Analysis 1
No ratings yet
Lecture 02 - Cluster Analysis 1
59 pages
SPSS Week7
No ratings yet
SPSS Week7
42 pages
SPSS Week7
No ratings yet
SPSS Week7
42 pages
Cluster Analysis Techniques
No ratings yet
Cluster Analysis Techniques
33 pages
Lec 35
No ratings yet
Lec 35
18 pages
Unit 6 - Machine Learning in R
No ratings yet
Unit 6 - Machine Learning in R
45 pages
Hierarchical Clustering: Relationship Between Clusters
No ratings yet
Hierarchical Clustering: Relationship Between Clusters
23 pages
Cluster Analysis Overview
No ratings yet
Cluster Analysis Overview
77 pages
Cluster Analysis Hierarchical & - Means
No ratings yet
Cluster Analysis Hierarchical & - Means
41 pages
Cluster Analysis
No ratings yet
Cluster Analysis
101 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
84 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
26 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
Data Science Cheatsheet
No ratings yet
Data Science Cheatsheet
5 pages
9536 DWM Expt 7 Merged
No ratings yet
9536 DWM Expt 7 Merged
14 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
21 pages
Clustering
No ratings yet
Clustering
6 pages
3.2 HierCluster
No ratings yet
3.2 HierCluster
17 pages
Smartphone Addiction in Grade 12
No ratings yet
Smartphone Addiction in Grade 12
3 pages
IFS231+Group+ +Individual+Assignment Instructions+2023 v2
No ratings yet
IFS231+Group+ +Individual+Assignment Instructions+2023 v2
13 pages
Cohesion and Coupling
No ratings yet
Cohesion and Coupling
40 pages
Library System Thesis Proposal
100% (3)
Library System Thesis Proposal
6 pages
Panduan National Institute of Standards and Technology (NIST) SP 800-30
No ratings yet
Panduan National Institute of Standards and Technology (NIST) SP 800-30
18 pages
Astrology Project Android Apps
No ratings yet
Astrology Project Android Apps
20 pages
KST OPC UA 10 en
100% (1)
KST OPC UA 10 en
39 pages
PM Debug Info
No ratings yet
PM Debug Info
44 pages
Oop
No ratings yet
Oop
2 pages
Allaire JRun - JRun - JRun Setup Guide PDF
No ratings yet
Allaire JRun - JRun - JRun Setup Guide PDF
196 pages
Engineering Course Vacancies
No ratings yet
Engineering Course Vacancies
16 pages
Math Coprocessor
No ratings yet
Math Coprocessor
4 pages
MCC DAQ Notes PDF
No ratings yet
MCC DAQ Notes PDF
3 pages
Widz The Wireless Intrusion Detection System
No ratings yet
Widz The Wireless Intrusion Detection System
6 pages
GEN 530 U-6 L-1 Role of Computers in Research
No ratings yet
GEN 530 U-6 L-1 Role of Computers in Research
11 pages
Mimosa Tutorial
No ratings yet
Mimosa Tutorial
53 pages
ZXUR 9000 GSM (V6.50.20) BSC Commissioning Guide - R1.1
67% (3)
ZXUR 9000 GSM (V6.50.20) BSC Commissioning Guide - R1.1
103 pages
RK Dumper
No ratings yet
RK Dumper
10 pages
ESET Remote Administrator v5 Guide
No ratings yet
ESET Remote Administrator v5 Guide
122 pages
Prediction-Based Scheduling Techniques For Cloud Data Center's Workload: A Systematic Review
No ratings yet
Prediction-Based Scheduling Techniques For Cloud Data Center's Workload: A Systematic Review
27 pages
Yard Management System: Automation Division Tata Steel Ltd. CII Steel Summit, Delhi On 4 November, 2009
No ratings yet
Yard Management System: Automation Division Tata Steel Ltd. CII Steel Summit, Delhi On 4 November, 2009
31 pages
PLC Training Equipment Guide
No ratings yet
PLC Training Equipment Guide
2 pages
CS Syllabus
No ratings yet
CS Syllabus
12 pages
Constructor
No ratings yet
Constructor
33 pages
The End of An Architectural Era (It's Time For A Complete Rewrite)
No ratings yet
The End of An Architectural Era (It's Time For A Complete Rewrite)
11 pages
Wireless Network Security Btech3630
No ratings yet
Wireless Network Security Btech3630
3 pages
Chapter 3: Jump, Loop and Call Instructions
100% (1)
Chapter 3: Jump, Loop and Call Instructions
8 pages
Sherlock Smart Lock Guide
No ratings yet
Sherlock Smart Lock Guide
7 pages
Easy Term Help
No ratings yet
Easy Term Help
5 pages
Group 3 - Final Thesis
No ratings yet
Group 3 - Final Thesis
50 pages

Cluster Analysis

Uploaded by

Cluster Analysis

Uploaded by

Name: Divya Jha

# Load the data

# Standardize the data

# Show the first 6 rows

# Compute the dissimilarity matrix

kmeans(df, centers=2, iter.max = 10, nstart = 25)

kmeans(df, centers=3, iter.max = 10, nstart = 25)

# Visualize the optimal number of clusters based on the silhouette method

You might also like