[go: up one dir, main page]

0% found this document useful (0 votes)
32 views8 pages

Cluster Analysis

Uploaded by

manohargade19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views8 pages

Cluster Analysis

Uploaded by

manohargade19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Name: Divya Jha

Rollno: L021
Practical: Cluster Analysis

# Load the data


install.packages(c("factoextra","cluster","NbClust"))

library(factoextra)
library(cluster)
library(NbClust)

data=read.csv("C:/ProgramData/Microsoft/Windows/Start
Menu/Programs/RStudio/Sales_Product_Details.csv")
View(data)

str(data)

# Standardize the data

data$Product_Description=as.numeric(as.factor(data$Product_Description))
data$Product_Category=as.numeric(as.factor(data$Product_Category))
data$Product_Line=as.numeric(as.factor(data$Product_Line))
data$Raw_Material=as.numeric(as.factor(data$Raw_Material))
data$Region=as.numeric(as.factor(data$Region))
df <- scale(data)
df <- scale(data): This line scales your data by centering and scaling each column so
that they have a mean of 0 and a standard deviation of 1. The result is stored in a new
variable df.

# Show the first 6 rows


head(df, nrow = 6)

# Compute the dissimilarity matrix


# df = the standardized data
res.dist <- dist(df, method = "euclidean")
The dist( function in R computes the distance matrix between the rows of a data matrix.
In this case, df contains your pre-processed and scaled data.
This line calculates the Euclidean distance between all pairs of rows.

as.matrix(res.dist)[1:6, 1:6]

This line extracts the upper left 6x6 block from the distance matrix and converts it into a
standard matrix format using as.matrix(). This is done for visualization or further analysis
of a subset of distances.

kmeans(df, centers=2, iter.max = 10, nstart = 25)

kmeans(df, centers=3, iter.max = 10, nstart = 25)


kmeans(df, centers=4, iter.max = 10, nstart = 25)

1. 2 clusters:
Cluster 1 has 15 observations.
Cluster 2 has 15 observations.
Within cluster sum of squares: 93.64579 and 91.32353.
Proportion of variance explained: 20.3%.

2. 3 clusters:
Cluster 1 has 14 observations.
Cluster 2 has 14 observations.
Cluster 3 has 2 observations.
Within cluster sum of squares: 65.49310, 70.31447, and 19.55407.
Proportion of variance explained: 33.0%.

3. 4 clusters:
Cluster 1 has 5 observations.
Cluster 2 has 14 observations.
Cluster 3 has 1 observation.
Cluster 4 has 10 observations.
Within cluster sum of squares: 29.53349, 65.49310, 0.00000, and 38.44444.
Proportion of variance explained: 42.5%.

The "Cluster means" section provides the means for each variable within each cluster. The
"Clustering vector" indicates the cluster assignment for each observation.
You can choose the number of clusters based on various criteria such as the proportion of variance
explained.

# Visualize the optimal number of clusters based on the silhouette method


fviz_nbclust(data,FUN=hcut, method = "silhouette")
Interpretation: this plot to identify the number of clusters that maximizes the average silhouette
width, which indicates the optimal number of clusters for your dataset. Placing the vertical line at 3
indicates that you are interested in exploring the possibility of clustering your data into 3 clusters. If
the silhouette scores are relatively high and consistent up to 3 clusters, it suggests that partitioning
the data into 3 clusters leads to well-separated and internally cohesive clusters. However, if the
silhouette scores start to decrease after 3 clusters, it might indicate that further partitioning the data
into more clusters doesn't improve the overall quality of the clustering.

There are various methods for performing clustering. I tried using the average linkage method, single
linkage method, ward method, complete method, and centroid method. Of the five methods, the
best method will be determined by calculating the cophenetic correlation value of each variable. The
cophenetic correlation coefficient is the correlation coefficient between the original elements of the
dissimilarity matrix (squared Euclidean distance matrix) and the elements generated by the
dendrogram (cophenetic matrix). Cophenetic correlation coefficient values range from 1 to -1. The
closer the value is to 1, the better the clustering is done.
Interpretation: Based on the results of the cophenetic correlation coefficient of the five methods, it
is known that the average linkage method has the largest value, which is 0.7913062. Then the
clustering will use the average linkage method.

plot(metode_al)
rect.hclust(metode_al,3)
Interpretation: Based on the dendogram and figure above, cluster 1 consists of 1 ID and cluster 2
consists of 1 ID and cluster 3 consists of 28 ID

library(factoextra)
km <- kmeans(df, centers = 3, nstart = 25)
fviz_cluster(km, data = df)

Cluster Profiling
Based on the members of each cluster that have been obtained, then profiling is carried out to
determine the characteristics of each cluster that has been obtained. Profiling is done by finding the
average value for each variable in each cluster.

group3=cutree(metode_al,3)
table(group3)
table(group3/nrow(df))
aggregate(df,list(group3),mean)

Interpretation:
 Cluster 1 (Group.1 = 1):
 It has relatively low values for Quantity, while having very high values for Unit_Price,
Sales_Revenue, and Region.
 Product_Description and Product_Line has negative mean values, indicating that
they are skewed towards certain categories.
 Raw_Material and Product_Category are close to zero, suggesting less influence in
defining this cluster's characteristics.
 Cluster 2 (Group.1 = 2):
 It has average values for Quantity, Unit_Price, and Sales_Revenue.
 Product_Description and Product_Line has positive mean values, indicating a
preference for certain categories.
 Raw_Material has a slightly negative mean value, suggesting a different distribution
compared to Cluster 1.
 Region is close to zero, indicating less influence in defining this cluster's
characteristics.
 Cluster 3 (Group.1 = 3):
 It has high values for Quantity and Unit_Price, while also having the highest
Sales_Revenue.
 Product_Description and Product_Line has very negative mean values, indicating a
different preference compared to the other clusters.
 Raw_Material and Product_Category are significantly negative, suggesting a distinct
distribution compared to the other clusters.
 Region has a negative mean value, but less influence compared to Cluster 1.

You might also like