Unsupervisd Learning Algorithm
Unsupervisd Learning Algorithm
Unsupervisd Learning Algorithm
Engineering
ETEL71A-Machine Learning and AI
Class: BE
Name: Adya Kastwar
UID : 2016120024
Sem: VII
Experiment: Unsupervised learning algorithms
Objective: Apply EM algorithm to cluster a set of data stored in a .CSV file. Use the same data set
for clustering using k-Means algorithm. Compare the results of these two algorithms
and comment on the quality of clustering. You can make use of /Python ML library
classes/API in the program.
Outcomes:
1. Understand unsupervised, semi-supervised learning and the methods of clustering.
2. Understand Expectation Maximization algorithms to maximize the likelihoods.
3. Apply K-means algorithm in Python to form clusters of unlabelled data.
4. Apply Hierarchical algorithm to form clusters and plot a dendogram.
5. Compare the unsupervised learning algorithms.
System Requirements:
Linux OS with Python and libraries or R or windows with MATLAB
Task 2: Write a python code to implement K-means and Agglomerative algorithm to form clusters for
the ‘Iris’ flower data. Assume the value of ‘K’ from the number of classes and remove the
class column to use the data for unsupervised learning.
#kmeans
#adya
#importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#plot clusters
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label =
'Centroids')
plt.title('Clusters of flowers species')
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend()
plt.show()
#hierarchical agglomerative
#Adya Kastwar
#import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#dendogram
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('flower')
plt.ylabel('Euclidean distances')
plt.show()
#fit model
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 3, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)
#plot clusters
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
#plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
#plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.title('Clusters of flowers')
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend()
plt.show()
Task 3: Plot a dendogram for agglomerative clustering, plot the clusters formed and compare
iterations required by both models.
Task 4: State the pros and cons of both the algorithms.
Kmeans
Pros:
1) Fast, robust and easier to understand.
2) Relatively efficient: O(tknd), where n is # objects, k is # clusters, d is # dimension of each object,
and t is # iterations. Normally, k, t, d << n.
3) Gives best result when data set are distinct or well separated from each other.
Cons :
1) The learning algorithm requires apriori specification of the number of cluster centers.
2) The use of Exclusive Assignment - If there are two highly overlapping data then k-means will not
be able to resolve that there are two clusters.
3) Euclidean distance measures can unequally weight underlying factors.
4) Unable to handle noisy data and outliers.
Agglomerative hieracrchical
Pros :
Flexible
Useful for smaller data
Cons :
Not very scalable
Cannot be used on large data
Dataset: Iris flower data set with ‘petal length’ and ‘petal width’ attributes to limit it in 2-dimensions.
Conclusion:
Both the algorithms look for similarities among data and both use the same approaches to decide
the number of clusters.
Both give almost similar accuracy of clustering.
Kmeans requires prior specification of k.
Elbow method was used to find out an approximate value of k for k means.