0% found this document useful (0 votes)

115 views27 pages

1 Stop Project1

Hierarchical clustering is an unsupervised machine learning technique used to group similar data points together. It begins with each data point as a separate cluster and then combines the closest clusters until only one is left. This allows data to be analyzed at different levels of granularity. Hierarchical clustering is used in data science to gain insights into customer behavior by grouping customers with similar characteristics together based on attributes in the data. It helps companies understand their customer base at a deeper level to better target products and services.

Uploaded by

Jagadeesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

115 views27 pages

1 Stop Project1

Uploaded by

Jagadeesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

HIERARCHICAL CLUSTERING PROJECT 1

DATA SCIENCE PERSONIFWY

Definition
Data science is the study of data to extract meaningful insights for business. It is a
multidisciplinary approach that combines principles and practices from the fields of
mathematics, statistics, artificial intelligence, and computer engineering to analyse large
amounts of data. This analysis helps data scientists to ask and answer questions like what
happened, why it happened, what will happen, and what can be done with the results.

History of data science

While the term data science is not new, the meanings and connotations have changed
over time. The word first appeared in the ’60s as an alternative name for statistics. In the late
’90s, computer science professionals formalized the term. A proposed definition for data
science saw it as a separate field with three aspects: data design, collection, and analysis. It
still took another decade for the term to be used outside of academia.

What is data science used for?

Data science is used to study data in four main ways:
1. Descriptive analysis
Descriptive analysis examines data to gain insights into what happened or what is happening
in the data environment. It is characterized by data visualizations such as pie charts, bar
charts, line graphs, tables, or generated narratives. For example, a flight booking service
may record data like the number of tickets booked each day. Descriptive analysis will reveal
booking spikes, booking slumps, and high-performing months for this service.
2. Diagnostic analysis
Diagnostic analysis is a deep-dive or detailed data examination to understand why
something happened. It is characterized by techniques such as drill-down, data discovery,
data mining, and correlations. Multiple data operations and transformations may be
performed on a given data set to discover unique patterns in each of these techniques.For
example, the flight service might drill down on a particularly high-performing month to
better understand the booking spike. This may lead to the discovery that many customers
visit a particular city to attend a monthly sporting event.

1
HIERARCHICAL CLUSTERING PROJECT 1

3. Predictive analysis
Predictive analysis uses historical data to make accurate forecasts about data patterns that
may occur in the future. It is characterized by techniques such as machine learning,
forecasting, pattern matching, and predictive modeling. In each of these techniques,
computers are trained to reverse engineer causality connections in the data.For example, the
flight service team might use data science to predict flight booking patterns for the coming
year at the start of each year. The computer program or algorithm may look at past data and
predict booking spikes for certain destinations in May. Having anticipated their customer’s
future travel requirements, the company could start targeted advertising for those cities from
February.
4. Prescriptive analysis
Prescriptive analytics takes predictive data to the next level. It not only predicts what is
likely to happen but also suggests an optimum response to that outcome. It can analyze the
potential implications of different choices and recommend the best course of action. It uses
graph analysis, simulation, complex event processing, neural networks, and recommendation
engines from machine learning.

Back to the flight booking example, prescriptive analysis could look at historical marketing
campaigns to maximize the advantage of the upcoming booking spike. A data scientist could
project booking outcomes for different levels of marketing spend on various marketing
channels. These data forecasts would give the flight booking company greater confidence in
their marketing decisions.

The process of Data Science

A business problem typically initiates the data science process. A data scientist will
work with business stakeholders to understand what business needs. Once the problem has
been defined, the data scientist may solve it using the OSEMN data science process:
O – Obtain data
Data can be pre-existing, newly acquired, or a data repository downloadable from the
internet. Data scientists can extract data from internal or external databases, company CRM
software, web server logs, social media or purchase it from trusted third-party sources.

2
HIERARCHICAL CLUSTERING PROJECT 1

S – Scrub data
Data scrubbing, or data cleaning, is the process of standardizing the data according to a
predetermined format. It includes handling missing data, fixing data errors, and removing
any data outliers. Some examples of data scrubbing are: ·

• Changing all date values to a common standard format.

• Fixing spelling mistakes or additional spaces.

• Fixing mathematical inaccuracies or removing commas from large numbers.

E – Explore data
Data exploration is preliminary data analysis that is used for planning further data modeling
strategies. Data scientists gain an initial understanding of the data using descriptive statistics
and data visualization tools. Then they explore the data to identify interesting patterns that
can be studied or actioned.
M – Model data
Software and machine learning algorithms are used to gain deeper insights, predict
outcomes, and prescribe the best course of action. Machine learning techniques like
association, classification, and clustering are applied to the training data set. The model
might be tested against predetermined test data to assess result accuracy. The data model can
be fine-tuned many times to improve result outcomes.

N – Interpret results

Data scientists work together with analysts and businesses to convert data insights into
action. They make diagrams, graphs, and charts to represent trends and predictions. Data
summarization helps stakeholders understand and implement results effectively.

3
HIERARCHICAL CLUSTERING PROJECT 1

Data Science Technologies

Data science practitioners work with complex technologies such as:

1. Artificial intelligence: Machine learning models and related software are used for predictive
and prescriptive analysis.
2. Cloud computing: Cloud technologies have given data scientists the flexibility and
processing power required for advanced data analytics.
3. Internet of things: IoT refers to various devices that can automatically connect to the
internet. These devices collect data for data science initiatives. They generate massive data
which can be used for data mining and data extraction.
4. Quantum computing: Quantum computers can perform complex calculations at high speed.
Skilled data scientists use them for building complex quantitative algorithms.

Tools for Data Science

AWS has a range of tools to support data scientists around the globe:
Data storage
For data warehousing, Amazon Redshift can run complex queries against structured or
unstructured data. Analysts and data scientists can use AWS glue to manage and search for
data. AWS Glue automatically creates a unified catalogue of all data in the data lake, with
metadata attached to make it discoverable.
Machine learning
Amazon Sage Maker is a fully managed machine learning service that runs on the Amazon
Elastic Compute Cloud (EC2). It allows users to organize data, build, train and deploy
machine learning models, and scale operations.
Analytics
• Amazon Athena is an interactive query service that makes it easy to analyse data in Amazon
S3 or Glacier. It is fast, serverless, and works using standard SQL queries.
• Amazon Elastic Map Reduce (EMR) processes big data using servers like Spark and
Hadoop.

• Amazon Kinesis allows aggregation and processing of streaming data in real-time. It uses
website clickstreams, application logs, and telemetry data from IoT devices.

• Amazon Open Search allows search, analysis, and visualization of petabytes of data.

4
HIERARCHICAL CLUSTERING PROJECT 1

Challenges faced by Data Science

Multiple data sources
Different types of apps and tools generate data in various formats. Data scientists have to
clean and prepare data to make it consistent. This can be tedious and time-consuming.
Understanding the business problem
Data scientists have to work with multiple stakeholders and business managers to define the
problem to be solved. This can be challenging—especially in large companies with multiple
teams that have varying requirements.
Elimination of bias
Machine learning tools are not completely accurate, and some uncertainty or bias can exist
as a result. Biases are imbalances in the training data or prediction behavior of the model
across different groups, such as age or income bracket. For instance, if the tool is trained
primarily on data from middle-aged individuals, it may be less accurate when making
predictions involving younger and older people. The field of machine learning provides an
opportunity to address biases by detecting them and measuring them in the data and model.

5
HIERARCHICAL CLUSTERING PROJECT 1

HIERARCHICAL CLUSTERING
INTRODUCTION
➢ It is crucial to understand customer behaviour in any industry. I realized this last year when
my chief marketing officer asked me – “Can you tell me which existing customers should
we target for our new product?”
➢ That was quite a learning curve for me. I quickly realized as a data scientist how important it
is to segment customers so my organization can tailor and build targeted strategies. This is
where the concept of clustering came in ever so handy!
➢ Problems like segmenting customers are often deceptively tricky because we are not
working with any target variable in mind. We are officially in the land of unsupervised
learning where we need to figure out patterns and structures without a set outcome in mind.
It’s both challenging and thrilling as a data scientist.

➢ Now, there are a few different ways to perform clustering. I will introduce you to one such
type – hierarchical clustering.
➢ We will learn what hierarchical clustering is, its advantage over the other clustering
algorithms, the different types of hierarchical clustering and the steps to perform it. We will
finally take up a customer segmentation dataset and then implement hierarchical clustering
in Python.

6
HIERARCHICAL CLUSTERING PROJECT 1

What is Hierarchical Clustering?

Let’s say we have the below points and we want to cluster them into groups:

We can assign each of these points to a separate cluster:

Now, based on the similarity of these clusters, we can combine the most similar clusters
together and repeat this process until only a single cluster is left:

We are essentially building a hierarchy of clusters. That’s why this algorithm is called
hierarchical clustering. I will discuss how to decide the number of clusters in a later section.
For now, let’s look at the different types of hierarchical clustering.

7
HIERARCHICAL CLUSTERING PROJECT 1

Types of Hierarchical Clustering

There are mainly two types of hierarchical clustering:

1. Agglomerative hierarchical clustering

2. Divisive Hierarchical clustering

Agglomerative Hierarchical Clustering:

We assign each point to an individual cluster in this technique. Suppose there are 4
data points. We will assign each of these points to a cluster and hence will have 4 clusters in
the beginning:

Then, at each iteration, we merge the closest pair of clusters and repeat this step until only a
single cluster is left:

We are merging (or adding) the clusters at each step. Hence, this type of clustering is also
known as Additive hierarchical clustering.

8
HIERARCHICAL CLUSTERING PROJECT 1

Divisive Hierarchical Clustering:

Divisive hierarchical clustering works in the opposite way. Instead of starting with n
clusters (in case of n observations), we start with a single cluster and assign all the points to
that cluster.
So, it doesn’t matter if we have 10 or 1000 data points. All these points will belong to the
same cluster at the beginning:

Now, at each iteration, we split the farthest point in the cluster and repeat this process until
each cluster only contains a single point:

We are splitting (or dividing) the clusters at each step, hence the name divisive hierarchical
clustering.
Agglomerative Clustering is widely used in the industry and that will be the focus in this
article. Divisive hierarchical clustering will be a piece of cake once we have a handle on the
agglomerative type.

9
HIERARCHICAL CLUSTERING PROJECT 1

Steps to Perform Hierarchical Clustering

➢ We merge the most similar points or clusters in hierarchical clustering – we know this. Now
the question is – how do we decide which points are similar and which are not? It’s one of
the most important questions in clustering!
➢ Here’s one way to calculate similarity – Take the distance between the centroids of these
clusters. The points having the least distance are referred to as similar points and we can
merge them. We can refer to this as a distance-based algorithm as well (since we are
calculating the distances between the clusters).
➢ In hierarchical clustering, we have a concept called a proximity matrix. This stores the
distances between each point. Let’s take an example to understand this matrix as well as the
steps to perform hierarchical clustering.
Step 1: First, we assign all the points to an individual cluster:

Different colors here represent different clusters. You can see that we have 5 different
clusters for the 5 points in our data.
Step 2: Next, we will look at the smallest distance in the proximity matrix and merge the
points with the smallest distance. We then update the proximity matrix:

Here, the smallest distance is 3 and hence we will merge point 1 and 2:

Let’s look at the updated clusters and accordingly update the proximity matrix:

10
HIERARCHICAL CLUSTERING PROJECT 1

Here, we have taken the maximum of the two marks (7, 10) to replace the marks for this
cluster. Instead of the maximum, we can also take the minimum value or the average values
as well. Now, we will again calculate the proximity matrix for these clusters:

Step 3: We will repeat step 2 until only a single cluster is left.

So, we will first look at the minimum distance in the proximity matrix and then merge the
closest pair of clusters. We will get the merged clusters as shown below after repeating these
steps:

We started with 5 clusters and finally have a single cluster. This is how Agglomerative
hierarchical clustering works.

11
HIERARCHICAL CLUSTERING PROJECT 1

Why Hierarchical Clustering?

We should first know how K-means works before we dive into hierarchical clustering. Trust
me, it will make the concept of hierarchical clustering all the easier.
Here’s a brief overview of how K-means works:

1. Decide the number of clusters (k)

2. Select k random points from the data as centroids
3. Assign all the points to the nearest cluster centroid
4. Calculate the centroid of newly formed clusters
5. Repeat steps 3 and 4
➢ It is an iterative process. It will keep on running until the centroids of newly formed clusters
do not change or the maximum number of iterations are reached.
➢ But there are certain challenges with K-means. It always tries to make clusters of the same
size. Also, we must decide the number of clusters at the beginning of the algorithm. Ideally,
we would not know how many clusters we should have, in the beginning of the algorithm
and hence it a challenge with K-means.
➢ This is a gap hierarchical clustering bridges with aplomb. It takes away the problem of
having to pre-define the number of clusters. Sounds like a dream! So, let’s see what
hierarchical clustering is and how it improves on K-means.

How it works
1. Make each data point a cluster.

12
HIERARCHICAL CLUSTERING PROJECT 1

2. Take the two closest clusters and make them one cluster.

3. Repeat step 2 until there is only one cluster.

Dendrograms
We can use a dendrogram to visualize the history of groupings and figure out the optimal
number of clusters.
1. Determine the largest vertical distance that doesn’t intersect any of the other clusters
2. Draw a horizontal line at both extremities
3. The optimal number of clusters is equal to the number of vertical lines going through the
horizontal line

13
HIERARCHICAL CLUSTERING PROJECT 1

For e.g., in the below case, best choice for no. of clusters will be 4.

Linkage Criteria
Similar to gradient descent, you can tweak certain parameters to get drastically different
results.

The linkage criteria refer to how the distance between clusters is calculated.

14
HIERARCHICAL CLUSTERING PROJECT 1

Single Linkage
The distance between two clusters is the shortest distance between two points in each cluster

Complete Linkage
The distance between two clusters is the longest distance between two points in each cluster

Average Linkage
The distance between clusters is the average distance between each point in one cluster to
every point in other cluster

15
HIERARCHICAL CLUSTERING PROJECT 1

Ward Linkage
The distance between clusters is the sum of squared differences within all clustering.

Euclidean Distance
The shortest distance between two points. For example, if x=(a,b) and y=(c,d), the Euclidean
distance between x and y is √(a−c)²+(b−d)²

Manhattan Distance
Imagine you were in the downtown center of a big city and you wanted to get from point A to
point B. You wouldn’t be able to cut across buildings, rather you’d have to make your way
by walking along the various streets. For example, if x=(a,b) and y=(c,d), the Manhattan
distance between x and y is |a−c|+|b−d|

16
HIERARCHICAL CLUSTERING PROJECT 1

Example 1 for Hierarchical Clustering

Let’s look at a concrete example of how we could go about labelling data using hierarchical
agglomerative clustering.

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as sch

In this tutorial, we use the csv file containing a list of customers with their gender, age,
annual income and spending score.

If you want to follow along, you can get the dataset from the super data science website.
To display our data on a graph at a later point, we can only take two variables (annual income
and spending score).

dataset = pd.read_csv('./data.csv')
X = dataset.iloc[:, [3, 4]].values

Looking at the dendrogram, the highest vertical distance that doesn’t intersect with any
clusters is the middle green one. Given that 5 vertical lines cross the threshold, the optimal
number of clusters is 5.
dendrogram = sch.dendrogram(sch.linkage(X, method='ward'))

17
HIERARCHICAL CLUSTERING PROJECT 1

We create an instance of Agglomerative Clustering using the Euclidean distance as the

measure of distance between points and ward linkage to calculate the proximity of clusters.

model = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')

model.fit(X)
labels = model.labels_

The labels property returns an array of integers where the values correspond to the distinct
categories.

We can use a shorthand notation to display all the samples belonging to a category as a
specific color.

plt.scatter(X[labels==0, 0], X[labels==0, 1], s=50, marker='o', color='red')

plt.scatter(X[labels==1, 0], X[labels==1, 1], s=50, marker='o', color='blue')
plt.scatter(X[labels==2, 0], X[labels==2, 1], s=50, marker='o', color='green')
plt.scatter(X[labels==3, 0], X[labels==3, 1], s=50, marker='o', color='purple')

18
HIERARCHICAL CLUSTERING PROJECT 1

plt.scatter(X[labels==4, 0], X[labels==4, 1], s=50, marker='o', color='orange')

plt.show()

Example 2
Hierarchical Clustering for Customer Data
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import plotly as py
import plotly.graph_objs as go

import warnings
warnings.filterwarnings('ignore')

from sklearn import preprocessing

import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering

19
HIERARCHICAL CLUSTERING PROJECT 1

Data Exploration

In [2]:
df = pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')
df.head()
Out [2]:

CustomerID Gender Age Annual Income (k$) Spending Score (1-100)

0 1 Male 19 15 39

1 2 Male 21 15 81

2 3 Female 20 16 6

3 4 Female 23 16 77

4 5 Female 31 17 40

In [3]:
df.isnull().sum()
Out [3]:
CustomerID 0
Gender 0
Age 0
Annual Income (k$) 0
Spending Score (1-100) 0
dtype : int64

In [4]:
df.describe()
Out [4]:

20
HIERARCHICAL CLUSTERING PROJECT 1

CustomerID Age Annual Income (k$) Spending Score (1-100)

count 200.000000 200.000000 200.000000 200.000000

mean 100.500000 38.850000 60.560000 50.200000

std 57.879185 13.969007 26.264721 25.823522

min 1.000000 18.000000 15.000000 1.000000

25% 50.750000 28.750000 41.500000 34.750000

50% 100.500000 36.000000 61.500000 50.000000

75% 150.250000 49.000000 78.000000 73.000000

max 200.000000 70.000000 137.000000 99.000000

In [5]:
plt.figure(1 , figsize = (15 , 6))
n=0
for x in ['Age' , 'Annual Income (k$)' , 'Spending Score (1-100)']:
n += 1
plt.subplot(1 , 3 , n)
plt.subplots_adjust(hspace = 0.5 , wspace = 0.5)
sns.distplot(df[x] , bins = 15)
plt.title('Distplot of {}'.format(x))
plt.show()

21
HIERARCHICAL CLUSTERING PROJECT 1

Label Encoding
Label Encoding refers to converting the labels into numeric form so as to convert it into the
machine-readable form. Machine learning algorithms can then decide in a better way on how
those labels must be operated.

In [6]:
label_encoder = preprocessing.LabelEncoder()

df['Gender'] = label_encoder.fit_transform(df['Gender'])
df.head()
Out [6]:

Gender Age Annual Income (k$) Spending Score (1-100)

CustomerID

0 1 1 19 15 39

1 2 1 21 15 81

2 3 0 20 16 6

3 4 0 23 16 77

4 5 0 31 17 40

22
HIERARCHICAL CLUSTERING PROJECT 1

Heatmap
A heat map is a data visualization technique that shows magnitude of a phenomenon as color
in two dimensions. The variation in color may be by hue or intensity, giving obvious visual
cues to the reader about how the phenomenon is clustered or varies over space.

In [7]:
plt.figure(1, figsize = (16 ,8))
sns.heatmap(df)
plt.show()

Dendrogram
A dendrogram is a diagram representing a tree. This diagrammatic representation is
frequently used in different contexts: in hierarchical clustering, it illustrates the arrangement
of the clusters produced by the corresponding analyses.

In [8]:
plt.figure(1, figsize = (16 ,8))
dendrogram = sch.dendrogram(sch.linkage(df, method = "ward"))

plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')

23
HIERARCHICAL CLUSTERING PROJECT 1

plt.show()

Agglomerative Clustering
This is a "bottom-up" approach: each observation starts in its own cluster, and pairs of
clusters are merged as one moves up the hierarchy.

In [9]:
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage ='average')

y_hc = hc.fit_predict(df)
y_hc
Out [9]:
array ([ 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4,
3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 2,
3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 0, 1, 2, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1])

In [10]:
df['cluster'] = pd.DataFrame(y_hc)

24
HIERARCHICAL CLUSTERING PROJECT 1

In [11]:
trace1 = go.Scatter3d(
x= df['Age'],
y= df['Spending Score (1-100)'],
z= df['Annual Income (k$)'],
mode='markers',
marker=dict(
color = df['cluster'],
size= 10,
line=dict(
color= df['cluster'],
width= 12
),
opacity=0.8
)
)
data = [trace1]
layout = go.Layout(
title= 'Clusters using Agglomerative Clustering',
scene = dict(
xaxis = dict(title = 'Age'),
yaxis = dict(title = 'Spending Score'),
zaxis = dict(title = 'Annual Income')
)
)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)

In [12]:
X = df.iloc[:, [3,4]].values
plt.scatter(X[y_hc==0, 0], X[y_hc==0, 1], s=100, c='red', label ='Cluster 1')
plt.scatter(X[y_hc==1, 0], X[y_hc==1, 1], s=100, c='blue', label ='Cluster 2')
plt.scatter(X[y_hc==2, 0], X[y_hc==2, 1], s=100, c='green', label ='Cluster 3')

25
HIERARCHICAL CLUSTERING PROJECT 1

plt.scatter(X[y_hc==3, 0], X[y_hc==3, 1], s=100, c='purple', label ='Cluster 4')

plt.scatter(X[y_hc==4, 0], X[y_hc==4, 1], s=100, c='orange', label ='Cluster 5')
plt.title('Clusters of Customers (Hierarchical Clustering Model)')
plt.xlabel('Annual Income(k$)')
plt.ylabel('Spending Score(1-100)')
plt.show()

Cluster Analysis

1. Green - Low Income, Low Spending

2. Yellow - Low Income, High Spending
3. Red - Medium Income, Medium Spending
4. Purple - High Income, Low Spending
5. Blue - High Income, High Spending

In [13]:
df.head()
Out [13]:

Annual Income Spending Score (1-

Gender Age cluster
CustomerID (k$) 100)

0 1 1 19 15 39 3

26
HIERARCHICAL CLUSTERING PROJECT 1

Annual Income Spending Score (1-

Gender Age cluster
CustomerID (k$) 100)

1 2 1 21 15 81 4

2 3 0 20 16 6 3

3 4 0 23 16 77 4

4 5 0 31 17 40 3

In [14]:
df.to_csv("segmented_customers.csv", index = False)

Conclusion
Thus, we have analysed Customer data and performed Hierarchical Clustering using
Agglomerative Clustering Algorithm. This kind of cluster analysis helps design better
customer acquisition strategies and helps in business growth.

DATA SCIENCE PERSONIFWY BATCH 6

FROM:
Bhumika Reddy Goddilla
bhumikareddy.1445050@gmail.com

Data Science
100% (1)
Data Science
31 pages
Chapter 1
No ratings yet
Chapter 1
85 pages
What Is Data Science?: Module - 1
No ratings yet
What Is Data Science?: Module - 1
29 pages
Data Similarity and Dissimilarity
No ratings yet
Data Similarity and Dissimilarity
73 pages
Data Science
No ratings yet
Data Science
46 pages
Data Science
No ratings yet
Data Science
11 pages
Introduction of Data Science
No ratings yet
Introduction of Data Science
28 pages
Himadev
No ratings yet
Himadev
37 pages
Impact of Data Science Across Industries
No ratings yet
Impact of Data Science Across Industries
3 pages
Introduction To Data Science L1
No ratings yet
Introduction To Data Science L1
28 pages
Ch7-Overview of Data Science-Part 1
No ratings yet
Ch7-Overview of Data Science-Part 1
37 pages
What Is Data Science - Data Science Explained - AWS
No ratings yet
What Is Data Science - Data Science Explained - AWS
13 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
Intro to Data Science Basics
No ratings yet
Intro to Data Science Basics
171 pages
Fundamentals of Data Science
No ratings yet
Fundamentals of Data Science
53 pages
Ab Assignment 3
No ratings yet
Ab Assignment 3
7 pages
Data Science
No ratings yet
Data Science
64 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
30 pages
Introduction To Data Science UNIT 1
No ratings yet
Introduction To Data Science UNIT 1
44 pages
Data Science Lifecycle Explained
No ratings yet
Data Science Lifecycle Explained
9 pages
Unit-1 Data Science
No ratings yet
Unit-1 Data Science
74 pages
Lecture 1 Introduction Tools An - Chniques For Data Science
No ratings yet
Lecture 1 Introduction Tools An - Chniques For Data Science
16 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
Chapter 14 Big Data and Data Science - DONE DONE DONE
No ratings yet
Chapter 14 Big Data and Data Science - DONE DONE DONE
28 pages
Unit 1-FDS
100% (2)
Unit 1-FDS
18 pages
02 Introduction - Fall 23-24
No ratings yet
02 Introduction - Fall 23-24
29 pages
Introduction To Data Science What Is Data Science?
No ratings yet
Introduction To Data Science What Is Data Science?
11 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
Selected Topics - Datascience
No ratings yet
Selected Topics - Datascience
17 pages
Data Science: A Guide for Professionals
No ratings yet
Data Science: A Guide for Professionals
8 pages
TLMweek 1 Intro Ds
No ratings yet
TLMweek 1 Intro Ds
11 pages
Unit 2 Data Science
No ratings yet
Unit 2 Data Science
53 pages
The Field of Data Science
No ratings yet
The Field of Data Science
4 pages
DS Notes
No ratings yet
DS Notes
159 pages
Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
BI Unit 2
No ratings yet
BI Unit 2
113 pages
File
No ratings yet
File
27 pages
Data Science & Business Basics Guide
No ratings yet
Data Science & Business Basics Guide
35 pages
DS B&V-1
No ratings yet
DS B&V-1
30 pages
Applied - Data - Science MODULE 1 SEM8
No ratings yet
Applied - Data - Science MODULE 1 SEM8
16 pages
M1.1 DS
No ratings yet
M1.1 DS
57 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
8 pages
Data Science for Professionals
No ratings yet
Data Science for Professionals
15 pages
Fundamentals of Data Science Course
100% (3)
Fundamentals of Data Science Course
62 pages
Data Science: Key Concepts & Skills
No ratings yet
Data Science: Key Concepts & Skills
48 pages
Beginner's Guide to Data Science
No ratings yet
Beginner's Guide to Data Science
26 pages
Approaches in Data Science (Slides)
No ratings yet
Approaches in Data Science (Slides)
13 pages
Unit 1
No ratings yet
Unit 1
28 pages
DS QB Unit 1
No ratings yet
DS QB Unit 1
45 pages
Datascience Notes
No ratings yet
Datascience Notes
161 pages
Data Science & Big Data Essentials
No ratings yet
Data Science & Big Data Essentials
46 pages
Bcom Python
No ratings yet
Bcom Python
71 pages
Data Science in IOT
No ratings yet
Data Science in IOT
220 pages
Data Science
No ratings yet
Data Science
9 pages
Summary of Data Science
No ratings yet
Summary of Data Science
5 pages
DSV Sem Exam
No ratings yet
DSV Sem Exam
15 pages
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
No ratings yet
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
5 pages
Data Science Introduction
No ratings yet
Data Science Introduction
24 pages
DermaBind Reimbursement Guide
No ratings yet
DermaBind Reimbursement Guide
6 pages
Bir Ruling Da 086 08
No ratings yet
Bir Ruling Da 086 08
5 pages
Books For Sale
No ratings yet
Books For Sale
35 pages
How Long Ago Q ?
No ratings yet
How Long Ago Q ?
2 pages
First Aid for Electric Shock
100% (2)
First Aid for Electric Shock
1 page
Operations Management: Chapter 11 - Supply-Chain Management
No ratings yet
Operations Management: Chapter 11 - Supply-Chain Management
22 pages
Compare The Parliamentary and Presidential Forms of Government With Reference To India and The U.S.A
100% (1)
Compare The Parliamentary and Presidential Forms of Government With Reference To India and The U.S.A
6 pages
2.2 Financial and Non-Financial Transactions - 2
No ratings yet
2.2 Financial and Non-Financial Transactions - 2
9 pages
Behavioral Finance in A M&A Setting - Vsent
No ratings yet
Behavioral Finance in A M&A Setting - Vsent
17 pages
Cultivating Daily Christian Gratitude
50% (2)
Cultivating Daily Christian Gratitude
2 pages
People vs. Gozo, 53 SCRA 476, Oct. 1973
100% (3)
People vs. Gozo, 53 SCRA 476, Oct. 1973
2 pages
Sensing Ethical Introtim
No ratings yet
Sensing Ethical Introtim
3 pages
Knee Strengthening Exercises
100% (1)
Knee Strengthening Exercises
4 pages
Polity Notes PDF
No ratings yet
Polity Notes PDF
6 pages
Chemistry Practical Exam Guide
0% (1)
Chemistry Practical Exam Guide
12 pages
2019 SALN Table Summary Senate Website
No ratings yet
2019 SALN Table Summary Senate Website
2 pages
Focus and Fear in the Bhagavad Gita
No ratings yet
Focus and Fear in the Bhagavad Gita
3 pages
Colgate Palmolive - Edited
No ratings yet
Colgate Palmolive - Edited
15 pages
Philosophy of Pain
100% (1)
Philosophy of Pain
19 pages
11 04 07
No ratings yet
11 04 07
88 pages
University of Cambridge International Examinations International General Certificate of Secondary Education English As A Second Language May/June 2004 2 Hours
No ratings yet
University of Cambridge International Examinations International General Certificate of Secondary Education English As A Second Language May/June 2004 2 Hours
20 pages
Nusrat Jahan Anika Internship Report 2 2
No ratings yet
Nusrat Jahan Anika Internship Report 2 2
56 pages
Online Research Workshop for Scholars
No ratings yet
Online Research Workshop for Scholars
1 page
History of Tourism in Medieval India
No ratings yet
History of Tourism in Medieval India
2 pages
Deserted Road
No ratings yet
Deserted Road
2 pages
Splines - Design and Application: AGMA Information Sheet
0% (1)
Splines - Design and Application: AGMA Information Sheet
9 pages
Electronics MCQs
No ratings yet
Electronics MCQs
5 pages
Introductory and Intermediate Algebra 5th Edition Bittinger Solutions Manual
100% (40)
Introductory and Intermediate Algebra 5th Edition Bittinger Solutions Manual
34 pages
Archetypes and Energies Guide
No ratings yet
Archetypes and Energies Guide
9 pages
Individual and Group Behavior Tutorial
100% (1)
Individual and Group Behavior Tutorial
62 pages

1 Stop Project1

Uploaded by

1 Stop Project1

Uploaded by

HIERARCHICAL CLUSTERING PROJECT 1

DATA SCIENCE PERSONIFWY

History of data science

What is data science used for?

The process of Data Science

• Changing all date values to a common standard format.

• Fixing spelling mistakes or additional spaces.

• Fixing mathematical inaccuracies or removing commas from large numbers.

Data Science Technologies

Tools for Data Science

Challenges faced by Data Science

What is Hierarchical Clustering?

We can assign each of these points to a separate cluster:

Types of Hierarchical Clustering

1. Agglomerative hierarchical clustering

Agglomerative Hierarchical Clustering:

Divisive Hierarchical Clustering:

Steps to Perform Hierarchical Clustering

Step 3: We will repeat step 2 until only a single cluster is left.

Why Hierarchical Clustering?

1. Decide the number of clusters (k)

3. Repeat step 2 until there is only one cluster.

Example 1 for Hierarchical Clustering

We create an instance of Agglomerative Clustering using the Euclidean distance as the

model = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')

plt.scatter(X[labels==0, 0], X[labels==0, 1], s=50, marker='o', color='red')

plt.scatter(X[labels==4, 0], X[labels==4, 1], s=50, marker='o', color='orange')

from sklearn import preprocessing

CustomerID Gender Age Annual Income (k$) Spending Score (1-100)

CustomerID Age Annual Income (k$) Spending Score (1-100)

count 200.000000 200.000000 200.000000 200.000000

mean 100.500000 38.850000 60.560000 50.200000

std 57.879185 13.969007 26.264721 25.823522

min 1.000000 18.000000 15.000000 1.000000

25% 50.750000 28.750000 41.500000 34.750000

50% 100.500000 36.000000 61.500000 50.000000

75% 150.250000 49.000000 78.000000 73.000000

max 200.000000 70.000000 137.000000 99.000000

Gender Age Annual Income (k$) Spending Score (1-100)

plt.scatter(X[y_hc==3, 0], X[y_hc==3, 1], s=100, c='purple', label ='Cluster 4')

1. Green - Low Income, Low Spending

Annual Income Spending Score (1-

Annual Income Spending Score (1-

DATA SCIENCE PERSONIFWY BATCH 6

You might also like