0% found this document useful (0 votes)

67 views42 pages

Test Bank For Introduction To Data Mining 2nd Edition

The document is a test bank for the textbook 'Introduction to Data Mining' by Pang Ning Tan, providing various data mining tasks and examples related to classification, clustering, association rule mining, and anomaly detection. It includes exercises on classifying attributes and data preprocessing techniques, along with explanations and answers. The content is structured to assist students in understanding data mining concepts and applications.

Uploaded by

hvndwgsvvj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views42 pages

Test Bank For Introduction To Data Mining 2nd Edition

Uploaded by

hvndwgsvvj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Test Bank + Answer Key

Test Bank for Introduction to Data Mining 2nd Edition by Pang

Ning Tan

View Full Product:

https://selldocx.com/products/test-bank-introduction-to-data-mining-2e-tan

Book Title: Introduction to Data Mining

Edition: 2nd Edition

Author: Pang Ning Tan

Click above to view a sample

Introduction
1. [Fall 2008]
For each data set given below, give specific examples of classification,
clustering, association rule mining, and anomaly detection tasks that
can be performed on the data. For each task, state how the data matrix
should be constructed (i.e., specify the rows and columns of the matrix).

(a) Ambulatory Medical Care data1 , which contains the demographic

and medical visit information for each patient (e.g., gender, age,
duration of visit, physician’s diagnosis, symptoms, medication, etc).
Answer:
Classification
Task: Diagnose whether a patient has a disease.
Row: Patient
Column: Patient’s demographic and hospital visit information (e.g., symptoms), along with
a class attribute that indicates whether the patient has the disease.
Clustering
Task: Find groups of patients with similar medical conditions
Row: A patient visit
Column: List of medical conditions of each patient
Association rule mining
Task: Identify the symptoms and medical conditions that co-occur together frequently
Row: A patient visit
Column: List of symptoms and diagnosed medical conditions of the patient
Anomaly detection
Task: Identify healthy looking patients with rare medical disorders
Row: A patient visit
Column: List of demographic attributes, symptoms, and medical test results of the patient
1
See for example, the National Hospital Ambulatory Medical Care Survey http://www.
cdc.gov/nchs/about/major/ahcd/ahcd1.htm
2 Chapter 1 Introduction

(b) Stock market data, which include the prices and volumes of various
stocks on different trading days.
Answer:
Classification
Task: Predict whether the stock price will go up or down the next trading day
Row: A trading day
Column: Trading volume and closing price of the stock the previous 5 days and a class
attribute that indicates whether the stock went up or down
Clustering
Task: Identify groups of stocks with similar price fluctuations
Row: A company’s stock
Column: Changes in the daily closing price of the stock over the past ten years
Association rule mining
Task: Identify stocks with similar fluctuation patterns(e.g., {Google-Up, Yahoo-Up})
Row: A trading day
Column: List of all stock-up and stock-down events on the given day.
Anomaly detection
Task: Identify unusual trading days for a given stock (e.g., unusually high volume)
Row: A trading day
Column: Trading volume, change in daily stock price (daily high − low prices), and average
price change of its competitor stocks
(c) Database of Major League Baseball (MLB).

Classification
Task: Predict the winner of a game between two MLB teams.
Row: A game.
Column: Statistics of the home and visiting teams over their past 10 games they had played
(e.g., average winning percentage and hitting percentage of their players)
Clustering
Task: Identify groups of players with similar statistics
Row: A player
Column: Statistics of the player
Association rule mining
Task: Identify interesting player statistics (e.g., 40% of right-handed players have a batting
percentage below 20% when facing left-handed pitchers)
Row: A player
Column: Discretized statistics of the player
Anomaly detection
Task: Identify players who performed considerably better than expected in a given season
Row: A (player,season) pair e.g, (player1 in 2007)
Column: Ratio statistics of a player (e.g., ratio of average batting percentage in 2007 to
career average batting percentage)

2
2

Data
2.1 Types of Attributes
1. Classify the following attributes as binary, discrete, or continuous. Also
classify them as qualitative (nominal or ordinal) or quantitative (interval
or ratio). Some cases may have more than one interpretation, so briefly
indicate your reasoning if you think there may be some ambiguity.

(a) Number of courses registered by a student in a given semester.

Answer: Discrete, quantitative, ratio.
(b) Speed of a car (in miles per hour).
Answer: Discrete, quantitative, ratio.
(c) Decibel as a measure of sound intensity.
Answer: Continuous, quantitative, interval or ratio. It is actually
a logratio type (which is somewhere between interval and ratio).
(d) Hurricane intensity according to the Saffir-Simpson Hurricane Scale.
Answer: Discrete, qualitative, ordinal.
(e) Social security number.
Answer: Discrete, qualitative, nominal.

2. Classify the following attributes as:

• discrete or continuous.
• qualitative or quantitative
• nominal, ordinal, interval, or ratio
4 Chapter 2 Data

Some cases may have more than one interpretation, so briefly indicate
your reasoning if you think there may be some ambiguity.

(a) Julian Date, which is the number of days elapsed since 12 noon
Greenwich Mean Time of January 1, 4713 BC.
Answer: Continuous, quantitative, interval
(b) Movie ratings provided by users (1-star, 2-star, 3-star, or 4-star).
Answer: Discrete, qualitative, ordinal
(c) Mood level of a blogger (cheerful, calm, relaxed, bored, sad, angry
or frustrated).
Answer: Discrete, qualitative, nominal
(d) Average number of hours a user spent on the Internet in a week.
Answer: Continuous, quantitative, ratio
(e) IP address of a machine.
Answer: Discrete, qualitative, nominal
(f) Richter scale (in terms of energy release during an earthquake).
Answer: Continuous, qualitative, ordinal
In terms of energy release, the difference between 0.0 and 1.0 is not
the same as between 1.0 and 2.0. Ordinal attributes are qualitative;
yet, can be continuous.
(g) Salary above the median salary of all employees in an organization.
Answer: Continuous, quantitative, interval
(h) Undergraduate level (freshman, sophomore, junior, and senior) for
measuring years in college.
Answer: Discrete, qualitative, ordinal

3. For each attribute given, classify its type as:

• discrete or continuous AND

• qualitative or quantitative AND
• nominal, ordinal, interval, or ratio

Indicate your reasoning if you think there may be some ambiguity in

some cases.
Example: Age in years.
Answer: Discrete, quantitative, ratio.

4
2.1 Types of Attributes 5

(a) Daily user traffic volume at YouTube.com (i.e., number of daily

visitors who visited the Web site).
Answer: Discrete, quantitative, ratio.
(b) Air pressure of a car/bicycle tire (in psi).
Answer: Continuous, quantitative, ratio.
(c) Homeland Security Advisory System ratings - code red/orange/etc.
Answer: Discrete, qualitative, ordinal.
(d) Amount of seismic energy release, measured in Richter scale.
Answer: Continuous, qualitative, ordinal.
(e) Credit card number.
Answer: Discrete, qualitative, nominal.
(f) The wealth of a nation measured in terms of gross domestic product
(GDP) per capita above the world’s average of $10,500.
Answer: Continuous, quantitative, interval.

4. For each attribute given, classify its type as:

• discrete or continuous AND

• qualitative or quantitative AND
• nominal, ordinal, interval, or ratio

Indicate your reasoning if you think there may be some ambiguity in

some cases.
Example: Age in years.
Answer: Discrete, quantitative, ratio.

(a) Favorite movie of each person.

Answer: Discrete, qualitative, nominal
(b) Number of days since Jan 1, 2011.
Answer: Discrete, quantitative, interval.
(c) Category of a hurricane (The Saffir-Simpson Hurricane Wind Scale
ranges from category 1 to category 5).
Answer: Discrete, qualitative, ordinal.
(d) Number of students enrolled in a class.
Answer: Discrete, quantitative, ratio

5
6 Chapter 2 Data

5. For each attribute given, classify its type as:

• discrete or continuous AND

• qualitative or quantitative AND
• nominal, ordinal, interval, or ratio

Indicate your reasoning if you think there may be some ambiguity in

some cases.
Example: Temperature in Kelvin
Answer: Continuous, quantitative, ratio.

(a) Number of years since 1 BC. For example, 2 BC is year -1, 1 BC is

year 0, 1 AD is year 1, and 2013 AD is year 2013 (note, there is no
0 AD in Gregorian calendar).
Answer: Discrete/Continuous, quantitative, interval.
(b) GPA of a student.
Answer: Continuous, qualitative, ordinal.
(c) Mood level of a blogger (cheerful, calm, relaxed, bored, sad, angry
or frustrated).
Answer: Discrete, qualitative, nominal.
(d) Sound intensity in decibel scale.
Answer: Continuous, qualitative, ordinal. In terms of sound in-
tensity, the difference between 0dB and 1dB is not the same as the
difference between 10 dB and 11 dB (decibels are in log scale); thus,
it is not an interval attribute.

6. State the type of each attribute given below before and after we have
performed the following transformation.

(a) Hair color of a person is mapped to the following values: black =

0, brown = 1, red = 2, blonde = 3, grey = 4, white = 5.
Answer: Nominal (both before and after transformation).
(b) Grade of a student (from 0 to 100) is mapped to the following scale:
A = 4.0, A- = 3.5, B = 3.0, B- = 2.5, C = 2.0, C- = 1.5, D = 1.0,
D- = 0.5, E = 0.0
Answer: Ratio (before transformation) to ordinal (after transfor-
mation).

6
2.1 Types of Attributes 7

(c) Age of a person is discretized to the following scale: Age < 12, 12
≤ Age < 21, 21 ≤ Age < 45, 45 ≤ Age < 65, Age > 65.
Answer: Ratio (before transformation) to ordinal (after transfor-
mation)
(d) Annual income of a person is discretized to the following scale: In-
come < $20K, $20K ≤ Income < $60K, $60K ≤ Income < $120K,
$120K ≤ Age < $250K, Age ≥ $250K.

Answer: Ratio (before transformation) to ordinal (after transfor-

mation).
(e) Height of a person is changed from meters to feet.
Answer: Ratio (both before and after transformation)
(f) Height of a person is changed from meters to {Short, Medium, Tall}.
Answer: Ratio (before transformation) to ordinal (after transfor-
mation).
(g) Height of a person is changed from feet to number of inches above
4 feet.
Answer: Ratio (before transformation) to interval (after transfor-
mation).
(h) Weight of a person is standardized by subtracting it with the mean
of the weight for all people and dividing by its standard deviation.
Answer: Ratio (before transformation) to interval (after transfor-
mation)

7. State whether it is meaningful (based on the properties of the attribute

values) to apply the following operations to the data given below

(a) Average amplitude of seismic waves (in Richter scale) for the 10
deadliest earthquakes in Asia.
Answer: No because Richter scale is ordinal.
(b) Average number of characters in a collection of spam messages.
Answer: Yes because number of characters is a ratio attribute.
(c) Pearson’s correlation between shirt size and height of an individual.
Answer: No because shirt size is ordinal.
(d) Median zipcode of households in the United States.
Answer: No because zipcode is nominal.

7
8 Chapter 2 Data

(e) Entropy of students (based on the GPA they obtained for a given
course).
Answer: Yes because entropy is applicable to nominal attributes.
(f) Geometric mean of temperature (in Fahrenheit) for a given city.
Answer: No because temperature (in Fahrenheit) is not a ratio
attribute.

2.2 Data Preprocessing

1. Consider the following dataset that contains the age and gender infor-
mation for 9 users who visited a given website.

UserID 1 2 3 4 5 6 7 8 9
Age 17 24 25 28 32 38 39 49 68
Gender Female Male Male Male Female Female Female Male Male

(a) Suppose you apply equal interval width approach to discretize the
Age attribute into 3 bins. Show the userIDs assigned to each of the
3 bins.
Answer: Bin width = 68−17 3 = 51
3 = 17.
Bin 1: 1, 2, 3, 4, 5
Bin 2: 6, 7, 8
Bin 3: 9
(b) Repeat the previous question using the equal frequency approach.
Answer: Since there are 9 users and 3 bins, every bin must contain
3 users.
Bin 1: 1, 2, 3
Bin 2: 4, 5, 6
Bin 3: 7, 8, 9
(c) Repeat question (a) using a supervised discretization approach (with
Gender as class attribute). Specifically, choose the bins in such a
way that their members are as “pure” as possible (i.e., belonging
to the same class).
Answer:
Bin 1: 1, 2, 3, 4
Bin 2: 5, 6, 7
Bin 3: 8, 9

8
2.2 Data Preprocessing 9

2. Consider an attribute X of a data set that takes the values {x1 , x2 , · · · , x9 }

(sorted in increasing order of magnitude). We apply two methods (equal
interval width and equal frequency) to discretize the attribute into 3
bins. The bins obtained are shown below:

Equal Width: {x1 , x2 , x3 }, {x4 , x5 , x6 , x7 , x8 }, {x9 }

Equal Frequency: {x1 , x2 , x3 }, {x4 , x5 , x6 }, {x7 , x8 , x9 }

Explain what will be the effect of applying the following transformations

on each discretization method, i.e., whether the elements assigned to
each bin can change if you discretize the attribute after applying the
transformation function below. Note that X̄ denotes the average value
and σx denotes standard deviation of attribute X.

(a) X → X − X̄ (i.e., if the attribute values are centered).

Answer: No change for equal width because the distance between
xi and xi+1 is unchanged. No change for equal frequency because
the relative ordering of data points remain the same (i.e., if xi <
xi+1 then xi − X̄ < xi+1 − X̄).
(b) X → X− X̄
σx (i.e., if the attribute values are standardized).
Answer: Since the distances between every pair of points (xi , xi+1 )
change uniformly (by a constant factor of σx , the elements in the
bins are unchanged for equal width discretization. No change for
equal frequency because the relative ordering of data points remain
the same.
X−X̄
(c) X → exp σx (i.e., if the values are standardized and exponen-
tiated).
Answer: The bin elements may change for equal width because
the distances between xi and xi+1 may not change uniformly. No
change for equal frequency because the relative ordering of data
points remain the same.

3. Consider a dataset that has 3 attributes (x1 , x2 , and x3 ). The distribu-

tion of each attribute is as follows and shown in Figure

• x1 has a uniform distribution in the range between 0 and 1.

• x2 is generated from a mixture of 3 Gaussian distributions centered
at 0.1, 0.5, and 0.9, respectively. The standard deviation of the

9
10 Chapter 2 Data

distributions are 0.02, 0.1, and 0.02, respectively. Assume each

point is generated from one of the 3 distributions and the number
of points associated with each distribution is different.
• x3 is generated from an exponential distribution with mean 0.1.

(a) Which attribute(s) is likely to produce the same bins regardless of

whether you use equal width or equal frequency approaches (as-
suming the number of bins is not too large).
Answer: x1 .
(b) Which attribute(s) is more suitable for equal frequency than equal
width discretization approaches.
Answer: x3 .
(c) Which attribute(s) is not appropriate for both equal width and
equal frequency discretization approaches.
Answer: x2 .
(d) If all 3 are initially ratio attributes, what are their attribute types
after discretization?
Answer: Ordinal.

4. An e-commerce company is interested in identifying the highest spending

customers at its online store using association rule mining. One of the
rules identified is:

21 ≤ Age < 45 AND NumberOfVisits > 50 → AmountSpent > $500,

where the Age attribute was discretized into 5 bins, NumberOfVisits

was discretized into 8 bins, and AmountSpent was discretized into 8
bins. The confidence of an association rule A, B → C is defined as

P (A, B, C)
Confidence(A, B → C) = P (C|A, B) = (2.1)
P (A, B)

where P (C|A, B) is the conditional probability of C given A and B,

P (A, B, C) is the joint probability of A, B, and C, and P (A, B) is the
joint probability of A and B. The probabilities are empirically esti-
mated based on their relative frequencies in the data. For example,
P (AmountSpent > $500) is given by the proportion of online users who
visited the store and spent more than $500.

10
2.2 Data Preprocessing 11

(a) Suppose we increase the number of bins for the Age attribute from
5 to 6 so that the discretized Age in the rule becomes 21 ≤ Age
< 30 instead of 21 ≤ Age < 45, will the confidence of the rule be
non-increasing, non-decreasing, stays the same, or could go either
way (increase/decrease)?
Answer: Can increase/decrease.
(b) Suppose we increase the number of bins for the AmountSpent at-
tribute from 8 to 10, so that the right hand side of the rule becomes
$500 < AmountSpent < $1000, will the confidence of the rule be
non-increasing, non-decreasing, stays the same, or could go either
way (increase/decrease)?
Answer: Non-increasing.
(c) Suppose the values for NumberOfVisits attribute are distributed
according to a Poisson distribution with a mean value equals to 4.
If we discretize the attribute into 4 bins using the equal frequency
approach, what are the bin values after discretization? Hint: you
need to refer to the cumulative distribution table for Poisson dis-
tribution to answer the question.
Answer: Choose the bin values such that the cumulative distribu-
tion is close to 0.25, 0.5, and 0.75. This corresponds to bin values:
0 to 2, 3, 4 to 5, and greater than 5.

5. Null values in data records may refer to missing or inapplicable values.

Consider the following table of employees for a hypothetical organization:

Name Sales commission Occupation

John 5000 Sales
Mary 1000 Sales
Bob null Non-sales
Lisa null Non-sales

The null values in the table refer to inapplicable values since sales com-
mission are calculated for sales employees only. Suppose we are interested
to calculate the similarity between users based on their sales commission.

(a) Explain what is the limitation of the approach to compute similarity

if we replace the null values in sales commission by 0.
Answer: Mary will be more similar to Bob and Lisa than to John.

11
12 Chapter 2 Data

(b) Explain what is the limitation of the approach to compute similarity

if we replace the null values in sales commission by the average value
of sales commission (i.e., 3000).
Answer: Both Mary and John are less similar to each other than
to Bob and Lisa.
(c) Propose a method that can handle null values in the sales commis-
sion so that employees that have the same occupation are closer to
each other than to employees that have different occupations.
Answer: One way is to change the similarity function as follows:

 ∞, if both a and b are null;
Similarity(a, b) = 0, if one of a or b is null;
s(a, b), otherwise.


where s(a, b) is the original similarity measure used for the sales
commission.

6. Consider a data set from an online social media Web site that contains
information about the age and number of friends for 5,000 users.

(a) Suppose the number of friends for each user is known. However,
only 4000 out of 5000 users provide their age information. The
average age of the 4,000 users is 30 years old. If you replace the
missing values for age with the value 30, will the average age com-
puted for the 5,000 users increases, decreases, or stays the same (as
30)?
Answer: Average age does not change.

4000
1 X
xold = xi
4000
i=1
5000 4000 5000
1 X 1 X X
xnew = xi = xi + xi
5000 5000
i=1 i=1 i=4001

P4000
Since xi = xold for i = 4001, 4002, · · · , 5000 and i=1 xi = 4000xold ,
we have

1
xnew = 4000xold + 1000xold = xold
5000

12
2.2 Data Preprocessing 13

(b) Suppose the covariance between age and number of friends calcu-
lated using the 4,000 users (with no missing values) is 20. If you
replace the missing values for age with the average age of the 4,000
users, would the covariance between age and number of friends in-
creases, decreases, or stays the same (as 20)? Assume that the
average number of followers for all 5,000 users is the same as the
average for 4,000 users.
Answer: Covariance will decrease. Let C1 = 4000
P
i=1 (xi − x)(yi −
y)/3999 be the covariance computed using the 4,000 users without
missing values. If we impute the missing values for age with average
age, x remains unchanged according to part (a). Furthermore, y is
assumed to be unchanged. Thus, the new covariance is
5000
1 X
C2 = (xi − x)(yi − y)
4999
i=1
4000 5000
1 X X
= (xi − x)(yi − y) + (xi − x)(yi − y)
4999
i=1 i=4001
4000 5000
1 X X
= (xi − x)(yi − y) + (x − x)(yi − y)
4999
i=1 i=4001
4000
1 X
= (xi − x)(yi − y) < C1 (2.2)
4999
i=1

7. Consider the following data matrix on the right, in which two of its
values are missing (the matrix on the left shows its true values).
   
−0.2326 0.2270 −0.2326 0.2270
−0.0847 0.7125  −0.0847 0.7125 
   
 0.1275 0.3902   0.1275 0.3902 
   
 0.1329 −0.1461  ? −0.1461
   
 0.3724 0.1756   0.3724 0.1756 
   
0.4975 0.8536 0.4975 0.8536
   
 −→ 
   
 0.6926 0.7834   0.6926 0.7834 
 
 0.7933 0.7375   0.7933 0.7375 
   

 0.8229 0.2147 
 
 0.8229 0.2147 

   
 0.8497 0.4980   0.8497 0.4980 
   
 1.0592 0.7600   1.0592 ? 
1.5028 1.0122 1.5028 1.0122

13
14 Chapter 2 Data

(a) Impute the missing values for the matrix on the right by their re-
spective column averages. Show the imputed values and calculate
their root-mean-square-error (RMSE).
s
(A4,1 − Ã4,1 )2 + (A11,2 − Ã11,2 )2
RMSE =
2
where Ai,j denotes the true value of the (i, j)-th element of the data
matrix and Ãi,j denotes its corresponding imputed value.
Answer: The column averages are [0.5819 0.4962]. The imputed
values are  
−0.2326 0.2270
−0.0847 0.7125 
 
 0.1275 0.3902 
 
 0.5819 −0.1461
 
 0.3724 0.1756 
 
 0.4975 0.8536 
 
 0.6926 0.7834 
 
 0.7933 0.7375 
 

 0.8229 0.2147 

 
 0.8497 0.4980 
 
 1.0592 0.4962 
1.5028 1.0122
and the RMSE value is
r
(0.1329 − 0.5819)2 + (0.7600 − 0.4962)2
RMSE = = 0.3683
2

(b) The Expectation-Maximization (E-M) algorithm is a well-known

approach for imputing missing values. Assuming the data is gen-
erated from a multivariate Gaussian distribution, E-M iteratively
computes the following conditional mean for each attribute and uses
it to impute the missing values:

µi|j = µ̂i + Σij Σ−1

jj (xj − µ̂j )

where the indices i, j ∈ {1, 2} refer to one of the two attributes of

the data and Σ−1 denote inverse of the covariance matrix. Repeat
the previous question by applying the E-M algorithm iteratively for

14
2.2 Data Preprocessing 15

5 times. Assume the covariance matrix of the data is known and

given by

0.25 0.1
Σ=
0.1 0.15
In the first iteration, compute the mean value for each column using
only the non-missing values. In subsequent iterations, compute
the mean value for each column using both the non-missing and
imputed values. Show the imputed values after each iteration and
compute the root-mean-square-error. Compare the error against
the answer in part (a).
Answer:
The inverse of the covariance matrix is

−1 5.4545 −3.6364
Σ =
−3.6364 9.0909

The results after each iteration are shown below:

Iteration µ̂1 µ̂2 Imputed x4,1 Imputed x11,2 RMSE

1 0.5819 0.4962 0.2315 0.9301 0.1390
2 0.5527 0.5324 0.1826 0.9928 0.1683
3 0.5486 0.5376 0.1756 1.0018 0.1736
4 0.5480 0.5384 0.1746 1.0030 0.1743
5 0.5479 0.5385 0.1745 1.0032 0.1745
The root-mean-square-error for EM algorithm is considerably lower
than that using mean imputation.

8. The purpose of this exercise is to illustrate the relationship between PCA

and SVD. Let A be an N × d rectangular data matrix and C be its d × d
covariance matrix.

(a) Suppose IN is an N ×N identity matrix and 1N is an N ×N matrix

whose elements are equal to 1, i.e., ∀i, j : (1)ij = 1. Show that the
covariance matrix C can be expressed into the following form:

1 T 1
C= A I N − 1N A
N −1 N
.

15
16 Chapter 2 Data

Answer: The covariance between columns i and j in matrix A is

given by P
(Aki − Ai )(Akj − Aj )
Cij = k , (2.3)
N −1
where Ai and Aj are their corresponding column averages. A matrix
of column averages for A can be computed as follows:
  
1 1 ··· 1 A11 A12 · · · A1d
1 1  1 1 ··· 1   A21 A22 · · · A2d 
1N A = 
 ··· ···
 
N N ··· ···  ··· ··· ··· ··· 
1 1 ··· 1 AN 1 AN 2 · · · AN d
1 P 1 P 1 P
N Pj Aj1 N Pj Aj2 · · · N Pj Ajd
 
1 1 1
j Aj1 j Aj2 · · · N j Ajd
 
=  N N  (2.4)
 · · · · · · · · · · · · 
1 P 1 P 1 P
N j Aj1 N j Aj2 · · · N j Ajd

Thus, each term (Aki − AiP ) in Equation (2.3) can be expressed in

matrix notation as Aki − N j Aji = [A− N1 1N A]ki . The covariance
1

matrix C can therefore be computed as follows:

1 1 1
C = (A − 1N A)T (A − 1N A)
N −1 N N
T
1 1 1
= (IN − 1N )A (IN − 1N )A
N −1 N N

1 T 1 1
= A I N − 1N IN − 1N A (2.5)
N −1 N N

where we have use the following property of matrix transpose (XY)T =

YT XT on the last line. Furthermore, since the identity matrix and
the matrix of all ones are symmetric, i.e., ITN = IN and 1TN = 1N ,
therefore (IN − N1 1N )T = (IN − N1 1N ). Finally, it can be shown
that the matrix (IN − N1 1N ) is idempotent, which means it is the

16
2.2 Data Preprocessing 17

same as the square of the matrix:

1 1 2 1
(IN − 1N )(IN − 1N ) = IN − 1N + 1N 1N
N N N N2
2 1
= IN − 1N + N 1N
N N2
2 1
= IN − 1N + 1N
N N
1
= IN − 1N , (2.6)
N
where 1N 1N = N 1N is an N × N matrix whose elements are equal
to N . Substituting (2.6) into (2.5), we obtain:

1 1
C= A T I N − 1N A (2.7)
N −1 N

(b) Using singular value decomposition, the matrix A can be factor-

ized as follows: A = UΣVT , where U is the N × N left singular
matrix, Σ is the N × d matrix containing the singular values, and
V is the d × d right singular matrix. Similarly, using eigenvalue
decomposition, the covariance matrix can be factorized as follows:
C = XΛXT . Show the relationship between SVD and PCA is given
by the following equation:
1 T
VΣ2 VT − A 1N A = (N − 1)XΛXT .
N
Answer: From the previous question, we can write:

1 T 1 1 T 1 T
C= A IN − 1N A = A A − A 1N A
N −1 N N −1 N
(2.8)
Since A = UΣVT and U is an orthogonal matrix,

AT A = [UΣVT ]T [UΣVT ] = VΣT UT UΣVT = VΣT ΣVT .

If N > d, then Σ has N −d rows of all zeros. If we remove such rows,

Σ becomes a d × d square matrix and ΣT Σ = Σ2 . By substituting
C = XΛXT and AT A = VΣ2 V into Equation (2.8), we have:

T 1 2 T 1
XΛX = VΣ V − A1N A .
N −1 N

17
18 Chapter 2 Data

(c) Find the relationship between the right singular matrix V and the
matrix of principal components X if the data matrix A has been
column-centered (i.e., every column of A has been subtracted by
the column mean) before applying SVD.
Answer: If the matrix A has been column-centered, then its col-
umn mean is zero, which means AT 1N is a matrix of all zeros.
Thus, the last equation in the previous question reduces to:

1
XΛXT = VΣ2 VT .
N −1

This suggests that the right singular matrix V corresponds to the

principal components X, while the square root of the singular values
are the same as N − 1 times the eigenvalues.

9. Principal component analysis (PCA) can be used for image compression

by transforming a high-resolution image into its lower rank approxima-
tion. In this exercise, you will be provided with the following three
images of size 1080 × 1920 pixels each (the filenames are img1.jpg,
img2.jpg, and img3.jpg).

(a) img1 (b) img2 (c) img3

Figure 2.1. Image data set.

You will use Matlab to apply PCA to each of the following images.

(a) Load each image using the imread command. For example:

matlab> A = imread(’img1.jpg’);

(b) Plot the image in gray scale.

matlab> imagesc(A);
matlab> colormap(gray);

18
2.2 Data Preprocessing 19

Answer: See Figure 2.1.

(c) Apply principal component analysis to obtain a reduced rank ap-
proximation of the image.
For example, to obtain a rank-10 approximation (i.e., using the first
10 principal components), use the following commands:
matlab> A = double(A); % convert A from uint8 to double format
matlab> [U,V] = princomp(A); % apply principal component analysis
matlab> rank = 10; % set rank to be 10
matlab> B = V(:,1:rank)*U(:,1:rank)’; % B is the compressed image of A
matlab> figure;
matlab> imagesc(B);
matlab> colormap(gray);
For each image, vary the rank (i.e., number of principal compo-
nents) as follows: 10, 30, 50, and 100. Save each image as follows:

matlab> saveas(gcf, ’filename.jpg’, ’jpeg’);

Insert the compressed (reduced rank) images to the solution file of

your homework (don’t submit the jpg files individually).
Answer: See Figure 2.2.
(d) Compare the size of matrix A (in bytes) to the total sizes of matrices
U and V (in bytes). Compute the compression ratio:

Size of matrix A
Compression ratio =
Size of matrix U + Size of matrix V

for each reduced rank (10, 30, 50, 100) of the images. You can use
the whos command to determine the size of the matrices:
matlab> whos A U V
Answer: See Table 2.1.

rank size of A size of U size of V compression rate

10 16588800 153600 86400 69.12
30 16588800 460800 259200 23.04
50 16588800 768000 432000 13.824
100 16588800 1536000 864000 6.912

Table 2.1. Compression ratio for various images

19
20 Chapter 2 Data

(a) rank 10 img1 (b) rank 10 img2 (c) rank 10 img3

(d) rank 30 img1 (e) rank 30 img2 (f) rank 30 img3

(g) rank 50 img1 (h) rank 50 img2 (i) rank 50 img3

(j) rank 100 img1 (k) rank 100 img2 (l) rank 100 img3

Figure 2.2. Reduced-rank images using PCA

(e) Compute the reconstruction error kA − BkF of each reduced rank

image, where k · kF denote the Frobenius norm of a matrix. Note
that the higher the reconstruction error, the lower the quality of
the compressed image. Plot a graph of reconstruction error (y-axis)
versus compression ratio (x-axis) for each image.
Answer: See Table 2.2 and Figure 2.3.
(f) State the minimum number of principal components (10, 30, 50,
100) needed to (visually) retain most of the salient features of each

20
2.2 Data Preprocessing 21

image rank reconstruction error

img1 10 4.9565 × 104
img1 30 3.7198 × 104
img1 50 3.0998 × 104
img1 100 2.2135 × 104
img2 10 1.7798 × 104
img2 30 1.2190 × 104
img2 50 1.0236 × 104
img2 100 7.4063 × 103
img3 10 3.9544 × 103
img3 30 3.1775 × 103
img3 50 2.8146 × 103
img3 100 2.2397 × 103

Table 2.2. Reconstruction error for various images

(a) img1 (b) img2 (c) img3

Figure 2.3. Reconstruction error versus compression ratio

image (i.e., the city square in img1.jpg, shape of the face in img2.
jpg, and shape of the apple in img3.jpg). Which image requires
the least number of principal components? Which image requires
the most number of principal components?
Answer:
img1.jpg: 50 components
img2.jpg: 30 components
img3.jpg: 10 components

21
22 Chapter 2 Data

2.3 Measures of Similarity and Dissimilarity

1. Consider the following binary vectors:
x1 = (1, 1, 1, 1, 1)
x2 = (1, 1, 1, 0, 0)
y1 = (0, 0, 0, 0, 0)
y2 = (0, 0, 0, 1, 1)

(a) According to Jaccard coefficient, which pair of vectors—(x1 , x2 ) or

(y1 , y2 )—are more similar to each other?
Answer:
Jaccard(x1 , x2 ) = 53 = 0.6.
Jaccard(y1 , y2 ) = 50 = 0.
Therefore, according to Jaccard coefficient, (x1 , x2 ) are more simi-
lar.

(b) According to simple matching coefficient, which pair of vectors—

(x1 , x2 ) or (y1 , y2 )—are more similar to each other?
Answer:
SMC(x1 , x2 ) = 53 = 0.6.
SMC(y1 , y2 ) = 35 = 0.6.
Therefore, according to simple matching coefficient, they are both
equally similar.

(c) According to Euclidean distance, which pair of vectors—(x1 , x2 ) or

(y1 , y2 )—are more similar to each other?
Answer:
√
Euclidean(x1 , x2 ) = 2 = 1.4142.
√
Euclidean(y1 , y2 ) = 2 = 1.4142.
Therefore, according to Euclidean distance, they are both equally
similar.

2. Consider a weighted, undirected, graph G (see Figure 2.4 as an example).

Let e(u, v) be the weight of the edge between nodes u and v, where
e(u, u) = 0 and e(u, v) = ∞ if u and v is disconnected. Assume the

22
2.3 Measures of Similarity and Dissimilarity 23

graph is a connected component, i.e., there exists a path between every

two nodes. Suppose the path length, d(u, v), is defined as follows:

 0 if u = v;
d(u, v) = e(u, v), if there is an edge between u and v;
minw6=u6=v d(u, w) + d(w, v), otherwise.


Is d(u, v) a metric? State your reasons clearly. (Check whether the

positivity, symmetry, and triangle inequality properties are preserved.).

Figure 2.4. Weighted undirected graph.

Answer:

(a) Positivity property is preserved by definition since d(u, u) = 0 and

d(u, v) > 0 if u 6= v.
(b) Symmetry property is preserved since the graph is undirected.
(c) Triangle inequality is not preserved. A counter-example is d(K, J) ≥
d(K, I) + d(I, J).

Therefore d(u, v) is not a metric.

3. For document analysis, numerous measures have been proposed to deter-

mine the semantic similarity between two words using a domain ontology
such as WordNet. For example, words such as dog and cat have higher
semantic similarity than dog and money (since the former refers to two
types of carnivores). Figure 2.5 below shows an example for comput-
ing the Wu-Palmer similarity between dog and cat based on their path

23
24 Chapter 2 Data

length in the WordNet hypernym hierarchy. The depth h refers to the

length of the shortest path from the root to their lowest common hyper-
nym (e.g., carnivore for the word pair dog and cat), whereas k is the
minimum path length between the two words.

Figure 2.5. Sample of the hypernym hierarchy in WordNet.

The Wu-Palmer similarity measure is defined as follows:

2h
W =
k + 2h

For example1 , for dog and cat, W = 26/(4 + 26) = 0.867, whereas for
dog and money, W = 4/(19 + 4) = 0.174.

(a) What is the maximum and minimum possible value for Wu-Palmer
similarity?
1
In this simplified example, we assume each word has exactly 1 sense. In general, a
word can have multiple senses. As a result, the Wu-Palmer measure is given by the highest
similarity that can be achieved using one of its possible senses.

24
2.3 Measures of Similarity and Dissimilarity 25

Answer: Maximum value is 1; minimum value approaches 0.

(b) Let 1 − W be the Wu-Palmer distance measure.
• Does 1 − W satisfy the positivity property?
k
Answer: Yes. Since 1 − W = 2h = 0 when k = 0, this implies
that d(u, v) = 0 if and only if u = v.
• Does 1 − W satisfy the symmetry property?
Answer: Yes because W is a symmetric measure.
• Does 1 − W satisfy the triangle inequality property?
Answer: No because each node can have more than one path
to the root, some maybe shorter than others. For example, the
words (money, statute) are very dissimilar to each other. But
(money, bill) and (bill, statute) are very similar, thus violating
triangle inequality. The actual path for these words in the
WordNet ontology are shown in Figure 2.6.

Figure 2.6. Sample of the hypernym hierarchy in WordNet.

4. Suppose you are given a census data, where every data object corre-
sponds to a household and the following continuous attributes are used to
characterize each household: total household income, number of house-
hold residents, property value, number of bedrooms, and number of ve-
hicles owned. Suppose we are interested in clustering the households
based on these attributes.

25
26 Chapter 2 Data

(a) Explain why cosine is not a good measure for clustering the data.
Answer: These attributes are all numerical and can have widely
varying ranges of values, depending on the scale used to measure
them. As a result, cosine measure will be biased by the attributes
with largest range of magnitudes (e.g., total household income and
property value).
(b) Explain why correlation is not a good measure for clustering the
data.
Answer: The same argument as part (a). Because each attribute
has different range, correlating the data points is meaningless.
(c) Explain what preprocessing steps and corresponding proximity mea-
sure you should use to do the clustering.
Answer: Euclidean distance, applied after standardizing the at-
tributes to have a mean of 0 and a standard deviation of 1, would
be appropriate
5. Consider the following distance measure:

d(x, y) = 1 − c(x, y),

where c(x, y) is the cosine similarity between two data objects, x and
y. Does the distance measure satisfy the positivity, symmetry, and tri-
angle inequality properties? For each property, show your steps clearly.
Assume x and y are non-negative vectors (e.g., term vectors for a pair
of documents).
Answer:
x·y
(a) Positivity You need to show that ∀x, y : d(x, y) = 1 − kxkkyk ≥0
and d(x, y) = 0 if and only if x = y.
By definition, x · y = kxkkyk cos θ, where θ is the angle between x
and y. Since cos θ ≤ 1 (from trigonometry), therefore

x·y kxkkyk cos θ

d(x, y) = 1 − =1− = 1 − cos θ ≥ 0,
kxkkyk kxkkyk

which completes the first part of the proof.

If x = y, then

x·x kxkkxk cos 0

d(x, y) = 1 − =1− = 0.
kxkkxk kxkkxk

26
2.3 Measures of Similarity and Dissimilarity 27

However, if d(x, y) = 0, then

1 − cos θ = 0 ⇒ cos θ = 1 ⇒ θ = 0

In other words, as long as x and y are co-linear to each other,

d(x, y) = 0 (even though x 6= y). The distance measure therefore
does not satisfy the positivity property.
(b) Symmetry
Because x · y = y · x,
x·y y·x
d(x, y) = 1 − =1− = d(y, x)
kxkkyk kykkxk

Hence, the distance measure satisfies the symmetry property.

(c) Triangle Inequality
First, note that cos θ decreases with increasing θ for 0 ≤ θ ≤ π/2
(we focus only on this range of values for θ because the vectors
are non-negative). Since the distance measure d(x, y) = 1 − cos θ
depends on the angle between the two vectors x and y, the larger
the angle, the larger the distance. We can show that the distance
measure violates triangle inequality by choosing the angles in such
a way that d(x, z) > d(x, y)+d(y, z). Consider the situation shown
in Figure 2.7 below.

Figure 2.7. Triangle inequality violation example

In this case, we have: d(x, z) = 1 − cos θ and d(x, y) = d(y, z) =

1 − cos(θ/2). From trigonometry identities, cos θ = 2 cos2 (θ/2) −

27
28 Chapter 2 Data

1. Therefore, d(x, z) = 1 − cos θ = 1 − 2 cos2 (θ/2) + 1 = 2 −

2 cos2 (θ/2). On the other hand, d(x, y) + d(y, z) = 2 − 2 cos(θ/2).
Since cos2 (θ/2) < cos(θ/2) as long as 0 < cos(θ/2) < 1, we have
found a counter-example where

d(x, z) = 2 − 2 cos2 (θ/2) > d(x, y) + d(y, z) = 2 − 2 cos(θ/2).

Here’s a simple example. Suppose we have 3 documents and 2 words

data and mining. Document x contains the word data only and
document z contains the word mining only. However, document y
contains both words. We can represent the documents as follows:

x = (1, 0), y = (1, 1), z = (0, 1).

√
In this case, d(x, z) = 1 and d(x, y) = d(y, z) = 1 − 1/ 2 = 0.2929.
Therefore, d(x, z) > d(x, y)+d(y, z), which is a violation of triangle
inequality.

6. Consider a database of web graphs. Each graph is unweighted and con-

tains a set of nodes and directed edges. A node corresponds to a web
page while an edge is a transition from one page to another when a user
clicks on a hyperlink or enters the URL directly into the location bar
of the Web browser. Each web graph also represents the Web session
of a user. Consider the following approaches for defining the similarity
between two Web sessions, s1 and s2 .

Approach 1 Node-based similarity

P
∈ s1 ) × I(wi ∈ s2 )
i I(wi
Simn (s1 , s2 ) = ,
max(|s1 |, |s2 |)

where wi is a web page, |si | is the number of web pages visited

during session si , max(a, b) is a function that returns the maximum
value between a and b, and I(wi ∈ sj ) is an indicator function whose
value is 1 if session sj visited web page wi and 0 otherwise.
Approach 2 Link-based similarity
P
i,j I(wi → wj ∈ s1 ) × I(wi → wj ∈ s2 )
Siml (s1 , s2 ) = ,
max(|s1 |, |s2 |)

28
2.3 Measures of Similarity and Dissimilarity 29

where wi → wj is a transition from page wi to wj , |si | is the number

of transitions in session si , max(a, b) is a function that returns the
maximum value between a and b, and I(wi → wj ∈ sk ) is an in-
dicator function whose value is 1 if session sk contains a transition
from web page wi to wj and 0 otherwise.

(a) Consider the following two Web sessions: s1 = (A → B → C →

B → D → E) and s2 = (A → C → B → E). Compute the node-
based and link-based similarities for the Web graphs constructed
from the two sessions.
Answer: Simn (s1 , s2 ) = 4/5 and Siml (s1 , s2 ) = 1/5.
(b) Suppose the node-based similarity for s1 and s2 equals to 1. Can the
web graphs for s1 and s2 be different? State your reasons clearly.
Answer: Yes. As long as both graphs contain the same set of
nodes, the node-based similarity is equal to 1. But the graphs may
still be different because the links in the graph could be different.
(c) Suppose Siml (s1 , s2 ) = 1 according to approach 2. Can the web
graphs for s1 and s2 be different? State your reasons clearly.
Answer: No. The web graphs are the same because all the node
transitions in s1 must also be present in s2 , and vice-versa.
(d) Which approach do you think is more effective at measuring simi-
larity between two web sessions? State your reasons clearly.
Answer: Link-based similarity is more effective because its value
is 1 only if the web graphs are isomorphic.

7. Consider the following distance measure D between two clusters of data

points, X and Y:

D(X, Y) = min{d(x, y) : x ∈ X, y ∈ Y},

where d(x, y) is the Euclidean distance between two data points, x and
y. Intuitively, D measures the distance between clusters in terms of
the closest two points from each cluster (see Figure 2.8). Does the dis-
tance measure satisfy the positivity, symmetry, and triangle inequality
properties? For each property, show your proof clearly or give a counter-
example if the property is not satisfied.
Answer:

29
30 Chapter 2 Data

Figure 2.8. Cluster distance measure

(a) Positivity: Since Euclidean distance between any two data points
is always non-negative, therefore D(X, Y ) ≥ 0. D(X, y) can be zero
even when X 6= Y only if there is a data point is assigned to both
clusters X and Y (i.e., if overlapping clusters are allowed). So,
the distance measure satisfies the positivity property for disjoint
clusters but not for overlapping clusters.
(b) Symmetry: Since Euclidean distance is a symmetric measure,
D(X, Y) = min{d(x, y) : x ∈ X, y ∈ Y} = min{d(y, x) : x ∈
X, y ∈ Y} = D(Y, X). Thus, the measure is symmetric.
(c) Triangle Inequality: Triangle inequality property can be vio-
lated. A counter-example is shown in Figure 2.9.

Figure 2.9. Violation of triangle inequality

8. For this question, assume each object is characterized by a set of continuous-

valued attributes.

(a) If two objects have a cosine similarity of 1, must their attribute

values be identical? Explain.

30
2.3 Measures of Similarity and Dissimilarity 31

Answer: No. A cosine similarity of 1 simply implies that the two

attribute vectors are parallel to each other. For example, when
x = (1, 2) and y = (2, 4), then their cosine similarity is 1.
(b) If two objects have a correlation value of 1, must their attribute
values be identical? Explain.
Answer: No. A correlation value of 1 simply implies that there is a
linear relationship between the two attribute vectors. For example,
when x = (1, 2) and y = (3, 5), then their correlation is 1.
(c) If two objects have a Euclidean distance of 0, must their attribute
values be identical? Explain.
Answer: Yes. Consider a pair of objects with attributePvectors x
and y. Suppose their Euclidean distance is d(x, y) = sqrt i (xi − yi )2 =
0, which is true only if xi = yi for all i.
(d) Let x and y be the attribute vectors of two objects. State whether
the following proximity measures—cosine, correlation, and Euclidean
distance—are invariant (unchanged) under the following transfor-
mations. Specifically, if x → x0 and y → y 0 , would cosine(x, y) =
cosine(x0 , y 0 ), correlation(x, y) = correlation(x0 , y 0 ), and Euclidean(x, y)
= Euclidean(x0 , y 0 )?
i. Translation: x → x + c and y → y + c, where c is a constant
added to each attribute value in x and y.
Answer:
P Cosine is notP invariant because cosine(x + c, y + c) =
(xi +c)(yi +c) xy
√P i
2 2
6= √i 2i 2i unless c = 0. Euclidean distance
i (xi +c) (yi +c) pP xi yi
pP
2 2
is invariant since i [(xi + c) − (yi + c)] = i (xi − yi ) .
Similarly, correlation measure is also invariant because when
x → x + c, then the mean will also be shifted x → x +
cpbut
P the standard deviation pP remains unchanged since σx =
c − x − c) =2 x)2 . Thus, correlation(x +
i (xi +P i (xi −P
(x +c−x−c)(y +c−y−c) (x −x)(y −y)
c, y+c) = i i σx σy
i
= i iσx σy i = correlation(x,y).
ii. Scaling: x → cx and y → cy, where c is a constant multiplied
to each attribute value in x and y. P P
cx cy i xi y
Answer: Cosine is invariant because √ P i 2i Pi 2
= √ 2
i
.
yj )2
P P
( i cxi ) ( j cyj ) ( i xi ) ( j
Correlation is also invariant because when x → cx, then both
the mean andP standard deviation are re-scaled by the same fac-
cx
tor: x0 = in i = cx and σx0 =
pP
− cx)2 = cσx . Eu-
i (cxip
2
P
clidean
pP distance is not invariant because i (cxi − cyi ) =
2
c i (xi − yi ) .

31
32 Chapter 2 Data

iii. Standardization: x → (x − c)/d and y → (y − c)/d, where c

and d are constants.
Answer: Standardization is a combination of translation (by
the mean of the vector) and scaling (by the standard deviation).
Since correlation is invariant with respect to both operations,
it is also invariant with respect to standardization. However,
cosine and Euclidean distance are not invariant since they are
not preserved by one of the two operations.

9. Consider the following survey data about users who joined an online
community. The sample covariance between the user’s height (in mm)
and number of years being a member of the community is 5.0.

(a) Suppose the sample covariance between the user’s age and number
of years being a member of the community is only 0.5. Does this
imply that user’s height is more correlated with number of years in
the community than user’s age? Answer yes or no and explain your
reasons clearly.
Answer: No. Covariance is not a dimensionless quantity, so its
magnitude depends on the scale of measurement.
(b) Suppose the height attribute is re-defined as height above the av-
erage for all users who participated in the survey. For example, a
user who is 1650 mm tall has a height value of -50 mm (assuming
the average height of all users is 1700 mm). Would the covariance
between the re-defined height attribute and number of years in the
community be greater than, smaller than, or equal to 5.0?
Answer: Equal. Let xh denote the height attribute and xy be the
number of years in the community. The sample covariance between
the two attributes is given by:

i=1
1 X
Σxh ,xy = (xih − xh )(xiy − xy ),
N −1
N

where xh and xy are the average height and average number of years,
respectively. If we re-define the height attribute as x0h = xh − xh ,

32
2.3 Measures of Similarity and Dissimilarity 33

then x0h = 0. Hence, the covariance between x0h and xy becomes

N
1 X 0
Σ x0h ,xy = (xih − x0h )(xiy − xy )
N −1
i=1
N
1 X
= (xih − xh − 0)(xiy − xy )
N −1
i=1
= Σxh ,xy

This result means centering the height attribute has no effect on its
covariance to other attributes.
(c) If the measurement unit for height is converted from mm to inches
(where 1 inch = 25.4 mm), will the covariance between height (in
inches) and number of years in the community be greater than,
smaller than, or equal to 5.0?
Answer: Re-scaling the height attribute is equivalent to multi-
plying the original attribute by some constant C, i.e., x0h = Cxh .
Furthermore, we can show that x0h = Cxh . Thus the covariance
between the rescaled height and number of years in the community
will be:

N
1 X 0
Σx0h ,xy = (xih − x0h )(xiy − xy )
N −1
i=1
N
1 X
= (Cxih − Cxh )(xiy − xy )
N −1
i=1
N
C X
= (xih − xh )(xiy − xy )
N −1
i=1
= CΣxh ,xy

1
In this case, C = 25.4 which is smaller than 1. Therefore, the
covariance value will be smaller when you convert the unit from
mm to inches.
(d) Suppose you standardize both the height and number of years in
the community attributes (by subtracting their respective means
and dividing by their corresponding standard deviations). Would
their covariance value be greater than, smaller than, or equal to

33
34 Chapter 2 Data

5.0? To obtain full credit, you must prove your answer by showing
the computations clearly.
Answer: The re-defined attributes after standardization are: x0h =
xh −xh 0 xy −xy 0 0
σh , xy = σy . Furthermore, we can show that xh = 0, xy = 0.
Then,

i=1
1 X 0
Σx0h ,x0y = (xih − x0h )(x0iy − x0y )
N −1
N
N
1 X xih − xh xiy − xy
= ( )( )
N −1 σh σy
i=1
1 PN
N −1 i=1 (xih − xh )(xiy − xy )
=
σh σy
Σxh ,xy
= (2.9)
σh σy

Note that σh1σy Σxh ,xy is equivalent to the correlation coefficient be-
tween xh and xy . Since correlation coefficient is always less than or
equal to 1 whereas the original covariance value is +5, this means
that the covariance value is smaller after standardization.
Next, we will prove that correlation coefficient is always between
−1 and +1. First, note that
v v
u N u N
u 1 X u 1 X
σh = t (xih − xh )2 , σy = t (xiy − xy )2 .
N −1 N −1
i=1 i=1

Thus, Equation (2.12) can be re-written as follows:

Σxh ,xy
Σx0h ,x0y =
σh σy
PN
− xh )(xiy − xy )
i=1 (xih
= s (2.10)
PN 2
PN 2
i=1 (xih − xh ) i=1 (xiy − xy )

34
2.3 Measures of Similarity and Dissimilarity 35

Let hi = xih − xh and yi = xiy − xy . Equation (2.13) becomes

PN ~h • ~y
i=1 hi yi
Σx0h ,x0y = s = ~ (2.11)
|h||~y |

PN 2
PN 2
i=1 hi i=1 yi

According to Cauchy-Schwarz inequality, for any vectors ~h and ~y ,

we have
~h • ~y ≤ |~h||~y |.
Thus the ratio on the right-hand side of Equation (2.14) is less than
or equal to 1, which completes the proof.
10. Suppose you are given a database of patient’s demographic information
from a healthcare provider. The covariance matrix obtained for three
attributes: age, weight, and systolic blood pressure (bp) is shown below:
 
age → 389.75 199.37 135.12
weight →  199.37 610.52 426.30 
bp → 135.12 426.30 359.36

(a) Does this imply that user’s age is more correlated with his/her
weight than systolic blood pressure? Answer yes or no and explain
your reasons clearly.
Answer: No. Covariance is not a dimensionless quantity, so its
magnitude depends on the scale of measurement. Even though
covariance between age and weight is higher than that between age
and systolic blood pressure, it is possible the correlation is lower.
(b) Suppose the weight attribute is centered by subtracting it with the
average weight of all patients in the database. For example, a 200-
pound patient has a weight recorded as 50 (if the average weight
of the patients is 150 pounds). Would the covariance between the
centered weight attribute and age be greater than, smaller than, or
equal to 199.37?
Answer: Equal. Let xh denote the weight attribute and xy is the
age attribute. The sample covariance between the two attributes is
given by:
i=1
1 X
Σxh ,xy = (xih − xh )(xiy − xy ),
N −1
N

35
36 Chapter 2 Data

where xh and xy are the average weight and average age, respec-
tively. If we re-define the weight attribute as x0h = xh − xh , then
x0h = 0. Hence, the covariance between x0h and xy becomes

N
1 X 0
Σx0h ,xy = (xih − x0h )(xiy − xy )
N −1
i=1
N
1 X
= (xih − xh − 0)(xiy − xy )
N −1
i=1
= Σxh ,xy

This result means centering the weight attribute has no effect on

its covariance to other attributes.
(c) If the measurement unit for weight is converted from pounds to
kilograms (where 1 kg = 2.2 pounds), will the covariance between
weight (in kilogram) and age be greater than, smaller than, or equal
to 199.37?
Answer: Re-scaling the weight attribute is equivalent to multi-
plying the original attribute by some constant C, i.e., x0h = Cxh .
Furthermore, we can show that x0h = Cxh . Thus the covariance
between the rescaled weight and age will be:
N
1 X 0
Σx0h ,xy = (xih − x0h )(xiy − xy )
N −1
i=1
N
1 X
= (Cxih − Cxh )(xiy − xy )
N −1
i=1
N
C X
= (xih − xh )(xiy − xy )
N −1
i=1
= CΣxh ,xy

1
In this case, C = 2.2 which is smaller than 1. Therefore, the covari-
ance value will be smaller when you convert the unit from pounds
to kilograms.
(d) Suppose you standardize both the age and weight attributes (by
subtracting their respective means and dividing by their corre-
sponding standard deviations). Would their covariance value be
greater than, smaller than, or equal to 199.37?

36
2.3 Measures of Similarity and Dissimilarity 37

Answer: The re-defined attributes after standardization are: x0h =

xh −xh 0 xy −xy 0 0
σh , xy = σy . Furthermore, we can show that xh = 0, xy = 0.
Then,
i=1
1 X 0
Σx0h ,x0y = (xih − x0h )(x0iy − x0y )
N −1
N
N
1 X xih − xh xiy − xy
= ( )( )
N −1 σh σy
i=1
1 PN
N −1 i=1 (xih − xh )(xiy − xy )
=
σh σy
Σxh ,xy
= (2.12)
σh σy

Note that σh1σy Σxh ,xy is equivalent to the correlation coefficient be-
tween xh and xy . Since correlation coefficient is always less than or
equal to 1 whereas the original covariance value is +5, this means
that the covariance value is smaller after standardization.
Next, we will prove that correlation coefficient is always between
−1 and +1. First, note that
v v
u
u 1 X N u
u 1 X N
σh = t 2
(xih − xh ) , σy = t (xiy − xy )2 .
N −1 N −1
i=1 i=1

Thus, Equation (2.12) can be re-written as follows:

Σxh ,xy
Σx0h ,x0y =
σh σy
PN
− xh )(xiy − xy )
i=1 (xih
= s (2.13)
PN 2
PN 2
i=1 (xih − xh ) i=1 (xiy − xy )

Let hi = xih − xh and yi = xiy − xy . Equation (2.13) becomes

PN
i=1 hi yi hT y
Σx0h ,x0y = s = (2.14)
PN

PN
khkkyk
2 2
i=1 hi i=1 yi

37
38 Chapter 2 Data

According to Cauchy-Schwarz inequality, for any vectors h and y,

we have
hT y ≤ khkkyk.
Thus the ratio on the right-hand side of Equation (2.14) is less than
or equal to 1, which completes the proof.

11. Consider the following distance measure for two sets, X and Y:

|X ∩ Y|
D(X, Y) = 1 − ,
|X ∪ Y|

where ∩ is the intersection between the two sets, ∩ is the union of the
two sets, and | · | denote the cardinality of the set. This measure is
equivalent to 1 minus the Jaccard similarity. Does the distance measure
satisfy the positivity, symmetry, and triangle inequality properties? For
each property, explain your reason clearly or give a counter-example if
the property is not satisfied.
Answer:
|X∩Y|
(a) Positivity: Since |X ∩ Y| ≤ |X ∪ Y|, therefore |X∪Y| ≤ 1 and

|X ∩ Y|
D(X, Y) = 1 − ≥ 0.
|X ∪ Y|

Furthermore, if X = Y, then |X ∪ Y| = |X ∩ Y| = |X|. Hence,

|X ∩ Y|
D(X, Y) = 1 − = 1 − 1 = 0.
|X ∪ Y|

Similarly, if D(X, Y) = 0, then

|X ∩ Y|
1− = 0,
|X ∪ Y|

which means, |X ∪ Y| = |X ∩ Y|, or equivalently, X = Y.

Hence, the positivity property holds for the distance measure.
(b) Symmetry:

|X ∩ Y| |Y ∩ X|
D(X, Y) = 1 − =1− = D(Y, X).
|X ∪ Y| |Y ∪ X|

Hence, the symmetry property holds for the distance measure.

38
2.3 Measures of Similarity and Dissimilarity 39

(c) Triangle inequality:

|X ∩ Y| |Y ∩ Z|
D(X, Y) + D(Y, Z) = 1 − +1−
|X ∪ Y| |Y ∪ Z|
|X ∪ Y| − |X ∩ Y| |Y ∪ Z| − |Y ∩ Z|
= +
|X ∪ Y| |Y ∪ Z|
|X ∪ Y| − |X ∩ Y| |Y ∪ Z| − |Y ∩ Z|
≥ +
|X ∪ Y ∪ Z| |X ∪ Y ∪ Z|

and

|X ∩ Z| |X ∩ Z| |X ∪ Y ∪ Z| − |X ∩ Z|
D(X, Z) = 1− ≤ 1− = .
|X ∪ Z| |X ∪ Y ∪ Z| |X ∪ Y ∪ Z|

Figure 2.10. Illustration of triangle inequality

Figure 2.10 shows the Venn diagram for sets X, Y and Z. The
number of data points in each subregion in the Venn Diagram is
labeled A through G. From this figure, it can be easily seen that,
|X∪Y|−|X∩Y|+|Y∪Z|−|Y∩Z| = A+C+D+F+B+C+D+G
whereas
|X ∪ Y ∪ Z| − |X ∩ Z| = A + B + C + F + G.

39
40 Chapter 2 Data

The preceding equations suggest that

|X ∪ Y ∪ Z| − |X ∩ Z| ≤ |X ∪ Y| − |X ∩ Y| + |Y ∪ Z| − |Y ∩ Z|

Putting the inequalities together, we have

|X ∪ Y ∪ Z| − |X ∩ Z|
D(X, Z) ≤
|X ∪ Y ∪ Z|
|X ∪ Y| − |X ∩ Y| + |Y ∪ Z| − |Y ∩ Z|
≤
|X ∪ Y ∪ Z|
≤ D(X, Y) + D(Y, Z)

12. Which similarity or distance measure is most effective for each of the
domains given below:

(a) Which measure, Jaccard or Simple Matching Coefficient, is most

appropriate to compare how similar are the answers provided by
students in an exam. Assume that the answers to all the questions
in the exam are either True or False.
Answer: Simple matching coefficient. The values of true and false
are equally important when computing similarity.
(b) Which measure, Jaccard or Simple Matching Coefficient, is most
appropriate to compare how similar are the locations visited by
tourists at an amusement park. Assume the location information is
stored as binary yes/no attributes (yes means a location was visited
by the tourist and no means a location has not been visited).
Answer: Jaccard. Here places visited by the tourists should play
a more significant role in computing similarity than places they did
not visit.
(c) Which measure, Euclidean distance or correlation coefficient, is
most appropriate to compare two flows in a network traffic. For
each flow, we record information about the number of packets trans-
mitted, number of bytes transferred, number of acknowledgments
sent, and duration of the session.
Answer: Euclidean distance (after standardizing each attribute).
Correlation coefficient is not meaningful here because it is not mean-
ingful to correlate two flows which has different attribute values
(i.e., correlating the attributes are meaningful but correlating the
flows are not).

40
2.3 Measures of Similarity and Dissimilarity 41

(d) Which measure, Euclidean distance or cosine similarity, is most

appropriate to compare the coordinates of a moving object in a
2-dimensional space. For example, using GPS data, the object
may be located at (31.4◦ West,12.4◦ North) at time t1 and (29.4◦
West,12.5◦ North) at another time t2 . Note: we may use +/- to
indicate East/West or North/South directions when computing the
similarity or distance measures.
Answer: Euclidean distance. This is because cosine measures the
angle of the two locations. Thus, if two locations lie along the same
line through the origin, their cosine similarity will be 0 even though
they are located far away from each other.
(e) Which measure, Euclidean distance or cosine similarity, is most ap-
propriate to compare the similarity of items bought by customers at
a grocery store. Assume each customer is represented by a 0/1 bi-
nary vector of items (where a 1 means the customer had previously
bought the item).
Answer: Cosine similarity because presence of an item in the trans-
action plays a more important role in determining similarity than
absence of the item.

Data Attribute Classification Guide
No ratings yet
Data Attribute Classification Guide
4 pages
Dmi Unit 2
No ratings yet
Dmi Unit 2
19 pages
Type of Attributes 1
No ratings yet
Type of Attributes 1
3 pages
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
No ratings yet
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
42 pages
Get To Know About Data
No ratings yet
Get To Know About Data
25 pages
Data Mining - Data Objects and Attributes
No ratings yet
Data Mining - Data Objects and Attributes
50 pages
Know Your Data
No ratings yet
Know Your Data
83 pages
Datalec 1
No ratings yet
Datalec 1
23 pages
2-Data Preprocessing
No ratings yet
2-Data Preprocessing
104 pages
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
No ratings yet
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
55 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
Lecture 2 EDA 1
No ratings yet
Lecture 2 EDA 1
26 pages
DEP Unit 2
No ratings yet
DEP Unit 2
83 pages
Chapter 2 Data Mining
No ratings yet
Chapter 2 Data Mining
41 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
DMW Unit1
No ratings yet
DMW Unit1
21 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
4 pages
ML 2
No ratings yet
ML 2
4 pages
Lect-2 Getting To Know Your Data-Part-I
No ratings yet
Lect-2 Getting To Know Your Data-Part-I
28 pages
Chap2 Data
No ratings yet
Chap2 Data
76 pages
Chap2 Data Rev
No ratings yet
Chap2 Data Rev
91 pages
Unit 2 Final Ids
No ratings yet
Unit 2 Final Ids
38 pages
DM Lec02
No ratings yet
DM Lec02
32 pages
Chap2 Data
No ratings yet
Chap2 Data
92 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Chapter 2
No ratings yet
Chapter 2
57 pages
Dmi Unit 2 - 186 - N3
No ratings yet
Dmi Unit 2 - 186 - N3
21 pages
EDS Unit 2 ?
No ratings yet
EDS Unit 2 ?
13 pages
Lecture Notes For Chapter 2 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining, 2 Edition
96 pages
02data - 7 7 25
No ratings yet
02data - 7 7 25
63 pages
Data Exploration
No ratings yet
Data Exploration
12 pages
Types of Data Attributes Explained
No ratings yet
Types of Data Attributes Explained
10 pages
Chap2 Data
No ratings yet
Chap2 Data
88 pages
Chap2 Data
No ratings yet
Chap2 Data
86 pages
Ids Unit-Ii
No ratings yet
Ids Unit-Ii
44 pages
UNIT3
No ratings yet
UNIT3
98 pages
Week 2 Discussion ITS 632 UC
No ratings yet
Week 2 Discussion ITS 632 UC
5 pages
Ids Unit 2 Final
No ratings yet
Ids Unit 2 Final
18 pages
IDS Unit 2
No ratings yet
IDS Unit 2
49 pages
All Data Mining Chapters
No ratings yet
All Data Mining Chapters
235 pages
Chapter 2 (Data)
No ratings yet
Chapter 2 (Data)
98 pages
Fds Unit 2
No ratings yet
Fds Unit 2
21 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
Topics To Be Covered
No ratings yet
Topics To Be Covered
58 pages
Sess02 Data
No ratings yet
Sess02 Data
96 pages
Agenda: 1) Assign Homework #1 (Due Wednesday 6/30) 2) Lecture Over More of Chapter 2
No ratings yet
Agenda: 1) Assign Homework #1 (Due Wednesday 6/30) 2) Lecture Over More of Chapter 2
43 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Fds Unit II Notes
No ratings yet
Fds Unit II Notes
37 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
19 pages
Data Types For Analyst
No ratings yet
Data Types For Analyst
8 pages
Data Mining and Data Warehouses: Professor: Liana Stanescu Student: Georgian Vladutu
No ratings yet
Data Mining and Data Warehouses: Professor: Liana Stanescu Student: Georgian Vladutu
12 pages
Unit II Notes
No ratings yet
Unit II Notes
38 pages
Lecture Notes For Chapter 2 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining, 2 Edition
87 pages
Chapter-2 (Data)
No ratings yet
Chapter-2 (Data)
95 pages
Know - Your - Data and Rescaling
No ratings yet
Know - Your - Data and Rescaling
72 pages
02data Part1
No ratings yet
02data Part1
19 pages
Chap2 Data
No ratings yet
Chap2 Data
78 pages
02.preliminaries For Data Analysis
No ratings yet
02.preliminaries For Data Analysis
4 pages
Solution Manual For Financial Reporting Financial Statement Analysis and Valuation 9th Edition
No ratings yet
Solution Manual For Financial Reporting Financial Statement Analysis and Valuation 9th Edition
37 pages
Test Bank For Reflective Teaching An Introduction 2nd Edition
No ratings yet
Test Bank For Reflective Teaching An Introduction 2nd Edition
3 pages
Solution Manual For Intermediate Accounting Vol. 1 4th Edition
No ratings yet
Solution Manual For Intermediate Accounting Vol. 1 4th Edition
4 pages
Solution Manual For Spreadsheet Modeling and Decision Analysis 8th Edition
No ratings yet
Solution Manual For Spreadsheet Modeling and Decision Analysis 8th Edition
14 pages
Test Bank For Essentials of Health Care Marketing 5th Edition
No ratings yet
Test Bank For Essentials of Health Care Marketing 5th Edition
16 pages
Solution Manual For Auditing and Assurance Services Global 15th Edition
No ratings yet
Solution Manual For Auditing and Assurance Services Global 15th Edition
11 pages
Test Bank For Conflict Resolution For The Helping Professions 3rd Edition
No ratings yet
Test Bank For Conflict Resolution For The Helping Professions 3rd Edition
12 pages
Test Bank For Exercise and Sport Pharmacology 1st Edition
No ratings yet
Test Bank For Exercise and Sport Pharmacology 1st Edition
10 pages
Test Bank For Social Work Research Methods 1st Edition
No ratings yet
Test Bank For Social Work Research Methods 1st Edition
11 pages
Solution Manual For Intro To Numerical Analysis 2nd Edition
No ratings yet
Solution Manual For Intro To Numerical Analysis 2nd Edition
17 pages
Solution Manual For The Theory of Interest 3rd Edition
No ratings yet
Solution Manual For The Theory of Interest 3rd Edition
14 pages
Solution Manual For Environmental Engineering Fundamentals Sustainability Design 2nd Edition
No ratings yet
Solution Manual For Environmental Engineering Fundamentals Sustainability Design 2nd Edition
41 pages
Test Bank For Nonverbal Communication Studies and Applications 6th Edition
No ratings yet
Test Bank For Nonverbal Communication Studies and Applications 6th Edition
21 pages
Solution Manual For Java How To Program Early Objects 11th Edition
No ratings yet
Solution Manual For Java How To Program Early Objects 11th Edition
9 pages
Solution Manual For Business 11th Edition
No ratings yet
Solution Manual For Business 11th Edition
68 pages
CLASS V MATHS The Fish Tale Worksheet
No ratings yet
CLASS V MATHS The Fish Tale Worksheet
40 pages
Math Exam for Grade 8 Students
No ratings yet
Math Exam for Grade 8 Students
2 pages
Trig Equations With Half-Angles and Multiple Angles
No ratings yet
Trig Equations With Half-Angles and Multiple Angles
3 pages
Pressure Vessel Design Excel Sheet
No ratings yet
Pressure Vessel Design Excel Sheet
1 page
Diff. Data and Information
No ratings yet
Diff. Data and Information
12 pages
Conic Sections: Types and Equations
No ratings yet
Conic Sections: Types and Equations
8 pages
B.Sc. Math: Calculus & Matrix Algebra
No ratings yet
B.Sc. Math: Calculus & Matrix Algebra
2 pages
The Voyage of The Vega Round Asia and Europe Volume I and Volume II 1st Edition Nils Adolf Erik Nordenskiöld PDF Download
No ratings yet
The Voyage of The Vega Round Asia and Europe Volume I and Volume II 1st Edition Nils Adolf Erik Nordenskiöld PDF Download
71 pages
Quantitative Techniques
No ratings yet
Quantitative Techniques
6 pages
CS1B - September 2024 - Examiner Report
No ratings yet
CS1B - September 2024 - Examiner Report
19 pages
ME Math 10 Q2 1103 SG
No ratings yet
ME Math 10 Q2 1103 SG
13 pages
Formula Sheet
No ratings yet
Formula Sheet
26 pages
AEEM-728 Introduction To Ultrasonics Lecture Notes
100% (2)
AEEM-728 Introduction To Ultrasonics Lecture Notes
147 pages
Project Management Theories and Practices
No ratings yet
Project Management Theories and Practices
53 pages
Unit-4 - Permutation and Combinations
No ratings yet
Unit-4 - Permutation and Combinations
10 pages
Constructional Schemas in Variation
No ratings yet
Constructional Schemas in Variation
46 pages
RSA Public-Key Encryption and Signature Lab
No ratings yet
RSA Public-Key Encryption and Signature Lab
8 pages
Syllabus Sem 1 Sem 2
No ratings yet
Syllabus Sem 1 Sem 2
29 pages
Course Code 18CSC207J Course Name ADVANCED PROGRAMMING PRACTICE Course L T P C 3 0 2 4
No ratings yet
Course Code 18CSC207J Course Name ADVANCED PROGRAMMING PRACTICE Course L T P C 3 0 2 4
1 page
Eo Y8 Assessment 2022
No ratings yet
Eo Y8 Assessment 2022
14 pages
Miniaturized UWB Monopole Microstrip Antenna Design by The Combination of Giusepe Peano and Sierpinski Carpet Fractals
No ratings yet
Miniaturized UWB Monopole Microstrip Antenna Design by The Combination of Giusepe Peano and Sierpinski Carpet Fractals
4 pages
Conic Sections
No ratings yet
Conic Sections
2 pages
Differential Equations (WORD)
No ratings yet
Differential Equations (WORD)
5 pages
ME 63 Syllabus For ME
No ratings yet
ME 63 Syllabus For ME
2 pages
Data Validity and Reliability Analysis
No ratings yet
Data Validity and Reliability Analysis
6 pages
In Process Monitoring of Porosity During Laser
No ratings yet
In Process Monitoring of Porosity During Laser
30 pages
Fluid Dynamics: Submitted by
100% (1)
Fluid Dynamics: Submitted by
24 pages
Direct and Inverse Proportions Quiz
No ratings yet
Direct and Inverse Proportions Quiz
21 pages
Worksheet 1: (CH 2: Exponents)
No ratings yet
Worksheet 1: (CH 2: Exponents)
6 pages