[go: up one dir, main page]

0% found this document useful (0 votes)
67 views42 pages

Test Bank For Introduction To Data Mining 2nd Edition

The document is a test bank for the textbook 'Introduction to Data Mining' by Pang Ning Tan, providing various data mining tasks and examples related to classification, clustering, association rule mining, and anomaly detection. It includes exercises on classifying attributes and data preprocessing techniques, along with explanations and answers. The content is structured to assist students in understanding data mining concepts and applications.

Uploaded by

hvndwgsvvj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views42 pages

Test Bank For Introduction To Data Mining 2nd Edition

The document is a test bank for the textbook 'Introduction to Data Mining' by Pang Ning Tan, providing various data mining tasks and examples related to classification, clustering, association rule mining, and anomaly detection. It includes exercises on classifying attributes and data preprocessing techniques, along with explanations and answers. The content is structured to assist students in understanding data mining concepts and applications.

Uploaded by

hvndwgsvvj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Test Bank + Answer Key

Test Bank for Introduction to Data Mining 2nd Edition by Pang


Ning Tan

View Full Product:


https://selldocx.com/products/test-bank-introduction-to-data-mining-2e-tan

Book Title: Introduction to Data Mining

Edition: 2nd Edition

Author: Pang Ning Tan

Click above to view a sample


1

Introduction
1. [Fall 2008]
For each data set given below, give specific examples of classification,
clustering, association rule mining, and anomaly detection tasks that
can be performed on the data. For each task, state how the data matrix
should be constructed (i.e., specify the rows and columns of the matrix).

(a) Ambulatory Medical Care data1 , which contains the demographic


and medical visit information for each patient (e.g., gender, age,
duration of visit, physician’s diagnosis, symptoms, medication, etc).
Answer:
Classification
Task: Diagnose whether a patient has a disease.
Row: Patient
Column: Patient’s demographic and hospital visit information (e.g., symptoms), along with
a class attribute that indicates whether the patient has the disease.
Clustering
Task: Find groups of patients with similar medical conditions
Row: A patient visit
Column: List of medical conditions of each patient
Association rule mining
Task: Identify the symptoms and medical conditions that co-occur together frequently
Row: A patient visit
Column: List of symptoms and diagnosed medical conditions of the patient
Anomaly detection
Task: Identify healthy looking patients with rare medical disorders
Row: A patient visit
Column: List of demographic attributes, symptoms, and medical test results of the patient
1
See for example, the National Hospital Ambulatory Medical Care Survey http://www.
cdc.gov/nchs/about/major/ahcd/ahcd1.htm
2 Chapter 1 Introduction

(b) Stock market data, which include the prices and volumes of various
stocks on different trading days.
Answer:
Classification
Task: Predict whether the stock price will go up or down the next trading day
Row: A trading day
Column: Trading volume and closing price of the stock the previous 5 days and a class
attribute that indicates whether the stock went up or down
Clustering
Task: Identify groups of stocks with similar price fluctuations
Row: A company’s stock
Column: Changes in the daily closing price of the stock over the past ten years
Association rule mining
Task: Identify stocks with similar fluctuation patterns(e.g., {Google-Up, Yahoo-Up})
Row: A trading day
Column: List of all stock-up and stock-down events on the given day.
Anomaly detection
Task: Identify unusual trading days for a given stock (e.g., unusually high volume)
Row: A trading day
Column: Trading volume, change in daily stock price (daily high − low prices), and average
price change of its competitor stocks
(c) Database of Major League Baseball (MLB).

Classification
Task: Predict the winner of a game between two MLB teams.
Row: A game.
Column: Statistics of the home and visiting teams over their past 10 games they had played
(e.g., average winning percentage and hitting percentage of their players)
Clustering
Task: Identify groups of players with similar statistics
Row: A player
Column: Statistics of the player
Association rule mining
Task: Identify interesting player statistics (e.g., 40% of right-handed players have a batting
percentage below 20% when facing left-handed pitchers)
Row: A player
Column: Discretized statistics of the player
Anomaly detection
Task: Identify players who performed considerably better than expected in a given season
Row: A (player,season) pair e.g, (player1 in 2007)
Column: Ratio statistics of a player (e.g., ratio of average batting percentage in 2007 to
career average batting percentage)

2
2

Data
2.1 Types of Attributes
1. Classify the following attributes as binary, discrete, or continuous. Also
classify them as qualitative (nominal or ordinal) or quantitative (interval
or ratio). Some cases may have more than one interpretation, so briefly
indicate your reasoning if you think there may be some ambiguity.

(a) Number of courses registered by a student in a given semester.


Answer: Discrete, quantitative, ratio.
(b) Speed of a car (in miles per hour).
Answer: Discrete, quantitative, ratio.
(c) Decibel as a measure of sound intensity.
Answer: Continuous, quantitative, interval or ratio. It is actually
a logratio type (which is somewhere between interval and ratio).
(d) Hurricane intensity according to the Saffir-Simpson Hurricane Scale.
Answer: Discrete, qualitative, ordinal.
(e) Social security number.
Answer: Discrete, qualitative, nominal.

2. Classify the following attributes as:

• discrete or continuous.
• qualitative or quantitative
• nominal, ordinal, interval, or ratio
4 Chapter 2 Data

Some cases may have more than one interpretation, so briefly indicate
your reasoning if you think there may be some ambiguity.

(a) Julian Date, which is the number of days elapsed since 12 noon
Greenwich Mean Time of January 1, 4713 BC.
Answer: Continuous, quantitative, interval
(b) Movie ratings provided by users (1-star, 2-star, 3-star, or 4-star).
Answer: Discrete, qualitative, ordinal
(c) Mood level of a blogger (cheerful, calm, relaxed, bored, sad, angry
or frustrated).
Answer: Discrete, qualitative, nominal
(d) Average number of hours a user spent on the Internet in a week.
Answer: Continuous, quantitative, ratio
(e) IP address of a machine.
Answer: Discrete, qualitative, nominal
(f) Richter scale (in terms of energy release during an earthquake).
Answer: Continuous, qualitative, ordinal
In terms of energy release, the difference between 0.0 and 1.0 is not
the same as between 1.0 and 2.0. Ordinal attributes are qualitative;
yet, can be continuous.
(g) Salary above the median salary of all employees in an organization.
Answer: Continuous, quantitative, interval
(h) Undergraduate level (freshman, sophomore, junior, and senior) for
measuring years in college.
Answer: Discrete, qualitative, ordinal

3. For each attribute given, classify its type as:

• discrete or continuous AND


• qualitative or quantitative AND
• nominal, ordinal, interval, or ratio

Indicate your reasoning if you think there may be some ambiguity in


some cases.
Example: Age in years.
Answer: Discrete, quantitative, ratio.

4
2.1 Types of Attributes 5

(a) Daily user traffic volume at YouTube.com (i.e., number of daily


visitors who visited the Web site).
Answer: Discrete, quantitative, ratio.
(b) Air pressure of a car/bicycle tire (in psi).
Answer: Continuous, quantitative, ratio.
(c) Homeland Security Advisory System ratings - code red/orange/etc.
Answer: Discrete, qualitative, ordinal.
(d) Amount of seismic energy release, measured in Richter scale.
Answer: Continuous, qualitative, ordinal.
(e) Credit card number.
Answer: Discrete, qualitative, nominal.
(f) The wealth of a nation measured in terms of gross domestic product
(GDP) per capita above the world’s average of $10,500.
Answer: Continuous, quantitative, interval.

4. For each attribute given, classify its type as:

• discrete or continuous AND


• qualitative or quantitative AND
• nominal, ordinal, interval, or ratio

Indicate your reasoning if you think there may be some ambiguity in


some cases.
Example: Age in years.
Answer: Discrete, quantitative, ratio.

(a) Favorite movie of each person.


Answer: Discrete, qualitative, nominal
(b) Number of days since Jan 1, 2011.
Answer: Discrete, quantitative, interval.
(c) Category of a hurricane (The Saffir-Simpson Hurricane Wind Scale
ranges from category 1 to category 5).
Answer: Discrete, qualitative, ordinal.
(d) Number of students enrolled in a class.
Answer: Discrete, quantitative, ratio

5
6 Chapter 2 Data

5. For each attribute given, classify its type as:

• discrete or continuous AND


• qualitative or quantitative AND
• nominal, ordinal, interval, or ratio

Indicate your reasoning if you think there may be some ambiguity in


some cases.
Example: Temperature in Kelvin
Answer: Continuous, quantitative, ratio.

(a) Number of years since 1 BC. For example, 2 BC is year -1, 1 BC is


year 0, 1 AD is year 1, and 2013 AD is year 2013 (note, there is no
0 AD in Gregorian calendar).
Answer: Discrete/Continuous, quantitative, interval.
(b) GPA of a student.
Answer: Continuous, qualitative, ordinal.
(c) Mood level of a blogger (cheerful, calm, relaxed, bored, sad, angry
or frustrated).
Answer: Discrete, qualitative, nominal.
(d) Sound intensity in decibel scale.
Answer: Continuous, qualitative, ordinal. In terms of sound in-
tensity, the difference between 0dB and 1dB is not the same as the
difference between 10 dB and 11 dB (decibels are in log scale); thus,
it is not an interval attribute.

6. State the type of each attribute given below before and after we have
performed the following transformation.

(a) Hair color of a person is mapped to the following values: black =


0, brown = 1, red = 2, blonde = 3, grey = 4, white = 5.
Answer: Nominal (both before and after transformation).
(b) Grade of a student (from 0 to 100) is mapped to the following scale:
A = 4.0, A- = 3.5, B = 3.0, B- = 2.5, C = 2.0, C- = 1.5, D = 1.0,
D- = 0.5, E = 0.0
Answer: Ratio (before transformation) to ordinal (after transfor-
mation).

6
2.1 Types of Attributes 7

(c) Age of a person is discretized to the following scale: Age < 12, 12
≤ Age < 21, 21 ≤ Age < 45, 45 ≤ Age < 65, Age > 65.
Answer: Ratio (before transformation) to ordinal (after transfor-
mation)
(d) Annual income of a person is discretized to the following scale: In-
come < $20K, $20K ≤ Income < $60K, $60K ≤ Income < $120K,
$120K ≤ Age < $250K, Age ≥ $250K.

Answer: Ratio (before transformation) to ordinal (after transfor-


mation).
(e) Height of a person is changed from meters to feet.
Answer: Ratio (both before and after transformation)
(f) Height of a person is changed from meters to {Short, Medium, Tall}.
Answer: Ratio (before transformation) to ordinal (after transfor-
mation).
(g) Height of a person is changed from feet to number of inches above
4 feet.
Answer: Ratio (before transformation) to interval (after transfor-
mation).
(h) Weight of a person is standardized by subtracting it with the mean
of the weight for all people and dividing by its standard deviation.
Answer: Ratio (before transformation) to interval (after transfor-
mation)

7. State whether it is meaningful (based on the properties of the attribute


values) to apply the following operations to the data given below

(a) Average amplitude of seismic waves (in Richter scale) for the 10
deadliest earthquakes in Asia.
Answer: No because Richter scale is ordinal.
(b) Average number of characters in a collection of spam messages.
Answer: Yes because number of characters is a ratio attribute.
(c) Pearson’s correlation between shirt size and height of an individual.
Answer: No because shirt size is ordinal.
(d) Median zipcode of households in the United States.
Answer: No because zipcode is nominal.

7
8 Chapter 2 Data

(e) Entropy of students (based on the GPA they obtained for a given
course).
Answer: Yes because entropy is applicable to nominal attributes.
(f) Geometric mean of temperature (in Fahrenheit) for a given city.
Answer: No because temperature (in Fahrenheit) is not a ratio
attribute.

2.2 Data Preprocessing


1. Consider the following dataset that contains the age and gender infor-
mation for 9 users who visited a given website.

UserID 1 2 3 4 5 6 7 8 9
Age 17 24 25 28 32 38 39 49 68
Gender Female Male Male Male Female Female Female Male Male

(a) Suppose you apply equal interval width approach to discretize the
Age attribute into 3 bins. Show the userIDs assigned to each of the
3 bins.
Answer: Bin width = 68−17 3 = 51
3 = 17.
Bin 1: 1, 2, 3, 4, 5
Bin 2: 6, 7, 8
Bin 3: 9
(b) Repeat the previous question using the equal frequency approach.
Answer: Since there are 9 users and 3 bins, every bin must contain
3 users.
Bin 1: 1, 2, 3
Bin 2: 4, 5, 6
Bin 3: 7, 8, 9
(c) Repeat question (a) using a supervised discretization approach (with
Gender as class attribute). Specifically, choose the bins in such a
way that their members are as “pure” as possible (i.e., belonging
to the same class).
Answer:
Bin 1: 1, 2, 3, 4
Bin 2: 5, 6, 7
Bin 3: 8, 9

8
2.2 Data Preprocessing 9

2. Consider an attribute X of a data set that takes the values {x1 , x2 , · · · , x9 }


(sorted in increasing order of magnitude). We apply two methods (equal
interval width and equal frequency) to discretize the attribute into 3
bins. The bins obtained are shown below:

Equal Width: {x1 , x2 , x3 }, {x4 , x5 , x6 , x7 , x8 }, {x9 }


Equal Frequency: {x1 , x2 , x3 }, {x4 , x5 , x6 }, {x7 , x8 , x9 }

Explain what will be the effect of applying the following transformations


on each discretization method, i.e., whether the elements assigned to
each bin can change if you discretize the attribute after applying the
transformation function below. Note that X̄ denotes the average value
and σx denotes standard deviation of attribute X.

(a) X → X − X̄ (i.e., if the attribute values are centered).


Answer: No change for equal width because the distance between
xi and xi+1 is unchanged. No change for equal frequency because
the relative ordering of data points remain the same (i.e., if xi <
xi+1 then xi − X̄ < xi+1 − X̄).
(b) X → X− X̄
σx (i.e., if the attribute values are standardized).
Answer: Since the distances between every pair of points (xi , xi+1 )
change uniformly (by a constant factor of σx , the elements in the
bins are unchanged for equal width discretization. No change for
equal frequency because the relative ordering of data points remain
the same. 
X−X̄
(c) X → exp σx (i.e., if the values are standardized and exponen-
tiated).
Answer: The bin elements may change for equal width because
the distances between xi and xi+1 may not change uniformly. No
change for equal frequency because the relative ordering of data
points remain the same.

3. Consider a dataset that has 3 attributes (x1 , x2 , and x3 ). The distribu-


tion of each attribute is as follows and shown in Figure

• x1 has a uniform distribution in the range between 0 and 1.


• x2 is generated from a mixture of 3 Gaussian distributions centered
at 0.1, 0.5, and 0.9, respectively. The standard deviation of the

9
10 Chapter 2 Data

distributions are 0.02, 0.1, and 0.02, respectively. Assume each


point is generated from one of the 3 distributions and the number
of points associated with each distribution is different.
• x3 is generated from an exponential distribution with mean 0.1.

(a) Which attribute(s) is likely to produce the same bins regardless of


whether you use equal width or equal frequency approaches (as-
suming the number of bins is not too large).
Answer: x1 .
(b) Which attribute(s) is more suitable for equal frequency than equal
width discretization approaches.
Answer: x3 .
(c) Which attribute(s) is not appropriate for both equal width and
equal frequency discretization approaches.
Answer: x2 .
(d) If all 3 are initially ratio attributes, what are their attribute types
after discretization?
Answer: Ordinal.

4. An e-commerce company is interested in identifying the highest spending


customers at its online store using association rule mining. One of the
rules identified is:

21 ≤ Age < 45 AND NumberOfVisits > 50 → AmountSpent > $500,

where the Age attribute was discretized into 5 bins, NumberOfVisits


was discretized into 8 bins, and AmountSpent was discretized into 8
bins. The confidence of an association rule A, B → C is defined as

P (A, B, C)
Confidence(A, B → C) = P (C|A, B) = (2.1)
P (A, B)

where P (C|A, B) is the conditional probability of C given A and B,


P (A, B, C) is the joint probability of A, B, and C, and P (A, B) is the
joint probability of A and B. The probabilities are empirically esti-
mated based on their relative frequencies in the data. For example,
P (AmountSpent > $500) is given by the proportion of online users who
visited the store and spent more than $500.

10
2.2 Data Preprocessing 11

(a) Suppose we increase the number of bins for the Age attribute from
5 to 6 so that the discretized Age in the rule becomes 21 ≤ Age
< 30 instead of 21 ≤ Age < 45, will the confidence of the rule be
non-increasing, non-decreasing, stays the same, or could go either
way (increase/decrease)?
Answer: Can increase/decrease.
(b) Suppose we increase the number of bins for the AmountSpent at-
tribute from 8 to 10, so that the right hand side of the rule becomes
$500 < AmountSpent < $1000, will the confidence of the rule be
non-increasing, non-decreasing, stays the same, or could go either
way (increase/decrease)?
Answer: Non-increasing.
(c) Suppose the values for NumberOfVisits attribute are distributed
according to a Poisson distribution with a mean value equals to 4.
If we discretize the attribute into 4 bins using the equal frequency
approach, what are the bin values after discretization? Hint: you
need to refer to the cumulative distribution table for Poisson dis-
tribution to answer the question.
Answer: Choose the bin values such that the cumulative distribu-
tion is close to 0.25, 0.5, and 0.75. This corresponds to bin values:
0 to 2, 3, 4 to 5, and greater than 5.

5. Null values in data records may refer to missing or inapplicable values.


Consider the following table of employees for a hypothetical organization:

Name Sales commission Occupation


John 5000 Sales
Mary 1000 Sales
Bob null Non-sales
Lisa null Non-sales

The null values in the table refer to inapplicable values since sales com-
mission are calculated for sales employees only. Suppose we are interested
to calculate the similarity between users based on their sales commission.

(a) Explain what is the limitation of the approach to compute similarity


if we replace the null values in sales commission by 0.
Answer: Mary will be more similar to Bob and Lisa than to John.

11
12 Chapter 2 Data

(b) Explain what is the limitation of the approach to compute similarity


if we replace the null values in sales commission by the average value
of sales commission (i.e., 3000).
Answer: Both Mary and John are less similar to each other than
to Bob and Lisa.
(c) Propose a method that can handle null values in the sales commis-
sion so that employees that have the same occupation are closer to
each other than to employees that have different occupations.
Answer: One way is to change the similarity function as follows:

 ∞, if both a and b are null;
Similarity(a, b) = 0, if one of a or b is null;
s(a, b), otherwise.

where s(a, b) is the original similarity measure used for the sales
commission.

6. Consider a data set from an online social media Web site that contains
information about the age and number of friends for 5,000 users.

(a) Suppose the number of friends for each user is known. However,
only 4000 out of 5000 users provide their age information. The
average age of the 4,000 users is 30 years old. If you replace the
missing values for age with the value 30, will the average age com-
puted for the 5,000 users increases, decreases, or stays the same (as
30)?
Answer: Average age does not change.

4000
1 X
xold = xi
4000
i=1
5000  4000 5000 
1 X 1 X X
xnew = xi = xi + xi
5000 5000
i=1 i=1 i=4001

P4000
Since xi = xold for i = 4001, 4002, · · · , 5000 and i=1 xi = 4000xold ,
we have
 
1
xnew = 4000xold + 1000xold = xold
5000

12
2.2 Data Preprocessing 13

(b) Suppose the covariance between age and number of friends calcu-
lated using the 4,000 users (with no missing values) is 20. If you
replace the missing values for age with the average age of the 4,000
users, would the covariance between age and number of friends in-
creases, decreases, or stays the same (as 20)? Assume that the
average number of followers for all 5,000 users is the same as the
average for 4,000 users.
Answer: Covariance will decrease. Let C1 = 4000
P
i=1 (xi − x)(yi −
y)/3999 be the covariance computed using the 4,000 users without
missing values. If we impute the missing values for age with average
age, x remains unchanged according to part (a). Furthermore, y is
assumed to be unchanged. Thus, the new covariance is
5000
1 X
C2 = (xi − x)(yi − y)
4999
i=1
 4000 5000 
1 X X
= (xi − x)(yi − y) + (xi − x)(yi − y)
4999
i=1 i=4001
 4000 5000 
1 X X
= (xi − x)(yi − y) + (x − x)(yi − y)
4999
i=1 i=4001
4000
1 X
= (xi − x)(yi − y) < C1 (2.2)
4999
i=1

7. Consider the following data matrix on the right, in which two of its
values are missing (the matrix on the left shows its true values).
   
−0.2326 0.2270 −0.2326 0.2270
−0.0847 0.7125  −0.0847 0.7125 
   
 0.1275 0.3902   0.1275 0.3902 
   
 0.1329 −0.1461  ? −0.1461
   
 0.3724 0.1756   0.3724 0.1756 
   
0.4975 0.8536 0.4975 0.8536
   
 −→ 
   
 0.6926 0.7834   0.6926 0.7834 
 
 0.7933 0.7375   0.7933 0.7375 
   

 0.8229 0.2147 
 
 0.8229 0.2147 

   
 0.8497 0.4980   0.8497 0.4980 
   
 1.0592 0.7600   1.0592 ? 
1.5028 1.0122 1.5028 1.0122

13
14 Chapter 2 Data

(a) Impute the missing values for the matrix on the right by their re-
spective column averages. Show the imputed values and calculate
their root-mean-square-error (RMSE).
s
(A4,1 − Ã4,1 )2 + (A11,2 − Ã11,2 )2
RMSE =
2
where Ai,j denotes the true value of the (i, j)-th element of the data
matrix and Ãi,j denotes its corresponding imputed value.
Answer: The column averages are [0.5819 0.4962]. The imputed
values are  
−0.2326 0.2270
−0.0847 0.7125 
 
 0.1275 0.3902 
 
 0.5819 −0.1461
 
 0.3724 0.1756 
 
 0.4975 0.8536 
 
 0.6926 0.7834 
 
 0.7933 0.7375 
 

 0.8229 0.2147 

 
 0.8497 0.4980 
 
 1.0592 0.4962 
1.5028 1.0122
and the RMSE value is
r
(0.1329 − 0.5819)2 + (0.7600 − 0.4962)2
RMSE = = 0.3683
2

(b) The Expectation-Maximization (E-M) algorithm is a well-known


approach for imputing missing values. Assuming the data is gen-
erated from a multivariate Gaussian distribution, E-M iteratively
computes the following conditional mean for each attribute and uses
it to impute the missing values:

µi|j = µ̂i + Σij Σ−1


jj (xj − µ̂j )

where the indices i, j ∈ {1, 2} refer to one of the two attributes of


the data and Σ−1 denote inverse of the covariance matrix. Repeat
the previous question by applying the E-M algorithm iteratively for

14
2.2 Data Preprocessing 15

5 times. Assume the covariance matrix of the data is known and


given by
 
0.25 0.1
Σ=
0.1 0.15
In the first iteration, compute the mean value for each column using
only the non-missing values. In subsequent iterations, compute
the mean value for each column using both the non-missing and
imputed values. Show the imputed values after each iteration and
compute the root-mean-square-error. Compare the error against
the answer in part (a).
Answer:
The inverse of the covariance matrix is
 
−1 5.4545 −3.6364
Σ =
−3.6364 9.0909

The results after each iteration are shown below:

Iteration µ̂1 µ̂2 Imputed x4,1 Imputed x11,2 RMSE


1 0.5819 0.4962 0.2315 0.9301 0.1390
2 0.5527 0.5324 0.1826 0.9928 0.1683
3 0.5486 0.5376 0.1756 1.0018 0.1736
4 0.5480 0.5384 0.1746 1.0030 0.1743
5 0.5479 0.5385 0.1745 1.0032 0.1745
The root-mean-square-error for EM algorithm is considerably lower
than that using mean imputation.

8. The purpose of this exercise is to illustrate the relationship between PCA


and SVD. Let A be an N × d rectangular data matrix and C be its d × d
covariance matrix.

(a) Suppose IN is an N ×N identity matrix and 1N is an N ×N matrix


whose elements are equal to 1, i.e., ∀i, j : (1)ij = 1. Show that the
covariance matrix C can be expressed into the following form:
 
1 T 1
C= A I N − 1N A
N −1 N
.

15
16 Chapter 2 Data

Answer: The covariance between columns i and j in matrix A is


given by P
(Aki − Ai )(Akj − Aj )
Cij = k , (2.3)
N −1
where Ai and Aj are their corresponding column averages. A matrix
of column averages for A can be computed as follows:
  
1 1 ··· 1 A11 A12 · · · A1d
1 1  1 1 ··· 1   A21 A22 · · · A2d 
1N A = 
 ··· ···
 
N N ··· ···  ··· ··· ··· ··· 
1 1 ··· 1 AN 1 AN 2 · · · AN d
1 P 1 P 1 P
N Pj Aj1 N Pj Aj2 · · · N Pj Ajd
 
1 1 1
j Aj1 j Aj2 · · · N j Ajd
 
=  N N  (2.4)
 · · · · · · · · · · · · 
1 P 1 P 1 P
N j Aj1 N j Aj2 · · · N j Ajd

Thus, each term (Aki − AiP ) in Equation (2.3) can be expressed in


matrix notation as Aki − N j Aji = [A− N1 1N A]ki . The covariance
1

matrix C can therefore be computed as follows:

1 1 1
C = (A − 1N A)T (A − 1N A)
N −1 N N
 T  
1 1 1
= (IN − 1N )A (IN − 1N )A
N −1 N N
  
1 T 1 1
= A I N − 1N IN − 1N A (2.5)
N −1 N N

where we have use the following property of matrix transpose (XY)T =


YT XT on the last line. Furthermore, since the identity matrix and
the matrix of all ones are symmetric, i.e., ITN = IN and 1TN = 1N ,
therefore (IN − N1 1N )T = (IN − N1 1N ). Finally, it can be shown
that the matrix (IN − N1 1N ) is idempotent, which means it is the

16
2.2 Data Preprocessing 17

same as the square of the matrix:


1 1 2 1
(IN − 1N )(IN − 1N ) = IN − 1N + 1N 1N
N N N N2
2 1
= IN − 1N + N 1N
N N2
2 1
= IN − 1N + 1N
N N
1
= IN − 1N , (2.6)
N
where 1N 1N = N 1N is an N × N matrix whose elements are equal
to N . Substituting (2.6) into (2.5), we obtain:
 
1 1
C= A T I N − 1N A (2.7)
N −1 N

(b) Using singular value decomposition, the matrix A can be factor-


ized as follows: A = UΣVT , where U is the N × N left singular
matrix, Σ is the N × d matrix containing the singular values, and
V is the d × d right singular matrix. Similarly, using eigenvalue
decomposition, the covariance matrix can be factorized as follows:
C = XΛXT . Show the relationship between SVD and PCA is given
by the following equation:
1 T
VΣ2 VT − A 1N A = (N − 1)XΛXT .
N
Answer: From the previous question, we can write:
   
1 T 1 1 T 1 T
C= A IN − 1N A = A A − A 1N A
N −1 N N −1 N
(2.8)
Since A = UΣVT and U is an orthogonal matrix,

AT A = [UΣVT ]T [UΣVT ] = VΣT UT UΣVT = VΣT ΣVT .

If N > d, then Σ has N −d rows of all zeros. If we remove such rows,


Σ becomes a d × d square matrix and ΣT Σ = Σ2 . By substituting
C = XΛXT and AT A = VΣ2 V into Equation (2.8), we have:
 
T 1 2 T 1
XΛX = VΣ V − A1N A .
N −1 N

17
18 Chapter 2 Data

(c) Find the relationship between the right singular matrix V and the
matrix of principal components X if the data matrix A has been
column-centered (i.e., every column of A has been subtracted by
the column mean) before applying SVD.
Answer: If the matrix A has been column-centered, then its col-
umn mean is zero, which means AT 1N is a matrix of all zeros.
Thus, the last equation in the previous question reduces to:

1
XΛXT = VΣ2 VT .
N −1

This suggests that the right singular matrix V corresponds to the


principal components X, while the square root of the singular values
are the same as N − 1 times the eigenvalues.

9. Principal component analysis (PCA) can be used for image compression


by transforming a high-resolution image into its lower rank approxima-
tion. In this exercise, you will be provided with the following three
images of size 1080 × 1920 pixels each (the filenames are img1.jpg,
img2.jpg, and img3.jpg).

(a) img1 (b) img2 (c) img3

Figure 2.1. Image data set.

You will use Matlab to apply PCA to each of the following images.

(a) Load each image using the imread command. For example:

matlab> A = imread(’img1.jpg’);

(b) Plot the image in gray scale.

matlab> imagesc(A);
matlab> colormap(gray);

18
2.2 Data Preprocessing 19

Answer: See Figure 2.1.


(c) Apply principal component analysis to obtain a reduced rank ap-
proximation of the image.
For example, to obtain a rank-10 approximation (i.e., using the first
10 principal components), use the following commands:
matlab> A = double(A); % convert A from uint8 to double format
matlab> [U,V] = princomp(A); % apply principal component analysis
matlab> rank = 10; % set rank to be 10
matlab> B = V(:,1:rank)*U(:,1:rank)’; % B is the compressed image of A
matlab> figure;
matlab> imagesc(B);
matlab> colormap(gray);
For each image, vary the rank (i.e., number of principal compo-
nents) as follows: 10, 30, 50, and 100. Save each image as follows:

matlab> saveas(gcf, ’filename.jpg’, ’jpeg’);

Insert the compressed (reduced rank) images to the solution file of


your homework (don’t submit the jpg files individually).
Answer: See Figure 2.2.
(d) Compare the size of matrix A (in bytes) to the total sizes of matrices
U and V (in bytes). Compute the compression ratio:

Size of matrix A
Compression ratio =
Size of matrix U + Size of matrix V

for each reduced rank (10, 30, 50, 100) of the images. You can use
the whos command to determine the size of the matrices:
matlab> whos A U V
Answer: See Table 2.1.

rank size of A size of U size of V compression rate


10 16588800 153600 86400 69.12
30 16588800 460800 259200 23.04
50 16588800 768000 432000 13.824
100 16588800 1536000 864000 6.912

Table 2.1. Compression ratio for various images

19
20 Chapter 2 Data

(a) rank 10 img1 (b) rank 10 img2 (c) rank 10 img3

(d) rank 30 img1 (e) rank 30 img2 (f) rank 30 img3

(g) rank 50 img1 (h) rank 50 img2 (i) rank 50 img3

(j) rank 100 img1 (k) rank 100 img2 (l) rank 100 img3

Figure 2.2. Reduced-rank images using PCA

(e) Compute the reconstruction error kA − BkF of each reduced rank


image, where k · kF denote the Frobenius norm of a matrix. Note
that the higher the reconstruction error, the lower the quality of
the compressed image. Plot a graph of reconstruction error (y-axis)
versus compression ratio (x-axis) for each image.
Answer: See Table 2.2 and Figure 2.3.
(f) State the minimum number of principal components (10, 30, 50,
100) needed to (visually) retain most of the salient features of each

20
2.2 Data Preprocessing 21

image rank reconstruction error


img1 10 4.9565 × 104
img1 30 3.7198 × 104
img1 50 3.0998 × 104
img1 100 2.2135 × 104
img2 10 1.7798 × 104
img2 30 1.2190 × 104
img2 50 1.0236 × 104
img2 100 7.4063 × 103
img3 10 3.9544 × 103
img3 30 3.1775 × 103
img3 50 2.8146 × 103
img3 100 2.2397 × 103

Table 2.2. Reconstruction error for various images

(a) img1 (b) img2 (c) img3

Figure 2.3. Reconstruction error versus compression ratio

image (i.e., the city square in img1.jpg, shape of the face in img2.
jpg, and shape of the apple in img3.jpg). Which image requires
the least number of principal components? Which image requires
the most number of principal components?
Answer:
img1.jpg: 50 components
img2.jpg: 30 components
img3.jpg: 10 components

21
22 Chapter 2 Data

2.3 Measures of Similarity and Dissimilarity


1. Consider the following binary vectors:
x1 = (1, 1, 1, 1, 1)
x2 = (1, 1, 1, 0, 0)
y1 = (0, 0, 0, 0, 0)
y2 = (0, 0, 0, 1, 1)

(a) According to Jaccard coefficient, which pair of vectors—(x1 , x2 ) or


(y1 , y2 )—are more similar to each other?
Answer:
Jaccard(x1 , x2 ) = 53 = 0.6.
Jaccard(y1 , y2 ) = 50 = 0.
Therefore, according to Jaccard coefficient, (x1 , x2 ) are more simi-
lar.

(b) According to simple matching coefficient, which pair of vectors—


(x1 , x2 ) or (y1 , y2 )—are more similar to each other?
Answer:
SMC(x1 , x2 ) = 53 = 0.6.
SMC(y1 , y2 ) = 35 = 0.6.
Therefore, according to simple matching coefficient, they are both
equally similar.

(c) According to Euclidean distance, which pair of vectors—(x1 , x2 ) or


(y1 , y2 )—are more similar to each other?
Answer:

Euclidean(x1 , x2 ) = 2 = 1.4142.

Euclidean(y1 , y2 ) = 2 = 1.4142.
Therefore, according to Euclidean distance, they are both equally
similar.

2. Consider a weighted, undirected, graph G (see Figure 2.4 as an example).


Let e(u, v) be the weight of the edge between nodes u and v, where
e(u, u) = 0 and e(u, v) = ∞ if u and v is disconnected. Assume the

22
2.3 Measures of Similarity and Dissimilarity 23

graph is a connected component, i.e., there exists a path between every


two nodes. Suppose the path length, d(u, v), is defined as follows:

 0 if u = v;
d(u, v) = e(u, v), if there is an edge between u and v;
minw6=u6=v d(u, w) + d(w, v), otherwise.

Is d(u, v) a metric? State your reasons clearly. (Check whether the


positivity, symmetry, and triangle inequality properties are preserved.).

Figure 2.4. Weighted undirected graph.

Answer:

(a) Positivity property is preserved by definition since d(u, u) = 0 and


d(u, v) > 0 if u 6= v.
(b) Symmetry property is preserved since the graph is undirected.
(c) Triangle inequality is not preserved. A counter-example is d(K, J) ≥
d(K, I) + d(I, J).

Therefore d(u, v) is not a metric.

3. For document analysis, numerous measures have been proposed to deter-


mine the semantic similarity between two words using a domain ontology
such as WordNet. For example, words such as dog and cat have higher
semantic similarity than dog and money (since the former refers to two
types of carnivores). Figure 2.5 below shows an example for comput-
ing the Wu-Palmer similarity between dog and cat based on their path

23
24 Chapter 2 Data

length in the WordNet hypernym hierarchy. The depth h refers to the


length of the shortest path from the root to their lowest common hyper-
nym (e.g., carnivore for the word pair dog and cat), whereas k is the
minimum path length between the two words.

Figure 2.5. Sample of the hypernym hierarchy in WordNet.

The Wu-Palmer similarity measure is defined as follows:

2h
W =
k + 2h

For example1 , for dog and cat, W = 26/(4 + 26) = 0.867, whereas for
dog and money, W = 4/(19 + 4) = 0.174.

(a) What is the maximum and minimum possible value for Wu-Palmer
similarity?
1
In this simplified example, we assume each word has exactly 1 sense. In general, a
word can have multiple senses. As a result, the Wu-Palmer measure is given by the highest
similarity that can be achieved using one of its possible senses.

24
2.3 Measures of Similarity and Dissimilarity 25

Answer: Maximum value is 1; minimum value approaches 0.


(b) Let 1 − W be the Wu-Palmer distance measure.
• Does 1 − W satisfy the positivity property?
k
Answer: Yes. Since 1 − W = 2h = 0 when k = 0, this implies
that d(u, v) = 0 if and only if u = v.
• Does 1 − W satisfy the symmetry property?
Answer: Yes because W is a symmetric measure.
• Does 1 − W satisfy the triangle inequality property?
Answer: No because each node can have more than one path
to the root, some maybe shorter than others. For example, the
words (money, statute) are very dissimilar to each other. But
(money, bill) and (bill, statute) are very similar, thus violating
triangle inequality. The actual path for these words in the
WordNet ontology are shown in Figure 2.6.

Figure 2.6. Sample of the hypernym hierarchy in WordNet.

4. Suppose you are given a census data, where every data object corre-
sponds to a household and the following continuous attributes are used to
characterize each household: total household income, number of house-
hold residents, property value, number of bedrooms, and number of ve-
hicles owned. Suppose we are interested in clustering the households
based on these attributes.

25
26 Chapter 2 Data

(a) Explain why cosine is not a good measure for clustering the data.
Answer: These attributes are all numerical and can have widely
varying ranges of values, depending on the scale used to measure
them. As a result, cosine measure will be biased by the attributes
with largest range of magnitudes (e.g., total household income and
property value).
(b) Explain why correlation is not a good measure for clustering the
data.
Answer: The same argument as part (a). Because each attribute
has different range, correlating the data points is meaningless.
(c) Explain what preprocessing steps and corresponding proximity mea-
sure you should use to do the clustering.
Answer: Euclidean distance, applied after standardizing the at-
tributes to have a mean of 0 and a standard deviation of 1, would
be appropriate
5. Consider the following distance measure:

d(x, y) = 1 − c(x, y),

where c(x, y) is the cosine similarity between two data objects, x and
y. Does the distance measure satisfy the positivity, symmetry, and tri-
angle inequality properties? For each property, show your steps clearly.
Assume x and y are non-negative vectors (e.g., term vectors for a pair
of documents).
Answer:
x·y
(a) Positivity You need to show that ∀x, y : d(x, y) = 1 − kxkkyk ≥0
and d(x, y) = 0 if and only if x = y.
By definition, x · y = kxkkyk cos θ, where θ is the angle between x
and y. Since cos θ ≤ 1 (from trigonometry), therefore

x·y kxkkyk cos θ


d(x, y) = 1 − =1− = 1 − cos θ ≥ 0,
kxkkyk kxkkyk

which completes the first part of the proof.


If x = y, then

x·x kxkkxk cos 0


d(x, y) = 1 − =1− = 0.
kxkkxk kxkkxk

26
2.3 Measures of Similarity and Dissimilarity 27

However, if d(x, y) = 0, then

1 − cos θ = 0 ⇒ cos θ = 1 ⇒ θ = 0

In other words, as long as x and y are co-linear to each other,


d(x, y) = 0 (even though x 6= y). The distance measure therefore
does not satisfy the positivity property.
(b) Symmetry
Because x · y = y · x,
x·y y·x
d(x, y) = 1 − =1− = d(y, x)
kxkkyk kykkxk

Hence, the distance measure satisfies the symmetry property.


(c) Triangle Inequality
First, note that cos θ decreases with increasing θ for 0 ≤ θ ≤ π/2
(we focus only on this range of values for θ because the vectors
are non-negative). Since the distance measure d(x, y) = 1 − cos θ
depends on the angle between the two vectors x and y, the larger
the angle, the larger the distance. We can show that the distance
measure violates triangle inequality by choosing the angles in such
a way that d(x, z) > d(x, y)+d(y, z). Consider the situation shown
in Figure 2.7 below.

Figure 2.7. Triangle inequality violation example

In this case, we have: d(x, z) = 1 − cos θ and d(x, y) = d(y, z) =


1 − cos(θ/2). From trigonometry identities, cos θ = 2 cos2 (θ/2) −

27
28 Chapter 2 Data

1. Therefore, d(x, z) = 1 − cos θ = 1 − 2 cos2 (θ/2) + 1 = 2 −


2 cos2 (θ/2). On the other hand, d(x, y) + d(y, z) = 2 − 2 cos(θ/2).
Since cos2 (θ/2) < cos(θ/2) as long as 0 < cos(θ/2) < 1, we have
found a counter-example where

d(x, z) = 2 − 2 cos2 (θ/2) > d(x, y) + d(y, z) = 2 − 2 cos(θ/2).

Here’s a simple example. Suppose we have 3 documents and 2 words


data and mining. Document x contains the word data only and
document z contains the word mining only. However, document y
contains both words. We can represent the documents as follows:

x = (1, 0), y = (1, 1), z = (0, 1).



In this case, d(x, z) = 1 and d(x, y) = d(y, z) = 1 − 1/ 2 = 0.2929.
Therefore, d(x, z) > d(x, y)+d(y, z), which is a violation of triangle
inequality.

6. Consider a database of web graphs. Each graph is unweighted and con-


tains a set of nodes and directed edges. A node corresponds to a web
page while an edge is a transition from one page to another when a user
clicks on a hyperlink or enters the URL directly into the location bar
of the Web browser. Each web graph also represents the Web session
of a user. Consider the following approaches for defining the similarity
between two Web sessions, s1 and s2 .

Approach 1 Node-based similarity


P
∈ s1 ) × I(wi ∈ s2 )
i I(wi
Simn (s1 , s2 ) = ,
max(|s1 |, |s2 |)

where wi is a web page, |si | is the number of web pages visited


during session si , max(a, b) is a function that returns the maximum
value between a and b, and I(wi ∈ sj ) is an indicator function whose
value is 1 if session sj visited web page wi and 0 otherwise.
Approach 2 Link-based similarity
P
i,j I(wi → wj ∈ s1 ) × I(wi → wj ∈ s2 )
Siml (s1 , s2 ) = ,
max(|s1 |, |s2 |)

28
2.3 Measures of Similarity and Dissimilarity 29

where wi → wj is a transition from page wi to wj , |si | is the number


of transitions in session si , max(a, b) is a function that returns the
maximum value between a and b, and I(wi → wj ∈ sk ) is an in-
dicator function whose value is 1 if session sk contains a transition
from web page wi to wj and 0 otherwise.

(a) Consider the following two Web sessions: s1 = (A → B → C →


B → D → E) and s2 = (A → C → B → E). Compute the node-
based and link-based similarities for the Web graphs constructed
from the two sessions.
Answer: Simn (s1 , s2 ) = 4/5 and Siml (s1 , s2 ) = 1/5.
(b) Suppose the node-based similarity for s1 and s2 equals to 1. Can the
web graphs for s1 and s2 be different? State your reasons clearly.
Answer: Yes. As long as both graphs contain the same set of
nodes, the node-based similarity is equal to 1. But the graphs may
still be different because the links in the graph could be different.
(c) Suppose Siml (s1 , s2 ) = 1 according to approach 2. Can the web
graphs for s1 and s2 be different? State your reasons clearly.
Answer: No. The web graphs are the same because all the node
transitions in s1 must also be present in s2 , and vice-versa.
(d) Which approach do you think is more effective at measuring simi-
larity between two web sessions? State your reasons clearly.
Answer: Link-based similarity is more effective because its value
is 1 only if the web graphs are isomorphic.

7. Consider the following distance measure D between two clusters of data


points, X and Y:

D(X, Y) = min{d(x, y) : x ∈ X, y ∈ Y},

where d(x, y) is the Euclidean distance between two data points, x and
y. Intuitively, D measures the distance between clusters in terms of
the closest two points from each cluster (see Figure 2.8). Does the dis-
tance measure satisfy the positivity, symmetry, and triangle inequality
properties? For each property, show your proof clearly or give a counter-
example if the property is not satisfied.
Answer:

29
30 Chapter 2 Data

Figure 2.8. Cluster distance measure

(a) Positivity: Since Euclidean distance between any two data points
is always non-negative, therefore D(X, Y ) ≥ 0. D(X, y) can be zero
even when X 6= Y only if there is a data point is assigned to both
clusters X and Y (i.e., if overlapping clusters are allowed). So,
the distance measure satisfies the positivity property for disjoint
clusters but not for overlapping clusters.
(b) Symmetry: Since Euclidean distance is a symmetric measure,
D(X, Y) = min{d(x, y) : x ∈ X, y ∈ Y} = min{d(y, x) : x ∈
X, y ∈ Y} = D(Y, X). Thus, the measure is symmetric.
(c) Triangle Inequality: Triangle inequality property can be vio-
lated. A counter-example is shown in Figure 2.9.

Figure 2.9. Violation of triangle inequality

8. For this question, assume each object is characterized by a set of continuous-


valued attributes.

(a) If two objects have a cosine similarity of 1, must their attribute


values be identical? Explain.

30
2.3 Measures of Similarity and Dissimilarity 31

Answer: No. A cosine similarity of 1 simply implies that the two


attribute vectors are parallel to each other. For example, when
x = (1, 2) and y = (2, 4), then their cosine similarity is 1.
(b) If two objects have a correlation value of 1, must their attribute
values be identical? Explain.
Answer: No. A correlation value of 1 simply implies that there is a
linear relationship between the two attribute vectors. For example,
when x = (1, 2) and y = (3, 5), then their correlation is 1.
(c) If two objects have a Euclidean distance of 0, must their attribute
values be identical? Explain.
Answer: Yes. Consider a pair of objects with attributePvectors x
and y. Suppose their Euclidean distance is d(x, y) = sqrt i (xi − yi )2 =
0, which is true only if xi = yi for all i.
(d) Let x and y be the attribute vectors of two objects. State whether
the following proximity measures—cosine, correlation, and Euclidean
distance—are invariant (unchanged) under the following transfor-
mations. Specifically, if x → x0 and y → y 0 , would cosine(x, y) =
cosine(x0 , y 0 ), correlation(x, y) = correlation(x0 , y 0 ), and Euclidean(x, y)
= Euclidean(x0 , y 0 )?
i. Translation: x → x + c and y → y + c, where c is a constant
added to each attribute value in x and y.
Answer:
P Cosine is notP invariant because cosine(x + c, y + c) =
(xi +c)(yi +c) xy
√P i
2 2
6= √i 2i 2i unless c = 0. Euclidean distance
i (xi +c) (yi +c) pP xi yi
pP
2 2
is invariant since i [(xi + c) − (yi + c)] = i (xi − yi ) .
Similarly, correlation measure is also invariant because when
x → x + c, then the mean will also be shifted x → x +
cpbut
P the standard deviation pP remains unchanged since σx =
c − x − c) =2 x)2 . Thus, correlation(x +
i (xi +P i (xi −P
(x +c−x−c)(y +c−y−c) (x −x)(y −y)
c, y+c) = i i σx σy
i
= i iσx σy i = correlation(x,y).
ii. Scaling: x → cx and y → cy, where c is a constant multiplied
to each attribute value in x and y. P P
cx cy i xi y
Answer: Cosine is invariant because √ P i 2i Pi 2
= √ 2
i
.
yj )2
P P
( i cxi ) ( j cyj ) ( i xi ) ( j
Correlation is also invariant because when x → cx, then both
the mean andP standard deviation are re-scaled by the same fac-
cx
tor: x0 = in i = cx and σx0 =
pP
− cx)2 = cσx . Eu-
i (cxip
2
P
clidean
pP distance is not invariant because i (cxi − cyi ) =
2
c i (xi − yi ) .

31
32 Chapter 2 Data

iii. Standardization: x → (x − c)/d and y → (y − c)/d, where c


and d are constants.
Answer: Standardization is a combination of translation (by
the mean of the vector) and scaling (by the standard deviation).
Since correlation is invariant with respect to both operations,
it is also invariant with respect to standardization. However,
cosine and Euclidean distance are not invariant since they are
not preserved by one of the two operations.

9. Consider the following survey data about users who joined an online
community. The sample covariance between the user’s height (in mm)
and number of years being a member of the community is 5.0.

(a) Suppose the sample covariance between the user’s age and number
of years being a member of the community is only 0.5. Does this
imply that user’s height is more correlated with number of years in
the community than user’s age? Answer yes or no and explain your
reasons clearly.
Answer: No. Covariance is not a dimensionless quantity, so its
magnitude depends on the scale of measurement.
(b) Suppose the height attribute is re-defined as height above the av-
erage for all users who participated in the survey. For example, a
user who is 1650 mm tall has a height value of -50 mm (assuming
the average height of all users is 1700 mm). Would the covariance
between the re-defined height attribute and number of years in the
community be greater than, smaller than, or equal to 5.0?
Answer: Equal. Let xh denote the height attribute and xy be the
number of years in the community. The sample covariance between
the two attributes is given by:

i=1
1 X
Σxh ,xy = (xih − xh )(xiy − xy ),
N −1
N

where xh and xy are the average height and average number of years,
respectively. If we re-define the height attribute as x0h = xh − xh ,

32
2.3 Measures of Similarity and Dissimilarity 33

then x0h = 0. Hence, the covariance between x0h and xy becomes

N
1 X 0
Σ x0h ,xy = (xih − x0h )(xiy − xy )
N −1
i=1
N
1 X
= (xih − xh − 0)(xiy − xy )
N −1
i=1
= Σxh ,xy

This result means centering the height attribute has no effect on its
covariance to other attributes.
(c) If the measurement unit for height is converted from mm to inches
(where 1 inch = 25.4 mm), will the covariance between height (in
inches) and number of years in the community be greater than,
smaller than, or equal to 5.0?
Answer: Re-scaling the height attribute is equivalent to multi-
plying the original attribute by some constant C, i.e., x0h = Cxh .
Furthermore, we can show that x0h = Cxh . Thus the covariance
between the rescaled height and number of years in the community
will be:

N
1 X 0
Σx0h ,xy = (xih − x0h )(xiy − xy )
N −1
i=1
N
1 X
= (Cxih − Cxh )(xiy − xy )
N −1
i=1
N
C X
= (xih − xh )(xiy − xy )
N −1
i=1
= CΣxh ,xy

1
In this case, C = 25.4 which is smaller than 1. Therefore, the
covariance value will be smaller when you convert the unit from
mm to inches.
(d) Suppose you standardize both the height and number of years in
the community attributes (by subtracting their respective means
and dividing by their corresponding standard deviations). Would
their covariance value be greater than, smaller than, or equal to

33
34 Chapter 2 Data

5.0? To obtain full credit, you must prove your answer by showing
the computations clearly.
Answer: The re-defined attributes after standardization are: x0h =
xh −xh 0 xy −xy 0 0
σh , xy = σy . Furthermore, we can show that xh = 0, xy = 0.
Then,

i=1
1 X 0
Σx0h ,x0y = (xih − x0h )(x0iy − x0y )
N −1
N
N
1 X xih − xh xiy − xy
= ( )( )
N −1 σh σy
i=1
1 PN
N −1 i=1 (xih − xh )(xiy − xy )
=
σh σy
Σxh ,xy
= (2.9)
σh σy

Note that σh1σy Σxh ,xy is equivalent to the correlation coefficient be-
tween xh and xy . Since correlation coefficient is always less than or
equal to 1 whereas the original covariance value is +5, this means
that the covariance value is smaller after standardization.
Next, we will prove that correlation coefficient is always between
−1 and +1. First, note that
v v
u N u N
u 1 X u 1 X
σh = t (xih − xh )2 , σy = t (xiy − xy )2 .
N −1 N −1
i=1 i=1

Thus, Equation (2.12) can be re-written as follows:

Σxh ,xy
Σx0h ,x0y =
σh σy
PN
− xh )(xiy − xy )
i=1 (xih
= s   (2.10)
PN 2
PN 2
i=1 (xih − xh ) i=1 (xiy − xy )

34
2.3 Measures of Similarity and Dissimilarity 35

Let hi = xih − xh and yi = xiy − xy . Equation (2.13) becomes


PN ~h • ~y
i=1 hi yi
Σx0h ,x0y = s = ~ (2.11)
|h||~y |

PN 2
PN 2
i=1 hi i=1 yi

According to Cauchy-Schwarz inequality, for any vectors ~h and ~y ,


we have
~h • ~y ≤ |~h||~y |.
Thus the ratio on the right-hand side of Equation (2.14) is less than
or equal to 1, which completes the proof.
10. Suppose you are given a database of patient’s demographic information
from a healthcare provider. The covariance matrix obtained for three
attributes: age, weight, and systolic blood pressure (bp) is shown below:
 
age → 389.75 199.37 135.12
weight →  199.37 610.52 426.30 
bp → 135.12 426.30 359.36

(a) Does this imply that user’s age is more correlated with his/her
weight than systolic blood pressure? Answer yes or no and explain
your reasons clearly.
Answer: No. Covariance is not a dimensionless quantity, so its
magnitude depends on the scale of measurement. Even though
covariance between age and weight is higher than that between age
and systolic blood pressure, it is possible the correlation is lower.
(b) Suppose the weight attribute is centered by subtracting it with the
average weight of all patients in the database. For example, a 200-
pound patient has a weight recorded as 50 (if the average weight
of the patients is 150 pounds). Would the covariance between the
centered weight attribute and age be greater than, smaller than, or
equal to 199.37?
Answer: Equal. Let xh denote the weight attribute and xy is the
age attribute. The sample covariance between the two attributes is
given by:
i=1
1 X
Σxh ,xy = (xih − xh )(xiy − xy ),
N −1
N

35
36 Chapter 2 Data

where xh and xy are the average weight and average age, respec-
tively. If we re-define the weight attribute as x0h = xh − xh , then
x0h = 0. Hence, the covariance between x0h and xy becomes

N
1 X 0
Σx0h ,xy = (xih − x0h )(xiy − xy )
N −1
i=1
N
1 X
= (xih − xh − 0)(xiy − xy )
N −1
i=1
= Σxh ,xy

This result means centering the weight attribute has no effect on


its covariance to other attributes.
(c) If the measurement unit for weight is converted from pounds to
kilograms (where 1 kg = 2.2 pounds), will the covariance between
weight (in kilogram) and age be greater than, smaller than, or equal
to 199.37?
Answer: Re-scaling the weight attribute is equivalent to multi-
plying the original attribute by some constant C, i.e., x0h = Cxh .
Furthermore, we can show that x0h = Cxh . Thus the covariance
between the rescaled weight and age will be:
N
1 X 0
Σx0h ,xy = (xih − x0h )(xiy − xy )
N −1
i=1
N
1 X
= (Cxih − Cxh )(xiy − xy )
N −1
i=1
N
C X
= (xih − xh )(xiy − xy )
N −1
i=1
= CΣxh ,xy

1
In this case, C = 2.2 which is smaller than 1. Therefore, the covari-
ance value will be smaller when you convert the unit from pounds
to kilograms.
(d) Suppose you standardize both the age and weight attributes (by
subtracting their respective means and dividing by their corre-
sponding standard deviations). Would their covariance value be
greater than, smaller than, or equal to 199.37?

36
2.3 Measures of Similarity and Dissimilarity 37

Answer: The re-defined attributes after standardization are: x0h =


xh −xh 0 xy −xy 0 0
σh , xy = σy . Furthermore, we can show that xh = 0, xy = 0.
Then,
i=1
1 X 0
Σx0h ,x0y = (xih − x0h )(x0iy − x0y )
N −1
N
N
1 X xih − xh xiy − xy
= ( )( )
N −1 σh σy
i=1
1 PN
N −1 i=1 (xih − xh )(xiy − xy )
=
σh σy
Σxh ,xy
= (2.12)
σh σy

Note that σh1σy Σxh ,xy is equivalent to the correlation coefficient be-
tween xh and xy . Since correlation coefficient is always less than or
equal to 1 whereas the original covariance value is +5, this means
that the covariance value is smaller after standardization.
Next, we will prove that correlation coefficient is always between
−1 and +1. First, note that
v v
u
u 1 X N u
u 1 X N
σh = t 2
(xih − xh ) , σy = t (xiy − xy )2 .
N −1 N −1
i=1 i=1

Thus, Equation (2.12) can be re-written as follows:

Σxh ,xy
Σx0h ,x0y =
σh σy
PN
− xh )(xiy − xy )
i=1 (xih
= s   (2.13)
PN 2
PN 2
i=1 (xih − xh ) i=1 (xiy − xy )

Let hi = xih − xh and yi = xiy − xy . Equation (2.13) becomes


PN
i=1 hi yi hT y
Σx0h ,x0y = s = (2.14)
PN

PN
 khkkyk
2 2
i=1 hi i=1 yi

37
38 Chapter 2 Data

According to Cauchy-Schwarz inequality, for any vectors h and y,


we have
hT y ≤ khkkyk.
Thus the ratio on the right-hand side of Equation (2.14) is less than
or equal to 1, which completes the proof.

11. Consider the following distance measure for two sets, X and Y:

|X ∩ Y|
D(X, Y) = 1 − ,
|X ∪ Y|

where ∩ is the intersection between the two sets, ∩ is the union of the
two sets, and | · | denote the cardinality of the set. This measure is
equivalent to 1 minus the Jaccard similarity. Does the distance measure
satisfy the positivity, symmetry, and triangle inequality properties? For
each property, explain your reason clearly or give a counter-example if
the property is not satisfied.
Answer:
|X∩Y|
(a) Positivity: Since |X ∩ Y| ≤ |X ∪ Y|, therefore |X∪Y| ≤ 1 and

|X ∩ Y|
D(X, Y) = 1 − ≥ 0.
|X ∪ Y|

Furthermore, if X = Y, then |X ∪ Y| = |X ∩ Y| = |X|. Hence,

|X ∩ Y|
D(X, Y) = 1 − = 1 − 1 = 0.
|X ∪ Y|

Similarly, if D(X, Y) = 0, then

|X ∩ Y|
1− = 0,
|X ∪ Y|

which means, |X ∪ Y| = |X ∩ Y|, or equivalently, X = Y.


Hence, the positivity property holds for the distance measure.
(b) Symmetry:

|X ∩ Y| |Y ∩ X|
D(X, Y) = 1 − =1− = D(Y, X).
|X ∪ Y| |Y ∪ X|

Hence, the symmetry property holds for the distance measure.

38
2.3 Measures of Similarity and Dissimilarity 39

(c) Triangle inequality:

|X ∩ Y| |Y ∩ Z|
D(X, Y) + D(Y, Z) = 1 − +1−
|X ∪ Y| |Y ∪ Z|
|X ∪ Y| − |X ∩ Y| |Y ∪ Z| − |Y ∩ Z|
= +
|X ∪ Y| |Y ∪ Z|
|X ∪ Y| − |X ∩ Y| |Y ∪ Z| − |Y ∩ Z|
≥ +
|X ∪ Y ∪ Z| |X ∪ Y ∪ Z|

and

|X ∩ Z| |X ∩ Z| |X ∪ Y ∪ Z| − |X ∩ Z|
D(X, Z) = 1− ≤ 1− = .
|X ∪ Z| |X ∪ Y ∪ Z| |X ∪ Y ∪ Z|

Figure 2.10. Illustration of triangle inequality

Figure 2.10 shows the Venn diagram for sets X, Y and Z. The
number of data points in each subregion in the Venn Diagram is
labeled A through G. From this figure, it can be easily seen that,
|X∪Y|−|X∩Y|+|Y∪Z|−|Y∩Z| = A+C+D+F+B+C+D+G
whereas
|X ∪ Y ∪ Z| − |X ∩ Z| = A + B + C + F + G.

39
40 Chapter 2 Data

The preceding equations suggest that

|X ∪ Y ∪ Z| − |X ∩ Z| ≤ |X ∪ Y| − |X ∩ Y| + |Y ∪ Z| − |Y ∩ Z|

Putting the inequalities together, we have

|X ∪ Y ∪ Z| − |X ∩ Z|
D(X, Z) ≤
|X ∪ Y ∪ Z|
|X ∪ Y| − |X ∩ Y| + |Y ∪ Z| − |Y ∩ Z|

|X ∪ Y ∪ Z|
≤ D(X, Y) + D(Y, Z)

12. Which similarity or distance measure is most effective for each of the
domains given below:

(a) Which measure, Jaccard or Simple Matching Coefficient, is most


appropriate to compare how similar are the answers provided by
students in an exam. Assume that the answers to all the questions
in the exam are either True or False.
Answer: Simple matching coefficient. The values of true and false
are equally important when computing similarity.
(b) Which measure, Jaccard or Simple Matching Coefficient, is most
appropriate to compare how similar are the locations visited by
tourists at an amusement park. Assume the location information is
stored as binary yes/no attributes (yes means a location was visited
by the tourist and no means a location has not been visited).
Answer: Jaccard. Here places visited by the tourists should play
a more significant role in computing similarity than places they did
not visit.
(c) Which measure, Euclidean distance or correlation coefficient, is
most appropriate to compare two flows in a network traffic. For
each flow, we record information about the number of packets trans-
mitted, number of bytes transferred, number of acknowledgments
sent, and duration of the session.
Answer: Euclidean distance (after standardizing each attribute).
Correlation coefficient is not meaningful here because it is not mean-
ingful to correlate two flows which has different attribute values
(i.e., correlating the attributes are meaningful but correlating the
flows are not).

40
2.3 Measures of Similarity and Dissimilarity 41

(d) Which measure, Euclidean distance or cosine similarity, is most


appropriate to compare the coordinates of a moving object in a
2-dimensional space. For example, using GPS data, the object
may be located at (31.4◦ West,12.4◦ North) at time t1 and (29.4◦
West,12.5◦ North) at another time t2 . Note: we may use +/- to
indicate East/West or North/South directions when computing the
similarity or distance measures.
Answer: Euclidean distance. This is because cosine measures the
angle of the two locations. Thus, if two locations lie along the same
line through the origin, their cosine similarity will be 0 even though
they are located far away from each other.
(e) Which measure, Euclidean distance or cosine similarity, is most ap-
propriate to compare the similarity of items bought by customers at
a grocery store. Assume each customer is represented by a 0/1 bi-
nary vector of items (where a 1 means the customer had previously
bought the item).
Answer: Cosine similarity because presence of an item in the trans-
action plays a more important role in determining similarity than
absence of the item.

41

You might also like