Test Bank For Introduction To Data Mining 2nd Edition
Test Bank For Introduction To Data Mining 2nd Edition
Introduction
1. [Fall 2008]
For each data set given below, give specific examples of classification,
clustering, association rule mining, and anomaly detection tasks that
can be performed on the data. For each task, state how the data matrix
should be constructed (i.e., specify the rows and columns of the matrix).
(b) Stock market data, which include the prices and volumes of various
stocks on different trading days.
Answer:
Classification
Task: Predict whether the stock price will go up or down the next trading day
Row: A trading day
Column: Trading volume and closing price of the stock the previous 5 days and a class
attribute that indicates whether the stock went up or down
Clustering
Task: Identify groups of stocks with similar price fluctuations
Row: A company’s stock
Column: Changes in the daily closing price of the stock over the past ten years
Association rule mining
Task: Identify stocks with similar fluctuation patterns(e.g., {Google-Up, Yahoo-Up})
Row: A trading day
Column: List of all stock-up and stock-down events on the given day.
Anomaly detection
Task: Identify unusual trading days for a given stock (e.g., unusually high volume)
Row: A trading day
Column: Trading volume, change in daily stock price (daily high − low prices), and average
price change of its competitor stocks
(c) Database of Major League Baseball (MLB).
Classification
Task: Predict the winner of a game between two MLB teams.
Row: A game.
Column: Statistics of the home and visiting teams over their past 10 games they had played
(e.g., average winning percentage and hitting percentage of their players)
Clustering
Task: Identify groups of players with similar statistics
Row: A player
Column: Statistics of the player
Association rule mining
Task: Identify interesting player statistics (e.g., 40% of right-handed players have a batting
percentage below 20% when facing left-handed pitchers)
Row: A player
Column: Discretized statistics of the player
Anomaly detection
Task: Identify players who performed considerably better than expected in a given season
Row: A (player,season) pair e.g, (player1 in 2007)
Column: Ratio statistics of a player (e.g., ratio of average batting percentage in 2007 to
career average batting percentage)
2
2
Data
2.1 Types of Attributes
1. Classify the following attributes as binary, discrete, or continuous. Also
classify them as qualitative (nominal or ordinal) or quantitative (interval
or ratio). Some cases may have more than one interpretation, so briefly
indicate your reasoning if you think there may be some ambiguity.
• discrete or continuous.
• qualitative or quantitative
• nominal, ordinal, interval, or ratio
4 Chapter 2 Data
Some cases may have more than one interpretation, so briefly indicate
your reasoning if you think there may be some ambiguity.
(a) Julian Date, which is the number of days elapsed since 12 noon
Greenwich Mean Time of January 1, 4713 BC.
Answer: Continuous, quantitative, interval
(b) Movie ratings provided by users (1-star, 2-star, 3-star, or 4-star).
Answer: Discrete, qualitative, ordinal
(c) Mood level of a blogger (cheerful, calm, relaxed, bored, sad, angry
or frustrated).
Answer: Discrete, qualitative, nominal
(d) Average number of hours a user spent on the Internet in a week.
Answer: Continuous, quantitative, ratio
(e) IP address of a machine.
Answer: Discrete, qualitative, nominal
(f) Richter scale (in terms of energy release during an earthquake).
Answer: Continuous, qualitative, ordinal
In terms of energy release, the difference between 0.0 and 1.0 is not
the same as between 1.0 and 2.0. Ordinal attributes are qualitative;
yet, can be continuous.
(g) Salary above the median salary of all employees in an organization.
Answer: Continuous, quantitative, interval
(h) Undergraduate level (freshman, sophomore, junior, and senior) for
measuring years in college.
Answer: Discrete, qualitative, ordinal
4
2.1 Types of Attributes 5
5
6 Chapter 2 Data
6. State the type of each attribute given below before and after we have
performed the following transformation.
6
2.1 Types of Attributes 7
(c) Age of a person is discretized to the following scale: Age < 12, 12
≤ Age < 21, 21 ≤ Age < 45, 45 ≤ Age < 65, Age > 65.
Answer: Ratio (before transformation) to ordinal (after transfor-
mation)
(d) Annual income of a person is discretized to the following scale: In-
come < $20K, $20K ≤ Income < $60K, $60K ≤ Income < $120K,
$120K ≤ Age < $250K, Age ≥ $250K.
(a) Average amplitude of seismic waves (in Richter scale) for the 10
deadliest earthquakes in Asia.
Answer: No because Richter scale is ordinal.
(b) Average number of characters in a collection of spam messages.
Answer: Yes because number of characters is a ratio attribute.
(c) Pearson’s correlation between shirt size and height of an individual.
Answer: No because shirt size is ordinal.
(d) Median zipcode of households in the United States.
Answer: No because zipcode is nominal.
7
8 Chapter 2 Data
(e) Entropy of students (based on the GPA they obtained for a given
course).
Answer: Yes because entropy is applicable to nominal attributes.
(f) Geometric mean of temperature (in Fahrenheit) for a given city.
Answer: No because temperature (in Fahrenheit) is not a ratio
attribute.
UserID 1 2 3 4 5 6 7 8 9
Age 17 24 25 28 32 38 39 49 68
Gender Female Male Male Male Female Female Female Male Male
(a) Suppose you apply equal interval width approach to discretize the
Age attribute into 3 bins. Show the userIDs assigned to each of the
3 bins.
Answer: Bin width = 68−17 3 = 51
3 = 17.
Bin 1: 1, 2, 3, 4, 5
Bin 2: 6, 7, 8
Bin 3: 9
(b) Repeat the previous question using the equal frequency approach.
Answer: Since there are 9 users and 3 bins, every bin must contain
3 users.
Bin 1: 1, 2, 3
Bin 2: 4, 5, 6
Bin 3: 7, 8, 9
(c) Repeat question (a) using a supervised discretization approach (with
Gender as class attribute). Specifically, choose the bins in such a
way that their members are as “pure” as possible (i.e., belonging
to the same class).
Answer:
Bin 1: 1, 2, 3, 4
Bin 2: 5, 6, 7
Bin 3: 8, 9
8
2.2 Data Preprocessing 9
9
10 Chapter 2 Data
P (A, B, C)
Confidence(A, B → C) = P (C|A, B) = (2.1)
P (A, B)
10
2.2 Data Preprocessing 11
(a) Suppose we increase the number of bins for the Age attribute from
5 to 6 so that the discretized Age in the rule becomes 21 ≤ Age
< 30 instead of 21 ≤ Age < 45, will the confidence of the rule be
non-increasing, non-decreasing, stays the same, or could go either
way (increase/decrease)?
Answer: Can increase/decrease.
(b) Suppose we increase the number of bins for the AmountSpent at-
tribute from 8 to 10, so that the right hand side of the rule becomes
$500 < AmountSpent < $1000, will the confidence of the rule be
non-increasing, non-decreasing, stays the same, or could go either
way (increase/decrease)?
Answer: Non-increasing.
(c) Suppose the values for NumberOfVisits attribute are distributed
according to a Poisson distribution with a mean value equals to 4.
If we discretize the attribute into 4 bins using the equal frequency
approach, what are the bin values after discretization? Hint: you
need to refer to the cumulative distribution table for Poisson dis-
tribution to answer the question.
Answer: Choose the bin values such that the cumulative distribu-
tion is close to 0.25, 0.5, and 0.75. This corresponds to bin values:
0 to 2, 3, 4 to 5, and greater than 5.
The null values in the table refer to inapplicable values since sales com-
mission are calculated for sales employees only. Suppose we are interested
to calculate the similarity between users based on their sales commission.
11
12 Chapter 2 Data
where s(a, b) is the original similarity measure used for the sales
commission.
6. Consider a data set from an online social media Web site that contains
information about the age and number of friends for 5,000 users.
(a) Suppose the number of friends for each user is known. However,
only 4000 out of 5000 users provide their age information. The
average age of the 4,000 users is 30 years old. If you replace the
missing values for age with the value 30, will the average age com-
puted for the 5,000 users increases, decreases, or stays the same (as
30)?
Answer: Average age does not change.
4000
1 X
xold = xi
4000
i=1
5000 4000 5000
1 X 1 X X
xnew = xi = xi + xi
5000 5000
i=1 i=1 i=4001
P4000
Since xi = xold for i = 4001, 4002, · · · , 5000 and i=1 xi = 4000xold ,
we have
1
xnew = 4000xold + 1000xold = xold
5000
12
2.2 Data Preprocessing 13
(b) Suppose the covariance between age and number of friends calcu-
lated using the 4,000 users (with no missing values) is 20. If you
replace the missing values for age with the average age of the 4,000
users, would the covariance between age and number of friends in-
creases, decreases, or stays the same (as 20)? Assume that the
average number of followers for all 5,000 users is the same as the
average for 4,000 users.
Answer: Covariance will decrease. Let C1 = 4000
P
i=1 (xi − x)(yi −
y)/3999 be the covariance computed using the 4,000 users without
missing values. If we impute the missing values for age with average
age, x remains unchanged according to part (a). Furthermore, y is
assumed to be unchanged. Thus, the new covariance is
5000
1 X
C2 = (xi − x)(yi − y)
4999
i=1
4000 5000
1 X X
= (xi − x)(yi − y) + (xi − x)(yi − y)
4999
i=1 i=4001
4000 5000
1 X X
= (xi − x)(yi − y) + (x − x)(yi − y)
4999
i=1 i=4001
4000
1 X
= (xi − x)(yi − y) < C1 (2.2)
4999
i=1
7. Consider the following data matrix on the right, in which two of its
values are missing (the matrix on the left shows its true values).
−0.2326 0.2270 −0.2326 0.2270
−0.0847 0.7125 −0.0847 0.7125
0.1275 0.3902 0.1275 0.3902
0.1329 −0.1461 ? −0.1461
0.3724 0.1756 0.3724 0.1756
0.4975 0.8536 0.4975 0.8536
−→
0.6926 0.7834 0.6926 0.7834
0.7933 0.7375 0.7933 0.7375
0.8229 0.2147
0.8229 0.2147
0.8497 0.4980 0.8497 0.4980
1.0592 0.7600 1.0592 ?
1.5028 1.0122 1.5028 1.0122
13
14 Chapter 2 Data
(a) Impute the missing values for the matrix on the right by their re-
spective column averages. Show the imputed values and calculate
their root-mean-square-error (RMSE).
s
(A4,1 − Ã4,1 )2 + (A11,2 − Ã11,2 )2
RMSE =
2
where Ai,j denotes the true value of the (i, j)-th element of the data
matrix and Ãi,j denotes its corresponding imputed value.
Answer: The column averages are [0.5819 0.4962]. The imputed
values are
−0.2326 0.2270
−0.0847 0.7125
0.1275 0.3902
0.5819 −0.1461
0.3724 0.1756
0.4975 0.8536
0.6926 0.7834
0.7933 0.7375
0.8229 0.2147
0.8497 0.4980
1.0592 0.4962
1.5028 1.0122
and the RMSE value is
r
(0.1329 − 0.5819)2 + (0.7600 − 0.4962)2
RMSE = = 0.3683
2
14
2.2 Data Preprocessing 15
15
16 Chapter 2 Data
1 1 1
C = (A − 1N A)T (A − 1N A)
N −1 N N
T
1 1 1
= (IN − 1N )A (IN − 1N )A
N −1 N N
1 T 1 1
= A I N − 1N IN − 1N A (2.5)
N −1 N N
16
2.2 Data Preprocessing 17
17
18 Chapter 2 Data
(c) Find the relationship between the right singular matrix V and the
matrix of principal components X if the data matrix A has been
column-centered (i.e., every column of A has been subtracted by
the column mean) before applying SVD.
Answer: If the matrix A has been column-centered, then its col-
umn mean is zero, which means AT 1N is a matrix of all zeros.
Thus, the last equation in the previous question reduces to:
1
XΛXT = VΣ2 VT .
N −1
You will use Matlab to apply PCA to each of the following images.
(a) Load each image using the imread command. For example:
matlab> A = imread(’img1.jpg’);
matlab> imagesc(A);
matlab> colormap(gray);
18
2.2 Data Preprocessing 19
Size of matrix A
Compression ratio =
Size of matrix U + Size of matrix V
for each reduced rank (10, 30, 50, 100) of the images. You can use
the whos command to determine the size of the matrices:
matlab> whos A U V
Answer: See Table 2.1.
19
20 Chapter 2 Data
(j) rank 100 img1 (k) rank 100 img2 (l) rank 100 img3
20
2.2 Data Preprocessing 21
image (i.e., the city square in img1.jpg, shape of the face in img2.
jpg, and shape of the apple in img3.jpg). Which image requires
the least number of principal components? Which image requires
the most number of principal components?
Answer:
img1.jpg: 50 components
img2.jpg: 30 components
img3.jpg: 10 components
21
22 Chapter 2 Data
22
2.3 Measures of Similarity and Dissimilarity 23
Answer:
23
24 Chapter 2 Data
2h
W =
k + 2h
For example1 , for dog and cat, W = 26/(4 + 26) = 0.867, whereas for
dog and money, W = 4/(19 + 4) = 0.174.
(a) What is the maximum and minimum possible value for Wu-Palmer
similarity?
1
In this simplified example, we assume each word has exactly 1 sense. In general, a
word can have multiple senses. As a result, the Wu-Palmer measure is given by the highest
similarity that can be achieved using one of its possible senses.
24
2.3 Measures of Similarity and Dissimilarity 25
4. Suppose you are given a census data, where every data object corre-
sponds to a household and the following continuous attributes are used to
characterize each household: total household income, number of house-
hold residents, property value, number of bedrooms, and number of ve-
hicles owned. Suppose we are interested in clustering the households
based on these attributes.
25
26 Chapter 2 Data
(a) Explain why cosine is not a good measure for clustering the data.
Answer: These attributes are all numerical and can have widely
varying ranges of values, depending on the scale used to measure
them. As a result, cosine measure will be biased by the attributes
with largest range of magnitudes (e.g., total household income and
property value).
(b) Explain why correlation is not a good measure for clustering the
data.
Answer: The same argument as part (a). Because each attribute
has different range, correlating the data points is meaningless.
(c) Explain what preprocessing steps and corresponding proximity mea-
sure you should use to do the clustering.
Answer: Euclidean distance, applied after standardizing the at-
tributes to have a mean of 0 and a standard deviation of 1, would
be appropriate
5. Consider the following distance measure:
where c(x, y) is the cosine similarity between two data objects, x and
y. Does the distance measure satisfy the positivity, symmetry, and tri-
angle inequality properties? For each property, show your steps clearly.
Assume x and y are non-negative vectors (e.g., term vectors for a pair
of documents).
Answer:
x·y
(a) Positivity You need to show that ∀x, y : d(x, y) = 1 − kxkkyk ≥0
and d(x, y) = 0 if and only if x = y.
By definition, x · y = kxkkyk cos θ, where θ is the angle between x
and y. Since cos θ ≤ 1 (from trigonometry), therefore
26
2.3 Measures of Similarity and Dissimilarity 27
1 − cos θ = 0 ⇒ cos θ = 1 ⇒ θ = 0
27
28 Chapter 2 Data
28
2.3 Measures of Similarity and Dissimilarity 29
where d(x, y) is the Euclidean distance between two data points, x and
y. Intuitively, D measures the distance between clusters in terms of
the closest two points from each cluster (see Figure 2.8). Does the dis-
tance measure satisfy the positivity, symmetry, and triangle inequality
properties? For each property, show your proof clearly or give a counter-
example if the property is not satisfied.
Answer:
29
30 Chapter 2 Data
(a) Positivity: Since Euclidean distance between any two data points
is always non-negative, therefore D(X, Y ) ≥ 0. D(X, y) can be zero
even when X 6= Y only if there is a data point is assigned to both
clusters X and Y (i.e., if overlapping clusters are allowed). So,
the distance measure satisfies the positivity property for disjoint
clusters but not for overlapping clusters.
(b) Symmetry: Since Euclidean distance is a symmetric measure,
D(X, Y) = min{d(x, y) : x ∈ X, y ∈ Y} = min{d(y, x) : x ∈
X, y ∈ Y} = D(Y, X). Thus, the measure is symmetric.
(c) Triangle Inequality: Triangle inequality property can be vio-
lated. A counter-example is shown in Figure 2.9.
30
2.3 Measures of Similarity and Dissimilarity 31
31
32 Chapter 2 Data
9. Consider the following survey data about users who joined an online
community. The sample covariance between the user’s height (in mm)
and number of years being a member of the community is 5.0.
(a) Suppose the sample covariance between the user’s age and number
of years being a member of the community is only 0.5. Does this
imply that user’s height is more correlated with number of years in
the community than user’s age? Answer yes or no and explain your
reasons clearly.
Answer: No. Covariance is not a dimensionless quantity, so its
magnitude depends on the scale of measurement.
(b) Suppose the height attribute is re-defined as height above the av-
erage for all users who participated in the survey. For example, a
user who is 1650 mm tall has a height value of -50 mm (assuming
the average height of all users is 1700 mm). Would the covariance
between the re-defined height attribute and number of years in the
community be greater than, smaller than, or equal to 5.0?
Answer: Equal. Let xh denote the height attribute and xy be the
number of years in the community. The sample covariance between
the two attributes is given by:
i=1
1 X
Σxh ,xy = (xih − xh )(xiy − xy ),
N −1
N
where xh and xy are the average height and average number of years,
respectively. If we re-define the height attribute as x0h = xh − xh ,
32
2.3 Measures of Similarity and Dissimilarity 33
N
1 X 0
Σ x0h ,xy = (xih − x0h )(xiy − xy )
N −1
i=1
N
1 X
= (xih − xh − 0)(xiy − xy )
N −1
i=1
= Σxh ,xy
This result means centering the height attribute has no effect on its
covariance to other attributes.
(c) If the measurement unit for height is converted from mm to inches
(where 1 inch = 25.4 mm), will the covariance between height (in
inches) and number of years in the community be greater than,
smaller than, or equal to 5.0?
Answer: Re-scaling the height attribute is equivalent to multi-
plying the original attribute by some constant C, i.e., x0h = Cxh .
Furthermore, we can show that x0h = Cxh . Thus the covariance
between the rescaled height and number of years in the community
will be:
N
1 X 0
Σx0h ,xy = (xih − x0h )(xiy − xy )
N −1
i=1
N
1 X
= (Cxih − Cxh )(xiy − xy )
N −1
i=1
N
C X
= (xih − xh )(xiy − xy )
N −1
i=1
= CΣxh ,xy
1
In this case, C = 25.4 which is smaller than 1. Therefore, the
covariance value will be smaller when you convert the unit from
mm to inches.
(d) Suppose you standardize both the height and number of years in
the community attributes (by subtracting their respective means
and dividing by their corresponding standard deviations). Would
their covariance value be greater than, smaller than, or equal to
33
34 Chapter 2 Data
5.0? To obtain full credit, you must prove your answer by showing
the computations clearly.
Answer: The re-defined attributes after standardization are: x0h =
xh −xh 0 xy −xy 0 0
σh , xy = σy . Furthermore, we can show that xh = 0, xy = 0.
Then,
i=1
1 X 0
Σx0h ,x0y = (xih − x0h )(x0iy − x0y )
N −1
N
N
1 X xih − xh xiy − xy
= ( )( )
N −1 σh σy
i=1
1 PN
N −1 i=1 (xih − xh )(xiy − xy )
=
σh σy
Σxh ,xy
= (2.9)
σh σy
Note that σh1σy Σxh ,xy is equivalent to the correlation coefficient be-
tween xh and xy . Since correlation coefficient is always less than or
equal to 1 whereas the original covariance value is +5, this means
that the covariance value is smaller after standardization.
Next, we will prove that correlation coefficient is always between
−1 and +1. First, note that
v v
u N u N
u 1 X u 1 X
σh = t (xih − xh )2 , σy = t (xiy − xy )2 .
N −1 N −1
i=1 i=1
Σxh ,xy
Σx0h ,x0y =
σh σy
PN
− xh )(xiy − xy )
i=1 (xih
= s (2.10)
PN 2
PN 2
i=1 (xih − xh ) i=1 (xiy − xy )
34
2.3 Measures of Similarity and Dissimilarity 35
(a) Does this imply that user’s age is more correlated with his/her
weight than systolic blood pressure? Answer yes or no and explain
your reasons clearly.
Answer: No. Covariance is not a dimensionless quantity, so its
magnitude depends on the scale of measurement. Even though
covariance between age and weight is higher than that between age
and systolic blood pressure, it is possible the correlation is lower.
(b) Suppose the weight attribute is centered by subtracting it with the
average weight of all patients in the database. For example, a 200-
pound patient has a weight recorded as 50 (if the average weight
of the patients is 150 pounds). Would the covariance between the
centered weight attribute and age be greater than, smaller than, or
equal to 199.37?
Answer: Equal. Let xh denote the weight attribute and xy is the
age attribute. The sample covariance between the two attributes is
given by:
i=1
1 X
Σxh ,xy = (xih − xh )(xiy − xy ),
N −1
N
35
36 Chapter 2 Data
where xh and xy are the average weight and average age, respec-
tively. If we re-define the weight attribute as x0h = xh − xh , then
x0h = 0. Hence, the covariance between x0h and xy becomes
N
1 X 0
Σx0h ,xy = (xih − x0h )(xiy − xy )
N −1
i=1
N
1 X
= (xih − xh − 0)(xiy − xy )
N −1
i=1
= Σxh ,xy
1
In this case, C = 2.2 which is smaller than 1. Therefore, the covari-
ance value will be smaller when you convert the unit from pounds
to kilograms.
(d) Suppose you standardize both the age and weight attributes (by
subtracting their respective means and dividing by their corre-
sponding standard deviations). Would their covariance value be
greater than, smaller than, or equal to 199.37?
36
2.3 Measures of Similarity and Dissimilarity 37
Note that σh1σy Σxh ,xy is equivalent to the correlation coefficient be-
tween xh and xy . Since correlation coefficient is always less than or
equal to 1 whereas the original covariance value is +5, this means
that the covariance value is smaller after standardization.
Next, we will prove that correlation coefficient is always between
−1 and +1. First, note that
v v
u
u 1 X N u
u 1 X N
σh = t 2
(xih − xh ) , σy = t (xiy − xy )2 .
N −1 N −1
i=1 i=1
Σxh ,xy
Σx0h ,x0y =
σh σy
PN
− xh )(xiy − xy )
i=1 (xih
= s (2.13)
PN 2
PN 2
i=1 (xih − xh ) i=1 (xiy − xy )
37
38 Chapter 2 Data
11. Consider the following distance measure for two sets, X and Y:
|X ∩ Y|
D(X, Y) = 1 − ,
|X ∪ Y|
where ∩ is the intersection between the two sets, ∩ is the union of the
two sets, and | · | denote the cardinality of the set. This measure is
equivalent to 1 minus the Jaccard similarity. Does the distance measure
satisfy the positivity, symmetry, and triangle inequality properties? For
each property, explain your reason clearly or give a counter-example if
the property is not satisfied.
Answer:
|X∩Y|
(a) Positivity: Since |X ∩ Y| ≤ |X ∪ Y|, therefore |X∪Y| ≤ 1 and
|X ∩ Y|
D(X, Y) = 1 − ≥ 0.
|X ∪ Y|
|X ∩ Y|
D(X, Y) = 1 − = 1 − 1 = 0.
|X ∪ Y|
|X ∩ Y|
1− = 0,
|X ∪ Y|
|X ∩ Y| |Y ∩ X|
D(X, Y) = 1 − =1− = D(Y, X).
|X ∪ Y| |Y ∪ X|
38
2.3 Measures of Similarity and Dissimilarity 39
|X ∩ Y| |Y ∩ Z|
D(X, Y) + D(Y, Z) = 1 − +1−
|X ∪ Y| |Y ∪ Z|
|X ∪ Y| − |X ∩ Y| |Y ∪ Z| − |Y ∩ Z|
= +
|X ∪ Y| |Y ∪ Z|
|X ∪ Y| − |X ∩ Y| |Y ∪ Z| − |Y ∩ Z|
≥ +
|X ∪ Y ∪ Z| |X ∪ Y ∪ Z|
and
|X ∩ Z| |X ∩ Z| |X ∪ Y ∪ Z| − |X ∩ Z|
D(X, Z) = 1− ≤ 1− = .
|X ∪ Z| |X ∪ Y ∪ Z| |X ∪ Y ∪ Z|
Figure 2.10 shows the Venn diagram for sets X, Y and Z. The
number of data points in each subregion in the Venn Diagram is
labeled A through G. From this figure, it can be easily seen that,
|X∪Y|−|X∩Y|+|Y∪Z|−|Y∩Z| = A+C+D+F+B+C+D+G
whereas
|X ∪ Y ∪ Z| − |X ∩ Z| = A + B + C + F + G.
39
40 Chapter 2 Data
|X ∪ Y ∪ Z| − |X ∩ Z| ≤ |X ∪ Y| − |X ∩ Y| + |Y ∪ Z| − |Y ∩ Z|
|X ∪ Y ∪ Z| − |X ∩ Z|
D(X, Z) ≤
|X ∪ Y ∪ Z|
|X ∪ Y| − |X ∩ Y| + |Y ∪ Z| − |Y ∩ Z|
≤
|X ∪ Y ∪ Z|
≤ D(X, Y) + D(Y, Z)
12. Which similarity or distance measure is most effective for each of the
domains given below:
40
2.3 Measures of Similarity and Dissimilarity 41
41