Outline
Attributes and Objects
Types of Data
Data Quality
Similarity and Distance
01/27/2021 Introduction to Data Mining, 2nd Edition 1
Tan, Steinbach, Karpatne, Kumar
What is Data?
Collection of data objects Attributes
and their attributes
An attribute is a property Tid Refund Marital Taxable
or characteristic of an Status Income Cheat
object 1 Yes Single 125K No
– Examples: eye color of a 2 No Married 100K No
person, temperature, etc.
3 No Single 70K No
Objects
– Attribute is also known as
variable, field, characteristic, 4 Yes Married 120K No
dimension, or feature 5 No Divorced 95K Yes
A collection of attributes 6 No Married 60K No
describe an object 7 Yes Divorced 220K No
– Object is also known as 8 No Single 85K Yes
record, point, case, sample, 9 No Married 75K No
entity, or instance
10 No Single 90K Yes
10
Attribute Values
Attribute values are numbers or symbols assigned to
an attribute for a particular object
Distinction between attributes and attribute values
– Same attribute can be mapped to different attribute values
◆ Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of values
◆ Example: Attribute values for ID and age are integers
– But properties of attribute can be different than the
properties of the values used to represent the
attribute
01/27/2021 Introduction to Data Mining, 2nd Edition 3
Tan, Steinbach, Karpatne, Kumar
Attribute Types
Nominal: categories, states, or “names of things”
– Hair_color = {auburn, black, blond, brown, grey, red, white}
– marital status, occupation, ID numbers, zip codes
Binary
– Nominal attribute with only 2 states (0 and 1)
– Symmetric binary: both outcomes equally important
◆ e.g., gender
– Asymmetric binary: outcomes not equally important.
◆ e.g., medical test (positive vs. negative)
◆ Convention: assign 1 to most important outcome (e.g., HIV
positive)
Ordinal
– Values have a meaningful order (ranking) but magnitude between
successive values is not known.
– Size = {small, medium, large}, grades, army rankings
01/27/2021 Introduction to Data Mining, 2nd Edition 4
Tan, Steinbach, Karpatne, Kumar
Numeric Attribute Types
Interval
◆ Measured on a scale of equal-sized units
◆ Values have order
– E.g., temperature in C˚or F˚, calendar dates
◆ No true zero-point
Ratio
◆ Inherent zero-point
◆ We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as high as
5 K˚).
– e.g. length, counts, monetary quantities
01/27/2021 Introduction to Data Mining, 2nd Edition 5
Tan, Steinbach, Karpatne, Kumar 5
https://www.graphpad.com/support/faq/what-is-the-difference-between-ordinal-interval-and-ratio-
variables-why-should-i-care/
01/27/2021 Introduction to Data Mining, 2nd Edition 6
Tan, Steinbach, Karpatne, Kumar
01/27/2021 Introduction to Data Mining, 2nd Edition 7
Tan, Steinbach, Karpatne, Kumar
Discrete and Continuous Attributes
Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
01/27/2021 Introduction to Data Mining, 2nd Edition 8
Tan, Steinbach, Karpatne, Kumar
Basic Statistical Descriptions of Data
Motivation
– To better understand the data: central tendency,
variation and spread
Data dispersion characteristics
– median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
– Data dispersion: analyzed with multiple granularities
of precision
– Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
– Folding measures into numerical dimensions
– Boxplot orIntroduction
quantileto analysis on the transformed cube
Data Mining, 2nd Edition
01/27/2021 9 9
Tan, Steinbach, Karpatne, Kumar
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population): 1 n
x = xi = x
Note: n is sample size and N is population size. n i =1 N
n
– Weighted arithmetic mean:
w x i i
– Trimmed mean: chopping extreme values x= i =1
n
Median: w
i =1
i
– Middle value if odd number of values, or average of
the middle two values otherwise
– Estimated by interpolation (for grouped data):
n / 2 − ( freq)l
median = L1 + ( ) width
freqmedian
Mode
– Value that occurs most frequently in the data
01/27/2021 Introduction to Data Mining, 2nd Edition 10
Tan, Steinbach, Karpatne, Kumar 10
Symmetric vs. Skewed
Data
Median, mean and mode of symmetric
symmetric, positively and
negatively skewed data
positively skewed negatively skewed
01/27/2021 IntroductionData
to Data Mining,
Mining: 2ndand
Concepts Edition 11
August 1, 2023 Tan, Steinbach, Karpatne, Kumar
Techniques
11
Measuring the Dispersion of Data
Quartiles, outliers and boxplots
– Quartiles: Q1 (25th percentile of data below this point), Q3 (75th percentile)
– Inter-quartile range: IQR = Q3 – Q1
– Five number summary: min, Q1, median, Q3, max
– Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
– Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation (sample: s, population: σ)
– Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n 2
[ xi − ( xi ) ]
n n
1 1
s = ( xi − x ) = = − = xi − 2
2 2 2 2 2
( x )
n − 1 i =1 n − 1 i =1 n i =1 N i =1
i
N i =1
– Standard deviation s (or σ) is the square root of variance s2 (or σ2)
01/27/2021 Introduction to Data Mining, 2nd Edition 12
Tan, Steinbach, Karpatne, Kumar 12
Boxplot Analysis
Five-number summary of a distribution
– Minimum, Q1, Median, Q3, Maximum
Boxplot
– Data is represented with a box
– The ends of the box are at the first and
third quartiles, i.e., the height of the box is
IQR
– The median is marked by a line within the
box
– Outliers: points beyond a specified outlier
threshold, plotted individually.
01/27/2021 Introduction to Data Mining, 2nd Edition 13
Tan, Steinbach, Karpatne, Kumar 13
Example
01/27/2021 Introduction to Data Mining, 2nd Edition 14
Tan, Steinbach, Karpatne, Kumar
Example
01/27/2021 Introduction to Data Mining, 2nd Edition 15
Tan, Steinbach, Karpatne, Kumar
Graphic Displays of Basic Statistical Descriptions
Boxplot: graphic display of five-number summary
Histogram: x-axis are values, y-axis repres. frequencies
Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are xi
Quantile-quantile (q-q) plot: graphs the quantiles of one
univariant distribution against the corresponding quantiles
of another
Scatter plot: each pair of values is a pair of coordinates
and plotted as points in the plane
01/27/2021 Introduction to Data Mining, 2nd Edition 16
Tan, Steinbach, Karpatne, Kumar 16
01/27/2021 Introduction to Data Mining, 2nd Edition 17
Tan, Steinbach, Karpatne, Kumar
Histogram Analysis
Histogram: Graph display of
tabulated frequencies, shown as
40
bars
35
It shows what proportion of cases fall
into each of several categories 30
Differs from a bar chart in that it is 25
the area of the bar that denotes the 20
value, not the height as in bar charts,
15
a crucial distinction when the
categories are not of uniform width 10
The categories are usually specified 5
as non-overlapping intervals of some 0
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
variable. The categories (bars) must
be adjacent
01/27/2021 Introduction to Data Mining, 2nd Edition 18
Tan, Steinbach, Karpatne, Kumar 18
Histogram vs. Bar Graph
01/27/2021 Introduction to Data Mining, 2nd Edition 19
Tan, Steinbach, Karpatne, Kumar
Histograms Often Tell More than Boxplots
◼ The two histograms shown in the left
may have the same boxplot
representation
◼ The same values for: min, Q1,
median, Q3, max
◼ But they have rather different data
distributions
01/27/2021 Introduction to Data Mining, 2nd Edition 20
Tan, Steinbach, Karpatne, Kumar 20
Quantile Plot
Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
Plots quantile information
– For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the data are
below or equal to the value xi
01/27/2021 IntroductionData
to Data Mining,
Mining: 2ndand
Concepts Edition 21
Tan, Steinbach, Karpatne, Kumar
Techniques
21
Quantile-Quantile (Q-Q) Plot
Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
View: Is there is a shift in going from one distribution to another?
Example shows unit price of items sold at Branch 1 vs. Branch 2 for each
quantile. Unit prices of items sold at Branch 1 tend to be lower than those
at Branch 2.
01/27/2021 Introduction to Data Mining, 2nd Edition 22
Tan, Steinbach, Karpatne, Kumar 22
Scatter plot
Provides a first look at bivariate data to see clusters of
points, outliers, etc
Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
01/27/2021 Introduction to Data Mining, 2nd Edition 23
Tan, Steinbach, Karpatne, Kumar 23
Positively and Negatively Correlated Data
The left half fragment is positively
correlated
The right half is negative correlated
01/27/2021 Introduction to Data Mining, 2nd Edition 24
Tan, Steinbach, Karpatne, Kumar 24
Uncorrelated Data
01/27/2021 Introduction to Data Mining, 2nd Edition 25
Tan, Steinbach, Karpatne, Kumar 25
Important Characteristics of Data
– Dimensionality (number of attributes)
◆ High dimensional data brings a number of challenges
– Sparsity
◆ Only presence counts
– Resolution
◆ Patterns depend on the scale
– Size
◆ Type of analysis may depend on size of data
01/27/2021 Introduction to Data Mining, 2nd Edition 26
Tan, Steinbach, Karpatne, Kumar
Types of data sets
Record
– Data Matrix
– Document Data
– Transaction Data
Graph
– World Wide Web
– Molecular Structures
Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
01/27/2021 Introduction to Data Mining, 2nd Edition 27
Tan, Steinbach, Karpatne, Kumar
Record Data
Data that consists of a collection of records, each
of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
01/27/2021 Introduction to Data Mining, 2nd Edition 28
Tan, Steinbach, Karpatne, Kumar
Data Matrix
If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute
Such a data set can be represented by an m by n matrix,
where there are m rows, one for each object, and n
columns, one for each attribute
Projection Projection Distance Load Thickness
of x Load of y load
10.23 5.27 15.22 2.7 1.2
12.65 6.25 16.22 2.2 1.1
01/27/2021 Introduction to Data Mining, 2nd Edition 29
Tan, Steinbach, Karpatne, Kumar
Document Data
Each document becomes a ‘term’ vector
– Each term is a component (attribute) of the vector
– The value of each component is the number of times
the corresponding term occurs in the document.
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
01/27/2021 Introduction to Data Mining, 2nd Edition 30
Tan, Steinbach, Karpatne, Kumar
Transaction Data
A special type of data, where
– Each transaction involves a set of items.
– For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased
are the items.
– Can represent transaction data as record data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
01/27/2021 Introduction to Data Mining, 2nd Edition 31
Tan, Steinbach, Karpatne, Kumar
Graph Data
Examples: Generic graph, a molecule, and webpages
2
5 1
2
5
Benzene Molecule: C6H6
01/27/2021 Introduction to Data Mining, 2nd Edition 32
Tan, Steinbach, Karpatne, Kumar
Ordered Data
Sequences of transactions
Items/Events
An element of
the sequence
01/27/2021 Introduction to Data Mining, 2nd Edition 33
Tan, Steinbach, Karpatne, Kumar
Ordered Data
Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
01/27/2021 Introduction to Data Mining, 2nd Edition 34
Tan, Steinbach, Karpatne, Kumar
Ordered Data
Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
01/27/2021 Introduction to Data Mining, 2nd Edition 35
Tan, Steinbach, Karpatne, Kumar
Similarity and Dissimilarity Measures
Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
Dissimilarity measure
– Numerical measure of how different two data objects
are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Proximity refers to a similarity or dissimilarity
01/27/2021 Introduction to Data Mining, 2nd Edition 36
Tan, Steinbach, Karpatne, Kumar
Similarity/Dissimilarity for Simple Attributes
The following table shows the similarity and dissimilarity
between two objects, x and y, with respect to a single, simple
attribute.
01/27/2021 Introduction to Data Mining, 2nd Edition 37
Tan, Steinbach, Karpatne, Kumar
Euclidean Distance
Euclidean Distance
where n is the number of dimensions (attributes) and
xk and yk are, respectively, the kth attributes
(components) or data objects x and y.
Standardization is necessary, if scales differ.
01/27/2021 Introduction to Data Mining, 2nd Edition 38
Tan, Steinbach, Karpatne, Kumar
Euclidean Distance
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
01/27/2021 Introduction to Data Mining, 2nd Edition 39
Tan, Steinbach, Karpatne, Kumar
Minkowski Distance
Minkowski Distance is a generalization of Euclidean
Distance
Where r is a parameter, n is the number of dimensions
(attributes) and xk and yk are, respectively, the kth
attributes (components) or data objects x and y.
01/27/2021 Introduction to Data Mining, 2nd Edition 40
Tan, Steinbach, Karpatne, Kumar
Minkowski Distance: Examples
r = 1. City block (Manhattan, taxicab, L1 norm) distance.
– A common example of this for binary vectors is the
Hamming distance, which is just the number of bits that are
different between two binary vectors
r = 2. Euclidean distance
r → . “supremum” (Lmax norm, L norm) distance.
– This is the maximum difference between any component of
the vectors
Do not confuse r with n, i.e., all these distances are
defined for all numbers of dimensions.
01/27/2021 Introduction to Data Mining, 2nd Edition 41
Tan, Steinbach, Karpatne, Kumar
Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
01/27/2021 Introduction to Data Mining, 2nd Edition 42
Tan, Steinbach, Karpatne, Kumar
Mahalanobis Distance
𝐦𝐚𝐡𝐚𝐥𝐚𝐧𝐨𝐛𝐢𝐬 𝐱, 𝐲 = ((𝐱 − 𝐲)𝑇 Ʃ−1 (𝐱 − 𝐲))-0.5
is the covariance matrix
For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.
01/27/2021 Introduction to Data Mining, 2nd Edition 43
Tan, Steinbach, Karpatne, Kumar
Common Properties of a Distance
Distances, such as the Euclidean distance,
have some well known properties.
1. d(x, y) 0 for all x and y and d(x, y) = 0 if and only
if x = y.
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z) d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between
points (data objects), x and y.
A distance that satisfies these properties is a
metric
01/27/2021 Introduction to Data Mining, 2nd Edition 44
Tan, Steinbach, Karpatne, Kumar
Common Properties of a Similarity
Similarities, also have some well known
properties.
1. s(x, y) = 1 (or maximum similarity) only if x = y.
(does not always hold, e.g., cosine)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)
where s(x, y) is the similarity between points (data
objects), x and y.
01/27/2021 Introduction to Data Mining, 2nd Edition 45
Tan, Steinbach, Karpatne, Kumar
Similarity Between Binary Vectors
Common situation is that objects, x and y, have only
binary attributes
Compute similarities using the following quantities
f01 = the number of attributes where x was 0 and y was 1
f10 = the number of attributes where x was 1 and y was 0
f00 = the number of attributes where x was 0 and y was 0
f11 = the number of attributes where x was 1 and y was 1
Simple Matching and Jaccard Coefficients
counts both presences and absences equally and it is normally
used for symmetric binary attributes
SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
01/27/2021 Introduction to Data Mining, 2nd Edition 46
Tan, Steinbach, Karpatne, Kumar
Similarity Between Binary Vectors
Common situation is that objects, x and y, have only
binary attributes
Compute similarities using the following quantities
f01 = the number of attributes where x was 0 and y was 1
f10 = the number of attributes where x was 1 and y was 0
f00 = the number of attributes where x was 0 and y was 0
f11 = the number of attributes where x was 1 and y was 1
Jaccard Coefficients
counts only presences and it is frequently for asymmetric binary
attributes.
J = number of 11 matches / number of non-zero attributes
= (f11) / (f01 + f10 + f11)
01/27/2021 Introduction to Data Mining, 2nd Edition 47
Tan, Steinbach, Karpatne, Kumar
SMC versus Jaccard: Example
x= 1000000000
y= 0000001001
f01 = 2 (the number of attributes where x was 0 and y was 1)
f10 = 1 (the number of attributes where x was 1 and y was 0)
f00 = 7 (the number of attributes where x was 0 and y was 0)
f11 = 0 (the number of attributes where x was 1 and y was 1)
SMC = (f11 + f00) / (f01 + f10 + f11 + f00)
= (0+7) / (2+1+0+7) = 0.7
J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0
01/27/2021 Introduction to Data Mining, 2nd Edition 48
Tan, Steinbach, Karpatne, Kumar
Cosine Similarity
01/27/2021 Introduction to Data Mining, 2nd Edition 49
Tan, Steinbach, Karpatne, Kumar
Cosine Similarity
If d1 and d2 are two document vectors, then
cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot
product of vectors, d1 and d2, and || d || is the length of
vector d.
Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
01/27/2021 Introduction to Data Mining, 2nd Edition 50
Tan, Steinbach, Karpatne, Kumar
Correlation measures the linear relationship
between objects
01/27/2021 Introduction to Data Mining, 2nd Edition 51
Tan, Steinbach, Karpatne, Kumar
Correlation
01/27/2021 Introduction to Data Mining, 2nd Edition 54
Tan, Steinbach, Karpatne, Kumar
Drawback of Correlation (Non-linear Data)
x = (-3, -2, -1, 0, 1, 2, 3)
y = (9, 4, 1, 0, 1, 4, 9)
yi = xi2
mean(x) = 0, mean(y) = 4
std(x) = 2.16, std(y) = 3.74
corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 )
=0
01/27/2021 Introduction to Data Mining, 2nd Edition 55
Tan, Steinbach, Karpatne, Kumar
Correlation vs cosine vs Euclidean distance
Choice of the right proximity measure depends on the domain
What is the correct choice of proximity measure for the
following situations?
– Comparing documents using the frequencies of words
◆ Documents are considered similar if the word frequencies are similar
– Comparing the temperature in Celsius of two locations
◆ Two locations are considered similar if the temperatures are similar in
magnitude
– Comparing two time series of temperature measured in Celsius
◆ Two time series are considered similar if their “shape” is similar, i.e., they
vary in the same way over time, achieving minimums and maximums at
similar times, etc.
01/27/2021 Introduction to Data Mining, 2nd Edition 56
Tan, Steinbach, Karpatne, Kumar