0% found this document useful (0 votes)

20 views37 pages

03 - Data Mining

The document discusses important characteristics of structured data including dimensionality, sparsity, and resolution. It also covers attribute values, types of attributes including nominal, ordinal, interval, and ratio. The document discusses properties of attribute values, attribute transformation, discrete and continuous attributes, asymmetric attributes, and data quality issues like noise, outliers, missing values, and duplicate data. Proximity measures like similarity and dissimilarity are also introduced.

Uploaded by

Dd d

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views37 pages

03 - Data Mining

Uploaded by

Dd d

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Data Mining:

Data Quality
& Proximity Measures

Data Mining - Lecture 3

Important Characteristics of Structured Data

 Dimensionality
 Number of attributes each object is described with
 Challenge: high dimensionality (curse of dimensionality)

 Sparsity
 Sparse data: values of most attributes are zero
 Challenge: sparse data call for special handling (Only presence counts)

 Resolution
 Data properties often could be measured with different resolutions
 Challenge: decide on the most appropriate resolution

Data Mining - Lecture 3 2

Attribute Values
 Attribute values are numbers or symbols assigned
to an attribute
 Distinction between attributes and attribute
values
 Same attribute can be mapped to different attribute
values
 Example: height can be measured in feet or meters

 Different attributes can be mapped to the same set of

values
 Example: Attribute values for ID and age are integers
 But properties of attribute values can be different
 ID has no limit but age has a maximum and minimum value

Data Mining - Lecture 3 3

Types of Attributes
 There are different types of attributes
 Nominal
 Examples: ID numbers, eye color, zip codes

 Ordinal
 Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}

 Interval
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit.

 Ratio
 Examples: temperature in Kelvin, length, time, counts

Data Mining - Lecture 3 4

Properties of Attribute Values
 The type of an attribute depends on which of the
following properties it possesses:
 Distinctness: = 
 Order: < >
 Addition: + -
 Multiplication: */

 Nominal attribute: distinctness

 Ordinal attribute: distinctness & order
 Interval attribute: distinctness, order & addition
 Ratio attribute: all 4 properties

Data Mining - Lecture 3 5

Attribute Type Description Examples

Nominal The values of a nominal attribute are just zip codes, employee ID
different names, i.e., nominal attributes numbers, eye color, gender
provide only enough information to
distinguish one object from another. (=, )

Ordinal The values of an ordinal attribute provide hardness of minerals,

enough information to order objects. (<, >) {good, better, best},
grades, street numbers

Interval For interval attributes, the differences calendar dates, temperature

between values are meaningful, i.e., a unit in Celsius or Fahrenheit
of measurement exists.
(+, - )

Ratio For ratio variables, both differences and temperature in Kelvin,

ratios are meaningful. (*, /) monetary quantities,
counts, age, mass, length,
electrical current

Data Mining - Lecture 3 6

Attribute Transformation
 A function that maps the entire set of values of a
given attribute to a new set of replacement
values such that each old value can be identified
with one of the new values
 Simple functions: xk, log(x), ex, |x|
 Standardization and Normalization

Data Mining - Lecture 3 7

Attribute Transformation Comments
Level
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?

Ordinal An order preserving change of values, An attribute encompassing the

i.e., notion of good, better best can
new_value = f(old_value) be represented equally well by
where f is a monotonic function. the values {1, 2, 3} or by {
0.5, 1, 10}.
Interval new_value =a * old_value + b where Thus, the Fahrenheit and
a and b are constants Celsius temperature scales
differ in terms of where their
zero value is and the size of a
unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.

Data Mining - Lecture 3 8

Discrete and Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values.
 Examples: zip codes, counts, or the set of words in a collection of
documents.
 Often represented as integer variables.
 Note: binary attributes are a special case of discrete attributes

 Continuous Attribute
 Has real numbers as attribute values.
 Examples: temperature, height, or weight.
 Practically, real values can only be measured and represented
using a finite number of digits.
 Continuous attributes are typically represented as floating-point
variables.

Data Mining - Lecture 3 9

Asymmetric Attributes
 Only presence of attribute – non zero
attribute-is important
 Example : each object is a student, each attribute is 1
if the student take a particular course at the
university, 0 otherwise.
 Asymmetric binary attributes
 Asymmetric discrete attributes
 Asymmetric continuous attributes

Data Mining - Lecture 3 10

Data Quality
 What kinds of data quality problems?
 How can we detect problems with the data?
 What can we do about these problems?

 Examples of data quality problems:

 Noise and outliers
 missing values
 duplicate data

Data Mining - Lecture 3 11

Noise
 Noise refers to modification of original values
 Examples: distortion of a person’s voice when talking
on a poor phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise

Data Mining - Lecture 3 12

Outliers
 Outliers are data objects with characteristics that are
considerably different than most of the other data objects
in the data set

Data Mining - Lecture 3 13

Missing Values
 Reasons for missing values
 Information is not collected
(e.g., people decline to give their age and weight)
 Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

 Handling missing values

 Eliminate Data Objects
 Estimate Missing Values
 Ignore the Missing Value During Analysis

Data Mining - Lecture 3 14

Duplicate Data
 Data set may include data objects that are
duplicates, or almost duplicates of one another
 Major issue when merging data from heterogeous
sources

 Examples:
 Same person with multiple email addresses

 Data cleaning
 Process of dealing with duplicate data issues

Data Mining - Lecture 3 15

Proximity Measures

 Similarity

 Dissimilarity

Data Mining - Lecture 3 16

Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are.

 Is higher when objects are more alike.

 Often falls in the range [0,1].

 Dissimilarity
 Numerical measure of how different are two data objects.

 Lower when objects are more alike.

 Minimum dissimilarity is often 0.

 Upper limit varies.

 Proximity refers to a similarity or dissimilarity

Data Mining - Lecture 3 17

Similarity/Dissimilarity for Simple Attributes

p and q are the attribute values for two data objects.

Data Mining - Lecture 3 18

Euclidean Distance: Example

dist  ( x1  x2 ) 2  ( y1  y2 ) 2
3

2 p1 point x y
p3 p4 p1 0 2
1 p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix

Data Mining - Lecture 3 19

Euclidean Distance
 Euclidean Distance
n
dist   k k
( p
k 1
 q ) 2

Where n is the number of dimensions (attributes) and pk and

qk are, respectively, the kth attributes (components) of data
objects p and q.

 Standardization is necessary, if scales differ.

Data Mining - Lecture 3 20

Minkowski Distance
 Minkowski Distance is a generalization of Euclidean
Distance
n 1
dist  ( | pk  qk | ) r r

k 1

Where r is a parameter, n is the number of dimensions

(attributes) and pk and qk are, respectively, the kth attributes
(components) or data objects p and q.

Data Mining - Lecture 3 21

Minkowski Distance: Examples
 r = 1: Manhattan distance
 A common example of this is the Hamming distance, which is just the
number of bits that are different between two binary vectors.

 r = 2: Euclidean distance

 r  : “supremum” distance
 This is the maximum difference between any component of the vectors

 Do not confuse r with n, i.e., all these distances are

defined for all numbers of dimensions.

Data Mining - Lecture 3 22

Minkowski Distance: Examples (cont.)
Manhattan distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
Euclidean distance
point x y L2 p1 p2 p3 p4
p1 0 2 p1 0 2.828 3.162 5.099
p2 2 0 p2 2.828 0 1.414 3.162
p3 3 1 p3 3.162 1.414 0 2
p4 5 1 p4 5.099 3.162 2 0
“supremum” distance
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
Data Mining - Lecture 3 23
Mahalanobis Distance
1
Mahalanobis ( p, q )  ( p  q )  ( p  q ) T

 is the covariance matrix of

the input data X

1 n
 j ,k   ( X ij  X j )( X ik  X k )
n  1 i 1

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

Data Mining - Lecture 3 24

Mahalanobis Distance
Covariance Matrix:

 0.3 0.2
 
 0.2 0.3
C

B A: (0.5, 0.5)
B: (0, 1)
A C: (1.5, 1.5)

Mahal(A,B) = 5
Mahal(A,C) = 4

Data Mining - Lecture 3 25

Common Properties of a Distance
 Distances, such as the Euclidean distance,
have some well known properties:
1. d(p, q)  0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r)  d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between
points (data objects), p and q.

 A distance that satisfies these properties is a

metric

Data Mining - Lecture 3 26

Common Properties of a Similarity
 Similarities, also have some well known
properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data

objects), p and q.

Data Mining - Lecture 3 27

Similarity Between Binary Vectors
 Common situation is that objects, p and q, have only
binary attributes
 Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1

 Simple Matching and Jaccard Coefficients

SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)

J = number of 11 matches / number of not-both-zero attributes values

= (M11) / (M01 + M10 + M11)

Data Mining - Lecture 3 28

SMC versus Jaccard: Example
p= 1000000000
q= 0000001001

M01 = 2 (the number of attributes where p was 0 and q was 1)

M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7)

= 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

Data Mining - Lecture 3 29

Cosine Similarity
 If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| ,
where  indicates vector dot product and || d || is the length of vector d.

 Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.45

cos( d1, d2 ) = .315

Data Mining - Lecture 3 30

Extended Jaccard Coefficient (Tanimoto)
 Variation of Jaccard for continuous or count
attributes
 Reduces to Jaccard for binary attributes

Data Mining - Lecture 3 31

Correlation
 Correlation measures the linear relationship
between objects
 To compute correlation, we standardize data
objects, p and q, and then take their dot product

pk  ( pk  mean( p)) / std ( p)

qk  (qk  mean(q)) / std (q)

correlation( p, q)  p  q

Data Mining - Lecture 3 32

Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

Data Mining - Lecture 3 33

General Approach for Combining Similarities

 Sometimes attributes are of many different

types, but an overall similarity is needed.

Data Mining - Lecture 3 34

Using Weights to Combine Similarities
 May not want to treat all attributes the same.
 Use weights wk which are between 0 and 1 and sum to 1.

Data Mining - Lecture 3 35

Issues in Proximity Calculation

 How to handle the case in witch attributes have different

scales and/or are correlated

 How to calculate proximity between objects that are

composed of different types of attributes, quantitative and
qualitative

 How to handle proximity calculation when attributes have

different weight (when not all attributes contribute equally
to the proximity of objects)

Data Mining - Lecture 3 36

Mahalanobis

 Attributes are correlated

 Different ranges of values
 The distribution of the date is Gaussian

Data Mining - Lecture 3 37

Data Mining Techniques Overview
No ratings yet
Data Mining Techniques Overview
12 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
Data and Attributes in Data Mining
No ratings yet
Data and Attributes in Data Mining
47 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Lec 5
No ratings yet
Lec 5
24 pages
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
No ratings yet
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
33 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
1 Data Mining
No ratings yet
1 Data Mining
47 pages
Data Mining: Data: Lecture Notes For Chapter 2
No ratings yet
Data Mining: Data: Lecture Notes For Chapter 2
34 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
No ratings yet
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
67 pages
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
25 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
Chap2 Data
No ratings yet
Chap2 Data
86 pages
Chap2 Data
No ratings yet
Chap2 Data
92 pages
Mod 4 Types of Data in Cluster Analysis
No ratings yet
Mod 4 Types of Data in Cluster Analysis
31 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
Data Mining Chapter 2 Notes
No ratings yet
Data Mining Chapter 2 Notes
87 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
Chap2 Data Rev
No ratings yet
Chap2 Data Rev
91 pages
Chap2 Data
No ratings yet
Chap2 Data
76 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
40 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Full
No ratings yet
Full
367 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
50 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Chapter 2
No ratings yet
Chapter 2
57 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Data
No ratings yet
Data
84 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
All Data Mining Chapters
No ratings yet
All Data Mining Chapters
235 pages
02 Data
No ratings yet
02 Data
35 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
Sess02 Data
No ratings yet
Sess02 Data
96 pages
CSC 452 DM Lecture02 Know Your Data A 13102020 014137pm
No ratings yet
CSC 452 DM Lecture02 Know Your Data A 13102020 014137pm
39 pages
DM Day3 Preprocessing A S25
No ratings yet
DM Day3 Preprocessing A S25
109 pages
02data Part4
No ratings yet
02data Part4
28 pages
Lecture Notes For Chapter 2: Data Mining: Data
No ratings yet
Lecture Notes For Chapter 2: Data Mining: Data
67 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
2 What Is DATA ST
No ratings yet
2 What Is DATA ST
63 pages
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
No ratings yet
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
2 pages
DM Lec03
No ratings yet
DM Lec03
37 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
Data Attributes & Types Explained
No ratings yet
Data Attributes & Types Explained
69 pages
Lecture Notes For Chapter 2 Introduction To Data Mining: by Tan, Steinbach, Kumar
100% (1)
Lecture Notes For Chapter 2 Introduction To Data Mining: by Tan, Steinbach, Kumar
66 pages
Class 2 Introduction To Data
No ratings yet
Class 2 Introduction To Data
40 pages
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
No ratings yet
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
42 pages
Experimental and Numerical Study of Drag Reduction On Elliptical
No ratings yet
Experimental and Numerical Study of Drag Reduction On Elliptical
18 pages
1991 - Bearing Capacity of Eccentrically Obliquely Loaded Footing - Saran & Agarwal
No ratings yet
1991 - Bearing Capacity of Eccentrically Obliquely Loaded Footing - Saran & Agarwal
22 pages
Physics Manual 2024-25
No ratings yet
Physics Manual 2024-25
62 pages
Method of Joints
No ratings yet
Method of Joints
6 pages
Determination of Nitrates
No ratings yet
Determination of Nitrates
5 pages
PhysRevE 106 024304
No ratings yet
PhysRevE 106 024304
8 pages
John Ringo Manxome Foe
100% (2)
John Ringo Manxome Foe
208 pages
Worm & Worm Gears
No ratings yet
Worm & Worm Gears
23 pages
Paul Langevin's 1908 Paper "On The Theory of Brownian Motion" ("Sur La Théorie Du Mouvement Brownien," C. R. Acad. Sci. (Paris) 146, 530-533 (1908) )
No ratings yet
Paul Langevin's 1908 Paper "On The Theory of Brownian Motion" ("Sur La Théorie Du Mouvement Brownien," C. R. Acad. Sci. (Paris) 146, 530-533 (1908) )
4 pages
Underground Mathematics Resources
No ratings yet
Underground Mathematics Resources
4 pages
Maths - IIA Important Questions
80% (5)
Maths - IIA Important Questions
16 pages
Shade Fabric Specification and Waranty
No ratings yet
Shade Fabric Specification and Waranty
7 pages
Electromagnetic Analysis Using Transmission Line Variables Third Edition Maurice Weiner Instant Download Full Chapters
100% (16)
Electromagnetic Analysis Using Transmission Line Variables Third Edition Maurice Weiner Instant Download Full Chapters
200 pages
Matrix Diagonalization Exercises
No ratings yet
Matrix Diagonalization Exercises
5 pages
Mth112 Notes
No ratings yet
Mth112 Notes
80 pages
JC003ALP000EV
No ratings yet
JC003ALP000EV
24 pages
Car Body Stainless Steel Welding
No ratings yet
Car Body Stainless Steel Welding
13 pages
RAT Mathematics Grade 5 FINAL Edited
No ratings yet
RAT Mathematics Grade 5 FINAL Edited
11 pages
Cambridge International AS & A Level: Chemistry 9701/33 October/November 2021
No ratings yet
Cambridge International AS & A Level: Chemistry 9701/33 October/November 2021
11 pages
Tensile Testing
No ratings yet
Tensile Testing
10 pages
Unit - 2 Single Phase AC Circuits: ET 10203A-Basic Electrical Engineering
No ratings yet
Unit - 2 Single Phase AC Circuits: ET 10203A-Basic Electrical Engineering
28 pages
WAEC Physics - Lesson Notes On Magnetic Field For WASSCE Success
No ratings yet
WAEC Physics - Lesson Notes On Magnetic Field For WASSCE Success
17 pages
My 10th Plan
No ratings yet
My 10th Plan
6 pages
1-D Mathematical Modeling and CFD Investigation On Supersonic Steam Ejector in MED-TVC
No ratings yet
1-D Mathematical Modeling and CFD Investigation On Supersonic Steam Ejector in MED-TVC
14 pages
A Note On Rhotrix
No ratings yet
A Note On Rhotrix
10 pages
Openings in Load-Bearing
No ratings yet
Openings in Load-Bearing
57 pages
Bumpless Transfer for Engineers
No ratings yet
Bumpless Transfer for Engineers
5 pages
12.1. Introduction To Fracture
No ratings yet
12.1. Introduction To Fracture
7 pages
Land Surveyor Exam Paper-2
No ratings yet
Land Surveyor Exam Paper-2
19 pages
Operation & Maintenance of 33,11 KV Substation - PDF p04206
76% (17)
Operation & Maintenance of 33,11 KV Substation - PDF p04206
100 pages

03 - Data Mining

Uploaded by

03 - Data Mining

Uploaded by

Data Mining:

Data Mining - Lecture 3

Data Mining - Lecture 3 2

 Different attributes can be mapped to the same set of

Data Mining - Lecture 3 3

Data Mining - Lecture 3 4

 Nominal attribute: distinctness

Data Mining - Lecture 3 5

Ordinal The values of an ordinal attribute provide hardness of minerals,

Interval For interval attributes, the differences calendar dates, temperature

Ratio For ratio variables, both differences and temperature in Kelvin,

Data Mining - Lecture 3 6

Data Mining - Lecture 3 7

Ordinal An order preserving change of values, An attribute encompassing the

Data Mining - Lecture 3 8

Data Mining - Lecture 3 9

Data Mining - Lecture 3 10

 Examples of data quality problems:

Data Mining - Lecture 3 11

Two Sine Waves Two Sine Waves + Noise

Data Mining - Lecture 3 12

Data Mining - Lecture 3 13

 Handling missing values

Data Mining - Lecture 3 14

Data Mining - Lecture 3 15

Data Mining - Lecture 3 16

 Is higher when objects are more alike.

 Often falls in the range [0,1].

 Lower when objects are more alike.

 Minimum dissimilarity is often 0.

 Upper limit varies.

 Proximity refers to a similarity or dissimilarity

Data Mining - Lecture 3 17

p and q are the attribute values for two data objects.

Data Mining - Lecture 3 18

Data Mining - Lecture 3 19

Where n is the number of dimensions (attributes) and pk and

 Standardization is necessary, if scales differ.

Data Mining - Lecture 3 20

Where r is a parameter, n is the number of dimensions

Data Mining - Lecture 3 21

 Do not confuse r with n, i.e., all these distances are

Data Mining - Lecture 3 22

 is the covariance matrix of

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

Data Mining - Lecture 3 24

Data Mining - Lecture 3 25

 A distance that satisfies these properties is a

Data Mining - Lecture 3 26

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data

Data Mining - Lecture 3 27

 Simple Matching and Jaccard Coefficients

J = number of 11 matches / number of not-both-zero attributes values

Data Mining - Lecture 3 28

M01 = 2 (the number of attributes where p was 0 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7)

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

Data Mining - Lecture 3 29

cos( d1, d2 ) = .315

Data Mining - Lecture 3 30

Data Mining - Lecture 3 31

pk  ( pk  mean( p)) / std ( p)

Data Mining - Lecture 3 32

Data Mining - Lecture 3 33

 Sometimes attributes are of many different

Data Mining - Lecture 3 34

Data Mining - Lecture 3 35

 How to handle the case in witch attributes have different

 How to calculate proximity between objects that are

 How to handle proximity calculation when attributes have

Data Mining - Lecture 3 36

 Attributes are correlated

Data Mining - Lecture 3 37

You might also like