0% found this document useful (0 votes)

28 views45 pages

CS822 DataMining Week4

The document discusses measuring data similarity and dissimilarity, explaining concepts such as similarity scores, dissimilarity metrics (Euclidean, Manhattan, Jaccard), and proximity measures for both numeric and nominal attributes. It also covers standardization techniques like Z-score and various distance metrics used in data analysis. Understanding these concepts is essential for improving the accuracy and efficiency of machine learning models.

Uploaded by

zainab zahid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views45 pages

CS822 DataMining Week4

Uploaded by

zainab zahid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 45

1

CS822
Data
Mining
Instructor: Dr. Muhammad Tahir

2
Measuring Data Similarity and
Dissimilarity

3
Measuring Data Similarity and
Dissimilarity

4
Similarity
• A numerical measure that indicates how alike two data
objects are.
• Often between 0 and 1, where 1 means identical and
0 means completely different.
• Example:
• Suppose we compare two text documents based on word
usage.
• Doc1: "Artificial intelligence is transforming industries."
• Doc2: "Industries are being transformed by artificial intelligence."
• Using cosine similarity, their similarity score might be
close to 1 because they contain the same words in a
different order. 5
Dissimilarity
• A numerical measure that indicates how different two
data objects are.
• The minimum dissimilarity is often 0 (identical objects),
but the upper limit varies depending on the method.
• Example:
• In a customer segmentation task, we compare two
customers based on their age:
• Customer A: Age 25
• Customer B: Age 50
• Using Euclidean distance, the dissimilarity is |50 - 25| =
25, meaning they are quite different in terms of age.
6
Dissimilarity …
Euclidean Distance (No Fixed Upper Limit)
• The straight-line distance between two points in a multi-
dimensional space.
• Example:
• Consider two points in a 2D space:
• A (2, 3)
• B (10, 15)
• Euclidean distance =
• Upper limit: No fixed value—it depends on the scale of the
data. In another dataset, distances might be in the hundreds
or thousands.
7
Dissimilarity …
Manhattan Distance (Upper Limit Depends on Data
Range)
• Measures the sum of absolute differences between
coordinates.
• Example:
• A (2,3) and B (10,15)

• Upper limit: Can be very high if data values are large (e.g.,
city distances in kilometers).

8
Dissimilarity …
Jaccard Distance (Fixed Upper Limit = 1)
• Used for comparing binary or set-based data. It is calculated as:

• Example:
• Set A = {apple, banana, mango}
• Set B = {banana, mango, orange}
• Intersection = {banana, mango} (2 elements)
• Union = {apple, banana, mango, orange} (4 elements)
• Jaccard Similarity = 2/4 = 0.5
• Jaccard Distance = 1 - 0.5 = 0.5
• Upper limit: Always 1, because the maximum possible difference is total
9
dissimilarity (no overlap)
Proximity
• A general term that can refer to either
similarity or dissimilarity between data
objects.
• Example:
• In a recommendation system (e.g., Netflix,
Spotify), proximity between users is measured to
suggest similar content:
• If User A and User B both watch sci-fi movies, their
proximity score is high, so they get similar movie
recommendations.

10
Data Matrix and Dissimilarity Matrix

11
Data Matrix
•A data matrix is a structured Obje Height Weight Age
table where: ct (cm) (kg) (years)
• rows represent objects
(instances) and A 170 65 25

• columns represent B 160 55 30

features (attributes).
C 175 70 28

Example of a Data Matrix

Each row = an entity (e.g., a person).
Each column = an attribute (e.g., height, weight,
age).
12
Dissimilarity
Matrix A B C
•A dissimilarity matrix shows how
different (or distant) each object is A 0 10 8
from the others.
•Instead of raw data, it contains B 10 0 5
distance values (e.g., Euclidean,
Manhattan, Jaccard). C 8 5 0
•It is often symmetric (distance
between A and B is the same as B
to A). Example of a Dissimilarity Matrix (Using
Euclidean Distance)
•Diagonal values (self-
A to B = 10 → They are more
comparison) are usually 0 (object is different.
identical to itself).
B to C = 5 → They are more similar.
C to A = 8 → Moderate difference.
13
Proximity Measure for Nominal
Attributes

14
Proximity Measure for Nominal
Attributes
• Proximity measures (similarity and dissimilarity) for
nominal attributes are used when dealing with
categorical data, where values represent distinct
categories with no inherent numerical ordering.

15
Proximity Measure for Nominal
Attributes
Understanding Nominal Attributes

• Nominal attributes are qualitative and describe

different categories or labels. Examples include:
• Colors: {Red, Blue, Green}
• Car Brands: {Toyota, Ford, Honda}
• Job Titles: {Engineer, Doctor, Teacher}
• Since these values do not have a numeric meaning,
we use specific techniques to measure proximity
(similarity/dissimilarity).
16
Proximity Measure for Nominal
Attributes
Dissimilarity Measure for Perso Car Eye
Job Title
Nominal Attributes n Brand Color

A Toyota Brown Engineer

The simplest method is Simple
Matching Coefficient (SMC): B Honda Brown Teacher

Example (Comparing two people based on their
attributes)

Total attributes = 3
Mismatches = 2 (Car Brand and Job Title are
different)
Dissimilarity = = 0.6732=0.67 (higher value =
more dissimilar)
17
Proximity Measure for Nominal
Attributes

Similarity Measure for

Perso Car Eye Job
Nominal Attributes n Brand Color Title

Enginee
Simple Matching Similarity A Toyota Brown
r
(SMC):
B Honda Brown Teacher
Example:
•Matches = 1 (Eye Color is the same)
•Similarity = = 0.33 (lower value = less
similar)

18
Proximity Measure for Nominal
Attributes

Jaccard Similarity for Binary Likes

Nominal Data Likes Likes
User Samsun
Apple Sony
If nominal attributes are binary (Yes/No, 1/0), g
Jaccard Similarity is often used:
A 1 1 0

B 1 0 1
In the Jaccard Similarity measure, 1-1 matches
and 0-0 matches refer to how binary (yes/no, 1-1 Matches = 1 (Both users like the same
1/0) attributes align between two objects. product: Likes Apple)
• 1-1 Match: Both objects have the same 0-0 Matches = 0 (Both users dislike the same
attribute as "1" (yes, true, present, etc.). product)
• 0-0 Match: Both objects have the same Total Attributes - 0-0 Matches = 3
attribute as "0" (no, false, absent, etc.). Jaccard Similarity = =0.3331=0.33
19
Proximity Measure for Nominal
Attributes
• A contingency table is a table that summarizes the
frequency of different combinations of two
categorical (or binary) variables.
• It helps in analyzing the relationship between two
variables.
• It allows us to compare two objects or variables.
• It helps in calculating similarity and dissimilarity.
• Used in data mining, statistics, and machine learning
to measure how similar or different two entities are.

20
Contingency Table for Binary Data

• A contingency table for binary

data (data with only 1s (Yes) and Object j → 1 0 Total
0s (No)) looks like this:
• What do these values mean?
Object i
• q → Both objects are 1 (Yes, q r q+r
=1
Yes)
• r → Object i is 1, but j is 0 (Yes, Object i
s t s+t
No) =0
• s → Object i is 0, but j is 1 (No,
Yes) q+ r+ p (total
Total s t data)
• t → Both objects are 0 (No, No)

21
Example: Sports Preferences of
Alice & Bob
Alice Bob
Imagine Alice and Bob are being compared based on
whether they like certain sports (Yes = 1, No = 0) Sport (i) (j)
Football 1 1
In the contingency table:
Cricket 1 0
q = 1 (Football)
Tennis 0 0
Contingency Table for Alice & Bob
r = 1 (Cricket)
Object j (Bob) Object j (Bob)
Total
s = 0 (No case where Bob likes but Alice =1 =0
doesn’t)
Object i q=1 q+r
r = 1 (Cricket)
(Alice) = 1 (Football) =2
t = 1 (Tennis)
Object i s+t=
s=0 t = 1 (Tennis)
(Alice) = 0 1
Now, let's use this table to understand similarity and
dissimilarity measures. Total q+s=1 r+t=2 p=3
22
Symmetric vs. Asymmetric Binary
Distance
• When comparing two objects based on binary (Yes/No or
1/0) attributes, we use different distance measures
depending on whether both 1s and 0s matter
equally or not.

23
Symmetric Binary Distance
• This considers both agreements (1-1) and disagreements (1-0 or 0-1)
equally.
• It is useful when both presence and absence of an attribute are meaningful.
• Example: Comparing disease symptoms in two patients (having or not having a
symptom matters equally).
Formula:
where:
• q = both are 1 (1-1 match)
• r = Alice is 1, Bob is 0 (1-0 mismatch)
• s = Alice is 0, Bob is 1 (0-1 mismatch)
• t = both are 0 (0-0 match)
24
Asymmetric Binary Distance
• Here, 0-0 matches (both saying "No") are ignored
because only the presence of an attribute matters.
• Used when only positive occurrences (1s) are
meaningful and 0s are unimportant.
• Example: Diagnosing rare diseases (if both don’t
have the disease, it doesn’t matter, but if one does and
the other doesn’t, it does).
Formula:

25
Jaccard Similarity
• Jaccard Similarity is a measure of how similar two binary objects are,
considering only the presence (1s) of attributes and
ignoring 0s.

• q = Both objects have 1 (1-1 match)

• r = Object i has 1, but object j has 0 (1-0 mismatch)
• s = Object i has 0, but object j has 1 (0-1 mismatch)
• t (0-0 matches are ignored in Jaccard similarity)

Jaccard is useful when only the presence of attributes

matters, such as in:
• Text analysis (words present in two documents)
• Market basket analysis (common items in shopping carts)
• Genetic similarity (shared mutations between species) 26
Example: Sports Preferences of
Alice & Bob
•Example: Sports Preferences of Alice Alice Bob
& Bob Sport (i) (j)
Football 1 1
•Symmetric Binary Distance Calculation: Cricket 1 0
Tennis 0 0
From the contingency table:
•Asymmetric Binary Distance Calculation:
Bob (j) = Tota
Bob (j) = 1
0 l
•Jaccard Similarity:
Alice (i) q=1 r=1
2
=1 (Football) (Cricket)
(50% similarity based on shared
Alice (i) t=1
preferences) s=0 1
=0 (Tennis)
Total 1 2 27 3
Standardizing Numeric Data with Z-
score

28
Standardizing Numeric Data with Z-
score
What is Standardization?
• Standardization is a technique used to transform
numerical data so that different features have a
common scale. This is crucial when:
• Data has different units (e.g., height in cm vs. weight in kg).
• Features have different ranges (e.g., one feature ranges from
1–1000, another from 0–1).
• Machine learning models are sensitive to magnitudes (e.g.,
distance-based algorithms like k-NN, K-means).

29
Standardizing Numeric Data with Z-
score
What is Z-score Standardization?
Z-score standardization transforms data to have:
• Mean (μ) = 0
• Standard deviation (σ) = 1
Formula for Z-score:
• X = Original value
• μ = Mean of the dataset
• σ = Standard deviation of the dataset
• This means each value is now represented in terms of how
many standard deviations it is away from the mean.
30
Stude Score
Example: Standardizing Exam nt (X)
A 80
Scores
•Raw Data (Math Exam Scores of 5 Students): B 60
•Step 1: Compute Mean (μ):
C 75
•Step 2: Compute Standard Deviation (σ):
D 90
•Step 3: Compute Z-score for Each Student
E 85

Stude Score
Z-score
nt (X)
A 80

Interpretation of Z-scores B 60
• Positive Z-score (Z > 0) → Value is
above the mean (e.g., Student D got
C 75
1.17 standard deviations above the
mean).
• Negative Z-score (Z < 0) → Value is D 90

below the mean (e.g., Student B got -

1.75, meaning much lower than the E 85
mean). 31
Commonly Used Distance Measures/Metrics
• Euclidean distance measures the straight-line
distance between two points in a multi-dimensional
space
• Manhattan distance is useful when the
dimensions in the data have different units of
measurement
• Chebyshev distance is ideal for applications where
the maximum difference between two dimensions is
more important than the individual differences.
• Mahalanobis distance takes into account the
covariance between variables. This is especially
useful in applications where the dimensions are
correlated
• Hamming distance is used to measure the
difference between two strings of equal length.
• The Haversine distance is used to calculate the
distance between two points on a sphere
• Cosine distance is a measure of similarity between
two non-zero vectors of an inner product space
32
Commonly Used Distance
Measures/Metrics
• Understanding the strengths and weaknesses of each
distance metric is crucial in selecting the appropriate
metric for a given problem.
• By choosing the right distance metric, we can improve
the accuracy and efficiency of our machine learning
models.

May be included in the exam

33
Minkowski Distance
• Minkowski distance is a generalized distance metric
that includes Euclidean and Manhattan distances as
special cases. It is defined as:

• and are two points in n-dimensional space.

• is the order of the Minkowski distance.
• is the absolute difference between the coordinates of
the two points.
34
Special Cases of Minkowski Distance
• Minkowski Distance varies depending on the value
of :
Manhattan Distance (p = 1) (city block, L1 norm)

• Interpretation: The sum of absolute differences

between coordinates
• Use case: When movement is restricted to grid-based
paths (like city blocks).
• Example: Points: A(1, 2) and B(4, 6)
• Distance = 35
Special Cases of Minkowski Distance
• Euclidean Distance (p = 2) (L2 norm)

• Interpretation: The straight-line distance between two

points.
• Use case: When a direct path is possible.
• Example: Points: A(1, 2) and B(4, 6)
• Distance =

36
Special Cases of Minkowski Distance
• Chebyshev Distance (p → ∞) (“supremum” (Lmax
norm, L norm))

• Interpretation: The maximum absolute difference

along any dimension.
• Use case: When diagonal moves are allowed and have
the same cost as horizontal/vertical moves.
• Example: Points: A(1, 2) and B(4, 6)
• Distance =
37
Special Cases of Minkowski Distance
• Minkowski Distance generalizes multiple
distance metrics.
• Choice of affects the result:
• (City block movement)
• (Straight-line movement)
• (Max absolute difference)

38
Properties of Minkowski Distance
• Positive Definiteness
• Distance is always positive:
• Distance is zero only when the points are the same:
• Symmetry
• Distance is the same in both directions:
• Triangle Inequality
• The direct distance between two points is always less than or
equal to the sum of the distances via a third point:
• Example: If going from A to C directly is 5 units, but going A →
B → C is 6 units, then direct travel is the shortest path.

Since Minkowski Distance satisfies these three conditions, it is 39

considered a valid distance metric.
Ordinal Variables

40
Ordinal Variables
• Ordinal data is a type of categorical data where values are ranked or
ordered, but the differences between them are not necessarily equal.
• Unlike numerical data, you cannot perform meaningful arithmetic operations like
addition or subtraction.
Examples of Ordinal Data
1.Movie Ratings (e.g., 1 star, 2 stars, ..., 5 stars)
2.Education Level (e.g., High School < Bachelor's < Master's < PhD)
3.Customer Satisfaction (e.g., Very Dissatisfied < Dissatisfied < Neutral < Satisfied <
Very Satisfied)
4.Pain Level in Medical Surveys (e.g., No Pain < Mild Pain < Moderate Pain < Severe
Pain)
• Even though these values have an order, the difference between "Satisfied" and
"Very Satisfied" is not necessarily the same as between "Neutral" and "Satisfied."
41
Cosine Similarity with Ordinal Data:
Example
•Scenario: Movie Ratings User A User B
•Two users (A & B) rate three movies on a scale of 1 Rating Rating
to 5 (where 1 = worst, 5 = best). Movie s s
•Step 1: Compute the Dot Product
Movie
5 4
1
•Step 2: Compute Euclidean Norms
Movie
• For User A: 3 2
• For User B: •
2Cosine similarity = 0.97 → Very high
• Step 3: Compute Cosine Similarity similarity between User A and User B's
Movie
ratings.
4 closer to 0,5it would
• If the similarity were
3mean their ratings are quite different.
• If it were negative, it would mean they
have opposite preferences. 42
Cosine Similarity
• A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or phrase
in the document.

• Other vector objects: gene features in micro-arrays, …

• Applications: information retrieval, biologic taxonomy, gene feature mapping, ...

• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then

cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,

where  indicates vector dot product, ||d||: the length of vector d 43
Example: Cosine Similarity
• cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of vector d

• Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 =
6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 =
4.12
44
cos(d , d ) = 0.94
You are welcome

CSE 1 PPT MiniTest 12feb24 Similarity
No ratings yet
CSE 1 PPT MiniTest 12feb24 Similarity
11 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
Data Mining for Analysts
No ratings yet
Data Mining for Analysts
43 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
Data Similarity & Dissimilarity Guide
No ratings yet
Data Similarity & Dissimilarity Guide
27 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
CSC 452 DM Lecture02 Know Your Data B 13102020 014200pm
No ratings yet
CSC 452 DM Lecture02 Know Your Data B 13102020 014200pm
26 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Class 1c - DataFundamentals
No ratings yet
Class 1c - DataFundamentals
27 pages
Unit-1 (Part-1) Similarity and Dissimilarity Measures
No ratings yet
Unit-1 (Part-1) Similarity and Dissimilarity Measures
24 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
17 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
No ratings yet
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
11 pages
STAT243 Chapter 2 - Section 2.4
No ratings yet
STAT243 Chapter 2 - Section 2.4
41 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
Measure of Proximity
No ratings yet
Measure of Proximity
11 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Unit 3
No ratings yet
Unit 3
13 pages
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
No ratings yet
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
19 pages
DSV-S6 Measures of Similarity and Dissimilarity
No ratings yet
DSV-S6 Measures of Similarity and Dissimilarity
43 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
Data Mining: Distance & Similarity
No ratings yet
Data Mining: Distance & Similarity
25 pages
DM&DW Individual Assignment (50%)
No ratings yet
DM&DW Individual Assignment (50%)
4 pages
Lec 5
No ratings yet
Lec 5
22 pages
Mod 4 Types of Data in Cluster Analysis
No ratings yet
Mod 4 Types of Data in Cluster Analysis
31 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Module-3Conti.. Similarity& Dissimlarity
No ratings yet
Module-3Conti.. Similarity& Dissimlarity
29 pages
CSC 522 Lecture10
No ratings yet
CSC 522 Lecture10
30 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages
06 Data Similarity and Data Dissimilarity
No ratings yet
06 Data Similarity and Data Dissimilarity
10 pages
Similarity and Disimilarity Measures
No ratings yet
Similarity and Disimilarity Measures
2 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
4 pages
Lec 5
No ratings yet
Lec 5
24 pages
Data Mining for Grad Students
No ratings yet
Data Mining for Grad Students
79 pages
Clustering
No ratings yet
Clustering
15 pages
Similarity and Dissimilarity Measures: Distance
No ratings yet
Similarity and Dissimilarity Measures: Distance
50 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
DM Lec03
No ratings yet
DM Lec03
37 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
IDS4
No ratings yet
IDS4
50 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
Class-Data Preprocessing-IV
No ratings yet
Class-Data Preprocessing-IV
28 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
Week 3 - Similarity Distance Measures
No ratings yet
Week 3 - Similarity Distance Measures
42 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
2 2 Data
No ratings yet
2 2 Data
27 pages
02data Part4
No ratings yet
02data Part4
28 pages
Data Similarity and Dissimilarity Guide
No ratings yet
Data Similarity and Dissimilarity Guide
20 pages
WWW - Kfda.go - KR: Registration Number 11-1470000-001902-01
No ratings yet
WWW - Kfda.go - KR: Registration Number 11-1470000-001902-01
162 pages
SOA: Misconceptions, Benefits & Pitfalls
No ratings yet
SOA: Misconceptions, Benefits & Pitfalls
14 pages
QuantLib Python Guide
No ratings yet
QuantLib Python Guide
285 pages
Bouncing Back - Scott Ostler
No ratings yet
Bouncing Back - Scott Ostler
220 pages
Cultivating Daily Christian Gratitude
50% (2)
Cultivating Daily Christian Gratitude
2 pages
Gian Lorenzo Bernini's Influence On Modern Art
No ratings yet
Gian Lorenzo Bernini's Influence On Modern Art
19 pages
Attachment
No ratings yet
Attachment
181 pages
618 e 274851457077745405
No ratings yet
618 e 274851457077745405
6 pages
Songs With PAST SIMPLE and PAST CONTINUOUS
No ratings yet
Songs With PAST SIMPLE and PAST CONTINUOUS
5 pages
Federalist and Antifederalist DBQ Packet
No ratings yet
Federalist and Antifederalist DBQ Packet
8 pages
30 Days of Miracles
No ratings yet
30 Days of Miracles
61 pages
Bsess 30 Concept Map Padalhin
No ratings yet
Bsess 30 Concept Map Padalhin
3 pages
7th Lec Modern History (Act) SSC 2024 (Class 7) English
No ratings yet
7th Lec Modern History (Act) SSC 2024 (Class 7) English
4 pages
5.guidelines For The Exams - M.1-3 Term2-67
No ratings yet
5.guidelines For The Exams - M.1-3 Term2-67
2 pages
Depreciation Chart 11-12 (FY)
100% (1)
Depreciation Chart 11-12 (FY)
4 pages
Day 1: Name That Text Structure!: Content Area and Grade Level
No ratings yet
Day 1: Name That Text Structure!: Content Area and Grade Level
12 pages
1st Departmental Test ICF II 14-15
No ratings yet
1st Departmental Test ICF II 14-15
2 pages
Turbo Pascal Reference Manual Feb84
100% (1)
Turbo Pascal Reference Manual Feb84
276 pages
20912lectures01 19
No ratings yet
20912lectures01 19
119 pages
SEALS2012 Program Booklet
No ratings yet
SEALS2012 Program Booklet
24 pages
Leaf Lab CBD Capsules Customer Complaints - Real or Fake?
No ratings yet
Leaf Lab CBD Capsules Customer Complaints - Real or Fake?
5 pages
Ateneo Law Journal Legal Citation Primer PDF
75% (4)
Ateneo Law Journal Legal Citation Primer PDF
21 pages
Pharmacology Volume 1 Unit 5 Cardio
No ratings yet
Pharmacology Volume 1 Unit 5 Cardio
138 pages
09 24 Eng Adv
No ratings yet
09 24 Eng Adv
1 page
Case Study Questionss - XI - (2023-24)
100% (2)
Case Study Questionss - XI - (2023-24)
12 pages
Removable Functional Appliance (Edited)
No ratings yet
Removable Functional Appliance (Edited)
76 pages
Sirius Guide
100% (1)
Sirius Guide
15 pages
Sirf Panch Minute Ka Madrasa - Free Download, Borrow, and Streaming - Internet Archive
No ratings yet
Sirf Panch Minute Ka Madrasa - Free Download, Borrow, and Streaming - Internet Archive
1 page
Space Strategy and Leadership
No ratings yet
Space Strategy and Leadership
6 pages
AV Hardware Connection Guide
No ratings yet
AV Hardware Connection Guide
84 pages

CS822 DataMining Week4

Uploaded by

CS822 DataMining Week4

Uploaded by

1

• columns represent B 160 55 30

Example of a Data Matrix

• Nominal attributes are qualitative and describe

A Toyota Brown Engineer

Similarity Measure for

Jaccard Similarity for Binary Likes

• A contingency table for binary

• q = Both objects have 1 (1-1 match)

Jaccard is useful when only the presence of attributes

below the mean (e.g., Student B got -

May be included in the exam

• and are two points in n-dimensional space.

• Interpretation: The sum of absolute differences

• Interpretation: The straight-line distance between two

• Interpretation: The maximum absolute difference

Since Minkowski Distance satisfies these three conditions, it is 39

• Other vector objects: gene features in micro-arrays, …

cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,

• Ex: Find the similarity between documents 1 and 2.

You might also like