1
CS822
           Data
          Mining
Instructor: Dr. Muhammad Tahir
                                 2
Measuring Data Similarity and
        Dissimilarity
                                3
Measuring Data Similarity and
        Dissimilarity
                                4
Similarity
• A numerical measure that indicates how alike two data
  objects are.
• Often between 0 and 1, where 1 means identical and
  0 means completely different.
• Example:
  • Suppose we compare two text documents based on word
    usage.
    • Doc1: "Artificial intelligence is transforming industries."
    • Doc2: "Industries are being transformed by artificial intelligence."
  • Using cosine similarity, their similarity score might be
    close to 1 because they contain the same words in a
    different order.                                                         5
Dissimilarity
• A numerical measure that indicates how different two
  data objects are.
• The minimum dissimilarity is often 0 (identical objects),
  but the upper limit varies depending on the method.
• Example:
  • In a customer segmentation task, we compare two
    customers based on their age:
     • Customer A: Age 25
     • Customer B: Age 50
  • Using Euclidean distance, the dissimilarity is |50 - 25| =
    25, meaning they are quite different in terms of age.
                                                                 6
Dissimilarity …
Euclidean Distance (No Fixed Upper Limit)
  • The straight-line distance between two points in a multi-
    dimensional space.
• Example:
  • Consider two points in a 2D space:
     • A (2, 3)
     • B (10, 15)
  • Euclidean distance =
  • Upper limit: No fixed value—it depends on the scale of the
    data. In another dataset, distances might be in the hundreds
    or thousands.
                                                                   7
Dissimilarity …
Manhattan Distance (Upper Limit Depends on Data
 Range)
  • Measures the sum of absolute differences between
    coordinates.
• Example:
  • A (2,3) and B (10,15)
  • Upper limit: Can be very high if data values are large (e.g.,
    city distances in kilometers).
                                                                    8
Dissimilarity …
Jaccard Distance (Fixed Upper Limit = 1)
  • Used for comparing binary or set-based data. It is calculated as:
• Example:
  • Set A = {apple, banana, mango}
  • Set B = {banana, mango, orange}
  • Intersection = {banana, mango} (2 elements)
  • Union = {apple, banana, mango, orange} (4 elements)
  • Jaccard Similarity = 2/4 = 0.5
  • Jaccard Distance = 1 - 0.5 = 0.5
  • Upper limit: Always 1, because the maximum possible difference is total
                                                                           9
    dissimilarity (no overlap)
Proximity
• A general term that can refer to either
  similarity or dissimilarity between data
  objects.
• Example:
  • In a recommendation system (e.g., Netflix,
    Spotify), proximity between users is measured to
    suggest similar content:
    • If User A and User B both watch sci-fi movies, their
      proximity score is high, so they get similar movie
      recommendations.
                                                             10
Data Matrix and Dissimilarity Matrix
                                   11
  Data Matrix
•A data matrix is a structured   Obje     Height            Weight        Age
table where:                     ct       (cm)              (kg)          (years)
  • rows represent objects
    (instances) and               A       170               65            25
  • columns represent             B       160               55            30
    features (attributes).
                                  C       175               70            28
                                 Example of a Data Matrix
                                 Each row = an entity (e.g., a person).
                                 Each column = an attribute (e.g., height, weight,
                                 age).
                                                                               12
  Dissimilarity
  Matrix                                                 A      B      C
•A dissimilarity matrix shows how
different (or distant) each object is             A      0      10     8
from the others.
•Instead of raw data, it contains                 B     10      0      5
distance values (e.g., Euclidean,
Manhattan, Jaccard).                              C      8      5      0
•It is often symmetric (distance
between A and B is the same as B
to A).                                  Example of a Dissimilarity Matrix (Using
                                        Euclidean Distance)
•Diagonal values (self-
                                                A to B = 10 → They are more
comparison) are usually 0 (object is    different.
identical to itself).
                                                B to C = 5 → They are more similar.
                                                C to A = 8 → Moderate difference.
                                                                               13
Proximity Measure for Nominal
          Attributes
                                14
Proximity Measure for Nominal
Attributes
• Proximity measures (similarity and dissimilarity) for
  nominal attributes are used when dealing with
  categorical data, where values represent distinct
  categories with no inherent numerical ordering.
                                                          15
Proximity Measure for Nominal
Attributes
Understanding Nominal Attributes
• Nominal attributes are qualitative and describe
  different categories or labels. Examples include:
  • Colors: {Red, Blue, Green}
  • Car Brands: {Toyota, Ford, Honda}
  • Job Titles: {Engineer, Doctor, Teacher}
• Since these values do not have a numeric meaning,
  we use specific techniques to measure proximity
  (similarity/dissimilarity).
                                                      16
Proximity Measure for Nominal
Attributes
Dissimilarity Measure for        Perso      Car          Eye
                                                                      Job Title
 Nominal Attributes              n          Brand        Color
                                 A          Toyota       Brown        Engineer
The simplest method is Simple
  Matching Coefficient (SMC):    B          Honda        Brown        Teacher
                                Example (Comparing two people based on their
                                attributes)
                                Total attributes = 3
                                Mismatches = 2 (Car Brand and Job Title are
                                different)
                                Dissimilarity = = 0.6732=0.67 (higher value =
                                more dissimilar)
                                                                          17
 Proximity Measure for Nominal
 Attributes
Similarity Measure for
                             Perso   Car         Eye        Job
Nominal Attributes           n       Brand       Color      Title
                                                            Enginee
Simple Matching Similarity   A       Toyota      Brown
                                                            r
 (SMC):
                             B       Honda       Brown      Teacher
                             Example:
                             •Matches = 1 (Eye Color is the same)
                             •Similarity = = 0.33 (lower value = less
                             similar)
                                                                      18
      Proximity Measure for Nominal
      Attributes
Jaccard Similarity for Binary                                         Likes
 Nominal Data                                                Likes                Likes
                                                     User             Samsun
                                                             Apple                Sony
If nominal attributes are binary (Yes/No, 1/0),                       g
   Jaccard Similarity is often used:
                                                     A       1        1           0
                                                     B       1        0           1
In the Jaccard Similarity measure, 1-1 matches
  and 0-0 matches refer to how binary (yes/no,    1-1 Matches = 1 (Both users like the same
  1/0) attributes align between two objects.      product: Likes Apple)
• 1-1 Match: Both objects have the same           0-0 Matches = 0 (Both users dislike the same
  attribute as "1" (yes, true, present, etc.).    product)
• 0-0 Match: Both objects have the same           Total Attributes - 0-0 Matches = 3
  attribute as "0" (no, false, absent, etc.).     Jaccard Similarity = =0.3331=0.33
                                                                                       19
Proximity Measure for Nominal
Attributes
• A contingency table is a table that summarizes the
  frequency of different combinations of two
  categorical (or binary) variables.
• It helps in analyzing the relationship between two
  variables.
  • It allows us to compare two objects or variables.
  • It helps in calculating similarity and dissimilarity.
  • Used in data mining, statistics, and machine learning
    to measure how similar or different two entities are.
                                                            20
   Contingency Table for Binary Data
• A contingency table for binary
  data (data with only 1s (Yes) and        Object j →   1    0     Total
  0s (No)) looks like this:
• What do these values mean?
                                           Object i
   • q → Both objects are 1 (Yes,                       q    r     q+r
                                             =1
     Yes)
   • r → Object i is 1, but j is 0 (Yes,   Object i
                                                        s    t     s+t
     No)                                     =0
   • s → Object i is 0, but j is 1 (No,
     Yes)                                               q+   r+   p (total
                                             Total       s    t    data)
   • t → Both objects are 0 (No, No)
                                                                             21
      Example: Sports Preferences of
      Alice & Bob
                                                                                       Alice         Bob
Imagine Alice and Bob are being compared based on
whether they like certain sports (Yes = 1, No = 0)                     Sport            (i)           (j)
                                                                       Football          1             1
In the contingency table:
                                                                       Cricket          1                 0
q = 1 (Football)
                                                                       Tennis           0                 0
                                                                   Contingency Table for Alice & Bob
r = 1 (Cricket)
                                                                          Object j (Bob) Object j (Bob)
                                                                                                        Total
s = 0 (No case where Bob likes but Alice                                  =1             =0
doesn’t)
                                                         Object i         q=1                                 q+r
                                                                                         r = 1 (Cricket)
                                                         (Alice) = 1      (Football)                          =2
t = 1 (Tennis)
                                                         Object i                                             s+t=
                                                                          s=0            t = 1 (Tennis)
                                                         (Alice) = 0                                          1
Now, let's use this table to understand similarity and
dissimilarity measures.                                  Total            q+s=1          r+t=2                p=3
                                                                                                                    22
Symmetric vs. Asymmetric Binary
Distance
• When comparing two objects based on binary (Yes/No or
  1/0) attributes, we use different distance measures
  depending on whether both 1s and 0s matter
  equally or not.
                                                      23
Symmetric Binary Distance
• This considers both agreements (1-1) and disagreements (1-0 or 0-1)
  equally.
• It is useful when both presence and absence of an attribute are meaningful.
• Example: Comparing disease symptoms in two patients (having or not having a
  symptom matters equally).
                                       Formula:
where:
• q = both are 1 (1-1 match)
• r = Alice is 1, Bob is 0 (1-0 mismatch)
• s = Alice is 0, Bob is 1 (0-1 mismatch)
• t = both are 0 (0-0 match)
                                                                                24
Asymmetric Binary Distance
• Here, 0-0 matches (both saying "No") are ignored
  because only the presence of an attribute matters.
• Used when only positive occurrences (1s) are
  meaningful and 0s are unimportant.
• Example: Diagnosing rare diseases (if both don’t
  have the disease, it doesn’t matter, but if one does and
  the other doesn’t, it does).
                        Formula:
                                                             25
Jaccard Similarity
• Jaccard Similarity is a measure of how similar two binary objects are,
  considering only the presence (1s) of attributes and
  ignoring 0s.
   •   q = Both objects have 1 (1-1 match)
   •   r = Object i has 1, but object j has 0 (1-0 mismatch)
   •   s = Object i has 0, but object j has 1 (0-1 mismatch)
   •   t (0-0 matches are ignored in Jaccard similarity)
Jaccard is useful when only the presence of attributes
  matters, such as in:
   • Text analysis (words present in two documents)
   • Market basket analysis (common items in shopping carts)
   • Genetic similarity (shared mutations between species)            26
   Example: Sports Preferences of
   Alice & Bob
•Example: Sports Preferences of Alice                         Alice          Bob
& Bob                                         Sport            (i)            (j)
                                              Football          1              1
•Symmetric Binary Distance Calculation:       Cricket            1            0
                                              Tennis             0            0
                                           From the contingency table:
•Asymmetric Binary Distance Calculation:
                                                                         Bob (j) =        Tota
                                                         Bob (j) = 1
                                                                         0                l
•Jaccard Similarity:
                                           Alice (i)     q=1             r=1
                                                                                          2
                                           =1            (Football)      (Cricket)
   (50% similarity based on shared
                                           Alice (i)                     t=1
   preferences)                                          s=0                              1
                                           =0                            (Tennis)
                                           Total         1               2           27   3
Standardizing Numeric Data with Z-
               score
                                 28
Standardizing Numeric Data with Z-
score
What is Standardization?
• Standardization is a technique used to transform
  numerical data so that different features have a
  common scale. This is crucial when:
  • Data has different units (e.g., height in cm vs. weight in kg).
  • Features have different ranges (e.g., one feature ranges from
    1–1000, another from 0–1).
  • Machine learning models are sensitive to magnitudes (e.g.,
    distance-based algorithms like k-NN, K-means).
                                                                      29
Standardizing Numeric Data with Z-
score
What is Z-score Standardization?
 Z-score standardization transforms data to have:
  • Mean (μ) = 0
  • Standard deviation (σ) = 1
                    Formula for Z-score:
  • X = Original value
  • μ = Mean of the dataset
  • σ = Standard deviation of the dataset
• This means each value is now represented in terms of how
  many standard deviations it is away from the mean.
                                                         30
                                                                Stude   Score
    Example: Standardizing Exam                                 nt      (X)
                                                                A       80
    Scores
•Raw Data (Math Exam Scores of 5 Students):                     B       60
•Step 1: Compute Mean (μ):
                                                                C       75
•Step 2: Compute Standard Deviation (σ):
                                                                D       90
•Step 3: Compute Z-score for Each Student
                                                                E       85
                                            Stude   Score
                                                            Z-score
                                            nt      (X)
                                            A       80
Interpretation of Z-scores                  B       60
  • Positive Z-score (Z > 0) → Value is
    above the mean (e.g., Student D got
                                            C       75
    1.17 standard deviations above the
    mean).
  • Negative Z-score (Z < 0) → Value is     D       90
    below the mean (e.g., Student B got -
    1.75, meaning much lower than the       E       85
    mean).                                                              31
Commonly Used Distance Measures/Metrics
  • Euclidean distance measures the straight-line
    distance between two points in a multi-dimensional
    space
  • Manhattan distance is useful when the
    dimensions in the data have different units of
    measurement
  • Chebyshev distance is ideal for applications where
    the maximum difference between two dimensions is
    more important than the individual differences.
  • Mahalanobis distance takes into account the
    covariance between variables. This is especially
    useful in applications where the dimensions are
    correlated
  • Hamming distance is used to measure the
    difference between two strings of equal length.
  • The Haversine distance is used to calculate the
    distance between two points on a sphere
  • Cosine distance is a measure of similarity between
    two non-zero vectors of an inner product space
                                                         32
Commonly Used Distance
Measures/Metrics
• Understanding the strengths and weaknesses of each
  distance metric is crucial in selecting the appropriate
  metric for a given problem.
• By choosing the right distance metric, we can improve
  the accuracy and efficiency of our machine learning
  models.
              May be included in the exam
                                                            33
Minkowski Distance
• Minkowski distance is a generalized distance metric
  that includes Euclidean and Manhattan distances as
  special cases. It is defined as:
• and are two points in n-dimensional space.
• is the order of the Minkowski distance.
• is the absolute difference between the coordinates of
  the two points.
                                                      34
Special Cases of Minkowski Distance
• Minkowski Distance varies depending on the value
  of :
Manhattan Distance (p = 1) (city block, L1 norm)
• Interpretation: The sum of absolute differences
  between coordinates
• Use case: When movement is restricted to grid-based
  paths (like city blocks).
• Example: Points: A(1, 2) and B(4, 6)
• Distance =                                            35
Special Cases of Minkowski Distance
• Euclidean Distance (p = 2) (L2 norm)
• Interpretation: The straight-line distance between two
  points.
• Use case: When a direct path is possible.
• Example: Points: A(1, 2) and B(4, 6)
• Distance =
                                                       36
Special Cases of Minkowski Distance
• Chebyshev Distance (p → ∞) (“supremum” (Lmax
  norm, L norm))
• Interpretation: The maximum absolute difference
  along any dimension.
• Use case: When diagonal moves are allowed and have
  the same cost as horizontal/vertical moves.
• Example: Points: A(1, 2) and B(4, 6)
• Distance =
                                                   37
Special Cases of Minkowski Distance
• Minkowski Distance generalizes multiple
  distance metrics.
• Choice of affects the result:
  • (City block movement)
  • (Straight-line movement)
  • (Max absolute difference)
                                            38
Properties of Minkowski Distance
• Positive Definiteness
  • Distance is always positive:
  • Distance is zero only when the points are the same:
• Symmetry
  • Distance is the same in both directions:
• Triangle Inequality
  • The direct distance between two points is always less than or
    equal to the sum of the distances via a third point:
  • Example: If going from A to C directly is 5 units, but going A →
    B → C is 6 units, then direct travel is the shortest path.
    Since Minkowski Distance satisfies these three conditions, it is   39
               considered a valid distance metric.
Ordinal Variables
                    40
Ordinal Variables
• Ordinal data is a type of categorical data where values are ranked or
  ordered, but the differences between them are not necessarily equal.
• Unlike numerical data, you cannot perform meaningful arithmetic operations like
  addition or subtraction.
Examples of Ordinal Data
   1.Movie Ratings (e.g., 1 star, 2 stars, ..., 5 stars)
   2.Education Level (e.g., High School < Bachelor's < Master's < PhD)
   3.Customer Satisfaction (e.g., Very Dissatisfied < Dissatisfied < Neutral < Satisfied <
     Very Satisfied)
   4.Pain Level in Medical Surveys (e.g., No Pain < Mild Pain < Moderate Pain < Severe
     Pain)
• Even though these values have an order, the difference between "Satisfied" and
  "Very Satisfied" is not necessarily the same as between "Neutral" and "Satisfied."
                                                                                             41
   Cosine Similarity with Ordinal Data:
   Example
•Scenario: Movie Ratings                                         User A            User B
•Two users (A & B) rate three movies on a scale of 1             Rating            Rating
to 5 (where 1 = worst, 5 = best).                          Movie s                 s
•Step 1: Compute the Dot Product
                                                           Movie
                                                                           5             4
                                                           1
•Step 2: Compute Euclidean Norms
                                                           Movie
   •   For User A:                                                         3            2
   •   For User B:                                     •
                                                           2Cosine similarity = 0.97 → Very high
• Step 3: Compute Cosine Similarity                       similarity between User A and User B's
                                                         Movie
                                                          ratings.
                                                                            4 closer to 0,5it would
                                                       • If the similarity were
                                                         3mean their ratings are quite different.
                                                       • If it were negative, it would mean they
                                                          have opposite preferences.        42
 Cosine Similarity
• A document can be represented by thousands of attributes, each
  recording the frequency of a particular word (such as keywords) or phrase
  in the document.
• Other vector objects: gene features in micro-arrays, …
• Applications: information retrieval, biologic taxonomy, gene feature mapping, ...
• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
        cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
      where  indicates vector dot product, ||d||: the length of vector d             43
 Example: Cosine Similarity
• cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
     where  indicates vector dot product, ||d|: the length of vector d
• Ex: Find the similarity between documents 1 and 2.
   d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
   d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
   d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
   ||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5           =
     6.481
   ||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5           =
     4.12
                                                                          44
   cos(d , d ) = 0.94
You are welcome
                  45