Similarity and Dissimilarity Measures
• Similarity measure
   • Numerical measure of how alike two data objects are.
   • Is higher when objects are more alike.
   • Often falls in the range [0,1]
• Dissimilarity measure
   •   Numerical measure of how different two data objects are
   •   Lower when objects are more alike
   •   Minimum dissimilarity is often 0
   •   Upper limit varies
• Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes
      The following table shows the similarity and dissimilarity
      between two objects, x and y, with respect to a single, simple
      attribute.
                              Euclidean Distance
• Euclidean Distance
      where n is the number of dimensions (attributes) and xk and yk are,
      respectively, the kth attributes (components) or data objects x and y.
• x = (3, 6, 0, 3, 6)
• y = (1, 2, 0, 1,2)
• dist x, y =   (3 − 1)2 +(6 − 2)2 + (0 − 0)2 +(3 − 1)2 +(6 − 2)2
 dist x, y = 6.324
          Standardization is necessary, if scales differ.
Euclidean Distance
3
                                                  point        x            y
2       p1
                                                   p1          0            2
                      p3            p4
1
                                                   p2          2            0
                 p2                                p3          3            1
0                                                  p4          5            1
    0        1   2    3    4        5    6
                               p1            p2       p3           p4
                 p1                0          2.828    3.162        5.099
                 p2            2.828              0    1.414        3.162
                 p3            3.162          1.414        0            2
                 p4            5.099          3.162        2            0
                           Distance Matrix
Minkowski Distance
    • Minkowski Distance is a generalization of
        Euclidean Distance Where r is a parameter, n is the
        number of dimensions (attributes) and xk and yk
        are, respectively, the kth attributes (components) or
        data objects x and y.
    •
Minkowski Distance: Examples
    • r = 1. City block (Manhattan, taxicab, L1 norm) distance.
       • A common example of this for binary vectors is the Hamming
         distance, which is just the number of bits that are different between
         two binary vectors
    • r = 2. Euclidean distance
    • r → . “supremum” (Lmax norm, L norm) distance.
       • This is the maximum difference between any component of the
         vectors
 Hamming Distance
• Hamming distance is the number of positions in
  which bit-vectors differ.
   • Example: p1 = 10101
              p2 = 10011.
      • d(p1, p2) = 2 because the bit-vectors differ in the 3rd and 4th
        positions.
      • The L1 norm for the binary vectors
                                                                          8
Distances for real vectors
• Vectors 𝑥 = 𝑥1 , … , 𝑥𝑑 and 𝑦 = (𝑦1 , … , 𝑦𝑑 )
• Lp norms or Minkowski distance:
                        𝐿𝑝 𝑥, 𝑦 =        𝑥1 − 𝑦1   𝑝   + ⋯ + 𝑥𝑑 − 𝑦𝑑   𝑝 1ൗ𝑝
• L2 norm: Euclidean distance:
                           𝐿2 𝑥, 𝑦 =        𝑥1 − 𝑦1    2   + ⋯ + 𝑥𝑑 − 𝑦𝑑   2
• L1 norm: Manhattan distance:
                             𝐿1 𝑥, 𝑦 = 𝑥1 − 𝑦1 + ⋯ + |𝑥𝑑 − 𝑦𝑑 |
• L∞ norm:
                              𝐿∞ 𝑥, 𝑦 = max 𝑥1 − 𝑦1 , … , |𝑥𝑑 − 𝑦𝑑 |
    • The limit of Lp as p goes to infinity.
Minkowski Distance
                L1    p1       p2       p3       p4
                p1     0        4        4        6
                p2     4        0        2        4
                p3     4        2        0        2
                p4     6        4        2        0
point   x   y
 p1     0   2   L2    p1       p2       p3       p4
 p2     2   0   p1         0    2.828    3.162    5.099
 p3     3   1   p2     2.828        0    1.414    3.162
 p4     5   1   p3     3.162    1.414        0        2
                p4     5.099    3.162        2        0
                L    p1       p2       p3       p4
                p1         0        2        3        5
                p2         2        0        1        3
                p3         3        1        0        2
                p4         5        3        2        0
                     Distance Matrix
Example of Distances
                                  y = (9,8)
 L2-norm:
  𝑑𝑖𝑠𝑡(𝑥, 𝑦) = 42 + 32 = 5
                         5       3
                                     L1-norm:
                         4            𝑑𝑖𝑠𝑡(𝑥, 𝑦) = 4 + 3 = 7
          x = (5,5)
                 L∞-norm:
                 𝑑𝑖𝑠𝑡(𝑥, 𝑦) = max 3,4 = 4
                                                               11
Common Properties of a Distance
 •   Distances, such as the Euclidean distance, have
     some well known properties.
     1.   d(x, y)  0 for all x and y and d(x, y) = 0 if and only if
          x = y.
     2.   d(x, y) = d(y, x) for all x and y. (Symmetry)
     3.   d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
          (Triangle Inequality)
     where d(x, y) is the distance (dissimilarity) between points
     (data objects), x and y.
 •   A distance that satisfies these properties is a metric
Similarity Between Binary Vectors
 •   Common situation is that objects, x and y, have only binary
     attributes
 •   Compute similarities using the following quantities
     f01 = the number of attributes where x was 0 and y was 1
     f10 = the number of attributes where x was 1 and y was 0
     f00 = the number of attributes where x was 0 and y was 0
     f11 = the number of attributes where x was 1 and y was 1
 •   Simple Matching and Jaccard Coefficients
     SMC = number of matches / number of attributes
            = (f11 + f00) / (f01 + f10 + f11 + f00)
     J   = number of 11 matches / number of non-zero attributes
         = (f11) / (f01 + f10 + f11)
SMC versus Jaccard: Example
 x= 1000000000
 y= 0000001001
 f01 = 2   (the number of attributes where x was 0 and y was 1)
 f10 = 1   (the number of attributes where x was 1 and y was 0)
 f00 = 7   (the number of attributes where x was 0 and y was 0)
 f11 = 0   (the number of attributes where x was 1 and y was 1)
 SMC       = (f11 + f00) / (f01 + f10 + f11 + f00)
           = (0+7) / (2+1+0+7) = 0.7
 J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0
Jaccard: Example
             Cosine Similarity
• If d1 and d2 are two document vectors, then
         cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot product of vectors, d1 and d2,
and || d || is the length of vector d.
                 X =    x   2
                             i                        X •Y         (x        i    yi )
                         i             cos( X , Y ) =       =         i
                                                      X  y     xi
                                                                          2
                                                                          i        y
                                                                                     i
                                                                                           2
                                                                                           i
• Example:
          d1 = 3 2 0 5 0 0 0 2 0 0
          d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> ( d1 • d2 ) =3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
|| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
Common Properties of a Similarity
 •   Similarities, also have some well known
     properties.
     1.   s(x, y) = 1 (or maximum similarity) only if x = y.
     2.   s(x, y) = s(y, x) for all x and y. (Symmetry)
     where s(x, y) is the similarity between points (data objects),
     x and y.
Similarities into distances
• Jaccard distance:
                      𝐽𝐷𝑖𝑠𝑡(𝑋, 𝑌) = 1 – 𝐽𝑆𝑖𝑚(𝑋, 𝑌)
• Jaccard Distance is a metric
• Cosine distance:
                       𝐷𝑖𝑠𝑡(𝑋, 𝑌) = 1 − cos(𝑋, 𝑌)