[go: up one dir, main page]

0% found this document useful (0 votes)
13 views81 pages

Data

Uploaded by

Aya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views81 pages

Data

Uploaded by

Aya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Data Mining

Chapter 2 . Data
SASSI Abdessamed
Data Preprocessing
What is Data?
● Collection of data objects and their Attributes
attributes
● An attribute is a property or
characteristic of an object
■ Examples: eye color of a person, temperature,
etc
■ Attribute is also known as variable, field,

Objects
characteristic, dimension, or feature
● A Collection of attributes describe an
object
■ Object is also known as record, point, case,
sample, entity, or instance
Attribute Values
● Attribute values are numbers or symbols assigned to an attribute for a
particular object

● Distinction between attributes and attribute values


■ Same attribute can be mapped to different attribute values
Example: height can be measured in feet or meters

■ Different attributes can be mapped to the same set of values


Example: Attribute values for ID and age are integers

■ But properties of attribute can be different than the properties of the values used to
represent the attribute
Real World Data
● Because data to be mined in the real world:
■ May come from different sources.
■ Consist of a large number of attributes.
■ Contain a large number of entries (rows).
■ May be of a complex nature.
● This data is susceptible to be:
■ Incomplete. Contain missing values for certain attributes of some entries.
■ Noisy. Contain outlier or erroneous values.
■ Inconsistent. Some entries may be in different formats or have different encoding
schemes for the same attribute (column).
● These issues with real world data suggests the need for techniques to
correct them Preprocessing Techniques
Incomplete Data
● Incomplete data is the result of the following possible reasons:
■ Data is an aggregation of several databases that contains a different number of
attributes.
■ The attributes with missing values was added recently to the structure of the database.
■ The attributes was of less importance or was optional.
■ The users forgot to enter the missing values.
■ The missing values were deleted because of a software bug or equipment malfunctions.

Age Blood Pressure Cholesterol Max heart rate achieved

71 ? 149 125

43 132 ? 136

34 118 210 192


Noisy Data
● Noisy data is the result of the following possible reasons:
■ Sensor / Equipment malfunctions.
■ Human errors when inputting the data fields.
■ Transmission errors.
■ Old technology limitations.

Age Blood Pressure Cholesterol Max heart rate achieved

255 65535 149 125

43 132 -128 10

34 118 210 192


Inconsistent Data
● Inconsistent data occurs for possible reasons:
■ Data from different sources use different coding and formatting schemes.
■ Data entry is carried out by different human users or organizations.
■ Data is collected from unofficial internet sources.

Age Blood Pressure Cholesterol Max heart rate achieved

Old 112 149 > 140

43 132 341 136

31-40 118 210 F


Data Preprocessing
● Preprocessing is the set of techniques used to correct data issues as a
preparation step for the data mining process.
● If the data contains issues, the quality of the data mining system will be
severely degraded.
● Data Cleaning. Removing noise and inferring missing values.
● Data Integration. Combining data from multiple sources without
inconsistency.
● Data Transformation. Changing the values of attributes. (E.g.
Normalization)
● Data Reduction. Removing some data (rows) or attributes (columns)
because they are redundant.
Descriptive Data Summarization (Statistics)
Descriptive Summarization
● Descriptive summarization is a set of visualisation and statistical
techniques that can help prior to the preprocessing step.
● Understand the data and its properties.
● Discover noisy data and outliers.
● Comprehend the distribution of the data.
● Decide which set of data cleaning techniques to be applied.
Central Tendency - Arithmetic Mean (Average)
● The most common numerical measure of the center of a set of data
points.
● Given a set of N data points (observations) X1, X2, …, XN the mean is
defined as:

X = X1 + X2 + … + XN
N
● Built in aggregate function of several database management systems.
● SELECT AVG(Salary) FROM EMPLOYEE;
Central Tendency - Weighted Arithmetic Mean
● The N data points may be associated with N real-valued weights W1, W2,
…, WN the mean is this case defined as:

X = W1X1 + W2X2 + … + WNXN


W1 + W2 + … + WN
Central Tendency - Median
● The median is the value in the middle of the set.
● To find the median of a set of N values:
1. Sort the N values.
2. If N is odd, a single value is in the middle and it is the median.
3. If N is even, two values are in the middle, the average of these two
values is the median.
● Median({-1, -3, 2, 5, 9}) = 2
● Median({-1, -3, 4, 9}) = 0.5
Central Tendency - Mode
● The mode is the most frequent (most repeated) value.
● If a set has a single mode it is called unimodal.
● There may be more than one value with the highest frequency, in the case
the dataset is called multimodal.
● A set with two modes is bimodal.
● A set with three modes is trimodal.
Central Tendency - Midrange
● The midrange of a dataset is simply the average of the minimum and
maximum values.
● Midrange = (Min + Max) / 2
Distributive, Algebraic, and Holistic Measures
● Distributive measures.
■ Can be computed on parts of the data (subsets) and then combine the partial results to
obtain the measure of the entire set.
■ E.g. Sum, Count, Min, and Max
● Algebraic Measures.
■ The application of algebraic functions to one or more distributive measures.
■ E.g. Mean = Sum / Count
● Holistic Measures.
■ Cannot be computed on batches (subsets). Computed on the whole dataset.
■ E.g. Median.
Data Dispersion Measures (1 / 4)
● Consider a data set of N values sorted (in ascending order) x1, x2, …, xN.
● The range of this set is the difference between the maximum and
minimum values: max() - min()
● The kth percentile is the value xi that have k% of the values before it.
● The first quartile Q1 is the 25th percentile.
● The median is the 50th percentile and the second quartile Q2.
● The third quartile Q3 is the 75th percentile.
● The interquartile range (IQR) is a measurement of data spread: IQR = Q3
- Q1.
Data Dispersion Measures (2 / 4)
● Boxplots Constitute an important way of
visualizing a summary of some data attribute. Max
● The two ends of the box (rectangle) are the first
Q3
and third quartiles.
● The line in the middle corresponds to the median.
Median
● The two lines at the bottom and top (called
Q1
whiskers) correspond to the minimum and
maximum respectively.
Min

A boxplot
Data Dispersion Measures (3 / 4)
● Min and Max whiskers extend at max for 1.5 IQR,
if a value is out of this range, it is considered an Outlier
outlier.
Max

Q3
Median
Q1
Min

A boxplot
Data Dispersion Measures (4 / 4)
● Consider a data set of N values x1, x2, …, xN.
● The variance ² is given by the formula:

N
1
²=
N
Σ
i=1
( xi - x )²

● The standard deviation is the square root of the variation.


● The standard deviation measure the spread of the data (average
error).
● Values more than 2 away from the mean for a given attribute may be
considered outliers.
Graphical Data Summarization
Histograms
● Categorical Attribute 50K
■ A rectangle for each value.
● Numerical Attribute 40K

Units sold
■ A rectangle represents a fixed-width
30K
range of values.
20K

10K

Yellow
Red

Blue
Gray

White
Car Color
Scatter Plot
● Displays two numerical attributes
against one another. 100

● Helpful for detecting correlations 80

Items sold
and relations between attributes.
60
● For each point (x, y) in the chart:
■ x corresponds to the value of the first
40
attribute
■ y corresponds to the value of the second 20
attribute

60

120
0

90
30
Price ($)
Quantile Plot
● The whole data for an attribute is 100
displayed.
80
● Plots quantile information.

Price ($)
● For a dataset {X1, X2, …, XN} sorted in 60
ascending order:
40

i - 0.5 20
fi =
N

0.50
0.00

0.75

1.00
0.25
● Approximately, (100×fi)% of the data
point have values ≤ Xi.
f-value
Data Cleaning
Filling Missing Values (1/ 3)
● Ignore the record (row).
■ Generally used in the classification task when the missing value is the label.
● Manually filling the missing value.
■ Not possible for large datasets.
■ Time consuming.
■ Requires the knowledge of the missing value.
● Fill with a global constant.
■ Replace all missing values for the attribute in question with the same value (e.g. -∞).
■ May lead the data mining process to believe that all records with this value form an
interesting pattern.
● Fill with the attribute’s mean.
■ Replace missing values in some column (attribute) with the column’s average value.
Filling Missing Values (2/ 3)
● Fill with the class’s average value.
■ If the records are classified or the task is classification, we can replace the missing values
of some attribute with the mean of the values in the same class.

117 Cholesterol Label

149 Diseased

? Healthy

210 Diseased

175 Diseased

120 Healthy

114 Healthy
Filling Missing Values (3/ 3)
● Fill with the most probable value.
■ Use a regression model to predict the missing value from existing values
■ E.g. Decision Trees, Bayesian inference, Neural Networks …
Noise Removal / Reduction (1 / 2)
● Binning. Splits the sorted samples of the data into bins (batches) that
correspond to local neighborhoods.
■ Smoothing by bin means. The values in each bin are replaced with the mean of the bin.
■ Smoothing by bin medians. The values in each bin are replaced with the median of the
bin.
■ Smoothing by bin boundaries. The values in each bin the closest min/max boundary.

{4, 8, 15, 21, 21, 24, 25, 28, 34}

Bin1: {4, 8, Bin1: {9, 9, Bin1: {8, 8, Bin1: {4, 4,


15} 9} 8} 15}
Bin2: {21, 21, 24} Bin2: {22, 22, 22} Bin2: {21, 21, 21} Bin2: {21, 21, 24}
Bin3: {25, 28, 34} Bin3: {29, 29, 29} Bin3: {28, 28, 28} Bin3: {25, 25, 34}
Original Bins By Means By Medians By Boundaries
Noise Removal / Reduction (2 / 2)
● Regression. We can fit the values of some attribute to a function. Then
we replace the values of the attribute with values from the function.
■ Example. We can use Linear Regression to fit the data to a linear equation.
● Clustering. Group values into clusters using clustering techniques.
Values outside of all clusters are considered as outliers.
Data Integration
● Combining data collected from multiple sources into one coherent
store.
■ File. Spreadsheet, CSV, JSON, …
■ Database
■ Data warehouse
Schema Integration and Object Matching
● Identifying correspondences between the elements (attributes, tables,
etc.) of different schemas.
● E.g. recognizing that "CustomerID" in one database corresponds to
"ClientID" in another.
● We can use the metadata of the original data sources to resolve this
issue.
Resolving data inconsistencies
● Data Entry Errors
Example: "NYC" and "New York City" referring to the same city, but recorded differently.
Resolution: Standardize the representation to either "NYC" or "New York City" to
maintain consistency.
● Unit Mismatches
Example: Weight measured in kilograms in some records and in pounds in others.
Resolution: Convert all values to a single, consistent unit (e.g., pounds to kilograms).
● Data Format Inconsistencies
Example. Transforming date from the format “dd/mm/yy” to “yyyy-mm-dd”.
Resolution: Standardize the format across the dataset, ensuring consistency (e.g., using
"YYYY-MM-DD").
Detecting Redundant Attributes (1 / 3)
1. Numerical Attributes
● Attributes that can be derived from other attributes.
● To detect that two attributes A and B are redundant attributes, we can
calculate the correlation coefficient (Pearson’s product moment) of the two
attributes.
N

Σ
i=1
(Ai - A)×(Bi - B)

rA,B =
N A B
Detecting Redundant Attributes (2 / 3)
1. Numerical Attributes.
● Note that -1 ≤ rA, B ≤ +1.
● If rA, B > 0 then A and B are positively correlated (If A increases, B increases).
● If rA, B < 0 then A and B are negatively correlated (If A increases, B decreases
and vice versa).
● If rA, B = 0 then A and B are independent.

A 0.38 -2.37 -3.65 1.05 3.58 -2.97

B -2.60 7.78 12.63 -5.14 -14.75 10.09


Detecting Redundant Attributes (3 / 3)
A 0.38 -2.37 -3.65 1.05 3.58 -2.97

B -2.60 7.78 12.63 -5.14 -14.75 10.09

A B 𝞂A 𝞂B

-0.66 1.34 2.56 9.68

A-A 1.04 -1.71 -2.99 1.71 4.24 -2.31

B-B -3.94 6.44 11.29 -6.48 -16.09 8.75

rA, B = -0.9979634771005512
Data value conflicts
● Values in different units
■ E.g. Metric (KG, KM, L, …) vs Imperial (Gallon, Pound, Mile, …)
● Prices in different currencies
■ E.g. US Dollars vs Euros vs Algerian Dinars
Normalization (1 / 5)
● Generally data attributes (columns) have different ranges of values.
■ E.g. Age (15 .. 83) vs Salary (15000 .. 300000)
● In several data mining algorithms and techniques, we compare data
entries (data points) as n dimensional Vectors using distance measures.
● If some attributes have larger ranges of values they will have more impact
on the value of the distance.
● To avoid this issue, we can normalize the data so that all attributes have
the same range of values.
● In this course, we will see three methods of normalization.
Normalization (2 / 5)
1. Min-Max Normalization.
● In this method, we normalize the values x1, x2, …, xN of an attribute X
using the linear transformation defined by the following formula:
xi − min X
N(xi) =
max X − min X

● Where:
● xi is the original value we want to normalize.
● N(xi) is the normalized value.
● min X is the minimum of all the values x1, x2, …, xN.
● max X is the maximum of all the values x1, x2, …, xN.
● The range of normalized values is [0..1].
Normalization (3 / 5)
2. Z-Score Normalization.
● In this method, we normalize the values x1, x2, …, xN of an attribute X
using the linear transformation defined by the following formula:
xi − X
N(xi) =
X

● Where:
● xi is the original value we want to normalize.
● N(xi) is the normalized value.
● This method is useful when:
● The minimum and maximum of X are unknown.
● There are some outliers in the values of X.
Normalization (4 / 5)
3. Decimal scaling Normalization.
● In this method, we normalize the values x1, x2, …, xN of an attribute X
using the linear transformation defined by the following formula:
xi
N(xi) =
10p

● Where:
● xi is the original value we want to normalize.
● N(xi) is the normalized value.
● p is the smallest integer such that max({|x1|, |x2|, …, |xN|}) < 10p
● The range of normalized values is [-1..1].
Normalization (5 / 5)
X 897 -67 -360 787 259 752
Original
Y -4 -45 -5 55 37 -7

X 1 0.23 0 0.91 0.49 0.88


Min-Max
Y 0.41 0 0.4 1 0.82 0.38

X 1.10 -0.94 -1.56 0.87 -0.25 0.79


Z-Score
Y -0.28 -1.54 -0.31 1.53 0.98 -0.37

X 0.897 -0.067 -0.36 0.787 0.259 0.752


Decimal Scaling
Y -0.04 -0.45 -0.05 0.55 0.37 -0.07
Data Reduction
● The goal of data reduction is to generate a smaller dataset without losing
the essence of the original one.
● Generating a dataset that’s smaller in volume but it produces the same
analysis results.
● Main strategies for data reduction are:
1. Attribute subset selection.
2. Dimensionality reduction.
3. Data Discretization.
Attribute subset selection (1 / 2)
● Not all attributes are important to the data mining task.
● This suggests that we must select a subset of the whole set of attributes to
keep, and discard the other attributes.
● A human expert can perform this job manually.
● However …
● It will consume a lot of time this way.
● We can design algorithms to perform this task automatically.
● However …
● For a set of n attributes there are 2n subsets to consider.
● We must use heuristic approaches to achieve this.
Attribute subset selection (2 / 2)

Greedy (heuristic) methods for attribute subset selection


Dimensionality Reduction
● Obtain a reduced (compressed) representation of the original data.
● Principal Component Analysis PCA.
Principal Component Analysis (1 / 2)
● PCA can be used to transform a dataset of n attributes (columns) to an
approximate one with k attributes (k ≤ n).
● To achieve this PCA will search for a projection from the n-dimensional
space to the k-dimensional space by minimizing a reconstruction error
criteria.
Principal Component Analysis (2 / 2)
● PCA can be summarized into the following steps:
1. Normalization. Give the attributes (columns) the same importance.
2. Finding Principal Components. Find k n-dimensional orthonormal
(perpendicular and have unit length) vectors that form a basis for the
3. Sorting Principal Components. Sort principal components by significance
(variance its values).
4. Discard Weaker Components. Keeping most significant (strongest)
components results in a good approximation of the original data.
Data Discretization
● Data Discretization is used to reduce the number of possible values an
attribute have.
● This is done by converting the values into discrete intervals or
categories, then replacing each value by the label assigned to its interval.
● Applying Discretization to a numerical attribute reduces the number of
possible values it can have.
● Data discretization techniques include: Binning, Histogram Analysis,
Entropy-based Discretization, Merging by ² Analysis, Cluster
Analysis.
Discretization with Binning
● Similarly to what we saw in data smoothing, binning consists of splitting a
sorted dataset into bins (batches) and replacing their values.
● If replace the values inside each bin with their mean or median, we will
reduce the number of unique values the attributes will have.
● As a result, in this case, the binning technique is also considered an
unsupervised data discretization technique.
Discretization with Histogram analysis
● Histogram Analysis for numerical attributes can be used as an
unsupervised data discretization technique.
● In the histogram technique numerical values are partitioned into ranges
(called buckets).
● The buckets are selected according to one of the following criteria:
1. Equal width. The values of the attributes are split into range of equal with
(e.g. 10..20, 20..30, 30..40, 40..50).
2. Equal frequency. The ranges are selected so that they almost have equal
frequencies.
Comparing data: Similarity and Distance
Data Matrix (1 / 2)
● The goal of Clustering and supervised classification/regression
algorithms - we will see later in this course - is to group into K classes:
■ A set of n object (patients, customers, cars, products, …)
■ Represented by m attributes (age, blood pressure, color, …)
● Hence, these algorithms accept as input a dataset in the form of a matrix
of n rows and m columns.
● The columns of this matrix correspond to the attributes (variables) of
each object.
● The rows of this matrix correspond to the objects.
● As a result, each object is a m-dimensional array (sometimes referred to
as data point).
Data Matrix (2 / 2)
● Each object (datapoint) Xi is a m-
X1 1 X1 2 X1 3 ... X1
dimensional vector (Xi 1, Xi 2, Xi 3, …, Xi m). m
X2 1 X2 2 X2 3 ... X2
m
X3 1 X3 2 X3 3 ... X3
m

Xn 1 Xn 2 Xn 3 ... Xn
m
Similarity, Dissimilarity, and Distance
● Several Clustering and supervised classification/regression algorithms
are based on comparing entire data points to each other.
● This raises the need for some functions to evaluate the degree of
resemblance / difference between two vectors.
● Such functions are called similarity / dissimilarity measures.
● The functions that evaluate the difference (dissimilarity) between two
vectors are also called distances.
Dissimilarity Matrix (2 / 2)
● It is an n×n triangular matrix d where each
0
element di j contains the distance
d2 1 0
between objects Xi and Xj.
● Most clustering algorithms operate on the d3 1 d3 2 0
dissimilarity matrix.
● In practice it is impractical to store all
distances but rather compute them when dn 1 dn 2 dn 3 ... 0
needed.
● If it is possible to store such huge matrix, it
will boost clustering algorithms.
Type of Variables
Type of Variables
1. Quantitative (Numerical ) variables : are variables that represent
numerical values and can be measured or counted. They allow for
arithmetic operations (such as addition, subtraction,...)." There are two
main types:
 Continuous Variables
 Discrete Variables
2. Qualitative (categorical ) variables : are variables that represent
characteristics or qualities. They describe categories or groups and cannot
be measured numerically in a meaningful way. There are two main types:
 Nominal Variables
 Ordinal Variables
Quantitative Variables
Quantitative variables
● They typically answer questions related to "how much" or "how many
1. Continuous Variables can take any value within a given range. These
values are not restricted to integers and can include fractions and
decimals. Example : Height (e.g., 172.5 cm), Temperature (e.g., 36.7°C),
Weight (e.g., 68.4 kg), Time (e.g., 2.5 hours),

2. Discrete Variables In contrast to continuous variables (which can take


any value within a range), discrete variables are limited to specific,
separate values. They typically represent counts of something. Example :
Number of students in a class (e.g., 25 students), Number of cars in a
parking lot (e.g., 50 cars), Number of books on a shelf (e.g., 10 books ),
Quantitative variables
● Variables (attributes) with different units and ranges can affect the
clustering quality.
● Hence, for Numerical vectors attributes, we need to normalize
(standardize) the attributes before computing distances.
Euclidean Distance (L2 Norm)
● The most used distance measure for vectors with interval-scaled
attributes.
● The expression of this distance is similar to that of Euclidean distance
between 2-dimensional points used in geometry. It is just extended for
m-dimensional vectors.
● The distance between two m-dimensional vectors Xi and Xj is defined by
the following formula:

di j = (Xi 1 - Xj 1)² + (Xi 2 - Xj 2)² + … + (Xi m - Xj


m)²
Manhattan Distance (L1 Norm)
● The distance between two m-dimensional vectors Xi and Xj is defined by
the following formula:

di j = |Xi 1 - Xj 1|+|Xi 2 - Xj 2|+...+|Xi m - Xj m|


Minkowski Distance (Lp Norm)
● It’s a generalization of the Euclidean and Manhattan distances
● The distance between two m-dimensional vectors Xi and Xj is defined by
the following formula:

p
di j |Xi 1 - Xj 1|p + |Xi 2 - Xj 2|p + … + |Xi m - Xj
= m|
p

● Where p is a positive integer number.


Categorical Variables
What are Categorical Variables?
● Represent attributes with k possible values (categories or groups).
● Generally encoded as integer values 1, 2, 3, …, k.
● Called Binary Variables if k take only two values, such as true/false, 0/1,
or yes/no.
● They often answer questions related to "what kind" or "which type."
● Examples include:
■ Direction: East, West, North, South.
■ Blood Types: O, A, B, AB.
■ Gender: Male, Female
■ Eye Color: Blue, Green, Brown
What are Categorical Variables?
● Types Categorical Variables :
■ Nominal Variables: These are purely categorical with no natural order among the
categories.
Example: Eye color (blue, green, brown), Gender (male, female).
■ Ordinal Variables: These are categorical but have a meaningful order or ranking.
Example : Educational level (high school, bachelor’s, master’s), Satisfaction level
(unsatisfied, neutral, satisfied).
Categorical Distance
● A ratio of mismatches can be used as a dissimilarity measure:

m-p
di j =
m

● Where p is the number of matching attributes between object Xi and Xj


● m is the total number of attributes.
● This can be written as:

Number of mismatching attributes


di j =
Total number of attributes
Ordinal Variables
What are Ordinal Variables?
● Attributes with k possible ordered values.
● Generally encoded as integer values 1 < 2 < 3 < … < k.
● Examples include:
■ Level of studies: Bachelor, Master, PhD.
■ Age: Child, Teenage, Young, Old.
■ Experience: Junior, Mid-level, Senior.
Ordinal Distance
● Given an attribute Xi f with kf values:
1. Replace each value of the attribute by its rank ri f {1, 2, …, kf}
2. Normalize the values of the attribute to be in the range [0.0 .. 1.0] using
this formula:

ri f - 1
zi f =
kf - 1

3. Treat the resulting object zi as a regular numerical variable and use


euclidean, manhattan, or any minkowski distance to compare it with
other objects.
Binary Variables
What are Binary Variables?
● Attributes with only two possible values generally encoded as 0 and 1.
● Example of such variables include:
■ Gender (Male, Female)
■ States (On/Off, Healthy/Diseased, …)
● There are two categories of binary variables:
1. Symmetric. The two values of the variable have the same importance
(weight).
■ E.g. Male vs Female
2. Asymmetric. The two values of the variable are not equally important.
■ E.g. Healthy vs Diseased
Contingency Table (1 / 2)
● Given two binary objects (object with binary attributes only):
■ Xi (Xi1, Xi2, …, Xim)
■ Xj (Xj1, Xj2, …, Xjm)
● A contingency table summarizes the number of bit-matches and
mismatches.
Xi

1 0 sum

1 q r q+r
Xj
0 s t s+t

sum q+s r+t m


Contingency Table (2 / 2)
● Example.
■ Xi (0, 0, 0, 0, 1, 0, 1, 1)
■ Xj (1, 0, 1, 0, 0, 0, 1, 1)
● A contingency table summarizes the number of bit-matches and
mismatches.
Xi

1 0 sum

1 2 2 4
Xj
0 1 3 4

sum 3 5 8
Symmetric Distance
● A simple symmetric distance is given by the following formula:

r+s
di j =
m

● In other words, it can be understood as:

Number of mismatching bits


di j =
Total number of bits
Asymmetric Distance
● A simple asymmetric distance is given by the following formula:

r+s
di j =
m-t

● In other words, it can be understood as:

Number of mismatching bits


di j = Total number of bits ignoring negative
matches
Vector Objects
Vector Objects (1 /2)
● If objects are of complex nature, we generally represent them using
feature vectors.
● In these contexts we can use the cosine similarity to compare two
vectors Xi and Xj:
Xi T · X j
Sim(Xi , Xj) =
║Xi║║Xj║
● Where · denotes the dot product of two vectors and ║║ denotes the L2
norm of a vector and T denotes the transpose of a vector.
Vector Objects (2 /2)
● We can also use the Tanimoto coefficient defined as:

Xi T · X j
Sim(Xi , Xj) =
XiT · Xi + XjT · Xj − XiT · Xj

● Where · denotes the dot product of two vectors and T denotes the
transpose of a vector.

You might also like