0% found this document useful (0 votes)

13 views81 pages

Data

Uploaded by

Aya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views81 pages

Data

Uploaded by

Aya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 81

Data Mining

Chapter 2 . Data
SASSI Abdessamed
Data Preprocessing
What is Data?
● Collection of data objects and their Attributes
attributes
● An attribute is a property or
characteristic of an object
■ Examples: eye color of a person, temperature,
etc
■ Attribute is also known as variable, field,

Objects
characteristic, dimension, or feature
● A Collection of attributes describe an
object
■ Object is also known as record, point, case,
sample, entity, or instance
Attribute Values
● Attribute values are numbers or symbols assigned to an attribute for a
particular object

● Distinction between attributes and attribute values

■ Same attribute can be mapped to different attribute values
Example: height can be measured in feet or meters

■ Different attributes can be mapped to the same set of values

Example: Attribute values for ID and age are integers

■ But properties of attribute can be different than the properties of the values used to
represent the attribute
Real World Data
● Because data to be mined in the real world:
■ May come from different sources.
■ Consist of a large number of attributes.
■ Contain a large number of entries (rows).
■ May be of a complex nature.
● This data is susceptible to be:
■ Incomplete. Contain missing values for certain attributes of some entries.
■ Noisy. Contain outlier or erroneous values.
■ Inconsistent. Some entries may be in different formats or have different encoding
schemes for the same attribute (column).
● These issues with real world data suggests the need for techniques to
correct them Preprocessing Techniques
Incomplete Data
● Incomplete data is the result of the following possible reasons:
■ Data is an aggregation of several databases that contains a different number of
attributes.
■ The attributes with missing values was added recently to the structure of the database.
■ The attributes was of less importance or was optional.
■ The users forgot to enter the missing values.
■ The missing values were deleted because of a software bug or equipment malfunctions.

Age Blood Pressure Cholesterol Max heart rate achieved

71 ? 149 125

43 132 ? 136

34 118 210 192

Noisy Data
● Noisy data is the result of the following possible reasons:
■ Sensor / Equipment malfunctions.
■ Human errors when inputting the data fields.
■ Transmission errors.
■ Old technology limitations.

Age Blood Pressure Cholesterol Max heart rate achieved

255 65535 149 125

43 132 -128 10

34 118 210 192

Inconsistent Data
● Inconsistent data occurs for possible reasons:
■ Data from different sources use different coding and formatting schemes.
■ Data entry is carried out by different human users or organizations.
■ Data is collected from unofficial internet sources.

Age Blood Pressure Cholesterol Max heart rate achieved

Old 112 149 > 140

43 132 341 136

31-40 118 210 F

Data Preprocessing
● Preprocessing is the set of techniques used to correct data issues as a
preparation step for the data mining process.
● If the data contains issues, the quality of the data mining system will be
severely degraded.
● Data Cleaning. Removing noise and inferring missing values.
● Data Integration. Combining data from multiple sources without
inconsistency.
● Data Transformation. Changing the values of attributes. (E.g.
Normalization)
● Data Reduction. Removing some data (rows) or attributes (columns)
because they are redundant.
Descriptive Data Summarization (Statistics)
Descriptive Summarization
● Descriptive summarization is a set of visualisation and statistical
techniques that can help prior to the preprocessing step.
● Understand the data and its properties.
● Discover noisy data and outliers.
● Comprehend the distribution of the data.
● Decide which set of data cleaning techniques to be applied.
Central Tendency - Arithmetic Mean (Average)
● The most common numerical measure of the center of a set of data
points.
● Given a set of N data points (observations) X1, X2, …, XN the mean is
defined as:

X = X1 + X2 + … + XN
N
● Built in aggregate function of several database management systems.
● SELECT AVG(Salary) FROM EMPLOYEE;
Central Tendency - Weighted Arithmetic Mean
● The N data points may be associated with N real-valued weights W1, W2,
…, WN the mean is this case defined as:

X = W1X1 + W2X2 + … + WNXN

W1 + W2 + … + WN
Central Tendency - Median
● The median is the value in the middle of the set.
● To find the median of a set of N values:
1. Sort the N values.
2. If N is odd, a single value is in the middle and it is the median.
3. If N is even, two values are in the middle, the average of these two
values is the median.
● Median({-1, -3, 2, 5, 9}) = 2
● Median({-1, -3, 4, 9}) = 0.5
Central Tendency - Mode
● The mode is the most frequent (most repeated) value.
● If a set has a single mode it is called unimodal.
● There may be more than one value with the highest frequency, in the case
the dataset is called multimodal.
● A set with two modes is bimodal.
● A set with three modes is trimodal.
Central Tendency - Midrange
● The midrange of a dataset is simply the average of the minimum and
maximum values.
● Midrange = (Min + Max) / 2
Distributive, Algebraic, and Holistic Measures
● Distributive measures.
■ Can be computed on parts of the data (subsets) and then combine the partial results to
obtain the measure of the entire set.
■ E.g. Sum, Count, Min, and Max
● Algebraic Measures.
■ The application of algebraic functions to one or more distributive measures.
■ E.g. Mean = Sum / Count
● Holistic Measures.
■ Cannot be computed on batches (subsets). Computed on the whole dataset.
■ E.g. Median.
Data Dispersion Measures (1 / 4)
● Consider a data set of N values sorted (in ascending order) x1, x2, …, xN.
● The range of this set is the difference between the maximum and
minimum values: max() - min()
● The kth percentile is the value xi that have k% of the values before it.
● The first quartile Q1 is the 25th percentile.
● The median is the 50th percentile and the second quartile Q2.
● The third quartile Q3 is the 75th percentile.
● The interquartile range (IQR) is a measurement of data spread: IQR = Q3
- Q1.
Data Dispersion Measures (2 / 4)
● Boxplots Constitute an important way of
visualizing a summary of some data attribute. Max
● The two ends of the box (rectangle) are the first
Q3
and third quartiles.
● The line in the middle corresponds to the median.
Median
● The two lines at the bottom and top (called
Q1
whiskers) correspond to the minimum and
maximum respectively.
Min

A boxplot
Data Dispersion Measures (3 / 4)
● Min and Max whiskers extend at max for 1.5 IQR,
if a value is out of this range, it is considered an Outlier
outlier.
Max

Q3
Median
Q1
Min

A boxplot
Data Dispersion Measures (4 / 4)
● Consider a data set of N values x1, x2, …, xN.
● The variance ² is given by the formula:

N
1
²=
N
Σ
i=1
( xi - x )²

● The standard deviation is the square root of the variation.

● The standard deviation measure the spread of the data (average
error).
● Values more than 2 away from the mean for a given attribute may be
considered outliers.
Graphical Data Summarization
Histograms
● Categorical Attribute 50K
■ A rectangle for each value.
● Numerical Attribute 40K

Units sold
■ A rectangle represents a fixed-width
30K
range of values.
20K

10K

Yellow
Red

Blue
Gray

White
Car Color
Scatter Plot
● Displays two numerical attributes
against one another. 100

● Helpful for detecting correlations 80

Items sold
and relations between attributes.
60
● For each point (x, y) in the chart:
■ x corresponds to the value of the first
40
attribute
■ y corresponds to the value of the second 20
attribute

120
0

90
30
Price ($)
Quantile Plot
● The whole data for an attribute is 100
displayed.
80
● Plots quantile information.

Price ($)
● For a dataset {X1, X2, …, XN} sorted in 60
ascending order:
40

i - 0.5 20
fi =
N

0.50
0.00

0.75

1.00
0.25
● Approximately, (100×fi)% of the data
point have values ≤ Xi.
f-value
Data Cleaning
Filling Missing Values (1/ 3)
● Ignore the record (row).
■ Generally used in the classification task when the missing value is the label.
● Manually filling the missing value.
■ Not possible for large datasets.
■ Time consuming.
■ Requires the knowledge of the missing value.
● Fill with a global constant.
■ Replace all missing values for the attribute in question with the same value (e.g. -∞).
■ May lead the data mining process to believe that all records with this value form an
interesting pattern.
● Fill with the attribute’s mean.
■ Replace missing values in some column (attribute) with the column’s average value.
Filling Missing Values (2/ 3)
● Fill with the class’s average value.
■ If the records are classified or the task is classification, we can replace the missing values
of some attribute with the mean of the values in the same class.

117 Cholesterol Label

149 Diseased

? Healthy

210 Diseased

175 Diseased

120 Healthy

114 Healthy
Filling Missing Values (3/ 3)
● Fill with the most probable value.
■ Use a regression model to predict the missing value from existing values
■ E.g. Decision Trees, Bayesian inference, Neural Networks …
Noise Removal / Reduction (1 / 2)
● Binning. Splits the sorted samples of the data into bins (batches) that
correspond to local neighborhoods.
■ Smoothing by bin means. The values in each bin are replaced with the mean of the bin.
■ Smoothing by bin medians. The values in each bin are replaced with the median of the
bin.
■ Smoothing by bin boundaries. The values in each bin the closest min/max boundary.

{4, 8, 15, 21, 21, 24, 25, 28, 34}

Bin1: {4, 8, Bin1: {9, 9, Bin1: {8, 8, Bin1: {4, 4,

15} 9} 8} 15}
Bin2: {21, 21, 24} Bin2: {22, 22, 22} Bin2: {21, 21, 21} Bin2: {21, 21, 24}
Bin3: {25, 28, 34} Bin3: {29, 29, 29} Bin3: {28, 28, 28} Bin3: {25, 25, 34}
Original Bins By Means By Medians By Boundaries
Noise Removal / Reduction (2 / 2)
● Regression. We can fit the values of some attribute to a function. Then
we replace the values of the attribute with values from the function.
■ Example. We can use Linear Regression to fit the data to a linear equation.
● Clustering. Group values into clusters using clustering techniques.
Values outside of all clusters are considered as outliers.
Data Integration
● Combining data collected from multiple sources into one coherent
store.
■ File. Spreadsheet, CSV, JSON, …
■ Database
■ Data warehouse
Schema Integration and Object Matching
● Identifying correspondences between the elements (attributes, tables,
etc.) of different schemas.
● E.g. recognizing that "CustomerID" in one database corresponds to
"ClientID" in another.
● We can use the metadata of the original data sources to resolve this
issue.
Resolving data inconsistencies
● Data Entry Errors
Example: "NYC" and "New York City" referring to the same city, but recorded differently.
Resolution: Standardize the representation to either "NYC" or "New York City" to
maintain consistency.
● Unit Mismatches
Example: Weight measured in kilograms in some records and in pounds in others.
Resolution: Convert all values to a single, consistent unit (e.g., pounds to kilograms).
● Data Format Inconsistencies
Example. Transforming date from the format “dd/mm/yy” to “yyyy-mm-dd”.
Resolution: Standardize the format across the dataset, ensuring consistency (e.g., using
"YYYY-MM-DD").
Detecting Redundant Attributes (1 / 3)
1. Numerical Attributes
● Attributes that can be derived from other attributes.
● To detect that two attributes A and B are redundant attributes, we can
calculate the correlation coefficient (Pearson’s product moment) of the two
attributes.
N

Σ
i=1
(Ai - A)×(Bi - B)

rA,B =
N A B
Detecting Redundant Attributes (2 / 3)
1. Numerical Attributes.
● Note that -1 ≤ rA, B ≤ +1.
● If rA, B > 0 then A and B are positively correlated (If A increases, B increases).
● If rA, B < 0 then A and B are negatively correlated (If A increases, B decreases
and vice versa).
● If rA, B = 0 then A and B are independent.

A 0.38 -2.37 -3.65 1.05 3.58 -2.97

B -2.60 7.78 12.63 -5.14 -14.75 10.09

Detecting Redundant Attributes (3 / 3)
A 0.38 -2.37 -3.65 1.05 3.58 -2.97

B -2.60 7.78 12.63 -5.14 -14.75 10.09

A B 𝞂A 𝞂B

-0.66 1.34 2.56 9.68

A-A 1.04 -1.71 -2.99 1.71 4.24 -2.31

B-B -3.94 6.44 11.29 -6.48 -16.09 8.75

rA, B = -0.9979634771005512
Data value conflicts
● Values in different units
■ E.g. Metric (KG, KM, L, …) vs Imperial (Gallon, Pound, Mile, …)
● Prices in different currencies
■ E.g. US Dollars vs Euros vs Algerian Dinars
Normalization (1 / 5)
● Generally data attributes (columns) have different ranges of values.
■ E.g. Age (15 .. 83) vs Salary (15000 .. 300000)
● In several data mining algorithms and techniques, we compare data
entries (data points) as n dimensional Vectors using distance measures.
● If some attributes have larger ranges of values they will have more impact
on the value of the distance.
● To avoid this issue, we can normalize the data so that all attributes have
the same range of values.
● In this course, we will see three methods of normalization.
Normalization (2 / 5)
1. Min-Max Normalization.
● In this method, we normalize the values x1, x2, …, xN of an attribute X
using the linear transformation defined by the following formula:
xi − min X
N(xi) =
max X − min X

● Where:
● xi is the original value we want to normalize.
● N(xi) is the normalized value.
● min X is the minimum of all the values x1, x2, …, xN.
● max X is the maximum of all the values x1, x2, …, xN.
● The range of normalized values is [0..1].
Normalization (3 / 5)
2. Z-Score Normalization.
● In this method, we normalize the values x1, x2, …, xN of an attribute X
using the linear transformation defined by the following formula:
xi − X
N(xi) =
X

● Where:
● xi is the original value we want to normalize.
● N(xi) is the normalized value.
● This method is useful when:
● The minimum and maximum of X are unknown.
● There are some outliers in the values of X.
Normalization (4 / 5)
3. Decimal scaling Normalization.
● In this method, we normalize the values x1, x2, …, xN of an attribute X
using the linear transformation defined by the following formula:
xi
N(xi) =
10p

● Where:
● xi is the original value we want to normalize.
● N(xi) is the normalized value.
● p is the smallest integer such that max({|x1|, |x2|, …, |xN|}) < 10p
● The range of normalized values is [-1..1].
Normalization (5 / 5)
X 897 -67 -360 787 259 752
Original
Y -4 -45 -5 55 37 -7

X 1 0.23 0 0.91 0.49 0.88

Min-Max
Y 0.41 0 0.4 1 0.82 0.38

X 1.10 -0.94 -1.56 0.87 -0.25 0.79

Z-Score
Y -0.28 -1.54 -0.31 1.53 0.98 -0.37

X 0.897 -0.067 -0.36 0.787 0.259 0.752

Decimal Scaling
Y -0.04 -0.45 -0.05 0.55 0.37 -0.07
Data Reduction
● The goal of data reduction is to generate a smaller dataset without losing
the essence of the original one.
● Generating a dataset that’s smaller in volume but it produces the same
analysis results.
● Main strategies for data reduction are:
1. Attribute subset selection.
2. Dimensionality reduction.
3. Data Discretization.
Attribute subset selection (1 / 2)
● Not all attributes are important to the data mining task.
● This suggests that we must select a subset of the whole set of attributes to
keep, and discard the other attributes.
● A human expert can perform this job manually.
● However …
● It will consume a lot of time this way.
● We can design algorithms to perform this task automatically.
● However …
● For a set of n attributes there are 2n subsets to consider.
● We must use heuristic approaches to achieve this.
Attribute subset selection (2 / 2)

Greedy (heuristic) methods for attribute subset selection

Dimensionality Reduction
● Obtain a reduced (compressed) representation of the original data.
● Principal Component Analysis PCA.
Principal Component Analysis (1 / 2)
● PCA can be used to transform a dataset of n attributes (columns) to an
approximate one with k attributes (k ≤ n).
● To achieve this PCA will search for a projection from the n-dimensional
space to the k-dimensional space by minimizing a reconstruction error
criteria.
Principal Component Analysis (2 / 2)
● PCA can be summarized into the following steps:
1. Normalization. Give the attributes (columns) the same importance.
2. Finding Principal Components. Find k n-dimensional orthonormal
(perpendicular and have unit length) vectors that form a basis for the
3. Sorting Principal Components. Sort principal components by significance
(variance its values).
4. Discard Weaker Components. Keeping most significant (strongest)
components results in a good approximation of the original data.
Data Discretization
● Data Discretization is used to reduce the number of possible values an
attribute have.
● This is done by converting the values into discrete intervals or
categories, then replacing each value by the label assigned to its interval.
● Applying Discretization to a numerical attribute reduces the number of
possible values it can have.
● Data discretization techniques include: Binning, Histogram Analysis,
Entropy-based Discretization, Merging by ² Analysis, Cluster
Analysis.
Discretization with Binning
● Similarly to what we saw in data smoothing, binning consists of splitting a
sorted dataset into bins (batches) and replacing their values.
● If replace the values inside each bin with their mean or median, we will
reduce the number of unique values the attributes will have.
● As a result, in this case, the binning technique is also considered an
unsupervised data discretization technique.
Discretization with Histogram analysis
● Histogram Analysis for numerical attributes can be used as an
unsupervised data discretization technique.
● In the histogram technique numerical values are partitioned into ranges
(called buckets).
● The buckets are selected according to one of the following criteria:
1. Equal width. The values of the attributes are split into range of equal with
(e.g. 10..20, 20..30, 30..40, 40..50).
2. Equal frequency. The ranges are selected so that they almost have equal
frequencies.
Comparing data: Similarity and Distance
Data Matrix (1 / 2)
● The goal of Clustering and supervised classification/regression
algorithms - we will see later in this course - is to group into K classes:
■ A set of n object (patients, customers, cars, products, …)
■ Represented by m attributes (age, blood pressure, color, …)
● Hence, these algorithms accept as input a dataset in the form of a matrix
of n rows and m columns.
● The columns of this matrix correspond to the attributes (variables) of
each object.
● The rows of this matrix correspond to the objects.
● As a result, each object is a m-dimensional array (sometimes referred to
as data point).
Data Matrix (2 / 2)
● Each object (datapoint) Xi is a m-
X1 1 X1 2 X1 3 ... X1
dimensional vector (Xi 1, Xi 2, Xi 3, …, Xi m). m
X2 1 X2 2 X2 3 ... X2
m
X3 1 X3 2 X3 3 ... X3
m

Xn 1 Xn 2 Xn 3 ... Xn
m
Similarity, Dissimilarity, and Distance
● Several Clustering and supervised classification/regression algorithms
are based on comparing entire data points to each other.
● This raises the need for some functions to evaluate the degree of
resemblance / difference between two vectors.
● Such functions are called similarity / dissimilarity measures.
● The functions that evaluate the difference (dissimilarity) between two
vectors are also called distances.
Dissimilarity Matrix (2 / 2)
● It is an n×n triangular matrix d where each
0
element di j contains the distance
d2 1 0
between objects Xi and Xj.
● Most clustering algorithms operate on the d3 1 d3 2 0
dissimilarity matrix.
● In practice it is impractical to store all
distances but rather compute them when dn 1 dn 2 dn 3 ... 0
needed.
● If it is possible to store such huge matrix, it
will boost clustering algorithms.
Type of Variables
Type of Variables
1. Quantitative (Numerical ) variables : are variables that represent
numerical values and can be measured or counted. They allow for
arithmetic operations (such as addition, subtraction,...)." There are two
main types:
 Continuous Variables
 Discrete Variables
2. Qualitative (categorical ) variables : are variables that represent
characteristics or qualities. They describe categories or groups and cannot
be measured numerically in a meaningful way. There are two main types:
 Nominal Variables
 Ordinal Variables
Quantitative Variables
Quantitative variables
● They typically answer questions related to "how much" or "how many
1. Continuous Variables can take any value within a given range. These
values are not restricted to integers and can include fractions and
decimals. Example : Height (e.g., 172.5 cm), Temperature (e.g., 36.7°C),
Weight (e.g., 68.4 kg), Time (e.g., 2.5 hours),

2. Discrete Variables In contrast to continuous variables (which can take

any value within a range), discrete variables are limited to specific,
separate values. They typically represent counts of something. Example :
Number of students in a class (e.g., 25 students), Number of cars in a
parking lot (e.g., 50 cars), Number of books on a shelf (e.g., 10 books ),
Quantitative variables
● Variables (attributes) with different units and ranges can affect the
clustering quality.
● Hence, for Numerical vectors attributes, we need to normalize
(standardize) the attributes before computing distances.
Euclidean Distance (L2 Norm)
● The most used distance measure for vectors with interval-scaled
attributes.
● The expression of this distance is similar to that of Euclidean distance
between 2-dimensional points used in geometry. It is just extended for
m-dimensional vectors.
● The distance between two m-dimensional vectors Xi and Xj is defined by
the following formula:

di j = (Xi 1 - Xj 1)² + (Xi 2 - Xj 2)² + … + (Xi m - Xj

m)²
Manhattan Distance (L1 Norm)
● The distance between two m-dimensional vectors Xi and Xj is defined by
the following formula:

di j = |Xi 1 - Xj 1|+|Xi 2 - Xj 2|+...+|Xi m - Xj m|

Minkowski Distance (Lp Norm)
● It’s a generalization of the Euclidean and Manhattan distances
● The distance between two m-dimensional vectors Xi and Xj is defined by
the following formula:

p
di j |Xi 1 - Xj 1|p + |Xi 2 - Xj 2|p + … + |Xi m - Xj
= m|
p

● Where p is a positive integer number.

Categorical Variables
What are Categorical Variables?
● Represent attributes with k possible values (categories or groups).
● Generally encoded as integer values 1, 2, 3, …, k.
● Called Binary Variables if k take only two values, such as true/false, 0/1,
or yes/no.
● They often answer questions related to "what kind" or "which type."
● Examples include:
■ Direction: East, West, North, South.
■ Blood Types: O, A, B, AB.
■ Gender: Male, Female
■ Eye Color: Blue, Green, Brown
What are Categorical Variables?
● Types Categorical Variables :
■ Nominal Variables: These are purely categorical with no natural order among the
categories.
Example: Eye color (blue, green, brown), Gender (male, female).
■ Ordinal Variables: These are categorical but have a meaningful order or ranking.
Example : Educational level (high school, bachelor’s, master’s), Satisfaction level
(unsatisfied, neutral, satisfied).
Categorical Distance
● A ratio of mismatches can be used as a dissimilarity measure:

m-p
di j =
m

● Where p is the number of matching attributes between object Xi and Xj

● m is the total number of attributes.
● This can be written as:

Number of mismatching attributes

di j =
Total number of attributes
Ordinal Variables
What are Ordinal Variables?
● Attributes with k possible ordered values.
● Generally encoded as integer values 1 < 2 < 3 < … < k.
● Examples include:
■ Level of studies: Bachelor, Master, PhD.
■ Age: Child, Teenage, Young, Old.
■ Experience: Junior, Mid-level, Senior.
Ordinal Distance
● Given an attribute Xi f with kf values:
1. Replace each value of the attribute by its rank ri f {1, 2, …, kf}
2. Normalize the values of the attribute to be in the range [0.0 .. 1.0] using
this formula:

ri f - 1
zi f =
kf - 1

3. Treat the resulting object zi as a regular numerical variable and use

euclidean, manhattan, or any minkowski distance to compare it with
other objects.
Binary Variables
What are Binary Variables?
● Attributes with only two possible values generally encoded as 0 and 1.
● Example of such variables include:
■ Gender (Male, Female)
■ States (On/Off, Healthy/Diseased, …)
● There are two categories of binary variables:
1. Symmetric. The two values of the variable have the same importance
(weight).
■ E.g. Male vs Female
2. Asymmetric. The two values of the variable are not equally important.
■ E.g. Healthy vs Diseased
Contingency Table (1 / 2)
● Given two binary objects (object with binary attributes only):
■ Xi (Xi1, Xi2, …, Xim)
■ Xj (Xj1, Xj2, …, Xjm)
● A contingency table summarizes the number of bit-matches and
mismatches.
Xi

1 0 sum

1 q r q+r
Xj
0 s t s+t

sum q+s r+t m

Contingency Table (2 / 2)
● Example.
■ Xi (0, 0, 0, 0, 1, 0, 1, 1)
■ Xj (1, 0, 1, 0, 0, 0, 1, 1)
● A contingency table summarizes the number of bit-matches and
mismatches.
Xi

1 0 sum

1 2 2 4
Xj
0 1 3 4

sum 3 5 8
Symmetric Distance
● A simple symmetric distance is given by the following formula:

r+s
di j =
m

● In other words, it can be understood as:

Number of mismatching bits

di j =
Total number of bits
Asymmetric Distance
● A simple asymmetric distance is given by the following formula:

r+s
di j =
m-t

● In other words, it can be understood as:

Number of mismatching bits

di j = Total number of bits ignoring negative
matches
Vector Objects
Vector Objects (1 /2)
● If objects are of complex nature, we generally represent them using
feature vectors.
● In these contexts we can use the cosine similarity to compare two
vectors Xi and Xj:
Xi T · X j
Sim(Xi , Xj) =
║Xi║║Xj║
● Where · denotes the dot product of two vectors and ║║ denotes the L2
norm of a vector and T denotes the transpose of a vector.
Vector Objects (2 /2)
● We can also use the Tanimoto coefficient defined as:

Xi T · X j
Sim(Xi , Xj) =
XiT · Xi + XjT · Xj − XiT · Xj

● Where · denotes the dot product of two vectors and T denotes the
transpose of a vector.

P-6 Complete Compressed
No ratings yet
P-6 Complete Compressed
401 pages
Chapter 8 F4 Measures of Dispersion For Ungrouped Data PDF
No ratings yet
Chapter 8 F4 Measures of Dispersion For Ungrouped Data PDF
22 pages
Chapter 2 - Data Exploration, Preprocessing and Visualization
No ratings yet
Chapter 2 - Data Exploration, Preprocessing and Visualization
92 pages
Statistics 201
No ratings yet
Statistics 201
150 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
02 KnowYourData
No ratings yet
02 KnowYourData
44 pages
Gulf View Case
50% (2)
Gulf View Case
24 pages
Unit 3
No ratings yet
Unit 3
41 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Chapter 01 Introduction To Python - Part2 - 2
No ratings yet
Chapter 01 Introduction To Python - Part2 - 2
62 pages
DWDM - Unit - III
No ratings yet
DWDM - Unit - III
77 pages
2 1 Data
No ratings yet
2 1 Data
22 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Data Preprocessing For Clustering
No ratings yet
Data Preprocessing For Clustering
40 pages
Chapter - 3 Data Pre - Processing
No ratings yet
Chapter - 3 Data Pre - Processing
54 pages
02 Data
No ratings yet
02 Data
24 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Week3 - Data Preprocessing, Extraction and Preparation
No ratings yet
Week3 - Data Preprocessing, Extraction and Preparation
34 pages
Chapter2 Data Preprocssing
No ratings yet
Chapter2 Data Preprocssing
70 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
Week2 1
No ratings yet
Week2 1
24 pages
Basic Statistical Descriptions of Data
No ratings yet
Basic Statistical Descriptions of Data
26 pages
Q4 - Week 1 - Illustrating Quartiles, Deciles and Percentiles
No ratings yet
Q4 - Week 1 - Illustrating Quartiles, Deciles and Percentiles
11 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
Design And Analysis Of Algorithm
From Everand
Design And Analysis Of Algorithm
Bhupendra Mandloi
No ratings yet
Data Preparation DM
No ratings yet
Data Preparation DM
26 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
02 Data
No ratings yet
02 Data
64 pages
Epei
No ratings yet
Epei
25 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
Chapter 02 Advanced Data Structures and Functions
No ratings yet
Chapter 02 Advanced Data Structures and Functions
103 pages
Analisis Univariat Frequencies: Statistics
No ratings yet
Analisis Univariat Frequencies: Statistics
5 pages
Data and Metrics
No ratings yet
Data and Metrics
35 pages
Normality Pre Test
No ratings yet
Normality Pre Test
8 pages
Math1005 Notes
No ratings yet
Math1005 Notes
31 pages
CH 2
No ratings yet
CH 2
36 pages
Lecture 5
No ratings yet
Lecture 5
27 pages
Unit 4
No ratings yet
Unit 4
66 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
LinearRegression - 2022
No ratings yet
LinearRegression - 2022
38 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Bbs14e PPT ch07
No ratings yet
Bbs14e PPT ch07
31 pages
Output SPSS 2
No ratings yet
Output SPSS 2
20 pages
Data ch2
No ratings yet
Data ch2
16 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
4.2 Variance and Covariance of Random Variables: Definition 4.4
No ratings yet
4.2 Variance and Covariance of Random Variables: Definition 4.4
5 pages
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
Chapter 03 Object Oriented Programming and Exceptions in Python
No ratings yet
Chapter 03 Object Oriented Programming and Exceptions in Python
70 pages
Practical Work 02 Solution
No ratings yet
Practical Work 02 Solution
9 pages
Lec 5
No ratings yet
Lec 5
24 pages
Chapter 04 Advanced Use of Python Libraries For AI and Data Science
No ratings yet
Chapter 04 Advanced Use of Python Libraries For AI and Data Science
179 pages
Module 5 Statistics
No ratings yet
Module 5 Statistics
13 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
02 Data
No ratings yet
02 Data
62 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
PW4 Python Solution
No ratings yet
PW4 Python Solution
6 pages
Practical Work 03 Solutions
No ratings yet
Practical Work 03 Solutions
5 pages
Lab 1
No ratings yet
Lab 1
5 pages
DWDM Unit 1 Chap2 PDF
No ratings yet
DWDM Unit 1 Chap2 PDF
21 pages
02know Your Data Lecture2 3
No ratings yet
02know Your Data Lecture2 3
53 pages
Jurnal JP - Peran Masa Kerja Dan Gaya Komunikasi Terhadap Kinerja Karyawan Dengan Motivasi Karyawan Sebagai Mediator Pada PT Gajah Tunggal TBK
No ratings yet
Jurnal JP - Peran Masa Kerja Dan Gaya Komunikasi Terhadap Kinerja Karyawan Dengan Motivasi Karyawan Sebagai Mediator Pada PT Gajah Tunggal TBK
13 pages
Psicothema2013UsingR MAPE
No ratings yet
Psicothema2013UsingR MAPE
8 pages
Applied Statistics Lab Manual No. 3 Minitab
No ratings yet
Applied Statistics Lab Manual No. 3 Minitab
11 pages
Unit 1
No ratings yet
Unit 1
26 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Practica Análisis de Datos Agrupados
No ratings yet
Practica Análisis de Datos Agrupados
10 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Measures of Variability: Range, Interquartile Range, Variance, and Standard Deviation
No ratings yet
Measures of Variability: Range, Interquartile Range, Variance, and Standard Deviation
13 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
32 pages
Q.1. Why Is Data Preprocessing Required?
100% (1)
Q.1. Why Is Data Preprocessing Required?
26 pages
Pooled Standard Deviation
No ratings yet
Pooled Standard Deviation
1 page
SMK Methodist Sitiawan, Perak
No ratings yet
SMK Methodist Sitiawan, Perak
4 pages
PW1 Python
No ratings yet
PW1 Python
2 pages
PW2 Python
No ratings yet
PW2 Python
2 pages
Practical Work 03 Advanced Functions in Python
No ratings yet
Practical Work 03 Advanced Functions in Python
2 pages
Median and Quartiles Practice Strips
No ratings yet
Median and Quartiles Practice Strips
1 page
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
SM814 (32523) - Training and Development Question Paper
No ratings yet
SM814 (32523) - Training and Development Question Paper
3 pages
Statistical Methods For Management: Group Assignment: Impact of Physical Fitness On Employee Productivity
No ratings yet
Statistical Methods For Management: Group Assignment: Impact of Physical Fitness On Employee Productivity
13 pages
Practical Work 04 Object Oriented Programming
No ratings yet
Practical Work 04 Object Oriented Programming
1 page
Lampiran Data Sekunder
No ratings yet
Lampiran Data Sekunder
15 pages
Stats U2 Mean - Unsolved Practicals
No ratings yet
Stats U2 Mean - Unsolved Practicals
4 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
16 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Sector - : Aba Project Pharmaceutical
No ratings yet
Sector - : Aba Project Pharmaceutical
8 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
Spring 12 ECON-E370 IU Exam 1 Formula Sheet
No ratings yet
Spring 12 ECON-E370 IU Exam 1 Formula Sheet
2 pages
Unit 3 Measures of Central Tendency
No ratings yet
Unit 3 Measures of Central Tendency
4 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Math 10 4th PT
No ratings yet
Math 10 4th PT
6 pages