Data
Data
Chapter 2 . Data
SASSI Abdessamed
Data Preprocessing
What is Data?
● Collection of data objects and their Attributes
attributes
● An attribute is a property or
characteristic of an object
■ Examples: eye color of a person, temperature,
etc
■ Attribute is also known as variable, field,
Objects
characteristic, dimension, or feature
● A Collection of attributes describe an
object
■ Object is also known as record, point, case,
sample, entity, or instance
Attribute Values
● Attribute values are numbers or symbols assigned to an attribute for a
particular object
■ But properties of attribute can be different than the properties of the values used to
represent the attribute
Real World Data
● Because data to be mined in the real world:
■ May come from different sources.
■ Consist of a large number of attributes.
■ Contain a large number of entries (rows).
■ May be of a complex nature.
● This data is susceptible to be:
■ Incomplete. Contain missing values for certain attributes of some entries.
■ Noisy. Contain outlier or erroneous values.
■ Inconsistent. Some entries may be in different formats or have different encoding
schemes for the same attribute (column).
● These issues with real world data suggests the need for techniques to
correct them Preprocessing Techniques
Incomplete Data
● Incomplete data is the result of the following possible reasons:
■ Data is an aggregation of several databases that contains a different number of
attributes.
■ The attributes with missing values was added recently to the structure of the database.
■ The attributes was of less importance or was optional.
■ The users forgot to enter the missing values.
■ The missing values were deleted because of a software bug or equipment malfunctions.
71 ? 149 125
43 132 ? 136
43 132 -128 10
X = X1 + X2 + … + XN
N
● Built in aggregate function of several database management systems.
● SELECT AVG(Salary) FROM EMPLOYEE;
Central Tendency - Weighted Arithmetic Mean
● The N data points may be associated with N real-valued weights W1, W2,
…, WN the mean is this case defined as:
A boxplot
Data Dispersion Measures (3 / 4)
● Min and Max whiskers extend at max for 1.5 IQR,
if a value is out of this range, it is considered an Outlier
outlier.
Max
Q3
Median
Q1
Min
A boxplot
Data Dispersion Measures (4 / 4)
● Consider a data set of N values x1, x2, …, xN.
● The variance ² is given by the formula:
N
1
²=
N
Σ
i=1
( xi - x )²
Units sold
■ A rectangle represents a fixed-width
30K
range of values.
20K
10K
Yellow
Red
Blue
Gray
White
Car Color
Scatter Plot
● Displays two numerical attributes
against one another. 100
Items sold
and relations between attributes.
60
● For each point (x, y) in the chart:
■ x corresponds to the value of the first
40
attribute
■ y corresponds to the value of the second 20
attribute
60
120
0
90
30
Price ($)
Quantile Plot
● The whole data for an attribute is 100
displayed.
80
● Plots quantile information.
Price ($)
● For a dataset {X1, X2, …, XN} sorted in 60
ascending order:
40
i - 0.5 20
fi =
N
0.50
0.00
0.75
1.00
0.25
● Approximately, (100×fi)% of the data
point have values ≤ Xi.
f-value
Data Cleaning
Filling Missing Values (1/ 3)
● Ignore the record (row).
■ Generally used in the classification task when the missing value is the label.
● Manually filling the missing value.
■ Not possible for large datasets.
■ Time consuming.
■ Requires the knowledge of the missing value.
● Fill with a global constant.
■ Replace all missing values for the attribute in question with the same value (e.g. -∞).
■ May lead the data mining process to believe that all records with this value form an
interesting pattern.
● Fill with the attribute’s mean.
■ Replace missing values in some column (attribute) with the column’s average value.
Filling Missing Values (2/ 3)
● Fill with the class’s average value.
■ If the records are classified or the task is classification, we can replace the missing values
of some attribute with the mean of the values in the same class.
149 Diseased
? Healthy
210 Diseased
175 Diseased
120 Healthy
114 Healthy
Filling Missing Values (3/ 3)
● Fill with the most probable value.
■ Use a regression model to predict the missing value from existing values
■ E.g. Decision Trees, Bayesian inference, Neural Networks …
Noise Removal / Reduction (1 / 2)
● Binning. Splits the sorted samples of the data into bins (batches) that
correspond to local neighborhoods.
■ Smoothing by bin means. The values in each bin are replaced with the mean of the bin.
■ Smoothing by bin medians. The values in each bin are replaced with the median of the
bin.
■ Smoothing by bin boundaries. The values in each bin the closest min/max boundary.
Σ
i=1
(Ai - A)×(Bi - B)
rA,B =
N A B
Detecting Redundant Attributes (2 / 3)
1. Numerical Attributes.
● Note that -1 ≤ rA, B ≤ +1.
● If rA, B > 0 then A and B are positively correlated (If A increases, B increases).
● If rA, B < 0 then A and B are negatively correlated (If A increases, B decreases
and vice versa).
● If rA, B = 0 then A and B are independent.
A B 𝞂A 𝞂B
rA, B = -0.9979634771005512
Data value conflicts
● Values in different units
■ E.g. Metric (KG, KM, L, …) vs Imperial (Gallon, Pound, Mile, …)
● Prices in different currencies
■ E.g. US Dollars vs Euros vs Algerian Dinars
Normalization (1 / 5)
● Generally data attributes (columns) have different ranges of values.
■ E.g. Age (15 .. 83) vs Salary (15000 .. 300000)
● In several data mining algorithms and techniques, we compare data
entries (data points) as n dimensional Vectors using distance measures.
● If some attributes have larger ranges of values they will have more impact
on the value of the distance.
● To avoid this issue, we can normalize the data so that all attributes have
the same range of values.
● In this course, we will see three methods of normalization.
Normalization (2 / 5)
1. Min-Max Normalization.
● In this method, we normalize the values x1, x2, …, xN of an attribute X
using the linear transformation defined by the following formula:
xi − min X
N(xi) =
max X − min X
● Where:
● xi is the original value we want to normalize.
● N(xi) is the normalized value.
● min X is the minimum of all the values x1, x2, …, xN.
● max X is the maximum of all the values x1, x2, …, xN.
● The range of normalized values is [0..1].
Normalization (3 / 5)
2. Z-Score Normalization.
● In this method, we normalize the values x1, x2, …, xN of an attribute X
using the linear transformation defined by the following formula:
xi − X
N(xi) =
X
● Where:
● xi is the original value we want to normalize.
● N(xi) is the normalized value.
● This method is useful when:
● The minimum and maximum of X are unknown.
● There are some outliers in the values of X.
Normalization (4 / 5)
3. Decimal scaling Normalization.
● In this method, we normalize the values x1, x2, …, xN of an attribute X
using the linear transformation defined by the following formula:
xi
N(xi) =
10p
● Where:
● xi is the original value we want to normalize.
● N(xi) is the normalized value.
● p is the smallest integer such that max({|x1|, |x2|, …, |xN|}) < 10p
● The range of normalized values is [-1..1].
Normalization (5 / 5)
X 897 -67 -360 787 259 752
Original
Y -4 -45 -5 55 37 -7
Xn 1 Xn 2 Xn 3 ... Xn
m
Similarity, Dissimilarity, and Distance
● Several Clustering and supervised classification/regression algorithms
are based on comparing entire data points to each other.
● This raises the need for some functions to evaluate the degree of
resemblance / difference between two vectors.
● Such functions are called similarity / dissimilarity measures.
● The functions that evaluate the difference (dissimilarity) between two
vectors are also called distances.
Dissimilarity Matrix (2 / 2)
● It is an n×n triangular matrix d where each
0
element di j contains the distance
d2 1 0
between objects Xi and Xj.
● Most clustering algorithms operate on the d3 1 d3 2 0
dissimilarity matrix.
● In practice it is impractical to store all
distances but rather compute them when dn 1 dn 2 dn 3 ... 0
needed.
● If it is possible to store such huge matrix, it
will boost clustering algorithms.
Type of Variables
Type of Variables
1. Quantitative (Numerical ) variables : are variables that represent
numerical values and can be measured or counted. They allow for
arithmetic operations (such as addition, subtraction,...)." There are two
main types:
Continuous Variables
Discrete Variables
2. Qualitative (categorical ) variables : are variables that represent
characteristics or qualities. They describe categories or groups and cannot
be measured numerically in a meaningful way. There are two main types:
Nominal Variables
Ordinal Variables
Quantitative Variables
Quantitative variables
● They typically answer questions related to "how much" or "how many
1. Continuous Variables can take any value within a given range. These
values are not restricted to integers and can include fractions and
decimals. Example : Height (e.g., 172.5 cm), Temperature (e.g., 36.7°C),
Weight (e.g., 68.4 kg), Time (e.g., 2.5 hours),
p
di j |Xi 1 - Xj 1|p + |Xi 2 - Xj 2|p + … + |Xi m - Xj
= m|
p
m-p
di j =
m
ri f - 1
zi f =
kf - 1
1 0 sum
1 q r q+r
Xj
0 s t s+t
1 0 sum
1 2 2 4
Xj
0 1 3 4
sum 3 5 8
Symmetric Distance
● A simple symmetric distance is given by the following formula:
r+s
di j =
m
r+s
di j =
m-t
Xi T · X j
Sim(Xi , Xj) =
XiT · Xi + XjT · Xj − XiT · Xj
● Where · denotes the dot product of two vectors and T denotes the
transpose of a vector.