Chapter 3: Data Preprocessing
Dong-Kyu Chae
PI of the Data Intelligence Lab @HYU
Department of Computer Science & Data Science
Hanyang University
Contents: Major Tasks in Data Preprocessing
❑ Data Cleaning
❑ Data Integration
❑ Data Reduction
❑ Data Transformation and Data Discretization
Why Pre-process the Data?
❑A multidimensional view
❑ Accuracy: your data mining/machine learning results are not
accurate, despite multiple trials…
❑ Completeness: not recorded, unavailable, …
❑ Consistency: some modified but some not, …
❑ Timeliness: timely updated?
❑ Believability: how trustable the data are?
❑ Interpretability: how easily the data can be understood?
Overview
❑ Data cleaning
❑ Fill
in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
❑ Data integration
❑ Integration of multiple databases, data cubes, or files
❑ Data reduction
❑ Dimensionality reduction
❑ Numerosity reduction
❑ Data compression
❑ Data transformation and data discretization
❑ Normalization
❑ Concept hierarchy generation
Data Cleaning
❑ Data in the Real World Is Dirty: Lots of potentially
incorrect data, e.g., instrument faulty, human or
computer error, transmission error
❑ incomplete: lacking feature values, lacking certain features of
interest, or containing only aggregate data
▪ e.g., Occupation=“ ” (missing data)
❑ noisy: containing noise, errors, or outliers
▪ e.g., Salary=“−10” (an error)
❑ inconsistent: containing discrepancies in codes or names, e.g.,
▪ Age=“42”, Birthday=“03/07/2000”
▪ In some DBs, rating is“1, 2, 3”, but some other DBS,“A, B, C”
▪ discrepancy between duplicate records
Missing Data
❑ Remove the object: usually done when class label is missing—
not effective when the % of missing values is large
❑ Fillin the missing value manually: might be accurate, but
tedious + infeasible
❑ Fill in it automatically with
❑a simple constant (default value, or “unknown”)
❑ the feature mean
❑ the feature mean for all samples belonging to the same group (e.g.,
same class, same cluster, etc…)
❑ the inferred value: such as based on some regression or
classification model
Noisy & Inconsistent Data
❑ Noise : random error or variance in a measured feature
❑ Mainly due to faulty data collection instruments
❑ Noisy data is often expressed as an outlier
▪ Outlier detection -> delete outliers -> find missing values
❑ Thus we can apply an outlier detection method (will learn it
later)
❑ Inconsistent data
❑ Age=“42”, but Birthday=“03/07/2000”
❑ For a duplicate records, one name is “cm“ but the other is
“inch”
❑ Human inspection will be needed
▪ Computer performs outlier detection, then human will inspect it
Contents
❑ Data Preprocessing: An Overview
❑ Data Quality
❑ Major Tasks in Data Preprocessing
❑ Data Cleaning
❑ Data Integration
❑ Data Reduction
❑ Data Transformation and Data Discretization
❑ Summary
Data Integration
❑ Data integration:
❑ Combines multiple datasets from multiple sources into a
coherent store
❑ Schema integration: e.g., A.cust-id B.cust-#
❑ Integrate metadata from different sources
❑ Detecting and resolving data value conflicts
❑ For the same real world entity, feature values from different
sources are different
❑ e.g., cm vs. inch, meter vs. mile
Handling Redundancy in Data Integration
❑ Redundant data occur often when integration of multiple
databases
❑ Derivabledata : One feature may be a “derived” feature in
another table, e.g., birthdate vs. age
❑ Redundant features can be automatically detected by
correlation analysis and covariance analysis
❑ Reducing/avoidingredundancies and inconsistencies
improves mining speed and quality
Correlation Analysis (Nominal Features)
❑ Wewant to know that “like_science_fiction” and
“play_chess” are correlated
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
(Observed − Expected ) 2
= 2
❑ Χ2 (chi-square) test Expected
each cell
❑ The larger the Χ2 value, the more likely the features are corelated
❑ The cells that contribute the most to the Χ2 value are those whose
actual count is very different from the expected count
❑ Expected value is estimated under the independence assumption
Correlation Analysis (Nominal Features)
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
❑ Χ2 (chi-square) calculation
❑ Numbers in parenthesis are expected counts calculated based
on the data distribution in the two categories
(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2
=
2
+ + + = 507.93
90 210 360 840
❑ Itshows that like_science_fiction and play_chess are
correlated in the group
Correlation Analysis (Numeric Features)
❑ Correlation coefficient (also called Pearson's correlation
coefficient, PCC) among features A and B:
i=1 (ai − A)(bi − B)
n n
(ai bi ) − n AB
rA, B = = i =1
(n − 1) A B (n − 1) A B
n is the number of data, B and A are the respective means of A and B,
σA and σB are the respective standard deviation of A and B, and Σ(aibi) is
the sum of the AB cross-product.
❑ If rA,B > 0: A and B are positively correlated
❑ A's values increase as B's). The higher, the stronger correlation
❑ rA,B = 0: independent
❑ rAB < 0: negatively correlated
Visually Evaluating Correlation
Scatter plots showing the
correlation from –1 to 1.
Correlation(상관관계) does not
imply causality(인과관계)
=> “# of hospitals” and “# of
car-theft” in a city are correlated.
However, both may be causally
linked to another feature:
population
Covariance (Numeric Features)
❑ Covariance is similar to correlation
Correlation coefficient:
where n is the number of data, A and B are the respective
mean or expected values of A and B, σA and σB are the respective
standard deviation of A and B.
❑ Positivecovariance: If CovA,B > 0, then A and B both tend to be
larger than their expected values.
❑ Negativecovariance: If CovA,B < 0 then if A is larger than its
expected value, B is likely to be smaller than its expected value.
❑ Independence: CovA,B = 0
Covariance: An Example
❑ It can be simplified in computation as
❑ Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
❑ Question: If the stocks are affected by the same industry trends,
will their prices rise or fall together?
❑ E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
❑ E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
❑ Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
❑ Thus, A and B rise together since Cov(A, B) > 0.
Covariance/Correlation Matrix Visualization
Contents
❑ Data Preprocessing: An Overview
❑ Data Quality
❑ Major Tasks in Data Preprocessing
❑ Data Cleaning
❑ Data Integration
❑ Data Reduction
❑ Data Transformation and Data Discretization
❑ Summary
Thank You