0% found this document useful (0 votes)

28 views19 pages

10-1 Data Analysis and Pre-Processing Part 3 PDF

Chapter 3 discusses data preprocessing, emphasizing its importance for ensuring data accuracy, completeness, consistency, and interpretability. Key tasks include data cleaning, integration, reduction, and transformation, with techniques for handling missing, noisy, and inconsistent data. The chapter also covers correlation and covariance analysis to evaluate relationships between features in datasets.

Uploaded by

김김진진태태

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views19 pages

10-1 Data Analysis and Pre-Processing Part 3 PDF

Uploaded by

김김진진태태

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Chapter 3: Data Preprocessing

Dong-Kyu Chae

PI of the Data Intelligence Lab @HYU

Department of Computer Science & Data Science
Hanyang University
Contents: Major Tasks in Data Preprocessing
❑ Data Cleaning

❑ Data Integration

❑ Data Reduction

❑ Data Transformation and Data Discretization

Why Pre-process the Data?
❑A multidimensional view
❑ Accuracy: your data mining/machine learning results are not
accurate, despite multiple trials…
❑ Completeness: not recorded, unavailable, …
❑ Consistency: some modified but some not, …
❑ Timeliness: timely updated?
❑ Believability: how trustable the data are?
❑ Interpretability: how easily the data can be understood?
Overview
❑ Data cleaning
❑ Fill
in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
❑ Data integration
❑ Integration of multiple databases, data cubes, or files
❑ Data reduction
❑ Dimensionality reduction
❑ Numerosity reduction
❑ Data compression
❑ Data transformation and data discretization
❑ Normalization
❑ Concept hierarchy generation
Data Cleaning
❑ Data in the Real World Is Dirty: Lots of potentially
incorrect data, e.g., instrument faulty, human or
computer error, transmission error
❑ incomplete: lacking feature values, lacking certain features of
interest, or containing only aggregate data
▪ e.g., Occupation=“ ” (missing data)
❑ noisy: containing noise, errors, or outliers
▪ e.g., Salary=“−10” (an error)
❑ inconsistent: containing discrepancies in codes or names, e.g.,
▪ Age=“42”, Birthday=“03/07/2000”
▪ In some DBs, rating is“1, 2, 3”, but some other DBS,“A, B, C”
▪ discrepancy between duplicate records
Missing Data
❑ Remove the object: usually done when class label is missing—
not effective when the % of missing values is large
❑ Fillin the missing value manually: might be accurate, but
tedious + infeasible
❑ Fill in it automatically with
❑a simple constant (default value, or “unknown”)
❑ the feature mean
❑ the feature mean for all samples belonging to the same group (e.g.,
same class, same cluster, etc…)
❑ the inferred value: such as based on some regression or
classification model
Noisy & Inconsistent Data
❑ Noise : random error or variance in a measured feature
❑ Mainly due to faulty data collection instruments
❑ Noisy data is often expressed as an outlier
▪ Outlier detection -> delete outliers -> find missing values
❑ Thus we can apply an outlier detection method (will learn it
later)

❑ Inconsistent data
❑ Age=“42”, but Birthday=“03/07/2000”
❑ For a duplicate records, one name is “cm“ but the other is
“inch”
❑ Human inspection will be needed
▪ Computer performs outlier detection, then human will inspect it
Contents
❑ Data Preprocessing: An Overview
❑ Data Quality
❑ Major Tasks in Data Preprocessing

❑ Data Cleaning

❑ Data Integration

❑ Data Reduction

❑ Data Transformation and Data Discretization

❑ Summary
Data Integration
❑ Data integration:
❑ Combines multiple datasets from multiple sources into a
coherent store

❑ Schema integration: e.g., A.cust-id  B.cust-#

❑ Integrate metadata from different sources

❑ Detecting and resolving data value conflicts

❑ For the same real world entity, feature values from different
sources are different
❑ e.g., cm vs. inch, meter vs. mile
Handling Redundancy in Data Integration
❑ Redundant data occur often when integration of multiple
databases
❑ Derivabledata : One feature may be a “derived” feature in
another table, e.g., birthdate vs. age

❑ Redundant features can be automatically detected by

correlation analysis and covariance analysis
❑ Reducing/avoidingredundancies and inconsistencies
improves mining speed and quality
Correlation Analysis (Nominal Features)
❑ Wewant to know that “like_science_fiction” and
“play_chess” are correlated
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

(Observed − Expected ) 2
 = 2
❑ Χ2 (chi-square) test Expected
each cell
❑ The larger the Χ2 value, the more likely the features are corelated
❑ The cells that contribute the most to the Χ2 value are those whose
actual count is very different from the expected count
❑ Expected value is estimated under the independence assumption
Correlation Analysis (Nominal Features)

Play chess Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

❑ Χ2 (chi-square) calculation
❑ Numbers in parenthesis are expected counts calculated based
on the data distribution in the two categories
(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2
 =
2
+ + + = 507.93
90 210 360 840

❑ Itshows that like_science_fiction and play_chess are

correlated in the group
Correlation Analysis (Numeric Features)
❑ Correlation coefficient (also called Pearson's correlation
coefficient, PCC) among features A and B:
i=1 (ai − A)(bi − B) 
n n
(ai bi ) − n AB
rA, B = = i =1
(n − 1) A B (n − 1) A B

n is the number of data, B and A are the respective means of A and B,

σA and σB are the respective standard deviation of A and B, and Σ(aibi) is
the sum of the AB cross-product.

❑ If rA,B > 0: A and B are positively correlated

❑ A's values increase as B's). The higher, the stronger correlation

❑ rA,B = 0: independent
❑ rAB < 0: negatively correlated
Visually Evaluating Correlation
Scatter plots showing the
correlation from –1 to 1.

Correlation(상관관계) does not

imply causality(인과관계)
=> “# of hospitals” and “# of
car-theft” in a city are correlated.
However, both may be causally
linked to another feature:
population
Covariance (Numeric Features)
❑ Covariance is similar to correlation

Correlation coefficient:

where n is the number of data, A and B are the respective

mean or expected values of A and B, σA and σB are the respective
standard deviation of A and B.
❑ Positivecovariance: If CovA,B > 0, then A and B both tend to be
larger than their expected values.
❑ Negativecovariance: If CovA,B < 0 then if A is larger than its
expected value, B is likely to be smaller than its expected value.
❑ Independence: CovA,B = 0
Covariance: An Example

❑ It can be simplified in computation as

❑ Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).

❑ Question: If the stocks are affected by the same industry trends,

will their prices rise or fall together?
❑ E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
❑ E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
❑ Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

❑ Thus, A and B rise together since Cov(A, B) > 0.

Covariance/Correlation Matrix Visualization
Contents
❑ Data Preprocessing: An Overview
❑ Data Quality
❑ Major Tasks in Data Preprocessing

❑ Data Cleaning

❑ Data Integration

❑ Data Reduction

❑ Data Transformation and Data Discretization

❑ Summary
Thank You

Data Mining 3
No ratings yet
Data Mining 3
57 pages
PPT1
No ratings yet
PPT1
93 pages
03 Preprocessing
No ratings yet
03 Preprocessing
60 pages
DM Merged
No ratings yet
DM Merged
169 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
56 pages
03 Preprocessing
No ratings yet
03 Preprocessing
65 pages
Data Pre Processing
No ratings yet
Data Pre Processing
63 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
Module 2
No ratings yet
Module 2
62 pages
Module 5 03preprocessing
No ratings yet
Module 5 03preprocessing
63 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
2020 Preprocessing
No ratings yet
2020 Preprocessing
63 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
Lec 7
No ratings yet
Lec 7
45 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
03 Preprocessing
No ratings yet
03 Preprocessing
38 pages
Mining
No ratings yet
Mining
63 pages
Data Pre Processing
No ratings yet
Data Pre Processing
62 pages
03 Pre Processing
No ratings yet
03 Pre Processing
63 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Slide 05 Chapter3 Data Preprocessing
No ratings yet
Slide 05 Chapter3 Data Preprocessing
58 pages
IT326 - Ch3
No ratings yet
IT326 - Ch3
33 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
UpdatedUnit 1 Data Preprocessing
No ratings yet
UpdatedUnit 1 Data Preprocessing
38 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
52 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
64 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
3 Processing
No ratings yet
3 Processing
79 pages
CH 03-01 Data Preprocessing
No ratings yet
CH 03-01 Data Preprocessing
27 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
63 pages
Lecture#2 Data Mining MS (DEIM) Spring 2025
No ratings yet
Lecture#2 Data Mining MS (DEIM) Spring 2025
61 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
Unit 1 C
No ratings yet
Unit 1 C
63 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
Lecture 3
No ratings yet
Lecture 3
47 pages
Data Mining: Dosen: Dr. Vitri Tundjungsari
No ratings yet
Data Mining: Dosen: Dr. Vitri Tundjungsari
64 pages
Lec 3
No ratings yet
Lec 3
31 pages
DM LAQs (CT 1)
No ratings yet
DM LAQs (CT 1)
40 pages
03 Preprocessing
No ratings yet
03 Preprocessing
64 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
66 pages
03 Preprocessing
No ratings yet
03 Preprocessing
54 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
03preprocessing 20160222
No ratings yet
03preprocessing 20160222
65 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
Data - Preprocessing 1 19
No ratings yet
Data - Preprocessing 1 19
19 pages
Unit 3
No ratings yet
Unit 3
164 pages
Data Preprocessing (DWDM MOD 2)
No ratings yet
Data Preprocessing (DWDM MOD 2)
62 pages
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
No ratings yet
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
64 pages
DP
No ratings yet
DP
44 pages
Accelergy An Architecture-Level Energy Estimation Methodology For Accelerator Designs
No ratings yet
Accelergy An Architecture-Level Energy Estimation Methodology For Accelerator Designs
8 pages
9-1 Data Analysis and Pre-Processing Part 1 PDF
No ratings yet
9-1 Data Analysis and Pre-Processing Part 1 PDF
19 pages
Eyeriss A Spatial Architecture For Energy-Efficient Dataflow For Convolutional Neural Networks
No ratings yet
Eyeriss A Spatial Architecture For Energy-Efficient Dataflow For Convolutional Neural Networks
13 pages
10-2 Data Analysis and Pre-Processing Part 4 PDF
No ratings yet
10-2 Data Analysis and Pre-Processing Part 4 PDF
23 pages
브릿지 14해설
No ratings yet
브릿지 14해설
8 pages
15 - Word Embedding
No ratings yet
15 - Word Embedding
11 pages
브릿지 14
No ratings yet
브릿지 14
12 pages
AWS
No ratings yet
AWS
2 pages
Amazfit Stratos 3 Product Instruction Manual
No ratings yet
Amazfit Stratos 3 Product Instruction Manual
23 pages
Calculation of Modified G and T Parameters For The Ring Stiffened Joints (As Per Ueg)
No ratings yet
Calculation of Modified G and T Parameters For The Ring Stiffened Joints (As Per Ueg)
17 pages
Not Their Daughter Laura Elliot Instant Download
No ratings yet
Not Their Daughter Laura Elliot Instant Download
26 pages
MV2109
No ratings yet
MV2109
5 pages
Typing Test English
No ratings yet
Typing Test English
1 page
Esp32-S2 Technical Reference Manual en
No ratings yet
Esp32-S2 Technical Reference Manual en
420 pages
BCM Programming
No ratings yet
BCM Programming
1 page
Capital One Offers Terms and Conditions
No ratings yet
Capital One Offers Terms and Conditions
4 pages
ML Unit-2
No ratings yet
ML Unit-2
17 pages
SIWES Report: NITT Zaria Experience
No ratings yet
SIWES Report: NITT Zaria Experience
23 pages
Lid Driven Cavity Semi Implicit CPP CODE
No ratings yet
Lid Driven Cavity Semi Implicit CPP CODE
10 pages
Microprocessor Notes
No ratings yet
Microprocessor Notes
13 pages
Ritesh Kumar (New)
No ratings yet
Ritesh Kumar (New)
3 pages
Manual-Bluetooth Keyboard-2011
No ratings yet
Manual-Bluetooth Keyboard-2011
2 pages
IBT Proctor Manual
No ratings yet
IBT Proctor Manual
57 pages
Assignment PDF
No ratings yet
Assignment PDF
13 pages
Introduction To Scripting
No ratings yet
Introduction To Scripting
4 pages
Audit Case 1 - Analytical Procedures
No ratings yet
Audit Case 1 - Analytical Procedures
2 pages
Americans in China Terry Lautz Instant Download
100% (2)
Americans in China Terry Lautz Instant Download
18 pages
V-Revision Worksheet Answer Key-T2
No ratings yet
V-Revision Worksheet Answer Key-T2
2 pages
Artificial Intelligence Techniques
No ratings yet
Artificial Intelligence Techniques
25 pages
IRC5 With Flex Pendant 3HAC16590-1 en
No ratings yet
IRC5 With Flex Pendant 3HAC16590-1 en
360 pages
A High Efficiency Flyback Micro-Inverter With A New Adaptive Snubber For Photovoltaic Applications
No ratings yet
A High Efficiency Flyback Micro-Inverter With A New Adaptive Snubber For Photovoltaic Applications
29 pages
Connecting Audiocodes SBC To Microsoft Teams Direct Routing Hosting Model Configuration Note
No ratings yet
Connecting Audiocodes SBC To Microsoft Teams Direct Routing Hosting Model Configuration Note
58 pages
Consumer Buying Behaviour Towards Smartphone
80% (30)
Consumer Buying Behaviour Towards Smartphone
36 pages
Wu 2021 J. Phys. Conf. Ser. 1885 052053
No ratings yet
Wu 2021 J. Phys. Conf. Ser. 1885 052053
7 pages
EARTHIMAGER
No ratings yet
EARTHIMAGER
123 pages
Starting Your Law Practice Guide
No ratings yet
Starting Your Law Practice Guide
6 pages
Interactive Prototyping for Designers
No ratings yet
Interactive Prototyping for Designers
11 pages

10-1 Data Analysis and Pre-Processing Part 3 PDF

Uploaded by

10-1 Data Analysis and Pre-Processing Part 3 PDF

Uploaded by

Chapter 3: Data Preprocessing

PI of the Data Intelligence Lab @HYU

❑ Data Transformation and Data Discretization

❑ Data Transformation and Data Discretization

❑ Schema integration: e.g., A.cust-id  B.cust-#

❑ Detecting and resolving data value conflicts

❑ Redundant features can be automatically detected by

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

Play chess Not play chess Sum (row)

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

❑ Itshows that like_science_fiction and play_chess are

n is the number of data, B and A are the respective means of A and B,

❑ If rA,B > 0: A and B are positively correlated

Correlation(상관관계) does not

where n is the number of data, A and B are the respective

❑ It can be simplified in computation as

❑ Question: If the stocks are affected by the same industry trends,

❑ Thus, A and B rise together since Cov(A, B) > 0.

❑ Data Transformation and Data Discretization

You might also like