0% found this document useful (0 votes)

18 views21 pages

Data Preprocessing

This document discusses various techniques for data preprocessing including data cleaning, integration, reduction, and transformation. Data cleaning involves handling missing values and noisy data through techniques like mean imputation, binning, and regression. Data integration combines data from multiple sources by resolving inconsistencies and schema conflicts. Data reduction reduces data size through dimensionality reduction, numerosity reduction, and compression. Data transformation techniques normalize and discretize data into meaningful ranges.

Uploaded by

nikhithalazarus4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views21 pages

Data Preprocessing

Uploaded by

nikhithalazarus4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Data preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

Data Pre-procesing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation
Data Cleaning

 Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, transmission error
 incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
 e.g., Occupation=“ ” (missing data)

 noisy: containing noise, errors, or outliers

 e.g., Salary=“−10” (an error)

 inconsistent: containing discrepancies in codes or names, e.g.,

 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records

 Intentional (e.g., disguised missing data)

 Jan. 1 as everyone’s birthday?
How to handle missing data

 Ignore the tuple: usually done when class label is missing

(when doing classification)—not effective when the % of
missing values per attribute varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class: smarter
 the most probable value: inference-based such as Bayesian formula or decision tree
How to handle noisy data

 Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.

 Regression
 smooth by fitting the data into regression functions

 Clustering
 detect and remove outliers

 Combined computer and human inspection

 detect suspicious values and check by human (e.g., deal with possible outliers)
Data integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different sources are
different
 Possible reasons: different representations, different scales, e.g., metric
vs. British units
Handling Redundancy in Data Integration

 Redundant data occur often when integration of multiple databases

 Object identification: The same attribute or object may have different names in different
databases
 Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue

 Redundant attributes may be able to be detected by correlation

analysis and covariance analysis
 Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality
Correlation Analysis

 Χ2 (chi-square) test
(Observed  Expected ) 2
2  
Expected
 The larger the Χ2 value, the more likely the variables are related
 The cells that contribute the most to the Χ2 value are those whose
actual count is very different from the expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population
Correlation Analysis

 Correlation coefficient (also called Pearson’s product moment coefficient)

i 1 (ai  A)(bi  B) 
n n
(ai bi )  n AB
rA, B   i 1
(n  1) A B (n  1) A B

where n is the number of tuples, and are the respective means of A and B, σA
and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the
AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values increase as B’s).
The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated
Data Reduction
 Data reduction: Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical results
 Why data reduction? — A database/data warehouse may store terabytes of data. Complex
data analysis may take a very long time to run on the complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant attributes
 Wavelet transforms
 Principal Components Analysis (PCA)
 Feature subset selection, feature creation

 Numerosity reduction (some simply call it: Data Reduction)

 Regression and Log-Linear Models
 Histograms, clustering, sampling
 Data cube aggregation

 Data compression
Dimensionality reduction

 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier analysis, becomes less
meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is
mapped to
 Z-score normalization (μ: mean, σ: standard deviation):
v  A
v' 
 A

 Ex. Let μ = 54,000, σ = 16,000. Then

 NormalizationWhere
by decimal scaling
j is the smallest integer such that Max(|ν’|) < 1
v
v'  j
10
Aggregation

 Combining of two o more record into single object

Sampling Techniques
 Simple random sampling
 There is an equal probability of selecting any particular item
 Sampling without replacement
 Oncean object is selected, it is removed from the
population
 Sampling with replacement
A selected object is not removed from the population
 Stratified sampling:
 Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the
data)
 Used in conjunction with skewed data
Progressive Sampling

 Starts with a very small Sample and then starts increasing the sample size
until a sample of sufficient size is obtained.
Dimensionality Reduction

 Need ?
 Many algorithms work with low dimensionality .
 Allows Data visualization better
 Amount of time for processing and memory required is reduced
 Dimensionality reduction reduce dimensionality by creating new attributes
which are combinations of existing attributes .
 The reduction of dimensionality by selecting new attributes that are subset of
old attributes is known as feature subset selection.
 Curse of Dimensionality
Feature Subset Selection

 There are three standard methods for feature subset selection

 Embedded Subset Selection
 Filter approach
 Wrapper Approach
Feature Subset Selection
Feature Extraction

 Creating new set of features from the original set of features is known as
feature extraction.
Discretization and Binarization
 Mapping continuous valued attributes to categorical attributes is called
Discretization.
 Mapping continuous valued attributes to one or more binary attributes is
called binarization.
 Example
Discretization of Continuous values
attributes
 Unsupervised Discretization
 Equal Width intervals
 Equal Depth Intervals
 Supervised Discretization

Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
45 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Module 2 - DM - AI
No ratings yet
Module 2 - DM - AI
61 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
8 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
UNIT II Data Processing (1) .PPTX DMT
No ratings yet
UNIT II Data Processing (1) .PPTX DMT
43 pages
Lec07 - Data-Preprocessing-18052023-082951pm 2
No ratings yet
Lec07 - Data-Preprocessing-18052023-082951pm 2
32 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Preprocessing-Cleaning & Reduction
No ratings yet
Preprocessing-Cleaning & Reduction
42 pages
Data Science Preprocessing Guide
No ratings yet
Data Science Preprocessing Guide
40 pages
Data Mining P5
No ratings yet
Data Mining P5
32 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Unit 3
No ratings yet
Unit 3
164 pages
CH 3
No ratings yet
CH 3
68 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
56 pages
Week2 2
No ratings yet
Week2 2
25 pages
Session 4
No ratings yet
Session 4
40 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Lecture#2 Data Mining MS (DEIM) Spring 2025
No ratings yet
Lecture#2 Data Mining MS (DEIM) Spring 2025
61 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
Session 2-Data Preprocessing
No ratings yet
Session 2-Data Preprocessing
29 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
Correlation
No ratings yet
Correlation
14 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Unit 2 Preprocessing
No ratings yet
Unit 2 Preprocessing
39 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
DM Data Transformation Techniques
No ratings yet
DM Data Transformation Techniques
25 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
52 pages
Propositions: Chapter 1: The Foundations: Logic and Proofs
No ratings yet
Propositions: Chapter 1: The Foundations: Logic and Proofs
21 pages
Roadmap - Quantitative Aptitude
No ratings yet
Roadmap - Quantitative Aptitude
3 pages
BCSE313L Module2
No ratings yet
BCSE313L Module2
27 pages
Report
No ratings yet
Report
17 pages
Codechef Training Program
No ratings yet
Codechef Training Program
4 pages
Docx
No ratings yet
Docx
13 pages
1 N-Grams and Language Models Detailed
No ratings yet
1 N-Grams and Language Models Detailed
4 pages
Alg DS1 Example Test 1
No ratings yet
Alg DS1 Example Test 1
3 pages
Tcu Guide Book
100% (4)
Tcu Guide Book
176 pages
A Soft Sensor Model Based On CNN-BiLSTM and IHHO Algorithm For Tennessee Eastman Process
No ratings yet
A Soft Sensor Model Based On CNN-BiLSTM and IHHO Algorithm For Tennessee Eastman Process
14 pages
PATTERN CLASSIFICATION BY DISTANCE FUNCTIONS BY Dr. K.Vijayarekha
No ratings yet
PATTERN CLASSIFICATION BY DISTANCE FUNCTIONS BY Dr. K.Vijayarekha
8 pages
Crime Rate Prediction: Ch. Mahendra1, G. Nani Babu2, G. Balu Nitin Chandra, A. Avinash 4, Y. Aditya5
No ratings yet
Crime Rate Prediction: Ch. Mahendra1, G. Nani Babu2, G. Balu Nitin Chandra, A. Avinash 4, Y. Aditya5
6 pages
Comparison of Vision Transformers and Convolutional Neural Networks in Medical Image Analysis: A Systematic Review
No ratings yet
Comparison of Vision Transformers and Convolutional Neural Networks in Medical Image Analysis: A Systematic Review
22 pages
Maths Class Xii Chapter 12 Linear Programming Practice Paper 13
50% (2)
Maths Class Xii Chapter 12 Linear Programming Practice Paper 13
5 pages
Statistic Id871513 Amount of Data Created Consumed and Stored 2010
No ratings yet
Statistic Id871513 Amount of Data Created Consumed and Stored 2010
1 page
m124 Tangent Review1
No ratings yet
m124 Tangent Review1
3 pages
Tcs Assignment 1
No ratings yet
Tcs Assignment 1
5 pages
Support Tsa
No ratings yet
Support Tsa
59 pages
An Al-BERT-Bi-GRU-LDA Algorithm For Negative Sentiment Analysis On Bilibili Comments
No ratings yet
An Al-BERT-Bi-GRU-LDA Algorithm For Negative Sentiment Analysis On Bilibili Comments
15 pages
Lahiri & Pal Problems 04.08
No ratings yet
Lahiri & Pal Problems 04.08
4 pages
Nyquist Sampling, Pulse-Amplitude Modulation, and Time-Division Multiplexing
No ratings yet
Nyquist Sampling, Pulse-Amplitude Modulation, and Time-Division Multiplexing
10 pages
Java Programming for CSE Students
No ratings yet
Java Programming for CSE Students
32 pages
MC Data Science23
No ratings yet
MC Data Science23
26 pages
Subdv Lot Data Comp
100% (1)
Subdv Lot Data Comp
8 pages
Unit 3 - Probability and Probability Distributions Vs2-Merged
No ratings yet
Unit 3 - Probability and Probability Distributions Vs2-Merged
28 pages
Data Mining Versus Knowledge Discovery I
No ratings yet
Data Mining Versus Knowledge Discovery I
3 pages
Whatsapp Security: Made By: Abdelrahman Badawy Yousef Abdelfatah Subervised By: Eng/Mai Magdy
100% (1)
Whatsapp Security: Made By: Abdelrahman Badawy Yousef Abdelfatah Subervised By: Eng/Mai Magdy
8 pages
Simulation of Collective Intelligence
100% (16)
Simulation of Collective Intelligence
177 pages
Sequential To Grid Computing Chapman Hall CRC Numerical Analy Scient Comp Series 1771660
No ratings yet
Sequential To Grid Computing Chapman Hall CRC Numerical Analy Scient Comp Series 1771660
145 pages
DSA Master Sheet
No ratings yet
DSA Master Sheet
5 pages

Data Preprocessing

Uploaded by

Data Preprocessing

Uploaded by

Data preprocessing

 Data Transformation and Data Discretization

 noisy: containing noise, errors, or outliers

 inconsistent: containing discrepancies in codes or names, e.g.,

 Intentional (e.g., disguised missing data)

 Ignore the tuple: usually done when class label is missing

 Combined computer and human inspection

 Redundant data occur often when integration of multiple databases

 Redundant attributes may be able to be detected by correlation

 Correlation coefficient (also called Pearson’s product moment coefficient)

 Numerosity reduction (some simply call it: Data Reduction)

 Ex. Let μ = 54,000, σ = 16,000. Then

 Combining of two o more record into single object

 There are three standard methods for feature subset selection

You might also like